codebasics
diff --git a/‎9_bag_of_words/bag_of_words_exercise_questions.ipynb‎
Lines changed: 315 additions & 0 deletions b/‎9_bag_of_words/bag_of_words_exercise_questions.ipynb‎
Lines changed: 315 additions & 0 deletions
@@ -0,0 +1,315 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "KYoVrnewenmh"
+   },
+   "source": [
+    "### Bag of words: Exercises\n",
+    "\n",
+    "\n",
+    "- In this Exercise, you are going to classify whether a given movie review is **positive or negative**.\n",
+    "- you are going to use Bag of words for pre-processing the text and apply different classification algorithms.\n",
+    "- Sklearn CountVectorizer has the inbuilt implementations for Bag of Words."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 24,
+   "metadata": {
+    "id": "JW6MPIjib_4G"
+   },
+   "outputs": [],
+   "source": [
+    "#Import necessary libraries\n",
+    "\n",
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "from sklearn.feature_extraction.text import CountVectorizer\n",
+    "from sklearn.ensemble import RandomForestClassifier\n",
+    "from  sklearn.neighbors import KNeighborsClassifier\n",
+    "from sklearn.naive_bayes import MultinomialNB\n",
+    "from sklearn.pipeline import Pipeline\n",
+    "from sklearn.metrics import classification_report"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "kDATDCL8NMML"
+   },
+   "source": [
+    "### **About Data: IMDB Dataset**\n",
+    "\n",
+    "Credits: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?resource=download\n",
+    "\n",
+    "\n",
+    "- This data consists of two columns.\n",
+    "        - review\n",
+    "        - sentiment\n",
+    "- Reviews are the statements given by users after watching the movie.\n",
+    "- sentiment feature tells whether the given review is positive or negative."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/",
+     "height": 224
+    },
+    "id": "beL29JwEb_7O",
+    "outputId": "cf0a9e1e-b80b-4447-d759-0828baba2620"
+   },
+   "outputs": [],
+   "source": [
+    "#1. read the data provided in the same directory with name 'movies_sentiment_data.csv' and store it in df variable\n",
+    "\n",
+    "\n",
+    "\n",
+    "#2. print the shape of the data\n",
+    "\n",
+    "\n",
+    "#3. print top 5 datapoints\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 26,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#creating a new column \"Category\" which represent 1 if the sentiment is positive or 0 if it is negative\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "OSwPM7mub_9S",
+    "outputId": "2b68719c-b7f4-48b8-a41e-3f95cca9f2f2"
+   },
+   "outputs": [],
+   "source": [
+    "#check the distribution of 'Category' and see whether the Target labels are balanced or not.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "id": "IB97QiFCcAAe"
+   },
+   "outputs": [],
+   "source": [
+    "#Do the 'train-test' splitting with test size of 20%\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "id": "mtr4mSLEMWiU"
+   },
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "J-pUGPqwMrDQ"
+   },
+   "source": [
+    "**Exercise-1**\n",
+    "\n",
+    "1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative.\n",
+    "\n",
+    "**Note:**\n",
+    "- use CountVectorizer for pre-processing the text.\n",
+    "\n",
+    "- use **Random Forest** as the classifier with estimators as 50 and criterion as entropy.\n",
+    "- print the classification report.\n",
+    "\n",
+    "**References**:\n",
+    "\n",
+    "- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html\n",
+    "\n",
+    "- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "CbldZv03MWkB",
+    "outputId": "cf70d361-da12-46a9-8d59-73cdba9bad91"
+   },
+   "outputs": [],
+   "source": [
+    "#1. create a pipeline object\n",
+    "\n",
+    "\n",
+    "\n",
+    "\n",
+    "#2. fit with X_train and y_train\n",
+    "\n",
+    "\n",
+    "\n",
+    "#3. get the predictions for X_test and store it in y_pred\n",
+    "\n",
+    "\n",
+    "\n",
+    "#4. print the classfication report\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "WMVvGzqXSFYr"
+   },
+   "source": [
+    "**Exercise-2**\n",
+    "\n",
+    "1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..\n",
+    "\n",
+    "**Note:**\n",
+    "- use CountVectorizer for pre-processing the text.\n",
+    "- use **KNN** as the classifier with n_neighbors of 10 and metric as 'euclidean'.\n",
+    "- print the classification report.\n",
+    "\n",
+    "**References**:\n",
+    "\n",
+    "- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html\n",
+    "- https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "tYkY77S6MWng",
+    "outputId": "53275bdc-4629-464c-d26f-00075b080174"
+   },
+   "outputs": [],
+   "source": [
+    "\n",
+    "#1. create a pipeline object\n",
+    "\n",
+    "\n",
+    "#2. fit with X_train and y_train\n",
+    "\n",
+    "\n",
+    "\n",
+    "#3. get the predictions for X_test and store it in y_pred\n",
+    "\n",
+    "\n",
+    "#4. print the classfication report\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Exercise-3**\n",
+    "\n",
+    "1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..\n",
+    "\n",
+    "**Note:**\n",
+    "- use CountVectorizer for pre-processing the text.\n",
+    "- use **Multinomial Naive Bayes** as the classifier.\n",
+    "- print the classification report.\n",
+    "\n",
+    "**References**:\n",
+    "\n",
+    "- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html\n",
+    "- https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\n",
+    "#1. create a pipeline object\n",
+    "\n",
+    "\n",
+    "\n",
+    "#2. fit with X_train and y_train\n",
+    "\n",
+    "\n",
+    "\n",
+    "#3. get the predictions for X_test and store it in y_pred\n",
+    "\n",
+    "\n",
+    "\n",
+    "#4. print the classfication report\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Can you write some observations of why model like KNN fails to produce good results unlike RandomForest and MultinomialNB?\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## [**Solution**](./bag_of_words_exercise_solutions.ipynb)"
+   ]
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "collapsed_sections": [],
+   "name": "BOW_exercise.ipynb",
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}