Skip to content

Commit f651863

Browse files
Add files via upload (#9)
1 parent 8c6117d commit f651863

File tree

3 files changed

+19814
-0
lines changed

3 files changed

+19814
-0
lines changed
Lines changed: 315 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,315 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {
6+
"id": "KYoVrnewenmh"
7+
},
8+
"source": [
9+
"### Bag of words: Exercises\n",
10+
"\n",
11+
"\n",
12+
"- In this Exercise, you are going to classify whether a given movie review is **positive or negative**.\n",
13+
"- you are going to use Bag of words for pre-processing the text and apply different classification algorithms.\n",
14+
"- Sklearn CountVectorizer has the inbuilt implementations for Bag of Words."
15+
]
16+
},
17+
{
18+
"cell_type": "code",
19+
"execution_count": 24,
20+
"metadata": {
21+
"id": "JW6MPIjib_4G"
22+
},
23+
"outputs": [],
24+
"source": [
25+
"#Import necessary libraries\n",
26+
"\n",
27+
"import pandas as pd\n",
28+
"import numpy as np\n",
29+
"from sklearn.model_selection import train_test_split\n",
30+
"from sklearn.feature_extraction.text import CountVectorizer\n",
31+
"from sklearn.ensemble import RandomForestClassifier\n",
32+
"from sklearn.neighbors import KNeighborsClassifier\n",
33+
"from sklearn.naive_bayes import MultinomialNB\n",
34+
"from sklearn.pipeline import Pipeline\n",
35+
"from sklearn.metrics import classification_report"
36+
]
37+
},
38+
{
39+
"cell_type": "markdown",
40+
"metadata": {
41+
"id": "kDATDCL8NMML"
42+
},
43+
"source": [
44+
"### **About Data: IMDB Dataset**\n",
45+
"\n",
46+
"Credits: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?resource=download\n",
47+
"\n",
48+
"\n",
49+
"- This data consists of two columns.\n",
50+
" - review\n",
51+
" - sentiment\n",
52+
"- Reviews are the statements given by users after watching the movie.\n",
53+
"- sentiment feature tells whether the given review is positive or negative."
54+
]
55+
},
56+
{
57+
"cell_type": "code",
58+
"execution_count": 1,
59+
"metadata": {
60+
"colab": {
61+
"base_uri": "https://localhost:8080/",
62+
"height": 224
63+
},
64+
"id": "beL29JwEb_7O",
65+
"outputId": "cf0a9e1e-b80b-4447-d759-0828baba2620"
66+
},
67+
"outputs": [],
68+
"source": [
69+
"#1. read the data provided in the same directory with name 'movies_sentiment_data.csv' and store it in df variable\n",
70+
"\n",
71+
"\n",
72+
"\n",
73+
"#2. print the shape of the data\n",
74+
"\n",
75+
"\n",
76+
"#3. print top 5 datapoints\n"
77+
]
78+
},
79+
{
80+
"cell_type": "code",
81+
"execution_count": 26,
82+
"metadata": {},
83+
"outputs": [],
84+
"source": [
85+
"#creating a new column \"Category\" which represent 1 if the sentiment is positive or 0 if it is negative\n"
86+
]
87+
},
88+
{
89+
"cell_type": "code",
90+
"execution_count": 2,
91+
"metadata": {
92+
"colab": {
93+
"base_uri": "https://localhost:8080/"
94+
},
95+
"id": "OSwPM7mub_9S",
96+
"outputId": "2b68719c-b7f4-48b8-a41e-3f95cca9f2f2"
97+
},
98+
"outputs": [],
99+
"source": [
100+
"#check the distribution of 'Category' and see whether the Target labels are balanced or not.\n",
101+
"\n"
102+
]
103+
},
104+
{
105+
"cell_type": "code",
106+
"execution_count": 3,
107+
"metadata": {
108+
"id": "IB97QiFCcAAe"
109+
},
110+
"outputs": [],
111+
"source": [
112+
"#Do the 'train-test' splitting with test size of 20%\n",
113+
"\n"
114+
]
115+
},
116+
{
117+
"cell_type": "code",
118+
"execution_count": null,
119+
"metadata": {
120+
"id": "mtr4mSLEMWiU"
121+
},
122+
"outputs": [],
123+
"source": []
124+
},
125+
{
126+
"cell_type": "markdown",
127+
"metadata": {
128+
"id": "J-pUGPqwMrDQ"
129+
},
130+
"source": [
131+
"**Exercise-1**\n",
132+
"\n",
133+
"1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative.\n",
134+
"\n",
135+
"**Note:**\n",
136+
"- use CountVectorizer for pre-processing the text.\n",
137+
"\n",
138+
"- use **Random Forest** as the classifier with estimators as 50 and criterion as entropy.\n",
139+
"- print the classification report.\n",
140+
"\n",
141+
"**References**:\n",
142+
"\n",
143+
"- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html\n",
144+
"\n",
145+
"- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html"
146+
]
147+
},
148+
{
149+
"cell_type": "code",
150+
"execution_count": 4,
151+
"metadata": {
152+
"colab": {
153+
"base_uri": "https://localhost:8080/"
154+
},
155+
"id": "CbldZv03MWkB",
156+
"outputId": "cf70d361-da12-46a9-8d59-73cdba9bad91"
157+
},
158+
"outputs": [],
159+
"source": [
160+
"#1. create a pipeline object\n",
161+
"\n",
162+
"\n",
163+
"\n",
164+
"\n",
165+
"#2. fit with X_train and y_train\n",
166+
"\n",
167+
"\n",
168+
"\n",
169+
"#3. get the predictions for X_test and store it in y_pred\n",
170+
"\n",
171+
"\n",
172+
"\n",
173+
"#4. print the classfication report\n"
174+
]
175+
},
176+
{
177+
"cell_type": "markdown",
178+
"metadata": {
179+
"id": "WMVvGzqXSFYr"
180+
},
181+
"source": [
182+
"**Exercise-2**\n",
183+
"\n",
184+
"1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..\n",
185+
"\n",
186+
"**Note:**\n",
187+
"- use CountVectorizer for pre-processing the text.\n",
188+
"- use **KNN** as the classifier with n_neighbors of 10 and metric as 'euclidean'.\n",
189+
"- print the classification report.\n",
190+
"\n",
191+
"**References**:\n",
192+
"\n",
193+
"- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html\n",
194+
"- https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html\n",
195+
"\n"
196+
]
197+
},
198+
{
199+
"cell_type": "code",
200+
"execution_count": 5,
201+
"metadata": {
202+
"colab": {
203+
"base_uri": "https://localhost:8080/"
204+
},
205+
"id": "tYkY77S6MWng",
206+
"outputId": "53275bdc-4629-464c-d26f-00075b080174"
207+
},
208+
"outputs": [],
209+
"source": [
210+
"\n",
211+
"#1. create a pipeline object\n",
212+
"\n",
213+
"\n",
214+
"#2. fit with X_train and y_train\n",
215+
"\n",
216+
"\n",
217+
"\n",
218+
"#3. get the predictions for X_test and store it in y_pred\n",
219+
"\n",
220+
"\n",
221+
"#4. print the classfication report\n"
222+
]
223+
},
224+
{
225+
"cell_type": "markdown",
226+
"metadata": {},
227+
"source": [
228+
"**Exercise-3**\n",
229+
"\n",
230+
"1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..\n",
231+
"\n",
232+
"**Note:**\n",
233+
"- use CountVectorizer for pre-processing the text.\n",
234+
"- use **Multinomial Naive Bayes** as the classifier.\n",
235+
"- print the classification report.\n",
236+
"\n",
237+
"**References**:\n",
238+
"\n",
239+
"- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html\n",
240+
"- https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html\n",
241+
"\n"
242+
]
243+
},
244+
{
245+
"cell_type": "code",
246+
"execution_count": 6,
247+
"metadata": {},
248+
"outputs": [],
249+
"source": [
250+
"\n",
251+
"#1. create a pipeline object\n",
252+
"\n",
253+
"\n",
254+
"\n",
255+
"#2. fit with X_train and y_train\n",
256+
"\n",
257+
"\n",
258+
"\n",
259+
"#3. get the predictions for X_test and store it in y_pred\n",
260+
"\n",
261+
"\n",
262+
"\n",
263+
"#4. print the classfication report\n"
264+
]
265+
},
266+
{
267+
"cell_type": "code",
268+
"execution_count": null,
269+
"metadata": {},
270+
"outputs": [],
271+
"source": []
272+
},
273+
{
274+
"cell_type": "markdown",
275+
"metadata": {},
276+
"source": [
277+
"### Can you write some observations of why model like KNN fails to produce good results unlike RandomForest and MultinomialNB?\n",
278+
"\n"
279+
]
280+
},
281+
{
282+
"cell_type": "markdown",
283+
"metadata": {},
284+
"source": [
285+
"## [**Solution**](./bag_of_words_exercise_solutions.ipynb)"
286+
]
287+
}
288+
],
289+
"metadata": {
290+
"colab": {
291+
"collapsed_sections": [],
292+
"name": "BOW_exercise.ipynb",
293+
"provenance": []
294+
},
295+
"kernelspec": {
296+
"display_name": "Python 3 (ipykernel)",
297+
"language": "python",
298+
"name": "python3"
299+
},
300+
"language_info": {
301+
"codemirror_mode": {
302+
"name": "ipython",
303+
"version": 3
304+
},
305+
"file_extension": ".py",
306+
"mimetype": "text/x-python",
307+
"name": "python",
308+
"nbconvert_exporter": "python",
309+
"pygments_lexer": "ipython3",
310+
"version": "3.8.10"
311+
}
312+
},
313+
"nbformat": 4,
314+
"nbformat_minor": 1
315+
}

0 commit comments

Comments
 (0)