Bag of words: Exercises
1. In this Exercise, you are going to classify whether a given movie review is positive or negative.
2. you are going to use Bag of words for pre-processing the text and apply different classification algorithms.
3. Sklearn CountVectorizer has the inbuilt implementations for Bag of Words.

In [17]:
#Import necessary libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from  sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score

#### About Data: IMDB Dataset
Credits: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?resource=download

1. This data consists of two columns. - review - sentiment
2. Reviews are the statements given by users after watching the movie.
3. sentiment feature tells whether the given review is positive or negative.

In [2]:
#1. read the data provided in the same directory with name 'movies_sentiment_data.csv' and store it in df variable
df = pd.read_csv("IMDB Dataset.csv")
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [3]:
#2. print the shape of the data
df.shape

(50000, 2)

In [4]:
#3. print top 5 datapoints
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [5]:
#creating a new column "Category" which represent 1 if the sentiment is positive or 0 if it is negative
df['Category'] = df['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)
df.head()

Unnamed: 0,review,sentiment,Category
0,One of the other reviewers has mentioned that ...,positive,1
1,A wonderful little production. <br /><br />The...,positive,1
2,I thought this was a wonderful way to spend ti...,positive,1
3,Basically there's a family where a little boy ...,negative,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,1


In [9]:
#check the distribution of 'Category' and see whether the Target labels are balanced or not.

df.Category.value_counts()

1    25000
0    25000
Name: Category, dtype: int64

In [10]:
#Do the 'train-test' splitting with test size of 20%

x_train, x_test, y_train, y_test = train_test_split(df.review, df.Category, test_size=0.2)

Exercise-1

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative.

Note:

1. use CountVectorizer for pre-processing the text.

2. use Random Forest as the classifier with estimators as 50 and criterion as entropy.

3. print the classification report.

References:

1. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

2. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [20]:
#1. create a pipeline object
from sklearn.pipeline import Pipeline

# clf = Pipeline([
#     ('vectorizer', CountVectorizer()),
#     ('nb', MultinomialNB())
# ])
clf_rf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('rfc', RandomForestClassifier(n_estimators=50, criterion='entropy'))
])

In [22]:
#2. fit with X_train and y_train
clf_rf.fit(x_train, y_train)

In [26]:
#3. get the predictions for X_test and store it in y_pred
y_pred = clf_rf.predict(x_test)


In [27]:
#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.84      0.85      0.84      5010
           1       0.84      0.84      0.84      4990

    accuracy                           0.84     10000
   macro avg       0.84      0.84      0.84     10000
weighted avg       0.84      0.84      0.84     10000



In [28]:
print(accuracy_score(y_test, y_pred))

0.8414


Exercise-2

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..

Note:

1. use CountVectorizer for pre-processing the text.
2. use KNN as the classifier with n_neighbors of 10 and metric as 'euclidean'.
3. print the classification report.

References:

1. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
2. https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

In [29]:
#1. create a pipeline object
clf_knn = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('knn', KNeighborsClassifier(n_neighbors=10, metric='euclidean'))
])

In [30]:
#2. fit with X_train and y_train
clf_knn.fit(x_train, y_train)

In [31]:
#3. get the predictions for X_test and store it in y_pred
y_pred = clf_knn.predict(x_test)

In [32]:
#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.66      0.67      0.67      5010
           1       0.67      0.65      0.66      4990

    accuracy                           0.66     10000
   macro avg       0.66      0.66      0.66     10000
weighted avg       0.66      0.66      0.66     10000



In [33]:
print(accuracy_score(y_test, y_pred))

0.6629


Exercise-3

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..

Note:

1. use CountVectorizer for pre-processing the text.
2. use Multinomial Naive Bayes as the classifier.
3. print the classification report.

References:

1. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
2. https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

In [34]:
#1. create a pipeline object
clf_nb = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('nb', MultinomialNB())
])

In [36]:
#2. fit with X_train and y_train
clf_nb.fit(x_train, y_train)

In [37]:
#3. get the predictions for X_test and store it in y_pred
y_pred = clf_nb.predict(x_test)

In [38]:
#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.88      0.85      5010
           1       0.87      0.82      0.84      4990

    accuracy                           0.85     10000
   macro avg       0.85      0.85      0.85     10000
weighted avg       0.85      0.85      0.85     10000



In [39]:
print(accuracy_score(y_test, y_pred))

0.8491


**Can you write some observations of why model like KNN fails to produce good results unlike RandomForest and MultinomialNB?**

1. As a ML algorithms does not work on Text data directly, we need to convert them into numeric vector and feed that into models while training.

2. In the process, we convert text into a **very high dimensional numeric** vector using the technique **Bag of words**.

3. Model like KNN does not work well with high dimensional dat because with large number of dimensions, it becomes difficult for the algorithm to calculate distance in each dimension. In higher dimensional space, the cost to calculate because expensive and hence impacts the performance of the model.

4. The easy calculation of the probabilities for the words in corpus(Row and Column Sampling) and storing them in contigency table is the major reason for the Multinomial NaiveBayes to be a text classification friendly algorithm. 

5. As Random Forest uses Bootstrapping(Row and column Sampling) with many decision tree and overcomes the high variance and over-fitting of high dimensional data and also uses feature importance of words for better classifying the categories.

6. ML is like trial and error scientific method, where we keep trying all the possible algorithms we have and select the one which give good results and satisfy the requirements like latency, interpretability etc.

    * stackabuse.com
    * analyticsindiamab.com