# Bag Of Words: Exercises
*   In this Exercise, you are going to classify wheter a given movie review is __positive__ or __negative__.
*   You are going to use BOW for pre-processing the text and apply different classification algorithms.
*   Sklearn CountVectorizer has the inbuilt implementations for BOW

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from  sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

##  About Data: IMDB Dataset
Credits: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?resource=download

* This data consists of two columns. - review - sentiment
* Reviews are the statements given by users after watching the movie.
* Sentiment feature tells whether the given review is positive or negative

In [None]:
#1. read the data provided in the same directory with name 'movies_sentiment_data.csv' and store it in df variable
df = pd.read_csv('movies_sentiment_data.csv')

#2. print the shape of the data
print(df.shape)

#3. print top 5 datapoints
df.head()

(19000, 2)


Unnamed: 0,review,sentiment
0,I first saw Jake Gyllenhaal in Jarhead (2005) ...,positive
1,I enjoyed the movie and the story immensely! I...,positive
2,I had a hard time sitting through this. Every ...,negative
3,It's hard to imagine that anyone could find th...,negative
4,This is one military drama I like a lot! Tom B...,positive


In [None]:
#creating a new column "Category" which represent 1 if the sentiment is positive or 0 if it is negative
df['Category'] = df['sentiment'].apply(lambda x:1 if x=='positive' else 0)
df.head()

Unnamed: 0,review,sentiment,Category
0,I first saw Jake Gyllenhaal in Jarhead (2005) ...,positive,1
1,I enjoyed the movie and the story immensely! I...,positive,1
2,I had a hard time sitting through this. Every ...,negative,0
3,It's hard to imagine that anyone could find th...,negative,0
4,This is one military drama I like a lot! Tom B...,positive,1


In [None]:
#check the distribution of 'Category' and see whether the Target labels are balanced or not.
df.Category.value_counts()

Unnamed: 0_level_0,count
Category,Unnamed: 1_level_1
1,9500
0,9500


In [None]:
#Do the 'train-test' splitting with test size of 20%
X_train, X_test, y_train, y_test = train_test_split(df.review, df.Category, test_size=0.2)

## Exercise 1:
1. Using sklearn pipeline module create a pipeline to classifiy the movie review's positive or negative

### Note:
* Use CountVectorizer for pre-processing the text.
* Use __Random Forest__ as the classifier with estimators as 50 and criterion as entropy.
* Print the classification report.

In [None]:
#1. create a pipeline object
clf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('rf', RandomForestClassifier(n_estimators=50, criterion='entropy'))
])

#2. fit with X_train and y_train
clf.fit(X_train, y_train)

#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)

#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.83      0.83      0.83      1904
           1       0.83      0.83      0.83      1896

    accuracy                           0.83      3800
   macro avg       0.83      0.83      0.83      3800
weighted avg       0.83      0.83      0.83      3800



## Exercise 2:
1. Using sklearn pipeline module create a pipeline to classifiy the movie review's positive or negative

### Note:
* Use CountVectorizer for pre-processing the text.
* Use __KNN__ as the classifier with n_neighbors of 10 and metric as 'euclidian'.
* Print the classification report.

In [None]:
#1. create a pipeline object
clf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('knn', KNeighborsClassifier(n_neighbors=10, metric='euclidean'))
])

#2. fit with X_train and y_train
clf.fit(X_train, y_train)

#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)

#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.64      0.66      0.65      1904
           1       0.65      0.63      0.64      1896

    accuracy                           0.65      3800
   macro avg       0.65      0.64      0.64      3800
weighted avg       0.65      0.65      0.64      3800



## Exercise 3:
1. Using sklearn pipeline module create a pipeline to classifiy the movie review's positive or negative

### Note:
* Use CountVectorizer for pre-processing the text.
* Use __Multinomial Naive Bayes__ as the classifier.
* Print the classification report.

In [None]:
#1. create a pipeline object
clf = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('nb', MultinomialNB())
])

#2. fit with X_train and y_train
clf.fit(X_train, y_train)

#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)

#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.81      0.87      0.84      1904
           1       0.86      0.80      0.83      1896

    accuracy                           0.83      3800
   macro avg       0.84      0.83      0.83      3800
weighted avg       0.84      0.83      0.83      3800

