###                     **Bag of words: Solutions**

In [4]:
#Import necessary libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from  sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

### **About Data: IMDB Dataset**

Credits: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?resource=download


- This Data consist of 2 features.
        - review
        - sentiment
- Reviews are the statements given by users after watching the movie.
- sentiment feature tells whether the given review is positive or negative.

In [5]:
#1. read the data provided in the same directory with name 'movies_sentiment_data.csv' and store it in df variable
df = pd.read_csv("movies_sentiment_data.csv")


#2. print the shape of the data
print(df.shape)

#3. print top 5 datapoints
df.head()

(4400, 2)


Unnamed: 0,review,sentiment
0,This film is mildly entertaining if one neglec...,negative
1,"This is, without doubt, one of my favourite ho...",positive
2,Stupid! Stupid! Stupid! I can not stand Ben st...,negative
3,There is not one character on this sitcom with...,negative
4,"""Checking Out"" is a very witty and honest port...",positive


In [None]:
#creating a new column "Category" which represent 1 if the sentiment is positive or 0 if it is negative
df['Category'] = df['sentiment'].apply(lambda x: 1 if x =='positive' else 0)

In [8]:
#check the distribution of 'Category' and see whether the Target labels are balanced or not.

df['Category'].value_counts()

0    2200
1    2200
Name: Category, dtype: int64

In [9]:
#Do the 'train-test' splitting with test size of 20%

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.review, df.Category, test_size=0.2)

**Exercise-1**

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative.

**Note:**
- use CountVectorizer for pre-processing the text.

- use **Random Forest** as the classifier with estimators as 50 and criterion as entropy.
- print the classification report.

**References**:

- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

In [12]:
#1. create a pipeline object
clf = Pipeline([
    ('vectorizer', CountVectorizer()),                                                    #initializing the vectorizer
    ('random_forest', (RandomForestClassifier(n_estimators=50, criterion='entropy')))      #using the RandomForest classifier
])



#2. fit with X_train and y_train
clf.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.80      0.83      0.82       435
           1       0.83      0.80      0.81       445

    accuracy                           0.82       880
   macro avg       0.82      0.82      0.82       880
weighted avg       0.82      0.82      0.82       880



**Exercise-2**

1. using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative..

**Note:**
- use CountVectorizer with max_df value of 0.8 and min_df value of 0.2.
- use **KNN** as the classifier with n_neighbors of 10 and metric as 'cosine'.
- print the classification report.

**References**:

- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
- https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html



In [21]:

#1. create a pipeline object
clf = Pipeline([
                
      ('vectorizer', 
          CountVectorizer
        (
            max_df = 0.8,                   #ignore terms that have a document frequency higher than 0.8
            min_df = 0.2                    #ignore terms that have a document frequency lower than 0.2
          )                                           
      ),        

      ('KNN', (KNeighborsClassifier(n_neighbors=10, metric = 'cosine')))   #using the KNN classifier with 10 neighbors and cosine metric
])


#2. fit with X_train and y_train
clf.fit(X_train, y_train)


#3. get the predictions for X_test and store it in y_pred
y_pred = clf.predict(X_test)


#4. print the classfication report
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.56      0.70      0.63       435
           1       0.62      0.47      0.54       445

    accuracy                           0.59       880
   macro avg       0.59      0.59      0.58       880
weighted avg       0.59      0.59      0.58       880

