# **Movie Review Classification**

In [113]:
#Loading Required Libraries

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

**Loading Dataset**

In [114]:
Dataset = pd.read_csv("/content/drive/MyDrive/Machine_Learning/NLP/Datasets/IMDB_Dataset.csv")
Dataset.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [115]:
Dataset.shape

(50000, 2)

**Preprocessing the Text**

In [116]:
Dataset['sentiment_Num'] = Dataset.sentiment.map(
    {
        "positive": 1,
        "negative": 0
    }
)
Dataset.head()

Unnamed: 0,review,sentiment,sentiment_Num
0,One of the other reviewers has mentioned that ...,positive,1
1,A wonderful little production. <br /><br />The...,positive,1
2,I thought this was a wonderful way to spend ti...,positive,1
3,Basically there's a family where a little boy ...,negative,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,1


In [117]:
Dataset.sentiment_Num.value_counts()

1    25000
0    25000
Name: sentiment_Num, dtype: int64

*Therefore, our dataset doen't contain any class imbalance.*

In [118]:
#Our class values are perfect let's view our Predictor values how are they.
Dataset.review[1]
#We, can see that our data contains lot of garabage. we have to clean the data by using regular expressions.

'A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well d

**Match a single character not present in the list below - [^\w\s\']**

* \w matches any word character (equivalent to [a-zA-Z0-9_])
* \s matches any whitespace character (equivalent to [\r\n\t\f\v ])
* \' matches the character ' with index 3910 (2716 or 478) literally (case se+nsitive)

**Match a single character present in the list below - [ ]+**
* '+' matches the previous token between one and unlimited times, as many times as possible, giving back as needed (greedy)
* ' ' (space) matches the character   with index 3210 (2016 or 408) literally (case sensitive)

**re.sub(r'Pattern', 'Replacer', 'Text')**

In [119]:
#Testing on a sample Record.
text = """
A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well done.
"""

In [120]:
import re

text = re.sub(r'[^\w\s\']', ' ',  text)#sub means what that should that have to substitute.
text = re.sub(r'[ ]+', ' ', text)#([ ]+, ' ') here we are saying that, in our text if there are more than one space has occured then it should be replaced with one space that which we had given as replacer.
text

"\nA wonderful little production br br The filming technique is very unassuming very old time BBC fashion and gives a comforting and sometimes discomforting sense of realism to the entire piece br br The actors are extremely well chosen Michael Sheen not only has got all the polari but he has all the voices down pat too You can truly see the seamless editing guided by the references to Williams' diary entries not only is it well worth the watching but it is a terrificly written and performed piece A masterful production about one of the great master's of comedy and his life br br The realism really comes home with the little things the fantasy of the guard which rather than use the traditional 'dream' techniques remains solid then disappears It plays on our knowledge and our senses particularly with the scenes concerning Orton and Halliwell and the sets particularly of their flat with Halliwell's murals decorating every surface are terribly well done \n"

In [121]:
#preprocess_text() -
def preprocess_text(text):
  import re

  text = re.sub(r'[^\w\s\']', ' ',  text)#sub means what that should that have to substitute.
  text = re.sub(r'[ ]+', ' ', text)#([ ]+, ' ') here we are saying that, in our text if there are more than one space has occured then it should be replaced with one space that which we had given as replacer.
  return text.strip().lower()

In [122]:
Dataset['Preprocessed_Review'] = Dataset.review.apply(preprocess_text)
Dataset.head()

Unnamed: 0,review,sentiment,sentiment_Num,Preprocessed_Review
0,One of the other reviewers has mentioned that ...,positive,1,one of the other reviewers has mentioned that ...
1,A wonderful little production. <br /><br />The...,positive,1,a wonderful little production br br the filmin...
2,I thought this was a wonderful way to spend ti...,positive,1,i thought this was a wonderful way to spend ti...
3,Basically there's a family where a little boy ...,negative,0,basically there's a family where a little boy ...
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,1,petter mattei's love in the time of money is a...


In [123]:
#Observing the Preprocessed data.
Dataset.review[1]

'A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well d

In [124]:
Dataset.Preprocessed_Review[1]

"a wonderful little production br br the filming technique is very unassuming very old time bbc fashion and gives a comforting and sometimes discomforting sense of realism to the entire piece br br the actors are extremely well chosen michael sheen not only has got all the polari but he has all the voices down pat too you can truly see the seamless editing guided by the references to williams' diary entries not only is it well worth the watching but it is a terrificly written and performed piece a masterful production about one of the great master's of comedy and his life br br the realism really comes home with the little things the fantasy of the guard which rather than use the traditional 'dream' techniques remains solid then disappears it plays on our knowledge and our senses particularly with the scenes concerning orton and halliwell and the sets particularly of their flat with halliwell's murals decorating every surface are terribly well done"

**Splitting the Dataset**

In [125]:
X_train,  X_test, y_train, y_test = train_test_split(Dataset.Preprocessed_Review, Dataset.sentiment_Num, test_size = 0.2, stratify = Dataset.sentiment_Num)

In [126]:
print(f'Training Dataset Size - {X_train.shape}')
print(f'Testing Dataset Size - {X_test.shape}')

Training Dataset Size - (40000,)
Testing Dataset Size - (10000,)


In [127]:
print(y_train.value_counts())
print(y_test.value_counts())
#We can see that Sentiment_Label's 0 & 1 are equally distributed for both training and testing datasets. that's why we had used stratify in train_test_split()

1    20000
0    20000
Name: sentiment_Num, dtype: int64
0    5000
1    5000
Name: sentiment_Num, dtype: int64


Machine Learning Model only understands numbers it can't able to understand the text data. so for this we use bag of words technique to change the text into numerical vectors.

The **Bag of Words (BoW)** model is a simple and commonly used technique in natural language processing (NLP) for text analysis. It represents text data as a collection of words, disregarding grammar and word order, and focusing on word frequencies. Here's how it works:

**Tokenization:** First, you need to tokenize the text, which means breaking it into individual words or terms. This can be done by splitting the text at spaces or punctuation marks.

**Vocabulary Creation:** Create a vocabulary of all unique words in the text data. Each word is assigned a unique index or ID.

**Counting Word Occurrences:** For each document (piece of text), count how many times each word from the vocabulary appears in that document. This results in a "term frequency" vector for each document, where each element of the vector represents the count of a specific word.

**Creating the Bag of Words:** Create a matrix where each row corresponds to a document, and each column corresponds to a word in the vocabulary. The matrix elements are the word counts for each document.

Here's a simplified example to illustrate the concept:

Suppose you have two movie reviews:

**Review_1:** "The movie was fantastic."

**Review_2:** "I thought the movie was terrible."

You can create a Bag of Words representation as follows:

Vocabulary: ["The", "movie", "was", "fantastic", "I", "thought", "terrible"]

Bag of Words representation:

                The  movie  was  fantastic  I  thought  terrible
      Review_1    1      1    1          1  0        0         0
      Review_2    1      1    1          0  1        1         1

In this representation, each row represents a review, and each column represents a word from the vocabulary. The values in the matrix indicate how many times each word appears in the corresponding review.

The Bag of Words model is a fundamental concept in text analysis and is used for various NLP tasks, including text classification, sentiment analysis, and information retrieval. It's important to note that BoW does not capture the semantic meaning of words or the word order in the text, which can limit its effectiveness for some NLP tasks.

Using sklearn pipeline module create a classification pipeline to classify the movie review's positive or negative.

we'll use the following classification algorithms in the pipeline:
* k-NN
* Naivee Bayes
* Logistic Regression
* Random Forest

**1. k-NN**

In [128]:
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier

KNN = Pipeline([
    ('Bag of words vectorizer', CountVectorizer()),
    ('KNN', KNeighborsClassifier())
])

In [129]:
KNN.fit(X_train, y_train)

In [130]:
y_pred = KNN.predict(X_test)
y_pred

array([1, 1, 0, ..., 0, 1, 0])

In [131]:
#Metrics of k-NN model.
from sklearn.metrics import accuracy_score, classification_report

print(classification_report(y_test, y_pred))
print("\nAccuracy Score - %.2f%%"%(accuracy_score(y_test, y_pred)*100))

              precision    recall  f1-score   support

           0       0.66      0.58      0.62      5000
           1       0.63      0.70      0.66      5000

    accuracy                           0.64     10000
   macro avg       0.65      0.64      0.64     10000
weighted avg       0.65      0.64      0.64     10000


Accuracy Score - 64.35%


*Accuracy is very low. let's find out why with comparing other models.*

**2. Naive Bayes - Multinomial NB**

In [132]:
from sklearn.naive_bayes import MultinomialNB

MNB = Pipeline([
        ("bag_of_words_vectorizer", CountVectorizer()),
        ("Multinomial NB", MultinomialNB())
])

In [133]:
MNB.fit(X_train, y_train)

In [134]:
y_pred = MNB.predict(X_test)

In [135]:
#Metrics of MultinomialNB model.
from sklearn.metrics import accuracy_score, classification_report

print(classification_report(y_test, y_pred))
print("\nAccuracy Score - %.2f%%"%(accuracy_score(y_test, y_pred)*100))

              precision    recall  f1-score   support

           0       0.82      0.88      0.85      5000
           1       0.87      0.81      0.84      5000

    accuracy                           0.85     10000
   macro avg       0.85      0.85      0.85     10000
weighted avg       0.85      0.85      0.85     10000


Accuracy Score - 84.62%


*Accuracy, Precision & Recall Scores are good for the naive bayes model.*

**3. Logistic Regression**

In [136]:
from sklearn.linear_model import LogisticRegression

LGR = Pipeline([
    ("Bag_of_Words_Vectorizer", CountVectorizer()),
    ("LGR", LogisticRegression())
])

In [137]:
LGR.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [138]:
y_pred = LGR.predict(X_test)

In [139]:
#Metrics of Logistic Regression model.
from sklearn.metrics import accuracy_score, classification_report

print(classification_report(y_test, y_pred))
print("\nAccuracy Score - %.2f%%"%(accuracy_score(y_test, y_pred)*100))

              precision    recall  f1-score   support

           0       0.89      0.89      0.89      5000
           1       0.89      0.89      0.89      5000

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000


Accuracy Score - 88.87%


**4. Random Forest**

In [140]:
from sklearn.ensemble import RandomForestClassifier

RF = Pipeline([
    ("BOW_vectorizer", CountVectorizer()),
    ("RandomForest", RandomForestClassifier())
])

In [141]:
RF.fit(X_train, y_train)

In [142]:
y_pred = RF.predict(X_test)

In [143]:
#Metrics of Random Forest Classifier model.
from sklearn.metrics import accuracy_score, classification_report

print(classification_report(y_test, y_pred))
print("\nAccuracy Score - %.2f%%"%(accuracy_score(y_test, y_pred)*100))

              precision    recall  f1-score   support

           0       0.85      0.85      0.85      5000
           1       0.85      0.85      0.85      5000

    accuracy                           0.85     10000
   macro avg       0.85      0.85      0.85     10000
weighted avg       0.85      0.85      0.85     10000


Accuracy Score - 85.17%


**Conclusion**

*Therefore, all models performed well except the k-NN model compared to other models.*

* *Machine Learning Algorithms doesn't work on the textual data, so we have to convert them into numerical data, so we convert the text into the high dimensional numeric vectors using the BOW technique.*

* *Model like K-Nearest Neighbours(KNN) doesn't work well with high dimensional data because with large number of dimensions, it becomes difficult for the algorithm to calculate distance in each dimension. In higher dimensional space, the cost to calculate distance becomes expensive and hence impacts the performance of model. Due to this, the k-NN model perfored not well on our data.*

* *The easy calculation of probabilities for the words in corpus(Bag of words) and storing them in contigency table is the major reason for the Multinomial NaiveBayes to be a text classification friendly algorithm.*

* *As Random Forest uses Bootstrapping(Row and column Sampling) with many decision tree and overcomes the high variance and overfitting of high dimensional data and also uses feature importance of words for better classifing the categories.*

* *from our research, we can conclude that Logistic Regression is the winner for this Dataset. because logistic regression performed well on this dataset with high accuracy 88.81% and precision and recall scores are also high compared to the other models. we can aslo say that logistic regression works on binary classification, our dataset is also of binary classification so it may be also a reson that it performed well.*

* *Machine Learning is like trial and error scientific method, where we keep trying all the possible algorithms we have and from that alogirithmns we have to select the one which give good results and satisfy the requirements like latency, interpretability etc.*