# Sentiment Analysis: Movie Review Classification
Sentiment analysis is a powerful tool that allows computers to understand the underlying subjective tone of a piece of writing. 

Sentiment analysis is used to analyze raw text to drive objective quantitative results using natural language processing, machine learning, and other data analytics techniques. It is used to detect positive or negative sentiment in text, and often businesses use it to gauge branded reputation among their customers. 

There are various types of sentiment analysis where the models focus on feelings and emotions, urgency, even intentions, and polarity. The most popular types of sentiment analysis are:

1. Fine-grained sentiment analysis
2. Emotion detection
3. Aspect based sentiment analysis
4. Multilingual sentiment analysis

Sentiment analysis is critical because it helps businesses to understand the emotion and sentiments of their customers. 


# Benefits of sentiment analysis
1. **Sorting Data at Scale**: With sentiment analysis, companies don't have to sort through customer support conversations manually, thousands of tweets, and surveys. Sentiment analysis helps businesses process vast amounts of data efficiently.
2. **Real-Time Analysis**: It helps to identify critical issues in real-time. For example, is a crisis on social media escalating? Is there an angry customer about to churn? With Sentiment analysis models, businesses can immediately identify customer pain points and take action right away.
3. **Consistent criteria**: A centralized sentiment analysis system can improve accuracy and deliver better insights since tagging text by sentiment is highly subjective, influenced by personal experiences, thoughts, and beliefs. 

# Project Description

IMDB is an entertainment review website where people leave their **opinions on various movies and TV series**. We can perform sentiment analysis on the reviews to find whether the viewers liked/disliked the show.

A movie review generally consists of some common words (articles, prepositions, pronouns, conjunctions, etc.) in any language. **These repetitive words are called stopwords that do not add much information to text. NLP libraries like spaCY and other methods like vectorizing the data efficiently remove stopwords** from review during text processing. This reduces the size of the dataset and improves multi-class model performance because the data would only contain meaningful words.

**These results are useful for production companies to understand why their title succeeded or failed.**


In [1]:
import numpy as np
import pandas as pd

# Dataset Description
In this project, we’ll use an IMDB dataset of 50k movie reviews available on Kaggle. The dataset contains 2 columns (review and sentiment) that will help us identify whether a review is positive or negative.

Dataset link: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

Our goal is to find which machine learning model is best suited to predict sentiment (output) given a movie review (input).

In [2]:
file = '/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv'
df_review = pd.read_csv(file)
print(df_review)

                                                  review sentiment
0      One of the other reviewers has mentioned that ...  positive
1      A wonderful little production. <br /><br />The...  positive
2      I thought this was a wonderful way to spend ti...  positive
3      Basically there's a family where a little boy ...  negative
4      Petter Mattei's "Love in the Time of Money" is...  positive
...                                                  ...       ...
49995  I thought this movie did a down right good job...  positive
49996  Bad plot, bad dialogue, bad acting, idiotic di...  negative
49997  I am a Catholic taught in parochial elementary...  negative
49998  I'm going to have to disagree with the previou...  negative
49999  No one expects the Star Trek movies to be high...  negative

[50000 rows x 2 columns]


# Creating two seperate datasets for positive and negative reviews

IMDB Dataset contains 50,000 rows
25000 rows containts positive reviews
25000 rows containt negative reviews

We are generating seperate datasets for positive and negative reviews to get a smaller dataset containing equal no. of each type of reviews which is going to help us to train our model faster.

In [3]:
df_positive = df_review[df_review['sentiment']=='positive']
print(df_positive.head())

df_negative = df_review[df_review['sentiment']=='negative']
print(df_negative.head())

                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
4  Petter Mattei's "Love in the Time of Money" is...  positive
5  Probably my all-time favorite movie, a story o...  positive
                                               review sentiment
3   Basically there's a family where a little boy ...  negative
7   This show was an amazing, fresh & innovative i...  negative
8   Encouraged by the positive comments about this...  negative
10  Phil the Alien is one of those quirky films wh...  negative
11  I saw this movie when I was about 12 when it c...  negative


# Random sampling to generate smaller dataset

This dataset contains 50000 rows; however, to train our model faster in the following steps, we’re going to take a smaller sample of 20000 rows. 

Our new dataset will containt 10,000 positive reviews and 10,000 negative reviews.

***dataset_name.sample(n = no_of_rows)***

This will help us to generate 10,000 random rows from each dataset we created above.

In [4]:
pos_review = df_positive.sample(n = 10000)
neg_review = df_negative.sample(n = 10000)

df_review_bal = pd.concat([pos_review, neg_review])
print(df_review_bal)

                                                  review sentiment
44636  I just watched Holly along with another movie ...  positive
29995  New York, I Love You finally makes it to our s...  positive
6528   This film is a knockout, Fires on the plain re...  positive
39038  If you enjoy romantic comedies then you will f...  positive
65     DON'T TORTURE A DUCKLING is one of Fulci's ear...  positive
...                                                  ...       ...
39805  Am I wrong,or is the 2007 version just a rip-o...  negative
48295  What I found so curious about this film--I saw...  negative
5363   It is apparent that director, writers and ever...  negative
47090  I have seen three other movies that are worse ...  negative
31545  I saw this film opening weekend in Australia, ...  negative

[20000 rows x 2 columns]


# Splitting dataset into training and test set

Before we work with our data, we need to split it into a train and test set. The train dataset will be used to fit the model, while the test dataset will be used to provide an unbiased evaluation of a final model fit on the training dataset.

We’ll use ***sklearn’s train_test_split*** to do the job. In this case, we set 33% to the test data.

In [5]:
#splitting dataset into training and test set
from sklearn.model_selection import train_test_split
train, test = train_test_split(df_review_bal, test_size=0.33, random_state = 1)

train_x, train_y = train['review'], train['sentiment']
test_x, test_y = test['review'], test['sentiment']



# Natural language processing pipeline:
1.	Tokenizing sentences to break text down into sentences, words, or other units.
2.	Removing stop words like “if,” “but,” “or,” and so on.
3.	Normalizing words by condensing all forms of a word into a single form.
4.	Vectorizing text by turning the text into a numerical representation for consumption by your classifier.

# TF-IDF Vectorizer
Term frequency-inverse document frequency is a text vectorizer that **transforms the text into a usable vector.**

The term frequency is the number of occurrences of a specific term in a document.
Document frequency is the number of documents containing a specific term. 

Inverse document frequency (IDF) is the weight of a term, it aims to reduce the weight of a term if the term’s occurrences are scattered throughout all the documents.

When the number of DF is equal to n which means that the term appears in all documents, the IDF will be zero, when in doubt just put this term in the stopword list because it doesn't provide much information.

The TF-IDF score as the name suggests is just a multiplication of the term frequency matrix with its IDF.



In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')
train_x_vector = tfidf.fit_transform(train_x)
#print(train_x_vector)

pd.DataFrame.sparse.from_spmatrix(train_x_vector,
                                  index=train_x.index,
                                  columns=tfidf.get_feature_names_out())

test_x_vector = tfidf.transform(test_x)
#print(test_x_vector)

# Training and Classification using Support Vector Machine (SVM)

The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future. This best decision boundary is called a hyperplane.

In [7]:
from sklearn.svm import SVC
svc = SVC(kernel= 'linear')
svc.fit(train_x_vector, train_y)

print(svc.predict(tfidf.transform(['A good movie'])))
print(svc.predict(tfidf.transform(['An excellent movie'])))
print(svc.predict(tfidf.transform(['I did not like this movie at all'])))

['positive']
['positive']
['negative']


# Training and Classification using Decision Tree Classifer
The intuition behind Decision Trees is that you use the dataset features to create yes/no questions and continually split the dataset until you isolate all data points belonging to each class.

With this process you’re organizing the data in a tree structure.

Every time you ask a question you’re adding a node to the tree. And the first node is called the root node.

The result of asking a question splits the dataset based on the value of a feature, and creates new nodes.

If you decide to stop the process after a split, the last nodes created are called leaf nodes.

In [8]:
from sklearn.tree import DecisionTreeClassifier
dec_tree = DecisionTreeClassifier()
dec_tree.fit(train_x_vector, train_y)

print(dec_tree.predict(tfidf.transform(['A good movie'])))
print(dec_tree.predict(tfidf.transform(['An excellent movie'])))
print(dec_tree.predict(tfidf.transform(['I did not like this movie at all'])))

['positive']
['positive']
['positive']


# Training and Classification using Gaussian Naive Bayes Classifier

Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for solving classification problems.
It is mainly used in text classification that includes a high-dimensional training dataset.
It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.

In [9]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(train_x_vector.toarray(), train_y)

print(gnb.predict(tfidf.transform(['A good movie']).toarray()))
print(gnb.predict(tfidf.transform(['An excellent movie']).toarray()))
print(gnb.predict(tfidf.transform(['I did not like this movie at all']).toarray()))

['negative']
['negative']
['negative']


# Training and Classification using Logistic Regression
Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.

In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which predicts two maximum values (0 or 1).


In [10]:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(train_x_vector, train_y)

print(log_reg.predict(tfidf.transform(['A good movie'])))
print(log_reg.predict(tfidf.transform(['An excellent movie'])))
print(log_reg.predict(tfidf.transform(['I did not like this movie at all'])))

['positive']
['positive']
['negative']


# Comparing models' performance
Using the sklearn library we can find out the scores of our ML Model and thus choose the algorithm with a higher score to predict our output.

In [11]:
print(svc.score(test_x_vector, test_y))
print(dec_tree.score(test_x_vector, test_y))
print(gnb.score(test_x_vector.toarray(), test_y))
print(log_reg.score(test_x_vector, test_y))

0.8834848484848485
0.7215151515151516
0.65
0.8784848484848485


# F1 score and confusion matrix for our highest performance model
F1 score is a machine learning evaluation metric that measures a model's accuracy. It combines the precision and recall scores of a model. 

Precision: Within everything that has been predicted as a positive, precision counts the percentage that is correct.

Recall: Within everything that actually is positive, how many did the model succeed to find.

The F1 score is defined as the harmonic mean of precision and recall.
F1 score has been designed to work well on imbalanced data.

In [12]:
from sklearn.metrics import f1_score
f1_score(test_y, svc.predict(test_x_vector),
         labels=['positive', 'negative'],
         average=None)

array([0.88582034, 0.88105182])

In [13]:
from sklearn.metrics import classification_report
print(classification_report(test_y, 
                            svc.predict(test_x_vector),
                            labels=['positive', 'negative']))

              precision    recall  f1-score   support

    positive       0.88      0.90      0.89      3329
    negative       0.89      0.87      0.88      3271

    accuracy                           0.88      6600
   macro avg       0.88      0.88      0.88      6600
weighted avg       0.88      0.88      0.88      6600



**A confusion matrix is a table that allows visualization of the performance of an algorithm. This table typically has two rows and two columns that report the number of false positives, false negatives, true positives, and true negatives**

Array represents: 

TP, FP

FN, TN

In [14]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(test_y, 
                            svc.predict(test_x_vector), 
                            labels=['positive', 'negative']))

[[2983  346]
 [ 423 2848]]
