# Rating Movie Reviews as Positive or Negative: Text Classification & Sentiment Analysis Project
by Deepikaa Sriram

For this project I will be using the Cornell University Movie Review polarity dataset v2.0 obtained from http://www.cs.cornell.edu/people/pabo/movie-review-data/

In this exercise I will develop a classification model to predict the Positive/Negative labels based on text content alone. Then, I will attempt to utilize Sentiment Analysis utilizing VADER to determine whether the accuracy of the model improves. 

## Perform imports and load the dataset
The dataset contains the text of 2000 movie reviews. 1000 are positive, 1000 are negative, and the text has been preprocessed as a tab-delimited file.

In [2]:
import numpy as np
import pandas as pd

df = pd.read_csv('../TextFiles/moviereviews.tsv', sep='\t')
df.head()

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


In [4]:
len(df)

2000

## Check for missing values:
NaN values, or missing values in the dataset, may occur if a reviewer declined to provide a comment with the review they provided.

In [5]:
df.isnull().sum()

label      0
review    35
dtype: int64

35 records show **NaN** (this stands for "not a number" and is equivalent to *None*). These can be removed utilizing the `.dropna()` pandas function.

In [6]:
df.dropna(inplace=True)

len(df)

1965

### Detect & remove empty strings
Empty strings are assigned NaN values, however some strings are "whitespace only" strings. In order to detect these strings, this formula iterates over each row in the DataFrame. The **.itertuples()** pandas method provides access to every field. The names `i`, `lb` and `rv` to the `index`, `label` and `review` have been assigned to the columns.

In [7]:
blanks = [] 

for i,lb,rv in df.itertuples():  # iterate over the DataFrame
    if type(rv)==str:            # avoid NaN values
        if rv.isspace():         # test 'review' for whitespace
            blanks.append(i)     # add matching index numbers to the list
        
print(len(blanks), 'blanks: ', blanks)

27 blanks:  [57, 71, 147, 151, 283, 307, 313, 323, 343, 351, 427, 501, 633, 675, 815, 851, 977, 1079, 1299, 1455, 1493, 1525, 1531, 1763, 1851, 1905, 1993]


In [7]:
df.drop(blanks, inplace=True)

len(df)

1938

Now, the whitespace records have been dropped as well from the original 2000.

## Take a quick look at the `label` column:

In [8]:
df['label'].value_counts()

neg    983
pos    982
Name: label, dtype: int64

## Split the data into train & test sets:

In [9]:
from sklearn.model_selection import train_test_split

X = df['review']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

## Build pipelines to vectorize the data, then train and fit a model

In [10]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

# Naïve Bayes:
text_clf_nb = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', MultinomialNB()),
])

# Linear SVC:
text_clf_lsvc = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC()),
])

## Feed the training data through the first pipeline

In [13]:
text_clf_nb.fit(X_train, y_train)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', MultinomialNB())])

## Naive Bayes Model
I chose to run the predictions through a Naive Bayes model first. 

In [14]:
predictions = text_clf_nb.predict(X_test)

**Confusion Matrix**

In [15]:
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions))

[[303  19]
 [114 213]]


**Print a Classification Report**

In [16]:
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         neg       0.73      0.94      0.82       322
         pos       0.92      0.65      0.76       327

    accuracy                           0.80       649
   macro avg       0.82      0.80      0.79       649
weighted avg       0.82      0.80      0.79       649



**Overall Accuracy**

In [15]:
print(metrics.accuracy_score(y_test,predictions))

0.7640625


The Naïve Bayes Model identified reviews as positive/negative based on text alone with 76.4% accuracy.

## Linear SVC Model

In [17]:
text_clf_lsvc.fit(X_train, y_train)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

In [19]:
predictions = text_clf_lsvc.predict(X_test)

**Confusion Matrix**

In [18]:
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions))

[[259  49]
 [ 49 283]]


**Print a Classification Report**

In [20]:
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         neg       0.83      0.87      0.85       322
         pos       0.87      0.83      0.85       327

    accuracy                           0.85       649
   macro avg       0.85      0.85      0.85       649
weighted avg       0.85      0.85      0.85       649



**Print the Overall Accuracy**

In [21]:
print(metrics.accuracy_score(y_test,predictions))

0.8505392912172574


The Linear SVC Model identified reviews as positive/negative based on text alone with **85.05% accuracy**. 

## Adding Stopwords to CountVectorizer
By default, **CountVectorizer** and **TfidfVectorizer** do *not* filter stopwords.

Scikit-learn's built-in list contains 318 stopwords. However, there are several stop words that may influence a classification of movie reviews. With this in mind, I have culled the list down to just 60 words:

In [22]:
stopwords = ['a', 'about', 'an', 'and', 'are', 'as', 'at', 'be', 'been', 'but', 'by', 'can', \
             'even', 'ever', 'for', 'from', 'get', 'had', 'has', 'have', 'he', 'her', 'hers', 'his', \
             'how', 'i', 'if', 'in', 'into', 'is', 'it', 'its', 'just', 'me', 'my', 'of', 'on', 'or', \
             'see', 'seen', 'she', 'so', 'than', 'that', 'the', 'their', 'there', 'they', 'this', \
             'to', 'was', 'we', 'were', 'what', 'when', 'which', 'who', 'will', 'with', 'you']

Now let's repeat the process above and see if the removal of stopwords improves or impairs our score.

In [23]:
import numpy as np
import pandas as pd

df = pd.read_csv('../TextFiles/moviereviews.tsv', sep='\t')
df.dropna(inplace=True)
blanks = []
for i,lb,rv in df.itertuples():
    if type(rv)==str:
        if rv.isspace():
            blanks.append(i)
df.drop(blanks, inplace=True)
from sklearn.model_selection import train_test_split
X = df['review']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn import metrics

**Add Stop Words to the Linear SVC Pipeline**

In [24]:
text_clf_lsvc2 = Pipeline([('tfidf', TfidfVectorizer(stop_words=stopwords)),
                     ('clf', LinearSVC()),
])
text_clf_lsvc2.fit(X_train, y_train)

Pipeline(steps=[('tfidf',
                 TfidfVectorizer(stop_words=['a', 'about', 'an', 'and', 'are',
                                             'as', 'at', 'be', 'been', 'but',
                                             'by', 'can', 'even', 'ever', 'for',
                                             'from', 'get', 'had', 'has',
                                             'have', 'he', 'her', 'hers', 'his',
                                             'how', 'i', 'if', 'in', 'into',
                                             'is', ...])),
                ('clf', LinearSVC())])

In [25]:
predictions = text_clf_lsvc2.predict(X_test)
print(metrics.confusion_matrix(y_test,predictions))

[[256  52]
 [ 48 284]]


In [26]:
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         neg       0.84      0.83      0.84       308
         pos       0.85      0.86      0.85       332

    accuracy                           0.84       640
   macro avg       0.84      0.84      0.84       640
weighted avg       0.84      0.84      0.84       640



In [27]:
print(metrics.accuracy_score(y_test,predictions))

0.84375


Filtering out the Stop Words decreased the accuracy of the model to **84.4%**. Given that the dataset only contains 2000 movie reviews, filtering out the stop words does not serve a benefit to building the model. However, in a large corpus of data, removing stopwords can decrease overall processing time by several hours. 

## Test the model with new data
To test my model, I have copied in a movie review from a recent movie "Murder Mystery 2".

### Next, feed new data to the model's `predict()` method

Review 1 (Negative): "This is yet another instance of Adam Sandler aiming for low-hanging fruit, and Netflix seems quite happy indirectly funding these. With Hustle and Uncut Gems, we know there's still a side to Sandler that craves a well-written, layered character. But lightweight action comedies like these only reinstate a couple of things: a) The banality of getting a pretty worthwhile cast together in exotic places to do silly things, and b) Sandler and Aniston can handle such roles without breaking a sweat. As such, Murder Mystery ends up being a film you can play on any randomly exhausting weeknight, offering a few harmless laughs here and there. There's also the undeniable fun of seeing Sandler and Aniston in Indian attires, dancing away to Bollywood numbers."

In [41]:
myreview = "This is yet another instance of Adam Sandler aiming for low-hanging fruit, and Netflix seems quite happy indirectly funding these. With Hustle and Uncut Gems, we know there's still a side to Sandler that craves a well-written, layered character. But lightweight action comedies like these only reinstate a couple of things: a) The banality of getting a pretty worthwhile cast together in exotic places to do silly things, and b) Sandler and Aniston can handle such roles without breaking a sweat. As such, Murder Mystery ends up being a film you can play on any randomly exhausting weeknight, offering a few harmless laughs here and there. There's also the undeniable fun of seeing Sandler and Aniston in Indian attires, dancing away to Bollywood numbers."

In [30]:
print(text_clf_lsvc.predict([myreview]))

['neg']


Review 2 (Positive): "Loved it and it's truest a wonderful piece. The movie is totally worth your time. Jennifer Aniston and Adam Sandler have come together in this very good-looking, intriguing and hilarious ride through France and Italy and they're both SO GOOD, especially Aniston who makes a meal of her part and owns every scene she's in. Audrey Spitz on her own is a very forgettable character but it's a testament to Aniston's talent and charm what she brings to the fore. The murder mystery part is more interesting than you'd expect, and a few twists are actually good. Apart from that and the beautiful locations, however, there's nothing too special about the movie. The conflict between our couple is half-baked and insignificant, though the chemistry is real. The plot too is laden with cliches. If it didn't have the star power it does, I would have given it a thumbs down. Anyhow, the best scene for me in the movie is the car chase sequence which sums up what the movie is: sexy, exciting and illogical."

In [40]:
myreview2= "Loved it and it's truest a wonderful piece. The movie is totally worth your time. Jennifer Aniston and Adam Sandler have come together in this very good-looking, intriguing and hilarious ride through France and Italy and they're both SO GOOD, especially Aniston who makes a meal of her part and owns every scene she's in. Audrey Spitz on her own is a very forgettable character but it's a testament to Aniston's talent and charm what she brings to the fore. The murder mystery part is more interesting than you'd expect, and a few twists are actually good. Apart from that and the beautiful locations, however, there's nothing too special about the movie. The conflict between our couple is half-baked and insignificant, though the chemistry is real. The plot too is laden with cliches. If it didn't have the star power it does, I would have given it a thumbs down. Anyhow, the best scene for me in the movie is the car chase sequence which sums up what the movie is: sexy, exciting and illogical."

In [38]:
print(text_clf_lsvc.predict([myreview2]))

['pos']


### As we can see, the model was successful in predicting whether the review was positive or negative based on the content of the review! 

# Sentiment Analysis

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a rule-based sentiment analysis tool that assesses the sentiment of text by assigning polarity scores based on a pre-built lexicon of words and phrases, taking into account intensity modifiers and punctuation.

In [48]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

### Append the comp_score to the dataset

In [49]:
df['scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))

df['compound']  = df['scores'].apply(lambda score_dict: score_dict['compound'])

df['comp_score'] = df['compound'].apply(lambda c: 'pos' if c >=0 else 'neg')

df.head()

Unnamed: 0,label,review,scores,compound,comp_score
0,neg,how do films like mouse hunt get into theatres...,"{'neg': 0.121, 'neu': 0.778, 'pos': 0.101, 'co...",-0.9125,neg
1,neg,some talented actresses are blessed with a dem...,"{'neg': 0.12, 'neu': 0.775, 'pos': 0.105, 'com...",-0.8618,neg
2,pos,this has been an extraordinary year for austra...,"{'neg': 0.068, 'neu': 0.781, 'pos': 0.15, 'com...",0.9951,pos
3,pos,according to hollywood movies made in last few...,"{'neg': 0.071, 'neu': 0.782, 'pos': 0.147, 'co...",0.9972,pos
4,neg,my first press screening of 1998 and already i...,"{'neg': 0.091, 'neu': 0.817, 'pos': 0.093, 'co...",-0.2484,neg


### Comparison analysis between the original `label` and `comp_score`

In [50]:
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

In [53]:
print(confusion_matrix(df['label'],df['comp_score']))

[[427 542]
 [164 805]]


In [52]:
print(classification_report(df['label'],df['comp_score']))

              precision    recall  f1-score   support

         neg       0.72      0.44      0.55       969
         pos       0.60      0.83      0.70       969

    accuracy                           0.64      1938
   macro avg       0.66      0.64      0.62      1938
weighted avg       0.66      0.64      0.62      1938



In [54]:
accuracy_score(df['label'],df['comp_score'])

0.6357069143446853

**The VADER Sentiment Analyzer is exhibiting poor prediction accuracy (63.6%) when assessing whether a movie review is negative or positive due to several factors.** 
* First, it could be due to the **complexity and subjectivity of language**. Sentiment analysis relies on understanding the nuances of sentiment expressed in text, which can be challenging as language is rich in context, sarcasm, and figurative expressions
* Additionally, the **training data might be limited or biased, leading to difficulties in generalizing sentiments across diverse reviews**. With a small dataset of 2000 reviews, the model may have difficulty identifying patterns in semantic verbage to accurately capture the tone of the review. 
* Lastly, the model may be unable **to capture domain-specific knowledge and cultural references within movie reviews could contribute to inaccuracies.**