# Sentiment Analysis Project
For this project, we'll perform the same type of NLTK VADER sentiment analysis, this time on our movie reviews dataset.

The 2,000 record IMDb movie review database is accessible through NLTK directly with
<pre>from nltk.corpus import movie_reviews</pre>


## import all the tools and load data


In [1]:
import numpy as np
import pandas as pd
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sia=SentimentIntensityAnalyzer()

  return f(*args, **kwds)
  return f(*args, **kwds)


In [2]:
# load the data
df=pd.read_csv("../TextFiles/moviereviews.tsv",sep="\t")
df.head()

Unnamed: 0,label,review
0,neg,how do films like mouse hunt get into theatres...
1,neg,some talented actresses are blessed with a dem...
2,pos,this has been an extraordinary year for austra...
3,pos,according to hollywood movies made in last few...
4,neg,my first press screening of 1998 and already i...


## Check for missing values in the data

In [3]:
df.isnull().sum()

label      0
review    35
dtype: int64

In [4]:
df.dropna(inplace=True)

## Check for blank reviews and drop them if you found any

In [6]:
blanks=[]
for i,lb,rv in df.itertuples():
    if type(rv)==str:
        if rv.isspace():
            blanks.append(i)
            
print(len(blanks),blanks)

27 [57, 71, 147, 151, 283, 307, 313, 323, 343, 351, 427, 501, 633, 675, 815, 851, 977, 1079, 1299, 1455, 1493, 1525, 1531, 1763, 1851, 1905, 1993]


In [7]:
df.drop(blanks,inplace=True)

In [8]:
len(df)

1938

## Check lables

In [9]:
df["label"].value_counts()

pos    969
neg    969
Name: label, dtype: int64

## Now let's perform sentiment analysis on reviews and add them to our dataframe

In [10]:
df['scores']=df['review'].apply(lambda x : sia.polarity_scores(x))
df['compound']=df['scores'].apply(lambda x : x['compound'])
df['compound_scores']=df['compound'].apply(lambda score: "pos" if score >=0 else "neg" )
df.head()

Unnamed: 0,label,review,scores,compound,compound_scores
0,neg,how do films like mouse hunt get into theatres...,"{'neg': 0.121, 'neu': 0.778, 'pos': 0.101, 'co...",-0.9125,neg
1,neg,some talented actresses are blessed with a dem...,"{'neg': 0.12, 'neu': 0.775, 'pos': 0.105, 'com...",-0.8618,neg
2,pos,this has been an extraordinary year for austra...,"{'neg': 0.067, 'neu': 0.783, 'pos': 0.15, 'com...",0.9953,pos
3,pos,according to hollywood movies made in last few...,"{'neg': 0.069, 'neu': 0.786, 'pos': 0.145, 'co...",0.9972,pos
4,neg,my first press screening of 1998 and already i...,"{'neg': 0.09, 'neu': 0.822, 'pos': 0.088, 'com...",-0.7264,neg


## Evaluate the Outcome of Vader model

In [11]:
from sklearn.metrics import classification_report,accuracy_score,confusion_matrix

## Accuracy of the model

In [12]:
accuracy_score(df["label"],df["compound_scores"])

0.6367389060887513

In [13]:
print(classification_report(df["label"],df["compound_scores"]))

              precision    recall  f1-score   support

         neg       0.72      0.44      0.55       969
         pos       0.60      0.83      0.70       969

   micro avg       0.64      0.64      0.64      1938
   macro avg       0.66      0.64      0.62      1938
weighted avg       0.66      0.64      0.62      1938



In [14]:
confusion_matrix(df["label"],df["compound_scores"])

array([[427, 542],
       [162, 807]], dtype=int64)

So, it looks like VADER couldn't judge the movie reviews very accurately. This demonstrates one of the biggest challenges in sentiment analysis - understanding human semantics. Many of the reviews had positive things to say about a movie, reserving final judgement to the last sentence.