# Sentiment Analysis on Movie Reviews
#### By Vishaal Rao
 Classify the sentiment of sentences from the Rotten Tomatoes dataset

## Introduction

The dataset is comprised of tab-separated files with phrases from the Rotten Tomatoes dataset. The train/test split has been preserved for the purposes of benchmarking, but the sentences have been shuffled from their original order. Each Sentence has been parsed into many phrases by the Stanford parser. Each phrase has a PhraseId. Each sentence has a SentenceId. Phrases that are repeated (such as short/common words) are only included once in the data.

Dataset taken from:
 https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data 

## Important Libraries used in the following project

1. **NLTK**: The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing for English written in the Python programming language.


2. **Pandas** :pandas is a software library written for the Python programming language for data manipulation and analysis.


3. **sklearn.feature_extraction.text** The library has been used to perform Countvectorization.


4. **sklearn.linear_model** The library has been used to perfrom logistic regression on the given data set.


5. **re**    A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern. 

## Methodology



1. **CountVectorizer** provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.



2. **Logistic Regression** is a statistical model that in its basic form uses a logistic function to model a binary dependent variable.
      


3. **Naive Bayes classifier**: are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features.



## Metrics used

1. **Accuracy score**: It is simply a ratio of correctly predicted observation to the total observations.
                
     *Accuracy = TP+TN/TP+FP+FN+TN*


2. **F1-Score**: F1 Score is the weighted average of Precision and Recall. Therefore, this score takes both false positives and false negatives into account.

      *F1 Score = 2*(Recall * Precision) / (Recall + Precision)*
      
      *Recall=TP/TP+FN*
      
      *Precision=TP/TP+FP*
    
    <font color=green>*TP-True Positive,*</font>
    <font color=green>*TN-True Negative,*</font>
  <font color=green>*FP-False Positive,*</font>
    <font color=green>*FN-False Negative*</font>

In [1]:
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

In [2]:
import os
os.chdir('D:\Data science\strings_73')

In [3]:
Train=pd.read_table('train.tsv')
Test=pd.read_table('test.tsv')

In [4]:
Train.head()

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,A series of escapades demonstrating the adage ...,1
1,2,1,A series of escapades demonstrating the adage ...,2
2,3,1,A series,2
3,4,1,A,2
4,5,1,series,2


In [5]:
Test.head(10)

Unnamed: 0,PhraseId,SentenceId,Phrase
0,156061,8545,An intermittently pleasing but mostly routine ...
1,156062,8545,An intermittently pleasing but mostly routine ...
2,156063,8545,An
3,156064,8545,intermittently pleasing but mostly routine effort
4,156065,8545,intermittently pleasing but mostly routine
5,156066,8545,intermittently pleasing but
6,156067,8545,intermittently pleasing
7,156068,8545,intermittently
8,156069,8545,pleasing
9,156070,8545,but


## Cleaning and preprocess of the text

STEPS INVOLVED IN PREPROCESS

1. **Removing punctuation marks from the corpus.**


2. **lemmatization**: Lemmatization means returning the base form of a particular word. Examples of Lemmatization are that “run” is a base form for words like “running” or “ran” or that the word “better” and “good” are in the same lemma so they are considered the same.


3. **Removing stop words**: stop word is a commonly used word (such as "the") that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query.


4. **Removing all words with frequency less than 10.**

In [6]:
xTrain=Train['Phrase']
yTrain=Train['Sentiment']
xTest=Test['Phrase']

### Removing punctuation marks from the corpus.

In [7]:
import re 

REPLACE_NO_SPACE = re.compile("[.;:!\'?,\"()\[\]]")

def preprocess_reviews(reviews):
    reviews = [REPLACE_NO_SPACE.sub("", line.lower()) for line in reviews]
    
    return reviews

Phrase_train_clean = preprocess_reviews(xTrain)
Phrase_test_clean = preprocess_reviews(xTest) 

In [8]:
Phrase_train_clean[0:5]

['a series of escapades demonstrating the adage that what is good for the goose is also good for the gander  some of which occasionally amuses but none of which amounts to much of a story ',
 'a series of escapades demonstrating the adage that what is good for the goose',
 'a series',
 'a',
 'series']

### Lemmatization

In [9]:
def get_lemmatized_text(corpus):
    from nltk.stem import WordNetLemmatizer
    lemmatizer = WordNetLemmatizer()
    return [' '.join([lemmatizer.lemmatize(word) for word in review.split()]) for review in corpus]

lemmatized_phrase_train = get_lemmatized_text(Phrase_train_clean)
lemmatized_phrase_test = get_lemmatized_text(Phrase_test_clean)

In [10]:
lemmatized_phrase_train[0:5]

['a series of escapade demonstrating the adage that what is good for the goose is also good for the gander some of which occasionally amuses but none of which amount to much of a story',
 'a series of escapade demonstrating the adage that what is good for the goose',
 'a series',
 'a',
 'series']

### Removing stop words

In [11]:
import nltk
from nltk.corpus import stopwords
print ('The version of NLTK used is:',nltk.__version__)

english_stop_words = stopwords.words('english')
def remove_stop_words(corpus):
    removed_stop_words = []
    for review in corpus:
        removed_stop_words.append(
            ' '.join([word for word in review.split() 
                      if word not in english_stop_words])
        )
    return removed_stop_words

no_stop_words_Train = remove_stop_words(lemmatized_phrase_train)
no_stop_words_Test = remove_stop_words(lemmatized_phrase_test)

The version of NLTK used is: 3.4


### Removing all words with frequency less than 10.

In [12]:
from nltk.probability import FreqDist
fdist1= FreqDist()
fdist2=FreqDist()
for i in range(len(no_stop_words_Train)):    
    for sentence in nltk.tokenize.sent_tokenize(no_stop_words_Train[i]):
        for word in nltk.tokenize.word_tokenize(sentence):
            fdist1[word] += 1
for i in range(len(no_stop_words_Test)):    
    for sentence in nltk.tokenize.sent_tokenize(no_stop_words_Test[i]):
        for word in nltk.tokenize.word_tokenize(sentence):
            fdist2[word] += 1

In [13]:
Val_train_10={ key for key, value in fdist1.items() if value<=10}
Val_test_10={ key for key, value in fdist2.items() if value<=10}

In [14]:
final_train=[]
final_test=[]
for sent in lemmatized_phrase_train:
    final_train.append(' '.join([word for word in sent.split() if word not in Val_train_10]))
for sent in lemmatized_phrase_test:
    final_test.append(' '.join([word for word in sent.split() if word not in Val_test_10]))
    

### Using CountVectorizer

In [15]:
from sklearn.feature_extraction.text import CountVectorizer

Tidfv = CountVectorizer(ngram_range=(1, 12),min_df=1)
Tidfv.fit(final_train)
X = Tidfv.transform(final_train)
X_test = Tidfv.transform(final_test)

In [16]:
from sklearn.linear_model  import LogisticRegression
from sklearn.naive_bayes  import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split

In [17]:
X_train, X_val, y_train, y_val = train_test_split(
    X, yTrain, train_size = 0.66)

### Building the model

We are comparing logistic regression and Naive Bayes so as to know which model performs better.

Why have we considered the above 2 models?
 
 - The dependent variable is classified into multiple levels.
 - we need to derive the proability of each of the level over the other levels.
 - The levels are nominal and do not follow an order.
 - These models are considered the best suited for semantic analysis.

### Logistic Regression (Multinomial)

In [18]:
lr = LogisticRegression(C=1,multi_class='multinomial',solver='saga',max_iter=200)#C=4 gives the best 
lr.fit(X_train, y_train)
print ("Accuracy for C=%s: %s" % (4, accuracy_score(y_val, lr.predict(X_val))))
print ("The f1 score is %s"% (f1_score(y_val, lr.predict(X_val),average ='weighted')))

Accuracy for C=4: 0.6508923691600234
The f1 score is 0.6373559845240885


### Multinomial Naive Bayes

In [19]:
NB = MultinomialNB(alpha=2)#C=4 gives the best 
NB.fit(X_train, y_train)
print ("Accuracy for alpha =%s: %s" % (2, accuracy_score(y_val, NB.predict(X_val))))
print ("The f1 score is %s"% (f1_score(y_val, NB.predict(X_val),average ='weighted')))

Accuracy for alpha =2: 0.6113529711087239
The f1 score is 0.601486804458539


In [20]:
d={'Logistic Regression':[round(accuracy_score(y_val, lr.predict(X_val)),2),round((f1_score(y_val, lr.predict(X_val),average ='weighted')),2)],'Multinomial Naive Bayes':[round(accuracy_score(y_val, NB.predict(X_val)),2),round((f1_score(y_val, NB.predict(X_val),average ='weighted')),2)]}

In [21]:
Tally=pd.DataFrame(data=d,index=['Accuracy score','F1 score'])

In [22]:
Tally

Unnamed: 0,Logistic Regression,Multinomial Naive Bayes
Accuracy score,0.65,0.61
F1 score,0.64,0.6


As we can see both with regards to Accuracy score and F1-score Logistic regression is performing better than Naive Bayes, Hence in order to predict for the test data set we shall be using Logistic regeression

In [23]:
lr.predict(X_test)

array([2, 2, 2, ..., 2, 2, 2], dtype=int64)

In [24]:
Final_soln=pd.DataFrame({'PhraseID':Test['PhraseId'],'Sentiment':lr.predict(X_test)})

In [25]:
Final_soln.set_index('PhraseID',inplace=True)

In [26]:
Final_soln.to_csv('Submission-Vishaal Rao.csv',index_label='PhraseID')