<h1><center> Sentiment Analysis of Movie Reviews using Bag-of-Words (BoW) model </center></h1>

The bag-of-words model is a primitive NLP model. This model is simple to understand and implement and has seen great success in problems such as language modeling and document classification. 

## Reading The Data

Let's first read the data. For this implementation, i will use <a href="https://www.kaggle.com/marklvl/sentiment-labelled-sentences-data-set"><b>Sentiment Labelled Sentences Data Set<b></a> on kaggle.

In [1]:
import pandas as pd
import numpy as np

In [2]:
data1 = pd.read_csv("./sentiment labelled sentences/amazon_cells_labelled.txt", delimiter="\t", header=None)
data2 = pd.read_csv("./sentiment labelled sentences/yelp_labelled.txt", delimiter="\t", header=None)
data3 = pd.read_csv("./sentiment labelled sentences/imdb_labelled.txt", delimiter="\t", header=None)

reviews_df = pd.concat([data1, data2, data3])
reviews_df.columns = ["Review_text", "Review_class"]

In [3]:
reviews_df.head(5)

Unnamed: 0,Review_text,Review_class
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1


## Cleaning Data

Cleaning data is an important part of any NLP model. Here i will use <emph>NLTK</emph> (Natural Language Toolkit) for cleaning data. This process consists of $n$ steps:
<ol>
  <li> Converting all characters to lower case </li>
  <li> Removing any possible link </li>
  <li> Removing punctuation</li>
  <li> Stemming, i.e process of reducing inflected words to their root form </li>
  <li> Removing <emph>Stopwords</emph>. Stop words are a set of commonly used words in a language. </li>
</ol>

In [4]:
import re
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [5]:
def clean_string(document):
    
        # everything should be lower case
        document = document.lower()
                
        # remove any possible link
        links_pattern = "((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:@\-_=#]+\.([a-zA-Z]){2,6}([a-zA-Z0-9\.\&\/\?\:@\-_=#])*"
        links = re.compile(links_pattern)        
        document = re.sub(links, "", document)
        
        # remove punctuation
        punct_pattern = ("[,.\"!@#$%^&*(){}?/;`~:<>+=-]")
        punct = re.compile(punct_pattern)
        document = re.sub(punct, "", document)
        
        # stemmer
        ps = PorterStemmer()
        words = word_tokenize(document)
        
        # a list of stop words
        stop_words = set(stopwords.words("english"))
        stop_words.discard("not") # i had an intuition that not shouldn't be removed
        
        # process of stemming and removing stopwords
        words = [ps.stem(word) for word in words if not (word in stop_words)]
        
        words = ' '.join(words)
        
        return words

In [6]:
def clean_data(dataframe):
    clean = list()
    reviews = dataframe["Review_text"].values.tolist()
    for document in reviews:
        
        words = clean_string(document)
        
        clean.append(words)
       
    return clean

In [7]:
clean_reviews = clean_data(reviews_df)
clean_reviews[0:5]

['way plug us unless go convert',
 'good case excel valu',
 'great jawbon',
 'tie charger convers last 45 problem',
 'mic great']

## Making the Feature Matrix

The next step is to turn each document of free text into a vector that we can use as input for a machine learning model. In order to do so, first we obtain a <b>dictionary</b> of unique words in corpus. Then we can create a matrix where each row corresponds to a document in corpus and each column corresponds to a word in the dictionary of unique words. The $<i, j>$ element of feature matrix is set to 1 if the $ith$ document contains the $jth$ word and 0 otherwise.
<br>
To do this, we will use the CountVectorizer from <i>sklearn</i>.
<br>
As it can be seen, we end up with a dataset with 2748 rows and 1208 columns.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(min_df=3)   
X = vectorizer.fit_transform(clean_reviews).toarray()
y = reviews_df.values[:,1].astype('int')
print(np.shape(X))
print(np.shape(y))

(2748, 1208)
(2748,)


## Train Test Split

Split the feature matrix into random train and test subsets.

In [9]:
from sklearn.model_selection import  train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, shuffle=True)

## Classification Model

After we obtain a dataset, we can use any machine learning algorithm to fit a model to it. Here will will use <i>Random Forest</i> amd <i>Naive Bayes</i> seperately.

### Random Forest

Fitting a random forest model to data.

In [10]:
from sklearn.ensemble import RandomForestClassifier

RF_model = RandomForestClassifier()

RF_model.fit(X_train, y_train)

### Testing the model

In [11]:
from sklearn.metrics import accuracy_score, f1_score, precision_score


y_pred = RF_model.predict(X_test)

print(f"accuracy: {accuracy_score(y_test, y_pred) * 100:.3f}%")
print(f"f1 score: {f1_score(y_test, y_pred):.3f}")
print(f"precision score: {precision_score(y_test, y_pred):.3f}")

accuracy: 81.818%
f1 score: 0.819
precision score: 0.825


### Giving Manual Inputs to model

In [12]:
new_review = input("Manual input: ")
clean_x = clean_string(new_review)
clean_x = [clean_x]
new_input = vectorizer.transform(clean_x).toarray()

predic = RF_model.predict(new_input)

print("the review is a ", end = "")

if predic:
    print("positive review")
    
else:
    print("negative review")

Manual input:  I hated it, it was a waste of time!!


the review is a negative review


+---------------------------------------------------------------------------------------------------------------------------------------------------+

### Naive Bayes Model

In [13]:
from sklearn.naive_bayes import GaussianNB

NB_model = GaussianNB()

NB_model.fit(X_train, y_train)

### Testing the model

In [14]:
y_pred = NB_model.predict(X_test)

print(f"accuracy: {accuracy_score(y_test, y_pred) * 100:.3f}%")
print(f"f1 score: {f1_score(y_test, y_pred):.3f}")
print(f"precision score: {precision_score(y_test, y_pred):.3f}")

accuracy: 68.545%
f1 score: 0.725
precision score: 0.650


In [15]:
new_review = input("Manual input: ")
clean_x = clean_string(new_review)
clean_x = [clean_x]
new_input = vectorizer.transform(clean_x).toarray()

predic = NB_model.predict(new_input)

print("the review is a ", end = "")

if predic:
    print("positive review")
    
else:
    print("negative review")

Manual input:  I really liked it. Great work.


the review is a positive review
