## Movie Sentiment Analysis

  * [Objective](#objective)
  * [About Data](#data)
  * [Creating the Corpus](#corpus)
  * [Build the model](#model)
      + [Bag of Words](#bow)
      + [TF-IDF model](#tfidf_model)
  * [Evaluating the model](#metric)
  * [Save the model](#save)


<a id='objective'></a>

## **Objective:** 

Find out the sentiment (good / bad) of a movie based on the review.

<a id='data'></a>

## About Data

The polarity dataset is available at [Cornell website.](http://www.cs.cornell.edu/people/pabo/movie-review-data/)
This dataset has 2000 (1000 postive and 1000 negative) movie reviews.

The reviews are availabel in the following format in respective sub directories:

  * txt_sentoken/neg
  
  * txt_sentoken/pos
  
So, you can load the files using the sklearn.dataset utilities.

### Read Data

In [1]:
import numpy as np
from sklearn.datasets import load_files

In [2]:
reviews = load_files("./txt_sentoken")

In [3]:
type(reviews)

sklearn.utils.Bunch

The **[sklear.utils.Bunch](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_files.html#sklearn.datasets.load_files )** is a dictionary like object.
So, you can get all the keys and values.

In [4]:
reviews.keys()

['target_names', 'data', 'target', 'DESCR', 'filenames']

In [5]:
len(reviews)

5

In [6]:
reviews.target_names

['neg', 'pos']

In [7]:
reviews.target

array([0, 1, 1, ..., 1, 0, 0])

In [8]:
len(reviews.target)

2000

In [9]:
reviews.filenames

array(['./txt_sentoken/neg/cv405_21868.txt',
       './txt_sentoken/pos/cv190_27052.txt',
       './txt_sentoken/pos/cv132_5618.txt', ...,
       './txt_sentoken/pos/cv653_19583.txt',
       './txt_sentoken/neg/cv559_0057.txt',
       './txt_sentoken/neg/cv684_12727.txt'], dtype='|S34')

In [10]:
len(reviews.filenames)

2000

In [11]:
reviews.data[1]

"good films are hard to find these days . \ngreat films are beyond rare . \nproof of life , russell crowe's one-two punch of a deft kidnap and rescue thriller , is one of those rare gems . \na taut drama laced with strong and subtle acting , an intelligent script , and masterful directing , together it delivers something virtually unheard of in the film industry these days , genuine motivation in a story that rings true . \nconsider the strange coincidence of russell crowe's character in proof of life making the moves on a distraught wife played by meg ryan's character in the film -- all while the real russell crowe was hitching up with married woman meg ryan in the outside world . \ni haven't seen this much chemistry between actors since mcqueen and mcgraw teamed up in peckinpah's masterpiece , the getaway . \nbut enough with the gossip , let's get to the review . \nthe film revolves around the kidnapping of peter bowman ( david morse ) , an american engineer working in south america 

In [12]:
len(reviews.data)

2000

In [13]:
## Get the data & target values
X, y = reviews.data, reviews.target

<a id='corpus'></a>
### Creating the corpus

Basically, perform the following steps:
 * convert all the text to **lower case**
 * remove special characters (!@#$\%\' etc) characters if any
 * remove single characters
 * remove more spaces

In [14]:
# use regular expression to clean 
import re

In [15]:
# create the corpus
corpus= []

for cnt in range(len(X)):
    # remove special characters
    review = re.sub(r'\W', ' ', X[cnt])
    # to lower
    review = review.lower()
    # remove single characters
    review = re.sub(r'\s+[a-z]\s+', ' ', review)
    
    # remove extra spaces
    review = re.sub(r'\s+', ' ', review)
    
    # add it to corpus
    corpus.append(review)
    


In [16]:
len(corpus)

2000

In [17]:
corpus[-1]

'any remake of an alfred hitchcock film is at best an uncertain project as perfect murder illustrates frankly dial for murder is not one of the master director greatest efforts so there is ample room for improvement unfortunately instead of updating the script ironing out some of the faults and speeding up the pace little perfect murder has inexplicably managed to eliminate almost everything that was worthwhile about dial for murder leaving behind the nearly unwatchable wreckage of would be 90s thriller almost all suspense films are loaded with plot implausibilities the best thrillers keep viewers involved enough in what going on so that these flaws in logic don become apparent until long after the final credits have rolled unfortunately in perfect murder the faults are often so overt that we become aware of them as they re happening this is very bad sign not only do such occurrences shatter any suspension of disbelief but they have the astute viewer looking for the next such blunder o

<a id='model'></a>

## Builld Model

<a id='bow'></a>

### Bag of Words

The drawbacks of Bag of words model is that:
  * Does not capture the semantic of the sentence.
  * Equal weightage to all words.

<a id='tfidf_model'></a>
### TF-IDF model

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer

from nltk.corpus import stopwords

In [19]:
?TfidfVectorizer

In [20]:
vectorizer = TfidfVectorizer(max_features=2000, stop_words=stopwords.words('english'))


In [21]:
X = vectorizer.fit_transform(corpus)

In [22]:
print(X.shape)

(2000, 2000)


In [23]:
#print(vectorizer.get_feature_names())

In [24]:
X = X.toarray()

In [25]:
X[0]

array([0., 0., 0., ..., 0., 0., 0.])

### Train and Test data split

In [26]:
from sklearn.model_selection import train_test_split

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=12)

### Training the Classifier

In [28]:
from sklearn.linear_model import LogisticRegression

In [29]:
classifier = LogisticRegression()

In [30]:
classifier.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [31]:
## predict
pred = classifier.predict(X_test)

<a id='metric'></a>
### Model Evaluation

In [32]:
from sklearn.metrics import confusion_matrix

In [33]:
cm = confusion_matrix(y_test, pred)

In [34]:
print cm

[[171  35]
 [ 29 165]]


In [35]:
print("Test Size:", len(y_test))

('Test Size:', 400)


In [36]:
print("Accuracy : {}".format((171+165) *1.0/len(y_test)))

Accuracy : 0.84


<a id='save'></a>
### Save Classifier and Model

In [37]:
import pickle

In [38]:
# classifier

with open('classifier.pickle', 'wb') as f:
    pickle.dump(classifier, f)

In [39]:
# save vectorizier model

with open('vectorizer.pickle', 'wb') as f:
    pickle.dump(vectorizer, f)
          