## Text Analytics -- Movie Reviews Sentiment Analysis

I would like to detail the analysis of a movie review  from scratch.
Initially we will using machine learning with NLTK and then we will move on to Word2Vec using the deep learning.


Steps:

Phase-1 
--------

1) We will load a review data set

2) Pefrom the preprocessing of the text

3) convert the text to numerical values, i.e. vectorize the text ( we will use the tf-idf model) 

4) Use the Naive Bayes multinomial model to predict the review


In [1]:
import nltk
import pandas as pd
import numpy as np
from sklearn import naive_bayes
from sklearn.metrics import roc_auc_score
from sklearn.feature_extraction.text import TfidfVectorizer
import re


In [2]:
my_local_path='E:/P/DataScience/ML/DL/NLP/'

In [3]:
df=pd.read_csv(my_local_path+'labeledTrainData.tsv',delimiter='\t')

In [4]:
df.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 3 columns):
id           25000 non-null object
sentiment    25000 non-null int64
review       25000 non-null object
dtypes: int64(1), object(2)
memory usage: 586.0+ KB


In [15]:
df['sentiment'].value_counts()

1    12500
0    12500
Name: sentiment, dtype: int64

So ,there is no class imbalance in the data. Had it been there, we would need to get the thresholds to be adjusted using the 
ROC AUC Curves

# Preprocessing of the Text

1) Case Conversion - all the words to lower case

2) Removal of the stop words

3) Removal of the Punctuations

4) Spell correction  ( using TextBlob, a wrapper on top of NLTK)

5) Tokenization

6) Stemming

7) lemmatization

8) POS Tagging

9) Named Entity Recognition ( NER) 



In [5]:
## We will define a function to clean up the message like removing punctuations, single letter words etc..
def clean_str(string):
  """
  String cleaning before vectorization
  """
  try:    
    string = re.sub(r'^https?:\/\/<>.*[\r\n]*', '', string, flags=re.MULTILINE)
    string = re.sub(r"[^A-Za-z]", " ", string)         
    words = string.strip().lower().split()    
    words = [w for w in words if len(w)>=1]
    return " ".join(words)	
  except:
    return ""

In [6]:
df['clean_review']=df['review'].apply(clean_str)

In [8]:
df.head(20)

Unnamed: 0,id,sentiment,review,clean_review
0,5814_8,1,With all this stuff going down at the moment w...,with all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",the classic war of the worlds by timothy hines...
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,the film starts with a manager nicholas bell g...
3,3630_4,0,It must be assumed that those who praised this...,it must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,superbly trashy and wondrously unpretentious s...
5,8196_8,1,I dont know why people think this is such a ba...,i dont know why people think this is such a ba...
6,7166_2,0,"This movie could have been very good, but come...",this movie could have been very good but comes...
7,10633_1,0,I watched this video at a friend's house. I'm ...,i watched this video at a friend s house i m g...
8,319_1,0,"A friend of mine bought this film for £1, and ...",a friend of mine bought this film for and even...
9,8713_10,1,<br /><br />This movie is full of references. ...,br br this movie is full of references like ma...


Spell correction. I dont recommend this unless you see the data set has lot of spell mistakes. This is very expensive piece of code takes lot of time to execute

In [None]:
from textblob import  TextBlob
df['clean_review']=df['clean_review'].apply(lambda x: str(TextBlob(x).correct()))

# POS tagging and Lemmatization

Its advised to perform the POS(Part of Speech) Tagging before the lemmatization to get the acual 'lemma's 

The wordnet Lemmatizer does take into consideration of the POS tags

For example

 WordNetLemmatizer().lemmatize("loving")

'loving'

WordNetLemmatizer().lemmatize('loving','v')

'love'

So it considers everything as Noun if we dont pass the POS Tag.

The wordnet lemmatizer only knows four parts of speech (ADJ, ADV, NOUN, and VERB). We need to convert the POS tags to format understood by wordnet. 
Here i am considering only these 4 Tags. you can try all the POS tags from Penn Treebank

In [36]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn

def penn2wordnet(tag):
    if tag in ['NN', 'NNS', 'NNP', 'NNPS']:
        return wn.NOUN
    elif tag in ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']:
        return wn.VERB
    elif tag in ['RB', 'RBR', 'RBS']:
        return wn.ADV
    elif tag in ['JJ', 'JJR', 'JJS']:
        return wn.ADJ
    else:
        return None

In [66]:
def lemmatize_text(inputText):
    pos_tag=nltk.pos_tag(nltk.word_tokenize(inputText))
    outputText=[]
    for word,tag in pos_tag:
        wntag=penn2wordnet(tag)
        if wntag is None:
            outputText.append(WordNetLemmatizer().lemmatize(word))
#            print(outputText)
        else:
            outputText.append(WordNetLemmatizer().lemmatize(word,wntag))
#            print(outputText)
    return ' '.join(outputText)

We will apply the function to the clean_review text of the dataset

In [69]:
df['clean_review']=df['clean_review'].apply(lemmatize_text)

In [75]:
df.head()

Unnamed: 0,id,sentiment,review,clean_review
0,5814_8,1,With all this stuff going down at the moment w...,with all this stuff go down at the moment with...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",the classic war of the world by timothy hines ...
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,the film start with a manager nicholas bell gi...
3,3630_4,0,It must be assumed that those who praised this...,it must be assume that those who praise this f...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,superbly trashy and wondrously unpretentious s...


As we can see the words have been convereted to their lemmas. You can try stemming as an alternate approach

Next activity could be identifying the Named Entity Recognitions NER.

Its about information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. 

For ex: United states : This represents a country and this text being in 2 parts does not help us in predicting the output accurately

But for our use case , its not so relevant so we will not perform this

# Now we will conver the text message to numeric values( in a vector) that a machine understands. For this purpose we will use TF-IDF(Term freequencey /Invese document freequency) converter

Along with the conversion , we will remove the stop words as well

In [145]:
from nltk.corpus import stopwords

stop_words=stopwords.words('English')
vectorizer=TfidfVectorizer(use_idf=True,lowercase=True,strip_accents='ascii',stop_words=stop_words,max_features=20000)

In [91]:
clean_df=pd.DataFrame()
clean_df['review']=df['clean_review']
clean_df['sentiment']=df['sentiment']

In [146]:
vectorizer.fit(clean_df)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=20000, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs',... 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"],
        strip_accents='ascii', sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [147]:
X=vectorizer.fit_transform(clean_df.review)
Y=clean_df.sentiment

In [137]:
print(X[0])

  (0, 53360)	0.03382431451126539
  (0, 22611)	0.0494022613806681
  (0, 36384)	0.027846133141277738
  (0, 36184)	0.7826918259125908
  (0, 52627)	0.0490623918302241
  (0, 32370)	0.039550996218539344
  (0, 37260)	0.05515344421511069
  (0, 60699)	0.04916467595315593
  (0, 39130)	0.039800396501580765
  (0, 15423)	0.03751090556928087
  (0, 61686)	0.07278959679939737
  (0, 36593)	0.1352104958827194
  (0, 34591)	0.0864681267049137
  (0, 60572)	0.06349384858106918
  (0, 22117)	0.015053718843290174
  (0, 8823)	0.037275292754472396
  (0, 27743)	0.0461269362189857
  (0, 23781)	0.05102403500708591
  (0, 55680)	0.03505046071148595
  (0, 45149)	0.0356545877579731
  (0, 11518)	0.0712410378807852
  (0, 16927)	0.05260937228132301
  (0, 33626)	0.02707075086222816
  (0, 35797)	0.028444617564664845
  (0, 61149)	0.03632291335629353
  :	:
  (0, 28320)	0.051539941151255333
  (0, 5178)	0.02141957988014816
  (0, 7274)	0.04450992435476921
  (0, 22360)	0.026255284423782402
  (0, 57196)	0.030913789148438037
  (0, 

In [110]:
vectorizer.get_feature_names()[12616]

'latter'

In [116]:
max(vectorizer.get_feature_names())

'zwick'

In [117]:
X.shape

(25000, 25000)

In [118]:
Y.shape

(25000,)

In [148]:
from sklearn.cross_validation import train_test_split

x_train,x_test,y_train,y_test=train_test_split(X,Y,random_state=121)


In [121]:
x_train.shape

(18750, 25000)

In [122]:
y_train.shape

(18750,)

# We will train the Naive Bayes model with the train data (Conditional probability model)

In [149]:
nb=naive_bayes.MultinomialNB()
model=nb.fit(x_train,y_train)

In [151]:
y_pred=model.predict(x_test)
y_pred

array([0, 0, 0, ..., 0, 1, 0], dtype=int64)

In [152]:
from sklearn.metrics import confusion_matrix,f1_score,accuracy_score,classification_report
conf=confusion_matrix(y_pred,y_test)
conf

array([[2724,  460],
       [ 361, 2705]], dtype=int64)

In [153]:
accuracy_score(y_pred,y_test)

0.86864

In [154]:
report=classification_report(y_pred,y_test)
print(report)

             precision    recall  f1-score   support

          0       0.88      0.86      0.87      3184
          1       0.85      0.88      0.87      3066

avg / total       0.87      0.87      0.87      6250



We got an accuracy score of 86.86% which is not great & not bad. Lets try with other models

In [160]:
from sklearn.ensemble import RandomForestClassifier
rfmodel=RandomForestClassifier(criterion='entropy')
rfmodel.fit(x_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [161]:
y_pred_rf=rfmodel.predict(x_test)

In [170]:
accuracy_score(y_pred_rf,y_test)

0.77376

 # So far Naive Bayes is better 

# Now we will perform the vector conversion using Deep learning Word2Vec

We will use Keras to work with the embedding layers

In [241]:
X_train,X_test,Y_train,Y_test=train_test_split(df['clean_review'],df['sentiment'],random_state=111,test_size=0.2)

In [242]:
X_train.shape

(20000,)

# Build the tokenizer

In [243]:
from tensorflow.python.keras.preprocessing.text import Tokenizer

In [245]:
top_words=10000

In [246]:
t=Tokenizer(num_words=top_words)

In [247]:
t.fit_on_texts(X_train)

In [248]:
X_train=t.texts_to_sequences(X_train)

In [249]:
len(X_train[2])

76

In [250]:
X_test=t.texts_to_sequences(X_test)

# We need to pad sequences to make all the reviews uniform length

In [251]:
from tensorflow.python.keras.preprocessing import sequence

In [252]:
max_review_length=400

In [253]:
X_train=sequence.pad_sequences(X_train,maxlen=max_review_length,padding='post')

In [254]:
X_test=sequence.pad_sequences(X_test,maxlen=max_review_length,padding='post')

In [255]:
X_train[212]

array([   1,   81,    5,    1,   16,   72,   84,  104,   46,   48,   14,
          2,    1,  599,    5,   28,    1, 7087,    4, 2962,   12,   14,
         96,  231,  189,   91,   63,   26,  102,   14,   27,    7,   39,
         78, 2092, 3872,   24,   14,    1,    4, 1509,    6,    2,  820,
          9,    1,  194,  422,   19,   86,    1, 1640,    6,  653,  124,
          1,  246,    5, 1489,    4,  914,    4, 2069,   18,  150,  441,
         91, 1636, 1304, 5165,   41,  162,   18,    1,  176,  959,    5,
       6801,    6,   41,  274,   19,   86,    3,  208,   53,   17,   48,
          2,   57,   22,  189,   41,  108,   90, 1445, 2844,   13,   52,
         37,    1,   88,  112,   16,   96,  475,   87,   25,  196,  322,
        207,    3,  162,    9,   12, 7230,   72,    2,   95,    3,  334,
        180,   37,  587, 8840,   33, 2844,   24,    2,  110,   29,    1,
        344,    5,    1, 1532, 2348,    4,  910,    6,   63,   26,   55,
         31,    3, 1479,   18,  121,   79,  123,   

In [256]:
X_train[100]

array([  11,   15,    2,  132,   10,  104,   11, 3320,   10,  438,   20,
        104,   11,   15, 1635,  331,   10,    2,   38, 1199,   72,  104,
          7,  127,   72,   14,   36,    7,   22,  251,  157,    4,  116,
       2413,    6,  104,    7,    4,  461,    4,   14,  469,  223,  903,
          9,  461,    7,   26,   85, 2155,  425,   12,    7,    2,    2,
        485,  151,   10,  213,   80,   12,    7,    2,   23,  267,   47,
         21,    2,    3, 1266,  392,   21,   84,   99,   11,   15,    1,
       1266,    2, 1117,   62, 8585,    4,    1,   15,    2,   62,   27,
          7,    2,  446,   26,   72,   84,    2, 2196,   17,  273,  350,
        275, 2906,   72,   84,    2,  319,    7,    6,  273,  223,    1,
       1272,   31,  211,  126,    4, 1266,    9,   11,   16,    2,   38,
       5952,    4,   31,   21,  116,    6,  506,   53,   49,   18,  129,
       1266,   11,    2,  446,    3,   15,   12,   72,   84,   58,  469,
         49,    0,    0,    0,    0,    0,    0,   

As we see the above, padding has been done for the reviews less than 400

# Embedding Layer

In [257]:
Embeding_Vector_Length=100 # This determines the number of elements per vector representation of each of the 5000 words

# Building the Graph

In [258]:
from tensorflow.python.keras.models import Sequential

In [259]:
from tensorflow.python.keras.layers import Embedding,Dense,Dropout,Flatten

We need to define the word embedding layer of Word2Vec

In [260]:
model=Sequential()

In [261]:
model.add(Embedding(top_words+1, # size of the vocabulary
                    Embeding_Vector_Length, # size of the vector
                    input_length=max_review_length
                   )
         )

In [262]:
model.add(Flatten())

In [263]:
model.add(Dense(200,activation='relu'))

In [264]:
model.add(Dense(100,activation='relu'))

In [265]:
model.add(Dense(60,activation='relu'))

In [266]:
model.add(Dense(40,activation='relu'))

In [267]:
model.add(Dense(1,activation='sigmoid'))

In [268]:
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

# We need to execute the graph now

In [269]:
model.fit(X_train,Y_train,epochs=10,batch_size=128,validation_data=(X_test,Y_test))

Train on 20000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x3a5b0d30>

So, After using the Word2Vec , deep learnign as well, our accuracy score is more or less the same value(86.9%) . I hope this will give you an idea of how do we analyse the text using both the approaches.