<img style="float: left;" src="pic2.png">


### Sridhar Palle, Ph.D, spalle@emory.edu (Applied ML & DS with Python Program)

# Text Mining & Sentiment Analysis with NLTK and Sklearn

**Import the libraries and dependencies**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from bs4 import BeautifulSoup
import nltk
import contractions
%matplotlib inline

In [2]:
#nltk.download('all', halt_on_error=False) # do this only once if never done before

## 1. Load the Dataset 

**Lets load it and store it in imdb_big**

In [3]:
imdb_big = pd.read_csv('movie_reviews.csv')

In [4]:
imdb_big.head(3)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive


In [5]:
imdb_big.shape

(50000, 2)

In [6]:
imdb_big['review'].describe()

count                                                 50000
unique                                                49582
top       Loved today's show!!! It was a variety and not...
freq                                                      5
Name: review, dtype: object

In [7]:
imdb_big['review'][0:5]

0    One of the other reviewers has mentioned that ...
1    A wonderful little production. <br /><br />The...
2    I thought this was a wonderful way to spend ti...
3    Basically there's a family where a little boy ...
4    Petter Mattei's "Love in the Time of Money" is...
Name: review, dtype: object

In [8]:
imdb_big['sentiment'].value_counts()

negative    25000
positive    25000
Name: sentiment, dtype: int64

**Q. How to find the longest review**

## 2. Sentiment Analysis just with Sklearn (with basic preprocessing)

### 2.1 Bag of Words Approach

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

In [10]:
cv = CountVectorizer()
bow = cv.fit_transform(imdb_big['review'])
bow.shape

(50000, 101895)

In [11]:
cv = CountVectorizer(stop_words='english')
bow = cv.fit_transform(imdb_big['review'])
bow.shape

(50000, 101583)

**We have chosen deafult Preprocessing in Countvectorizer. It has built-in arguments for common pre-processing text options**
* lowercase = True
* stop_words = 'english'
* token_pattern - takes care of punctuation

**Lets check a small review to see what Countvectorizer has achieved**

In [12]:
imdb_big['review'].str.len().sort_values().head() # first five reviews based on length

27521    32
31072    41
40817    49
28920    51
19874    52
Name: review, dtype: int64

In [13]:
imdb_big['review'][31072]

'What a script, what a story, what a mess!'

In [14]:
print (bow[31072]) # if we just print it, we can see in which column words are present

  (0, 79154)	1
  (0, 85982)	1
  (0, 57605)	1


In [15]:
rows, col = bow[31072].nonzero() 
# it also has a non-zero method which directly show which column indices have non-zero values
col

array([79154, 85982, 57605])

In [16]:
cv.vocabulary_ # lets look at the vocabulary to see the mapping of column indices and names

{'reviewers': 75307,
 'mentioned': 57407,
 'watching': 97949,
 'just': 48334,
 'oz': 65272,
 'episode': 30033,
 'll': 53078,
 'hooked': 42725,
 'right': 75711,
 'exactly': 30888,
 'happened': 40335,
 'br': 11989,
 'thing': 90149,
 'struck': 86368,
 'brutality': 12922,
 'unflinching': 94350,
 'scenes': 78541,
 'violence': 96866,
 'set': 80101,
 'word': 99790,
 'trust': 92697,
 'faint': 31830,
 'hearted': 41057,
 'timid': 90720,
 'pulls': 71391,
 'punches': 71438,
 'regards': 73904,
 'drugs': 27374,
 'sex': 80196,
 'hardcore': 40397,
 'classic': 17478,
 'use': 95543,
 'called': 14048,
 'nickname': 62053,
 'given': 37368,
 'oswald': 64572,
 'maximum': 56385,
 'security': 79484,
 'state': 85330,
 'penitentary': 66873,
 'focuses': 34150,
 'mainly': 54835,
 'emerald': 29180,
 'city': 17324,
 'experimental': 31282,
 'section': 79463,
 'prison': 70404,
 'cells': 15522,
 'glass': 37444,
 'fronts': 35325,
 'face': 31709,
 'inwards': 46429,
 'privacy': 70420,
 'high': 41839,
 'agenda': 3017,
 'em

In [17]:
imdb_big['review'][31072]

'What a script, what a story, what a mess!'

In [18]:
[k for k,v in cv.vocabulary_.items() if v in col]

['mess', 'story', 'script']

### 2.2 Tfidf

In [19]:
from sklearn.feature_extraction.text import TfidfTransformer

tf = TfidfTransformer()
tfidf = tf.fit_transform(bow)
tfidf.shape

(50000, 101583)

In [20]:
tfidf # it is also a sparse matrix

<50000x101583 sparse matrix of type '<class 'numpy.float64'>'
	with 4434500 stored elements in Compressed Sparse Row format>

**For analyzing the sentiment, we can fit the ML algorithm on either bow or tfidf**

### 2.3 Test/Train Split

In [29]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(tfidf,imdb_big['sentiment'], test_size = 0.2, random_state=30)

In [30]:
X_train.shape

(40000, 101583)

In [31]:
X_test.shape

(10000, 101583)

In [32]:
type(X_train)

scipy.sparse.csr.csr_matrix

### 2.4  Fitting a logistic Regression Model directly on this sparse matrix

In [33]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

In [34]:
logr = LogisticRegression(random_state=0)
logr.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=0, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

####  Predictions on the training and test data sets

In [35]:
ypred_train = logr.predict(X_train)
ypred_test = logr.predict(X_test)
print ('Predictions on training data for the first 3 reviews :', ypred_train[0:3])
print ('Predictions on the test data for the first 3 reviews :',ypred_test[0:3])

Predictions on training data for the first 3 reviews : ['negative' 'positive' 'negative']
Predictions on the test data for the first 3 reviews : ['positive' 'positive' 'positive']


#### Confusion matrix and accuracies on training and test data

In [36]:
print ('Confusion Matrix and Accuracies for Training Data')
print (confusion_matrix(y_train, ypred_train))
print (logr.score(X_train, y_train), '\n')

print ('Confusion Matrix and Accuracies for Test Data')
print (confusion_matrix(y_test, ypred_test))
print (logr.score(X_test, y_test), '\n')

Confusion Matrix and Accuracies for Training Data
[[18568  1455]
 [ 1135 18842]]
0.93525 

Confusion Matrix and Accuracies for Test Data
[[4349  628]
 [ 439 4584]]
0.8933 



**Lets look at the cross-val score**

In [37]:
from sklearn.model_selection import cross_val_score

cross_val_score(logr,tfidf,imdb_big['sentiment'],cv=5).mean()



0.8947800000000001

## 3. Sentiment Analysis with both NLTK and Sklearn

**Text Normalization or preprocessing steps**
    - Converting to lowercase
    - Remove html tags
    - Expanding contractions
    - Removing punctuation
    - Removing stop words
    - Stemming or lemmatization

We have already defined functions which can perform these steps. All these functions
are in text_preprocessing.py file. We can directly import the functions from this file.

In [38]:
from Text_Preprocessing import lower_case,html_parser,replace_contractions
from Text_Preprocessing import remove_special, remove_stopwords, word_stem

# remember we are importing from .py file  not .pynb

**Preprocessing with the above imported functions**

In [39]:
def text_preprocess(text):
    text = lower_case(text) # convert to lower case
    text = html_parser(text) # remove html tags
    text = replace_contractions(text) # replace contractions Ex: haven't  to have not
    text = remove_special(text) # remove special characters @, #, %, $ etc..
    text = remove_stopwords(text) # remove stop words. Ex: and, the
    text = word_stem(text, 'lemmatize') # stem or lemmatize
    return text

In [41]:
imdb_big.head(3)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive


In [42]:
prep_review = []
for rev in imdb_big['review']:
    prep_review.append(text_preprocess(rev))
    
imdb_big['prep_review'] = prep_review
imdb_big.head()

Unnamed: 0,review,sentiment,prep_review
0,One of the other reviewers has mentioned that ...,positive,one reviewer mentioned watching 1 oz episode h...
1,A wonderful little production. <br /><br />The...,positive,wonderful little production filming technique ...
2,I thought this was a wonderful way to spend ti...,positive,thought wonderful way spend time hot summer we...
3,Basically there's a family where a little boy ...,negative,basically family little boy jake think zombie ...
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,petter mattei love time money visually stunnin...


### 3.1 Bag of Words

In [43]:
# We are still going to leverage sklearn for creating a bag of words sparse matrix.
# The difference from the previous scenario (section 2) is that we have preprocessed the reviews using our custom functions
# we will pass the newly created prep_review column to the CountVectorizer transformation.

In [44]:
from sklearn.feature_extraction.text import CountVectorizer
cv_prep = CountVectorizer()
bow_prep = cv_prep.fit_transform(imdb_big['prep_review'])
bow_prep.shape

(50000, 94065)

### 3.1 TFidf

In [45]:
from sklearn.feature_extraction.text import TfidfTransformer

tf_prep = TfidfTransformer()
tfidf_prep = tf_prep.fit_transform(bow_prep)
tfidf_prep.shape

(50000, 94065)

**For analyzing the sentiment, we can again fit the ML algorithm on either bow_prep or tfidf_prep**

### 3.2 Test/Train Split

In [46]:
from sklearn.model_selection import train_test_split
X_train_prep, X_test_prep, y_train_prep, y_test_prep = train_test_split(tfidf_prep,imdb_big['sentiment'], test_size = 0.2, random_state=30)

In [47]:
X_train_prep.shape

(40000, 94065)

In [48]:
X_test_prep.shape

(10000, 94065)

In [49]:
type(X_train_prep)

scipy.sparse.csr.csr_matrix

### 3.3 Fitting a logistic Regression Model directly on this sparse matrix

In [50]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score

In [51]:
logr_prep = LogisticRegression(random_state=0)
logr_prep.fit(X_train_prep, y_train_prep)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=0, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

**Predictions on the training and test data sets**

In [52]:
ypred_train_prep = logr_prep.predict(X_train_prep)
ypred_test_prep = logr_prep.predict(X_test_prep)
print ('Predictions on training data for the first 3 reviews :', ypred_train_prep[0:3])
print ('Predictions on the test data for the first 3 reviews :',ypred_test_prep[0:3])

Predictions on training data for the first 3 reviews : ['negative' 'positive' 'negative']
Predictions on the test data for the first 3 reviews : ['positive' 'positive' 'positive']


**Confusion matrix and accuracies on training and test data**

In [53]:
print ('Confusion Matrix and Accuracies for Training Data')
print (confusion_matrix(y_train_prep, ypred_train_prep))
print (accuracy_score(y_train_prep, ypred_train_prep), '\n')

print ('Confusion Matrix and Accuracies for Test Data')
print (confusion_matrix(y_test_prep, ypred_test_prep))
print (accuracy_score(y_test_prep, ypred_test_prep), '\n')

Confusion Matrix and Accuracies for Training Data
[[18511  1512]
 [ 1168 18809]]
0.933 

Confusion Matrix and Accuracies for Test Data
[[4357  620]
 [ 435 4588]]
0.8945 



In [54]:
from sklearn.model_selection import cross_val_score

cross_val_score(logr_prep,tfidf_prep,imdb_big['sentiment'],cv=5).mean()



0.8948599999999999

### 3.4 Improving the accuracy of the models (Optional)

Options
* More text preprocessing (ngrams, correcting spellings)
* More data instances (more examples or reviews. We have only 50000.)
* Investigate different ML models
* Regularization
* Hyperparameter Tuning

**RandomForest**

In [55]:
from sklearn.ensemble import RandomForestClassifier

In [56]:
rf = RandomForestClassifier()
rf.fit(X_train_prep, y_train_prep)



RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [57]:
ypred_train_rf = rf.predict(X_train_prep)
ypred_test_rf = rf.predict(X_test_prep)
print ('Predictions on training data for the first 3 reviews :', ypred_train_rf[0:3])
print ('Predictions on the test data for the first 3 reviews :',ypred_test_rf[0:3])

Predictions on training data for the first 3 reviews : ['negative' 'positive' 'negative']
Predictions on the test data for the first 3 reviews : ['positive' 'positive' 'positive']


In [58]:
print ('Confusion Matrix and Accuracies for Training Data')
print (confusion_matrix(y_train_prep, ypred_train_rf))
print (accuracy_score(y_train_prep, ypred_train_rf), '\n')

print ('Confusion Matrix and Accuracies for Test Data')
print (confusion_matrix(y_test_prep, ypred_test_rf))
print (accuracy_score(y_test_prep, ypred_test_rf), '\n')

Confusion Matrix and Accuracies for Training Data
[[19978    45]
 [  202 19775]]
0.993825 

Confusion Matrix and Accuracies for Test Data
[[4135  842]
 [1443 3580]]
0.7715 



**Evaluating cross-val score with RandomForest**

In [None]:
cross_val_score(rf,tfidf_prep,imdb_big['sentiment'],cv=5).mean()

**GridSearchCV for tuning hyper parameters**

In [None]:
from sklearn.model_selection import GridSearchCV

**GridSearchCV for logistic Regression**

In [None]:
params = {'C': [0.001, 0.001, 0.01, 0.5, 0.1, 1, 5, 10, 100 ]}
logr_gscv = GridSearchCV(logr_prep, params, cv = 5)
logr_gscv.fit(tfidf_prep,imdb_big['sentiment'])

In [None]:
logr_gscv.best_score_ # Best score

In [None]:
logr_gscv.best_params_ #best params

**GridSearchCV for Random Forest**

In [None]:
params_rf = {'n_estimators':[100, 500, 1000], 'max_depth': [3, 5, 8, 10]}
rf_gscv = GridSearchCV(rf, params_rf, cv = 5)
rf_gscv.fit(tfidf_prep,imdb_big['sentiment'])

In [None]:
rf_gscv.best_score_ # Best score

In [None]:
rf_gscv.best_params_ #best params