## Tutorial 20. Sentiment analysis

Created by Emanuel Flores-Bautista 2019  All content contained in this notebook is licensed under a [Creative Commons License 4.0 BY NC](https://creativecommons.org/licenses/by-nc/4.0/). The code is licensed under a [MIT license](https://opensource.org/licenses/MIT).


In [1]:
import numpy as np
import pandas as pd
import scipy.stats as st
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.neural_network import MLPClassifier
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report,accuracy_score,confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from keras.datasets import imdb

%matplotlib inline
%config InlineBackend.figure_format = 'svg'

Using TensorFlow backend.


In [2]:
import matplotlib.pyplot as plt

We will train a classifier movie for reviews in the IMDB data set.

In [3]:
vocabulary_size = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words = vocabulary_size)
print('Loaded dataset with {} training samples,{} test samples'.format(len(X_train), len(X_test)))

Loaded dataset with 25000 training samples,25000 test samples


In [4]:
len(X_train[0])

218

In [5]:
print('---review---')
print(X_train[6])
print('---label---')
print(y_train[6])

---review---
[1, 2, 365, 1234, 5, 1156, 354, 11, 14, 2, 2, 7, 1016, 2, 2, 356, 44, 4, 1349, 500, 746, 5, 200, 4, 4132, 11, 2, 2, 1117, 1831, 2, 5, 4831, 26, 6, 2, 4183, 17, 369, 37, 215, 1345, 143, 2, 5, 1838, 8, 1974, 15, 36, 119, 257, 85, 52, 486, 9, 6, 2, 2, 63, 271, 6, 196, 96, 949, 4121, 4, 2, 7, 4, 2212, 2436, 819, 63, 47, 77, 2, 180, 6, 227, 11, 94, 2494, 2, 13, 423, 4, 168, 7, 4, 22, 5, 89, 665, 71, 270, 56, 5, 13, 197, 12, 161, 2, 99, 76, 23, 2, 7, 419, 665, 40, 91, 85, 108, 7, 4, 2084, 5, 4773, 81, 55, 52, 1901]
---label---
1


Note that the review is stored as a sequence of integers. From the [Keras documentation](https://keras.io/datasets/) we can see that these are words IDs that have been pre-assigned to individual words, and the label is an integer (0 for negative, 1 for positive). We can go ahead and access the words from each review with the `get_word_index()` method from the `imdb` object.

In [6]:
word2id = imdb.get_word_index()
id2word = {i: word for word, i in word2id.items()}
print('---review with words---')
print([id2word.get(i, ' ') for i in X_train[6]])
print('---label---')
print(y_train[6])

---review with words---
['the', 'and', 'full', 'involving', 'to', 'impressive', 'boring', 'this', 'as', 'and', 'and', 'br', 'villain', 'and', 'and', 'need', 'has', 'of', 'costumes', 'b', 'message', 'to', 'may', 'of', 'props', 'this', 'and', 'and', 'concept', 'issue', 'and', 'to', "god's", 'he', 'is', 'and', 'unfolds', 'movie', 'women', 'like', "isn't", 'surely', "i'm", 'and', 'to', 'toward', 'in', "here's", 'for', 'from', 'did', 'having', 'because', 'very', 'quality', 'it', 'is', 'and', 'and', 'really', 'book', 'is', 'both', 'too', 'worked', 'carl', 'of', 'and', 'br', 'of', 'reviewer', 'closer', 'figure', 'really', 'there', 'will', 'and', 'things', 'is', 'far', 'this', 'make', 'mistakes', 'and', 'was', "couldn't", 'of', 'few', 'br', 'of', 'you', 'to', "don't", 'female', 'than', 'place', 'she', 'to', 'was', 'between', 'that', 'nothing', 'and', 'movies', 'get', 'are', 'and', 'br', 'yes', 'female', 'just', 'its', 'because', 'many', 'br', 'of', 'overly', 'to', 'descent', 'people', 'time', 

Because we cannot feed the index matrix directly to the classifier, we need to perform some data wrangling and feature extraction abilities. We're going to write a couple of functions, in order to 

1. Get a list of reviews, consisting of full length strings. 
2. Perform TF-IDF feature extraction on the reviews documents. 

In [7]:
def get_joined_rvw(X):
    
    """
    
    Given an X_train or X_test dataset from the IMDB reviews
    of Keras, return a list of the reviews in string format. 
    
    """
    
    #Get word to index dictionary
    word2id = imdb.get_word_index()
    #Get index to word mapping dictionary
    id2word = {i: word for word, i in word2id.items()}
    
    #Initialize reviews list
    doc_list = []
    
    for review in X:
        #Extract review
        initial_rvw = [id2word.get(i) for i in review]
        
        #Join strings followed by spaces
        joined_rvw = " ".join(initial_rvw)
        
        #Append review to the doc_list
        doc_list.append(joined_rvw)
        
    return doc_list

In [8]:
def get_data_from_keras_imdb():
    
    """
    
    Extract TF-IDF matrices for the Keras IMDB dataset. 
    
    """
    vocabulary_size = 1000
    (X_train, y_train), (X_test, y_test) = imdb.load_data(num_words = vocabulary_size)
    
    #X = np.vstack([X_train[:, None], X_test[:, None]])
    
    X_train_docs = get_joined_rvw(X_train)
    X_test_docs = get_joined_rvw(X_test)
    
    
    tf_idf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=vocabulary_size,
                                   stop_words='english')
    
    tf_idf_train = tf_idf_vectorizer.fit_transform(X_train_docs)
    
    tf_idf_test = tf_idf_vectorizer.fit_transform(X_test_docs)
    
    #tf_idf_feature_names = tf_idf_vectorizer.get_feature_names() 
    
    #tf_idf = np.vstack([tf_idf_train.toarray(), tf_idf_test.toarray()])
    
    #X_new = pd.DataFrame(tf_idf, columns=tf_idf_feature_names)
    
    X_train_new = tf_idf_train.toarray()
    
    X_test_new = tf_idf_test.toarray()

    
    return X_train_new, y_train, X_test_new, y_test 

In [9]:
X_train, y_train, X_test, y_test  = get_data_from_keras_imdb()

In [10]:
print('train dataset shape', X_train.shape)
print('test dataset shape', X_test.shape)

train dataset shape (25000, 745)
test dataset shape (25000, 745)


We can readily see that we are ready to train our classification algorithm with the TF-IDF matrices. 

In [11]:
model = RandomForestClassifier(n_estimators=200, max_depth=3, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

In [12]:
print(classification_report(y_test, y_pred))
print('Accuracy score : ', accuracy_score(y_test, y_pred))

             precision    recall  f1-score   support

          0       0.84      0.67      0.75     12500
          1       0.73      0.88      0.80     12500

avg / total       0.79      0.78      0.77     25000

Accuracy score :  0.77504


In [None]:
model = MLPClassifier()

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))
print('Accuracy score : ', accuracy_score(y_test, y_pred))

             precision    recall  f1-score   support

          0       0.84      0.81      0.83     12500
          1       0.82      0.84      0.83     12500

avg / total       0.83      0.83      0.83     25000

Accuracy score :  0.82752


In [None]:
from sklearn.model_selection import cross_val_score
cross_val_score(model, X_train, y_train, cv=5)

In [None]:
import manu_utils as TCD
palette = TCD.palette(cmap = True)

In [None]:
C = confusion_matrix(y_test, y_pred)
c_normed = C / C.astype(np.float).sum(axis=1) [:, np.newaxis]

sns.heatmap(c_normed, cmap = palette, xticklabels=['negative', 'positive'], 
           yticklabels=['negative', 'positive'], annot= True, vmin = 0, vmax = 1, 
           cbar_kws = {'label': 'recall'})

#

plt.ylabel('True label')
plt.xlabel('Predicted label');