## Tutorial 20. Sentiment analysis

Created by Emanuel Flores-Bautista 2019  All content contained in this notebook is licensed under a [Creative Commons License 4.0 BY NC](https://creativecommons.org/licenses/by-nc/4.0/). The code is licensed under a [MIT license](https://opensource.org/licenses/MIT).


In [None]:
import numpy as np
import pandas as pd
import scipy.stats as st
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.neural_network import MLPClassifier
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report,accuracy_score,confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from keras.datasets import imdb
import TCD19_utils as TCD

TCD.set_plotting_style_2()

%matplotlib inline
%config InlineBackend.figure_format = 'svg'

In [None]:
import matplotlib.pyplot as plt

We will train a classifier movie for reviews in the IMDB data set.

In [None]:
vocabulary_size = 5000
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words = vocabulary_size)
print('Loaded dataset with {} training samples,{} test samples'.format(len(X_train), len(X_test)))

In [None]:
len(X_train[0])

In [None]:
print('---review---')
print(X_train[6])
print('---label---')
print(y_train[6])

Note that the review is stored as a sequence of integers. From the [Keras documentation](https://keras.io/datasets/) we can see that these are words IDs that have been pre-assigned to individual words, and the label is an integer (0 for negative, 1 for positive). We can go ahead and access the words from each review with the `get_word_index()` method from the `imdb` object.

In [None]:
word2id = imdb.get_word_index()
id2word = {i: word for word, i in word2id.items()}
print('---review with words---')
print([id2word.get(i, ' ') for i in X_train[6]])
print('---label---')
print(y_train[6])

Because we cannot feed the index matrix directly to the classifier, we need to perform some data wrangling and feature extraction abilities. We're going to write a couple of functions, in order to 

1. Get a list of reviews, consisting of full length strings. 
2. Perform TF-IDF feature extraction on the reviews documents. 

In [None]:
def get_joined_rvw(X):
    
    """
    
    Given an X_train or X_test dataset from the IMDB reviews
    of Keras, return a list of the reviews in string format. 
    
    """
    
    #Get word to index dictionary
    word2id = imdb.get_word_index()
    #Get index to word mapping dictionary
    id2word = {i: word for word, i in word2id.items()}
    
    #Initialize reviews list
    doc_list = []
    
    for review in X:
        #Extract review
        initial_rvw = [id2word.get(i) for i in review]
        
        #Join strings followed by spaces
        joined_rvw = " ".join(initial_rvw)
        
        #Append review to the doc_list
        doc_list.append(joined_rvw)
        
    return doc_list

In [None]:
def get_data_from_keras_imdb():
    
    """
    
    Extract TF-IDF matrices for the Keras IMDB dataset. 
    
    """
    vocabulary_size = 1000
    (X_train, y_train), (X_test, y_test) = imdb.load_data(num_words = vocabulary_size)
    
    #X = np.vstack([X_train[:, None], X_test[:, None]])
    
    X_train_docs = get_joined_rvw(X_train)
    X_test_docs = get_joined_rvw(X_test)
    
    
    tf_idf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,
                                   max_features=vocabulary_size,
                                   stop_words='english')
    
    tf_idf_train = tf_idf_vectorizer.fit_transform(X_train_docs)
    
    tf_idf_test = tf_idf_vectorizer.fit_transform(X_test_docs)
    
    #tf_idf_feature_names = tf_idf_vectorizer.get_feature_names() 
    
    #tf_idf = np.vstack([tf_idf_train.toarray(), tf_idf_test.toarray()])
    
    #X_new = pd.DataFrame(tf_idf, columns=tf_idf_feature_names)
    
    X_train_new = tf_idf_train.toarray()
    
    X_test_new = tf_idf_test.toarray()

    
    return X_train_new, y_train, X_test_new, y_test 

In [None]:
X_train, y_train, X_test, y_test  = get_data_from_keras_imdb()

In [None]:
print('train dataset shape', X_train.shape)
print('test dataset shape', X_test.shape)

We can readily see that we are ready to train our classification algorithm with the TF-IDF matrices. 

In [None]:
model = RandomForestClassifier(n_estimators=200, max_depth=3, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred))
print('Accuracy score : ', accuracy_score(y_test, y_pred))

In [None]:
model = MLPClassifier()

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))
print('Accuracy score : ', accuracy_score(y_test, y_pred))

In [None]:
from sklearn.model_selection import cross_val_score
cross_val_score(model, X_train, y_train, cv=5)

In [None]:
import manu_utils as TCD
palette = TCD.palette(cmap = True)

In [None]:
C = confusion_matrix(y_test, y_pred)
c_normed = C / C.astype(np.float).sum(axis=1) [:, np.newaxis]

sns.heatmap(c_normed, cmap = palette, xticklabels=['negative', 'positive'], 
           yticklabels=['negative', 'positive'], annot= True, vmin = 0, vmax = 1, 
           cbar_kws = {'label': 'recall'})

#

plt.ylabel('True label')
plt.xlabel('Predicted label');