Another way to build a model to understand the text content and predict the sentiment of the text-based reviews is to use supervised machine learning. To be more specific, we use classification models for solving this problem.

We build an automated sentiment text classification system in subsequent sections. The major steps to achieve this are as follows:
1. Prepare train and test datasets (optionally a validation dataset).
2. Preprocess and normalize text documents.
3. Feature engineering.
4. Model training.
5. Model prediction and evaluation.
The last optional step is to deploy the model in your server or on the cloud.

In our scenario, documents indicate the movie reviews and classes indicate
the review sentiments, which can either be positive or negative, making it a binary classification problem. 

We will build models using traditional machine learning methods and the newer deep learning in the subsequent sections.

# Import necessary depencencies

In [1]:
import pandas as pd
import numpy as np
import text_normalizer as tn
import model_evaluation_utils as meu
import nltk

np.set_printoptions(precision=2, linewidth=80)

# Load and normalize data
35,000 reviews for training models, and save the remaining 15,000 reviews as the test dataset to evaluate model performance.

In [2]:
%%time
# dataset = pd.read_csv('movie_reviews.csv.bz2', compression='bz2')
dataset = pd.read_csv('movie_reviews.csv')


# take a peek at the data
print(dataset.head())
reviews = np.array(dataset['review'])
sentiments = np.array(dataset['sentiment'])
 
# build train and test datasets
train_reviews = reviews[:3]
train_sentiments = sentiments[:3]
test_reviews = reviews[3:]
test_sentiments = sentiments[3:]

# normalize datasets
stop_words = nltk.corpus.stopwords.words('english')
stop_words.remove('no')
stop_words.remove('but')
stop_words.remove('not')

norm_train_reviews = tn.normalize_corpus(train_reviews, stopwords=stop_words)
norm_test_reviews = tn.normalize_corpus(test_reviews, stopwords=stop_words)

                                              review sentiment
0  It is the first time for me that a teacher has...  positive
1  It is the first time for me that a teacher has...  positive
2  I would like access to an LA android-based app...  positive
3  I would like access to an LA android-based app...  positive
4  I would like access to an LA android-based app...  positive
Wall time: 447 ms


Our datasets are now prepared and normalized so we can proceed from Step 3 in our text classification workflow to build our classification system.

# Traditional Supervised Machine Learning Models

## Feature Engineering

In [3]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# build BOW features on train reviews
# We take into account word as well as bi-grams for our feature sets
cv = CountVectorizer(binary=False, min_df=0.0, max_df=1.0, ngram_range=(1,2))
cv_train_features = cv.fit_transform(norm_train_reviews)
# build TFIDF features on train reviews
tv = TfidfVectorizer(use_idf=True, min_df=0.0, max_df=1.0, ngram_range=(1,2),
                     sublinear_tf=True)
tv_train_features = tv.fit_transform(norm_train_reviews)

In [4]:
# transform test reviews into features
cv_test_features = cv.transform(norm_test_reviews)
tv_test_features = tv.transform(norm_test_reviews)

In [5]:
print('BOW model:> Train features shape:', cv_train_features.shape, ' Test features shape:', cv_test_features.shape)
print('TFIDF model:> Train features shape:', tv_train_features.shape, ' Test features shape:', tv_test_features.shape)

BOW model:> Train features shape: (3, 38)  Test features shape: (2, 38)
TFIDF model:> Train features shape: (3, 38)  Test features shape: (2, 38)


## Model Training, Prediction and Performance Evaluation
We recommend using logistic regression, support vector machines, and
multinomial Naïve Bayes models when you work on your own datasets in the future. In this chapter, we built models using logistic regression as well as SVM.

In [6]:
from sklearn.linear_model import SGDClassifier, LogisticRegression

lr = LogisticRegression(penalty='l2', max_iter=100, C=1)
svm = SGDClassifier(loss='hinge', max_iter=100)

In this model, we try to predict the probability that a given movie review will
belong to one of the discrete classes (binary classes in our scenario

In [7]:
# Logistic Regression model on BOW features
# logistic regression model is a supervised linear machine learning model used for classification
lr_bow_predictions = meu.train_predict_model(classifier=lr, 
                                             train_features=cv_train_features, train_labels=train_sentiments,
                                             test_features=cv_test_features, test_labels=test_sentiments)
meu.display_model_performance_metrics(true_labels=test_sentiments, predicted_labels=lr_bow_predictions,
                                      classes=['positive', 'negative'])

ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: 'positive'

In [8]:
# Logistic Regression model on TF-IDF features
lr_tfidf_predictions = meu.train_predict_model(classifier=lr, 
                                               train_features=tv_train_features, train_labels=train_sentiments,
                                               test_features=tv_test_features, test_labels=test_sentiments)
meu.display_model_performance_metrics(true_labels=test_sentiments, predicted_labels=lr_tfidf_predictions,
                                      classes=['positive', 'negative'])

ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: 'positive'

In [9]:
svm_bow_predictions = meu.train_predict_model(classifier=svm, 
                                             train_features=cv_train_features, train_labels=train_sentiments,
                                             test_features=cv_test_features, test_labels=test_sentiments)
meu.display_model_performance_metrics(true_labels=test_sentiments, predicted_labels=svm_bow_predictions,
                                      classes=['positive', 'negative'])

ValueError: The number of classes has to be greater than one; got 1 class

In [10]:
svm_tfidf_predictions = meu.train_predict_model(classifier=svm, 
                                                train_features=tv_train_features, train_labels=train_sentiments,
                                                test_features=tv_test_features, test_labels=test_sentiments)
meu.display_model_performance_metrics(true_labels=test_sentiments, predicted_labels=svm_tfidf_predictions,
                                      classes=['positive', 'negative'])

ValueError: The number of classes has to be greater than one; got 1 class

Thus you can see how effective and accurate these supervised machine learning classification algorithms are in building a text sentiment classifier.

# Newer Supervised Deep Learning Models
Τext features based on word embeddings to build a text sentiment classification system.

So far, our models in Scikit-Learn directly accepted the sentiment class labels as positive and negative and internally performed these operations. However, for our deep learning models, we need to encode them explicitly. The following snippet helps us tokenize our movie reviews and convert the text-based sentiment class labels into one hot encoded vectors.

In [11]:
import gensim
import keras
from keras.models import Sequential
from keras.layers import Dropout, Activation, Dense
from sklearn.preprocessing import LabelEncoder
from keras.layers.normalization import BatchNormalization

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


## Prediction class label encoding

In [12]:
le = LabelEncoder()
num_classes=2 
# tokenize train reviews & encode train labels
tokenized_train = [tn.tokenizer.tokenize(text)
                   for text in norm_train_reviews]
y_tr = le.fit_transform(train_sentiments)
y_train = keras.utils.to_categorical(y_tr, num_classes)
# tokenize test reviews & encode test labels
tokenized_test = [tn.tokenizer.tokenize(text)
                   for text in norm_test_reviews]
y_ts = le.fit_transform(test_sentiments)
y_test = keras.utils.to_categorical(y_ts, num_classes)

In [13]:
# print class label encoding map and encoded labels
print('Sentiment class label map:', dict(zip(le.classes_, le.transform(le.classes_))))
print('Sample test label transformation:\n'+'-'*35,
      '\nActual Labels:', test_sentiments[:3], '\nEncoded Labels:', y_ts[:3], 
      '\nOne hot encoded Labels:\n', y_test[:3])

Sentiment class label map: {'positive': 0}
Sample test label transformation:
----------------------------------- 
Actual Labels: ['positive' 'positive'] 
Encoded Labels: [0 0] 
One hot encoded Labels:
 [[1. 0.]
 [1. 0.]]


## Feature Engineering with word embeddings
Use the Word2Vec and GloVe models to generate embeddings.

In [None]:
%%time
# build word2vec model
w2v_num_features = 512
w2v_model = gensim.models.Word2Vec(tokenized_train, size=w2v_num_features, window=150,
                                   min_count=10, sample=1e-3, workers=16)    

In [None]:
def averaged_word2vec_vectorizer(corpus, model, num_features):
    vocabulary = set(model.wv.index2word)
    
    def average_word_vectors(words, model, vocabulary, num_features):
        feature_vector = np.zeros((num_features,), dtype="float64")
        nwords = 0.
        
        for word in words:
            if word in vocabulary: 
                nwords = nwords + 1.
                feature_vector = np.add(feature_vector, model.wv[word])
        if nwords:
            feature_vector = np.divide(feature_vector, nwords)

        return feature_vector

    features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
                    for tokenized_sentence in corpus]
    return np.array(features)

In [None]:
# generate averaged word vector features from word2vec model
avg_wv_train_features = averaged_word2vec_vectorizer(corpus=tokenized_train, model=w2v_model,
                                                     num_features=w2v_num_features)
avg_wv_test_features = averaged_word2vec_vectorizer(corpus=tokenized_test, model=w2v_model,
                                                    num_features=w2v_num_features)

In [None]:
# feature engineering with GloVe model
train_nlp = [tn.nlp_vec(item) for item in norm_train_reviews]
train_glove_features = np.array([item.vector for item in train_nlp])

test_nlp = [tn.nlp_vec(item) for item in norm_test_reviews]
test_glove_features = np.array([item.vector for item in test_nlp])

In [None]:
print('Word2Vec model:> Train features shape:', avg_wv_train_features.shape, ' Test features shape:', avg_wv_test_features.shape)
print('GloVe model:> Train features shape:', train_glove_features.shape, ' Test features shape:', test_glove_features.shape)

## Modeling with deep neural networks 

### Building Deep neural network architecture

In [None]:
# function leverages Keras on top of TensorFlow to build the desired DNN model
def construct_deepnn_architecture(num_input_features):
    dnn_model = Sequential()
    dnn_model.add(Dense(512, input_shape=(num_input_features,), kernel_initializer='glorot_uniform'))
    dnn_model.add(BatchNormalization())
    dnn_model.add(Activation('relu'))
    dnn_model.add(Dropout(0.2))
    
    dnn_model.add(Dense(512, kernel_initializer='glorot_uniform'))
    dnn_model.add(BatchNormalization())
    dnn_model.add(Activation('relu'))
    dnn_model.add(Dropout(0.2))
    
    dnn_model.add(Dense(512, kernel_initializer='glorot_uniform'))
    dnn_model.add(BatchNormalization())
    dnn_model.add(Activation('relu'))
    dnn_model.add(Dropout(0.2))
    
    dnn_model.add(Dense(2))
    dnn_model.add(Activation('softmax'))

    dnn_model.compile(loss='categorical_crossentropy', optimizer='adam',                 
                      metrics=['accuracy'])
    return dnn_model

We accept a num_input_features parameter, which decides the number of units needed in the input layer (512 for Word2Vec and 300 for GloVe features). We build a Sequential model, which helps us in linearly stacking our hidden and output layers.

In [None]:
w2v_dnn = construct_deepnn_architecture(num_input_features=w2v_num_features)

### Visualize sample deep architecture
with the help of Keras.

In [None]:
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot

SVG(model_to_dot(w2v_dnn, show_shapes=True, show_layer_names=False, 
                 rankdir='TB').create(prog='dot', format='svg'))

### Model Training, Prediction and Performance Evaluation

In [None]:
batch_size = 100
w2v_dnn.fit(avg_wv_train_features, y_train, epochs=10, batch_size=batch_size, 
            shuffle=True, validation_split=0.1, verbose=1)

In [None]:
y_pred = w2v_dnn.predict_classes(avg_wv_test_features)
predictions = le.inverse_transform(y_pred) 

In [None]:
meu.display_model_performance_metrics(true_labels=test_sentiments, predicted_labels=predictions, 
                                      classes=['positive', 'negative'])  

In [None]:
glove_dnn = construct_deepnn_architecture(num_input_features=300)

In [None]:
batch_size = 100
glove_dnn.fit(train_glove_features, y_train, epochs=10, batch_size=batch_size, 
              shuffle=True, validation_split=0.1, verbose=1)

In [None]:
y_pred = glove_dnn.predict_classes(test_glove_features)
predictions = le.inverse_transform(y_pred) 

In [None]:
meu.display_model_performance_metrics(true_labels=test_sentiments, predicted_labels=predictions, 
                                      classes=['positive', 'negative'])  

We obtained an overall model accuracy and F1-score of 86% with the GloVe features, which is still good but not better than what we obtained using our Word2Vec features.