### IMDB sentiment classification

In [1]:
import nltk
import pandas as pd
from nltk import word_tokenize
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from matplotlib import pyplot as plt

Load the csv file with pandas `read_csv` function

In [2]:
imdb_data = pd.read_csv('resource/imdb.csv')

In [3]:
imdb_data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


`Sentiment` Series has to be vectorized. Since we have positive and negative classes, They will be mapped to 1 and 0 respectively. The pandas Series offers a replace function, where it maps every entry in the Series with the key-value pairs provided in the dictionary.

#### Encoding the labels

In [4]:
sent_dict ={'positive':1, 'negative':0}
y_sentiments = imdb_data.sentiment.replace(sent_dict)

In [5]:
y_sentiments[:10]

0    1
1    1
2    1
3    0
4    1
5    1
6    1
7    0
8    0
9    1
Name: sentiment, dtype: int64

`train_test_split()` from sklearn shuffles the data and splits to train and test datasets. We deliberately avoided further split of validation sets for simplicity.

In [6]:
x_train, x_test, y_train, y_test = train_test_split(imdb_data.review, y_sentiments, test_size=0.2)

In [7]:
train_vc = y_train.value_counts()
test_vc = y_test.value_counts()
print(train_vc)
print(test_vc)

1    20024
0    19976
Name: sentiment, dtype: int64
0    5024
1    4976
Name: sentiment, dtype: int64


The train and test distribution are not perfectly balanced, and its best to use `stratify` param in train_test_split function. It distributes the classes present in the given Series, yet shuffles the dataset.

In [8]:
x_train, x_test, y_train, y_test = train_test_split(imdb_data.review, y_sentiments, test_size=0.2, stratify=y_sentiments)

In [9]:
train_vc = y_train.value_counts()
test_vc = y_test.value_counts()
print(train_vc)
print(test_vc)

1    20000
0    20000
Name: sentiment, dtype: int64
1    5000
0    5000
Name: sentiment, dtype: int64


Now we can see that the distribution of classes in train vs test data is fair. Remember, stratify only returns a fair distribution between train and test. It cannot make a fair distribution between positive and negative classes, that depends upon how unbiased the dataset is. `stratify` also works with multi-class classification.

#### Computing bag of words

In [10]:
CV = CountVectorizer(min_df=3, stop_words='english', tokenizer=word_tokenize)

In [11]:
X_train_counts = CV.fit_transform(x_train)

In [12]:
X_test_counts = CV.transform(x_test)

In [13]:
len(CV.get_feature_names())

51116

#### Model training

For binary classification, Logistic regression classifier is a good model to get started. Sklearn offers `LogisticRegression` classifier model.

In [19]:
clf = LogisticRegression(max_iter=1000, random_state=0, solver='liblinear')
# clf = RandomForestClassifier(max_depth=25, random_state=0)
# clf = GaussianNB()

In [20]:
clf.fit(X_train_counts, y_train)

LogisticRegression(max_iter=1000, random_state=0, solver='liblinear')

In [21]:
predicted = clf.predict(X_test_counts)

In [22]:
accuracy_score(y_test, predicted)

0.8871

88% Accuracy is actually great. Given that, We haven't done any pre-processing. Its raw data stright into the model. Now we could try a Neural network model to see how it differs from a linear model.

#### Training a NN model using keras

In [23]:
from keras import models, layers
import tensorflow as tf

Using TensorFlow backend.


In [24]:
model = models.Sequential()
model.add(layers.Dense(128, activation = "relu", input_shape=(X_train_counts.shape[1], )))
# Hidden - Layers
model.add(layers.Dropout(0.3, noise_shape=None, seed=None))
model.add(layers.Dense(64, activation = "relu"))
model.add(layers.Dropout(0.2, noise_shape=None, seed=None))
model.add(layers.Dense(32, activation = "relu"))
# Output- Layer
model.add(layers.Dense(1, activation = "sigmoid"))
model.summary()

Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 128)               6542976   
_________________________________________________________________
dropout_1 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 64)                8256      
_________________________________________________________________
dropout_2 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_3 (Dense)              (None, 32)                2080      
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 33        
Total params: 6,553,345
Trainable param

In [25]:
model.compile(optimizer='adam', loss=tf.keras.losses.binary_crossentropy)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


In [26]:
model.fit(X_train_counts, y_train, epochs=2)


Epoch 1/2
Epoch 2/2


<keras.callbacks.callbacks.History at 0x1af14aef3c8>

In [27]:
predicted = model.predict_classes(X_test_counts)

In [28]:
accuracy_score(y_test, predicted)

0.8916

### Model improvement

In [32]:
import string
from nltk.stem import SnowballStemmer, PorterStemmer
from nltk.stem import WordNetLemmatizer

In [33]:
stemmer = PorterStemmer()

In [34]:
punctuation_map = str.maketrans('', '', string.punctuation)
def get_stems(sent):
    return ' '.join([stemmer.stem(tok) for tok in sent.translate(punctuation_map).split()])

In [35]:
stemmed = imdb_data.review.apply(get_stems)

In [36]:
x_train, x_test, y_train, y_test = train_test_split(stemmed, y_sentiments, test_size=0.2, stratify=y_sentiments)

### Inference

Let us play with some of our data and see how good it can classify the positive and negative sentiments

In [37]:
def predict_sentiment(sent):
    sent_tokens = CV.transform(sent)
    predicted = model.predict_classes(sent_tokens)
    if predicted == 0:
        return "Negative"
    else:
        return "Positive"

In [38]:
sent1 = ['This movie doesnt make sense at all. The portrayal of the lead character is awful, \
atleast they did a good job with the music. I would never suggest this movie to anyone']

In [39]:
sent2 = ['This movie is pretty good. The portrayal of the lead character is awesome, \
they did mess up with the cast though.']

In [41]:
predict_sentiment(sent1)

'Negative'