### IMDB sentiment classification

In [None]:
import nltk
import pandas as pd
from nltk import word_tokenize
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from matplotlib import pyplot as plt

Load the csv file with pandas `read_csv` function

In [None]:
imdb_data = pd.read_csv('resource/imdb.csv')

In [None]:
imdb_data.head()

`Sentiment` Series has to be vectorized. Since we have positive and negative classes, They will be mapped to 1 and 0 respectively. The pandas Series offers a replace function, where it maps every entry in the Series with the key-value pairs provided in the dictionary.

#### Encoding the labels

In [None]:
sent_dict ={'positive':1, 'negative':0}
y_sentiments = imdb_data.sentiment.replace(sent_dict)

In [None]:
y_sentiments[:10]

`train_test_split()` from sklearn shuffles the data and splits to train and test datasets. We deliberately avoided further split of validation sets for simplicity.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(imdb_data.review, y_sentiments, test_size=0.2)

In [None]:
train_vc = y_train.value_counts()
test_vc = y_test.value_counts()
print(train_vc)
print(test_vc)

The train and test distribution are not perfectly balanced, and its best to use `stratify` param in train_test_split function. It distributes the classes present in the given Series, yet shuffles the dataset.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(imdb_data.review, y_sentiments, test_size=0.2, stratify=y_sentiments)

In [None]:
train_vc = y_train.value_counts()
test_vc = y_test.value_counts()
print(train_vc)
print(test_vc)

Now we can see that the distribution of classes in train vs test data is fair. Remember, stratify only returns a fair distribution between train and test. It cannot make a fair distribution between positive and negative classes, that depends upon how unbiased the dataset is. `stratify` also works with multi-class classification.

#### Computing bag of words

!Exercise: 

Initialise a CountVectorizer and transform the `x_train` and `x_test` data into  `x_train_counts` and `x_test_counts` term-document matrix respectively

#### Model training

For binary classification, Logistic regression classifier is a good model to get started. Sklearn offers `LogisticRegression` classifier model.

In [None]:
clf = LogisticRegression(max_iter=1000, random_state=0, solver='liblinear')
# clf = RandomForestClassifier(max_depth=25, random_state=0)
# clf = GaussianNB()

In [None]:
clf.fit(x_train_counts, y_train)

In [None]:
predicted = clf.predict(x_test_counts)

In [None]:
accuracy_score(y_test, predicted)

88% Accuracy is actually great. Given that, We haven't done any pre-processing. Its raw data stright into the model. Now we could try a Neural network model to see how it differs from a linear model.

#### Training a NN model using keras

In [None]:
from keras import models, layers
import tensorflow as tf

In [None]:
model = models.Sequential()
model.add(layers.Dense(128, activation = "relu", input_shape=(x_train_counts.shape[1], )))
# Hidden - Layers
model.add(layers.Dropout(0.3, noise_shape=None, seed=None))
model.add(layers.Dense(64, activation = "relu"))
model.add(layers.Dropout(0.2, noise_shape=None, seed=None))
model.add(layers.Dense(32, activation = "relu"))
# Output- Layer
model.add(layers.Dense(1, activation = "sigmoid"))
model.summary()

In [None]:
model.compile(optimizer='adam', loss=tf.keras.losses.binary_crossentropy)

In [None]:
model.fit(x_train_counts, y_train, epochs=2)

In [None]:
predicted = model.predict_classes(x_test_counts)

In [None]:
accuracy_score(y_test, predicted)

### Model improvement

In [None]:
import string
from nltk.stem import SnowballStemmer, PorterStemmer
from nltk import word_tokenize

In [None]:
stemmer = PorterStemmer()

!Exercise:

Define the `get_stems` function to return the stemmed version of the sentence. Try to strip off the punctuations before stemming them.

In [None]:
punctuation_map = str.maketrans('', '', string.punctuation)
def get_stems(sent):
    # return a stemmed sentence along with punctuation removal

In [None]:
stemmed = imdb_data.review.apply(get_stems)

In [None]:
x_train, x_test, y_train, y_test = train_test_split(stemmed, y_sentiments, test_size=0.2, stratify=y_sentiments)

### Inference

Let us play with some of our data and see how good it can classify the positive and negative sentiments

In [None]:
def predict_sentiment(sent):
    sent_tokens = CV.transform(sent)
    predicted = model.predict_classes(sent_tokens)
    if predicted == 0:
        return "Negative"
    else:
        return "Positive"

In [None]:
sent1 = ['This movie doesnt make sense at all. The portrayal of the lead character is awful, \
atleast they did a good job with the music. I would never suggest this movie to anyone']

In [None]:
sent2 = ['This movie is pretty good. The portrayal of the lead character is awesome, \
they did mess up with the cast though.']

In [None]:
predict_sentiment(sent1)