# **Assignment-2 for CS60075: Natural Language Processing**

#### Instructor : Prof. Sudeshna Sarkar

#### Teaching Assistants : Alapan Kuila, Aniruddha Roy, Prithwish Jana, Udit Dharmin Desai

#### Date of Announcement: 15th Sept, 2021
#### Deadline for Submission: 11.59pm on Wednesday, 22nd Sept, 2021 
#### Submit this .ipynb file, named as `<Your_Roll_Number>_Assn2_NLP_A21.ipynb`

The central idea of this assignment is to use Naive Bayes classifier and LSTM based classifier and compare the models by accuracy on IMDB dataset.  This dataset consists of 50k movie reviews (25k positive, 25k negative). You can download the dataset from https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews



Please submit with outputs. 

In [1]:
import re
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import keras
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
#Load the IMDB dataset. You can load it using pandas as dataframe
df = pd.read_csv('IMDB Dataset.csv')

# Preprocessing
PrePrecessing that needs to be done on lower cased corpus

1. Remove html tags
2. Remove URLS
3. Remove non alphanumeric character
4. Remove Stopwords
5. Perform stemming and lemmatization

You can use regex from re. 

In [3]:
def preprocess(text):
    # Lowercase
    text = text.lower()
    # Remove html tags
    text = re.sub('<[^<]+?>', '', text)
    # Remove URLs
    text = re.sub('http\S+|www.\S+', '', text)
    return text

df['review'] = df['review'].apply(lambda text : preprocess(text))

# Count the number of sentences, because afterwards '.' will be removed
num_sentences = 0
for idx, row in df.iterrows():
    num_sentences += len(sent_tokenize(row['review']))

# Remove non-alphanumeric characters
df['review'] = df['review'].str.replace('\W', ' ')
df['review'] = df['review'].str.replace('\s+', ' ')

# Remove Stopwords
stopwords_list = stopwords.words('english')
df['review'] = df['review'].apply(lambda text : ' '.join([word for word in text.split() if word not in (stopwords_list)]))

In [4]:
# Stemming
porter = PorterStemmer()

def stemSentence(sentence):
    tokens = word_tokenize(sentence)
    stem_sentence = []
    for word in tokens:
        stem_sentence.append(porter.stem(word))
        stem_sentence.append(" ")
    return "".join(stem_sentence)

# Lemmatization
Lemmatizer = WordNetLemmatizer()

def lemmatizeSentence(sentence):
    tokens = word_tokenize(sentence)
    lemmatize_sentence = []
    for word in tokens:
        lemmatize_sentence.append(Lemmatizer.lemmatize(word))
        lemmatize_sentence.append(" ")
    return "".join(lemmatize_sentence)

In [5]:
for index, row in df.iterrows():
    row['review'] = stemSentence(row['review'])
    row['review'] = lemmatizeSentence(row['review'])

In [6]:
# Print Statistics of Data like avg length of sentence , proposition of data w.r.t class labels
num_words = 0
num_positive = 0
num_negative = 0

for index, row in df.iterrows():
    num_words += len(row['review'].split())
    if row['sentiment'] == 'positive':
        num_positive += 1
    else:
        num_negative += 1

print("Number of words in the reviews column =", num_words)
print("Number of sentences in the reviews column =", num_sentences)
print("Average Length of a sentence = {0:.4f}".format(num_words / num_sentences))

print("Number of positive sentiment examples =", num_positive)
print("Number of negative sentiment examples =", num_negative)
print("Proportion of data for positive sentiment = {0:.2f} %".format((num_positive * 100) / (num_positive + num_negative)))
print("Proportion of data for negative sentiment = {0:.2f} %".format((num_negative * 100) / (num_positive + num_negative)))

Number of words in the reviews column = 5980737
Number of sentences in the reviews column = 532492
Average Length of a sentence = 11.2316
Number of positive sentiment examples = 25000
Number of negative sentiment examples = 25000
Proportion of data for positive sentiment = 50.00 %
Proportion of data for negative sentiment = 50.00 %


# Naive Bayes classifier

In [7]:
# get reviews column from df
reviews = df['review']

# get labels column from df
labels = df['sentiment']

In [8]:
# Use label encoder to encode labels. Convert to 0/1
encoder = LabelEncoder()
encoded_labels = encoder.fit_transform(labels)

# print(enc.classes_)
print(encoder.classes_)

['negative' 'positive']


In [9]:
# Split the data into train and test (80% - 20%). 
# Use stratify in train_test_split so that both train and test have similar ratio of positive and negative samples.

train_sentences, test_sentences, train_labels, test_labels = train_test_split(reviews, encoded_labels,
                                                                              stratify = encoded_labels, test_size = 0.2,
                                                                              shuffle = True, random_state = 42)

train_sentences = train_sentences.to_numpy()
test_sentences = test_sentences.to_numpy()

Here there are two approaches possible for building vocabulary for the naive Bayes.
1. Take the whole data (train + test) to build the vocab. In this way while testing there is no word which will be out of vocabulary.
2. Take the train data to build vocab. In this case, some words from the test set may not be in vocab and hence one needs to perform smoothing so that one the probability term is not zero.
 
You are supposed to go by the 2nd approach.
 
Also building vocab by taking all words in the train set is memory intensive, hence you are required to build vocab by choosing the top 2000 - 3000 frequent words in the training corpus.

> $ P(x_i | w_j) = \frac{ N_{x_i,w_j}\, +\, \alpha }{ N_{w_j}\, +\, \alpha*d} $


$N_{x_i,w_j}$ : Number of times feature $x_i$ appears in samples of class $w_j$

$N_{w_j}$ : Total count of features in class $w_j$

$\alpha$ : Parameter for additive smoothing. Here consider $\alpha$ = 1

$d$ : Dimentionality of the feature vector  $x = [x_1,x_2,...,x_d]$. In our case its the vocab size.






In [10]:
from sklearn.feature_extraction.text import CountVectorizer
# Use Count vectorizer to get frequency of the words

vec = CountVectorizer(max_features = 3000)
doc_term_freq = vec.fit_transform(train_sentences).toarray()

In [11]:
# Use laplace smoothing for words in test set not present in vocab of train set
vocab = vec.vocabulary_
V = len(vocab)

class_wise_term_freq = np.zeros((2, V), dtype = int)

for i in range(len(doc_term_freq)):
    classification_of_doc = train_labels[i]
    for j in range(V):
        class_wise_term_freq[classification_of_doc][j] += doc_term_freq[i][j]

In [12]:
# Build the model. Don't use the model from sklearn
import math

total_count_of_features = np.zeros((2, 1), dtype = int)
for i in range(2):
    for j in range(V):
        total_count_of_features[i] += class_wise_term_freq[i][j]

freq_of_class = np.zeros((2, 1), dtype = int)
for i in range(len(doc_term_freq)):
    freq_of_class[train_labels[i]] += 1

prob_of_class = np.zeros((2, 1))
for i in range(2):
    prob_of_class[i] = freq_of_class[i] / len(doc_term_freq)

def predict(sentence):
    list_of_tokens = word_tokenize(sentence)
    prob_class = np.ones((2, 1))
    for i in range(2):
        prob_class[i] = math.log10(prob_of_class[i])
    
    D0 = total_count_of_features[0] + V
    D1 = total_count_of_features[1] + V
    
    for token in list_of_tokens:
        if token in vocab.keys():
            term_idx = vocab[token]
            prob_class[0] += math.log10(1 + class_wise_term_freq[0][term_idx]) - math.log10(D0)
            prob_class[1] += math.log10(1 + class_wise_term_freq[1][term_idx]) - math.log10(D1)
        else:
            continue
    
    if prob_class[1] > prob_class[0]:
        return 1
    return 0


In [13]:
# Test the model on test set and report Accuracy

correct_classsifications = 0
num_test_sentences = len(test_sentences)

for i in range(num_test_sentences):
    predicted_classification = predict(test_sentences[i])
    if predicted_classification == test_labels[i]:
        correct_classsifications += 1

print("Accuracy for Naive Bayes Classifier = {:.4f} %".format(correct_classsifications * 100 / num_test_sentences))

Accuracy for Naive Bayes Classifier = 84.4000 %


# *LSTM* based Classifier

Use the above train and test splits.

In [None]:
# Hyperparameters of the model
vocab_size = V
oov_tok = '<OOK>'
embedding_dim = 100
max_length = 150
padding_type='post'
trunc_type='post'

In [None]:
# tokenize sentences
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(train_sentences)
word_index = tokenizer.word_index

# convert train dataset to sequence and pad sequences
train_sequences = tokenizer.texts_to_sequences(train_sentences)
train_padded = pad_sequences(train_sequences, padding='post', maxlen=max_length)

# convert Test dataset to sequence and pad sequences
test_sequences = tokenizer.texts_to_sequences(test_sentences)
test_padded = pad_sequences(test_sequences, padding='post', maxlen=max_length)

In [None]:
# model initialization
model = keras.Sequential([
    keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    keras.layers.Bidirectional(keras.layers.LSTM(64)),
    keras.layers.Dense(24, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

# compile model
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

# model summary
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 150, 100)          300000    
_________________________________________________________________
bidirectional (Bidirectional (None, 128)               84480     
_________________________________________________________________
dense (Dense)                (None, 24)                3096      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 25        
Total params: 387,601
Trainable params: 387,601
Non-trainable params: 0
_________________________________________________________________


In [None]:
num_epochs = 5
history = model.fit(train_padded, train_labels, 
                    epochs=num_epochs, verbose=1, 
                    validation_split=0.1)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [None]:
# Calculate accuracy on Test data
# Get probabilities
prediction = model.predict(test_padded)

# Get labels based on probability 1 if p>= 0.5 else 0
lstm_predicted_labels = np.zeros((len(prediction), 1), dtype = int)

for i in range(len(prediction)):
    if prediction[i][0] >= 0.5:
        lstm_predicted_labels[i][0] = 1
    else:
        lstm_predicted_labels[i][0] = 0

# Accuracy : one can use classification_report from sklearn
print("Accuracy = {:.4f} %".format(100 * accuracy_score(test_labels, lstm_predicted_labels)))
print("\nClassification Report:")
print(classification_report(test_labels, lstm_predicted_labels))

Accuracy = 86.5700 %

Classification Report:
              precision    recall  f1-score   support

           0       0.86      0.87      0.87      5000
           1       0.87      0.86      0.86      5000

    accuracy                           0.87     10000
   macro avg       0.87      0.87      0.87     10000
weighted avg       0.87      0.87      0.87     10000



## Get predictions for random examples

In [None]:
# reviews on which we need to predict
sentence = ["The movie was very touching and heart whelming", 
            "I have never seen a terrible movie like this", 
            "the movie plot is terrible but it had good acting"]

def preprocess_random_sentences(text):
    # Lowercase
    text = text.lower()
    # Remove html tags
    text = re.sub('<[^<]+?>', '', text)
    # Remove apostrophes
    text = re.sub('\'.*?\s', '', text)
    # Remove URLs
    text = re.sub('http\S+|www.\S+', '', text)
    # Remove Stopwords
    text = ' '.join([word for word in text.split() if word not in (stopwords_list)])
    # Remove non-alphanumeric characters
    text = re.sub('\W', ' ', text)
    text = re.sub('\s+', ' ', text)
    text = stemSentence(text)
    text = lemmatizeSentence(text)
    return text

for i in range(3):
    sentence[i] = preprocess_random_sentences(sentence[i])

# convert to a sequence
sequences = tokenizer.texts_to_sequences(sentence)

# pad the sequence
padded = pad_sequences(sequences, padding='post', maxlen=max_length)

# Get probabilities
prediction_unseen = model.predict(padded)
print("The probabilities are the following:")
print(prediction_unseen)

# Get labels based on probability 1 if p>= 0.5 else 0
lstm_predicted_labels_unseen = np.zeros((len(prediction_unseen), 1), dtype = int)

for i in range(len(prediction_unseen)):
    if prediction_unseen[i][0] >= 0.5:
        lstm_predicted_labels_unseen[i][0] = 1
    else:
        lstm_predicted_labels_unseen[i][0] = 0

print("The labels are the following:")
print(lstm_predicted_labels_unseen)

The probabilities are the following:
[[0.95695686]
 [0.18327245]
 [0.07588911]]
The labels are the following:
[[1]
 [0]
 [0]]


**Conclusion**

The accuracy of Naive Bayes classifier is 84.4 % whereas the accuracy of LSTM classifier is 86.6 %. LSTM is a bit better in terms of accuracy but depending upon the utility, Naive Bayes may also be useful because of it's simplicity and competitive accuracy.