# Sentiment Analysis
#### **Submitted by : Aakruti Ambasana, Akshay Sharma, Rahul Dingria**
#### **Group : 27**

In [1]:
# Importing libraries that are needed
import os
import string
import shutil
import re
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import metrics
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization

In [3]:
# Getting the Current directory path and accessing the training dataset from 'aclImdb'.
# Relative path is specified.
current_path = os.getcwd()
path1 = os.path.join(current_path,"aclImdb")
train_folder=os.path.join(path1,"train")
os.listdir(train_folder)

['labeledBow.feat',
 'neg',
 'pos',
 'unsup',
 'unsupBow.feat',
 'urls_neg.txt',
 'urls_pos.txt',
 'urls_unsup.txt']

In [5]:
# Removed unsupervised folder from train folder of aclImdb because it is supervised problem 
# and there is no need of unsupervised data.
remove_dir = os.path.join(train_folder, 'unsup')
try:
    shutil.rmtree(remove_dir)
except OSError as e:
    print("'unsup' folder does not exist :)")

## **Preprocessing**

- Extracting data from each text file. The default batch size is 32. The seed value is set to 9 for validation set, training set and test set. The seed value is specified and same because there should be no overlap between the training set and validation set. 
- text_dataset_from_directory() will return the BatchDataset containing movie review and label tensors. 

In [9]:
data=tf.keras.preprocessing.text_dataset_from_directory('aclImdb/train', seed=9,
    validation_split=0.2,subset="training")
val_data=tf.keras.preprocessing.text_dataset_from_directory('aclImdb/train', seed=9,
    validation_split=0.2,subset="validation")
test_data=tf.keras.preprocessing.text_dataset_from_directory('aclImdb/test')
print(data)
# Considering label 0 as negative and label 1 as positive movie review
print('Label 0 is',data.class_names[0])
print('Label 1 is',data.class_names[1])

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.
Found 25000 files belonging to 2 classes.
<BatchDataset shapes: ((None,), (None,)), types: (tf.string, tf.int32)>
Label 0 is neg
Label 1 is pos


#### Cleaning the dataset

In [10]:
def clean_text(input_data):
    # Converting each word to lower case because vocabulary is also in lower case in 'imdb.vocab'
    lower = tf.strings.lower(input_data)
    # Removing new line characters from movie reviews
    space = tf.strings.regex_replace(lower, "<br />", " ")
    # Removing punctuations from sentences 
    clean_punch = tf.strings.regex_replace(space, '[%s]' %re.escape(string.punctuation), '')
    return clean_punch

#### Tokenizing the dataset

* Text Vectorization is for converting text into numerical number. That means if the word in vocab 
like "home" is in 339 position. So TextVectorization will assign each word with numbers. 
* In this, output_sequence_length is specified as 500 because all the movie reviews will have different number of words, this will truncate the review with words greater than 500 and perform padding for review which have words less than 500. 
* clean_text() is called from standardize which will clean the text of dataset. Then each word is tokenized and mapped to the integer number. The vocabulary is used from 'aclImdb/imdb.vocab' 
* Output_mode is set to integer number will be assigned to each token.

In [14]:
numerical_text_layer = TextVectorization(standardize=clean_text, vocabulary='aclImdb/imdb.vocab',
    output_mode='int', output_sequence_length=500)
# print(numerical_text_layer)
print("Word at number 339 :",numerical_text_layer.get_vocabulary()[339])
print("Word at number 56863 :",numerical_text_layer.get_vocabulary()[56863])
vocab_size = len(numerical_text_layer.get_vocabulary())
print('Vocabulary size of file imdb.vocab : ',vocab_size)

Word at number 339 : home
Word at number 56863 : eco
Vocabulary size of file imdb.vocab :  89529


##### Checking one movie review, to see mapping of text into numbers.

In [24]:
print(data)

def print_numerical_text(text, label):
    # Converting text tensor shape from () to (1,) because of numerical_text_layer.
    text = tf.expand_dims(text, -1)
    # print(text.shape)
    # Applied vocabulary on text data and returned it the numerical text and label.
    return numerical_text_layer(text), label

text_batch, label_batch = next(iter(data))
# It cleans the dataset
# clean_batch = clean_text(text_batch)
# print('Cleaned:',clean_batch)

print("Batch size of data:",len(text_batch))
review, lbl = text_batch[2], label_batch[2]
print("Movie Review:", review)
print("Movie Review Label:", data.class_names[lbl])
print("Vectorized review", print_numerical_text(review, lbl))



<BatchDataset shapes: ((None,), (None,)), types: (tf.string, tf.int32)>
Batch size of data: 32
Movie Review: tf.Tensor(b"OK maybe a 13 year old like me was a little to old for this movie. Its about this pampered rat, who lives in a palace. Then a sewer rat flushes him down a toilet! He ends up in this rat city and meets this girl rat who has a gem a greedy frog wants. He will do anything for this gem he sends a whole army after these two rats.He plans to take the gem and to flood rat city! THe cool part about this movie is the slugs. They do all the sound effects. They sing, make noises, its awesome, its also pretty funny. OK bottom line, it is aimed at 7 year olds. Other wise, a great movie to take a younger family member to see. I didn't think the animation was real dreamworks art though, more like WAllace and Gromit. i thinkthey slacked a little on that. The movie was just decent, not worth spending $9.50 for though, sorry.", shape=(), dtype=string)
Movie Review Label: neg
Vectorize

##### Converted whole training dataset, validation dataset and test dataset reviews in numerical numbers.

In [25]:
# In this, training data, validation data and test data is mapped each word of review into integer.
training_data = data.map(print_numerical_text)
validation_data = val_data.map(print_numerical_text)
test_data = test_data.map(print_numerical_text)

### Created and trained model (First Artificial Neural Network)
I have tried a simple neural network with 3 layers.
First layer is embedding layer, Second layer is GlobalAveragePooling1D(), Third layer is output layer and activation function selected is sigmoid function because for binary classification sigmoid function is a good choice. Binary Cross Entropy is used as loss function.  
The EarlyStopping is used to avoid overfitting. In Early Stopping method, Validation loss is monitored. The training set should be trained until validation loss is decreasing but when validation loss increases then training will be stopped. The patience specified is 3. The patience means it will wait for 3 epochs if the validation loss increases for 3 epochs then the training will be stopped.

In [None]:
embedding_dim = 16
ann1 = tf.keras.Sequential([
  layers.Embedding(vocab_size, embedding_dim, input_length=500),
  layers.GlobalAveragePooling1D(),
  layers.Dense(1, activation = 'sigmoid')])
ann1.summary()
ann1.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
epochs = 50
ann1.fit(training_data, validation_data=validation_data, epochs=epochs, 
                  callbacks = [keras.callbacks.EarlyStopping(monitor='val_loss', patience=3)])
# ann1loss, ann1accuracy = ann1.evaluate(test_data)
# print("Loss of Test dataset: ", ann1loss)
# print("Accuracy of Test dataset: ", ann1accuracy)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 500, 16)           1432464   
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense (Dense)                (None, 1)                 17        
Total params: 1,432,481
Trainable params: 1,432,481
Non-trainable params: 0
_________________________________________________________________
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
 68/625 [==>...........................] - ETA: 3:30 - loss: 0.2295 - accuracy: 0.9254

### Second Artificial Neural Network
* For better accuracy, Neural Network with Dropout layers is added to improve generalization and another dense layer is added for better result. In the Dropout layers, 20% neurons are not used and 80% neurons data is passed to next layer.
* In the Dense layer 16 neurons are used because the embedding size is 16. I have also tried with multiple layers and 8 neurons in hidden layer. But performance of that models is less than the model specified below in terms of loss and accuracy. 
* To avoid overfitting, the EarlyStopping is used with the patience of 3 epochs.

In [None]:
embedding_dim = 16
ann = tf.keras.Sequential([
  layers.Embedding(vocab_size, embedding_dim, input_length=500),
  layers.Dropout(0.2),
  layers.GlobalAveragePooling1D(),
  layers.Dropout(0.2),
  layers.Dense(16, activation = 'relu'),
  layers.Dense(1, activation = 'sigmoid')])
ann.summary()
ann.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
epochs = 50
history = ann.fit(training_data, validation_data=validation_data, epochs=epochs, 
                  callbacks = [keras.callbacks.EarlyStopping(monitor='val_loss', patience=3)])
# loss, accuracy = ann.evaluate(test_data)
# print("Loss of Test dataset: ", loss)
# print("Accuracy of Test dataset: ", accuracy)

### Third CNN
* For better results, Convolution Neural Network for Natural Language Processing is tried. In this, one convolution layer is used and one max pooling layer is used. And rest of the configurations are same as above in CNN. I have tried multiple convolution layers with multiple max pooling layer but the performance of that approaches are not good than this approach specified below.

In [None]:
cnn = tf.keras.Sequential([
  layers.Embedding(vocab_size, embedding_dim, input_length=500),
  layers.Conv1D(16, 2, activation='relu'),
  layers.MaxPool1D(2),
  layers.GlobalAveragePooling1D(),
  layers.Dense(16, activation = 'relu'),
  layers.Dense(1, activation = 'sigmoid')])
cnn.summary()
cnn.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
epochs = 50
chistory = cnn.fit(training_data, validation_data=validation_data, epochs=epochs, 
                  callbacks = [keras.callbacks.EarlyStopping(monitor='val_loss', patience=3)])
closs, caccuracy = cnn.evaluate(test_data)
print("Loss of Test dataset: ", closs)
print("Accuracy of Test dataset: ", caccuracy)

From all three model, the second model gives the best result because I generalizes the model, by adding Dropout layer and number of epochs is less than first layer. In second layer number of epochs is 7 whereas, in First model, number of epochs is 19. As the number of epochs increases the execution time also increases. The execution time for cnn model is more because of convolution layer. It takes more time. So the execution time in second model is minimum from all three models tried.  
The comparison of 3 models performance:

|  | Number of Epochs | Train Accuracy | Validation Accuracy | Train Loss | Validation Loss |
| :-: | :-: | :-: | :-: | :-: | :-: |
| ANN with 3 layers (First ANN) | 19 | 97% | 89% | 0.10 | 0.27 |
| ANN with 6 layers (Second ANN) | 7 | 97% | 89% | 0.09 | 0.30 |
| CNN with 6 layers | 5 | 98% | 89% | 0.05 | 0.36 |

**Checking the word embedding, which is learned after the training of the model**

In [None]:
# Embedding learned 
first_layer=ann.layers[0]
word_embeddings=first_layer.get_weights()[0]
print(word_embeddings.shape)
print("Word embedding of word 'home'")
print(word_embeddings[339][:])

In [None]:
!pip install h5py pyyaml

#### Saving the model

In [None]:
ann.save('models/20912881_NLP_model.h5')

#### References:
- https://www.tensorflow.org/tutorials/keras/text_classification
- https://nptel.ac.in/courses/106/106/106106213/
- https://www.tensorflow.org/tutorials/keras/save_and_load