<a href="https://colab.research.google.com/github/sarthakvinayaka/NLP-Project/blob/master/IMDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **IMDB-Movie-reviews-sentiment-classification**
The data set contains 50,000 movie reviews from Internet Movie Database (IMDB) labeled whether they are positive or negative.

**Task is to build a prediction model that will accurately classify which review are positive and negative**

## **Steps:**
- Importing Libraries
- Downloading dataset
- Preprocessing data
 - Tokenize 
 - Text to sequence
 - Padding or truncating 
- Building model
- Training model and validating model
- Predict a review from user
- Conclusion


## **Importing Libraries**

In [0]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import tensorflow_datasets as tfds
import numpy as np

## **Downloading Dataset**

IMDB dataset having 50K movie reviews for natural language processing or Text analytics.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training and 25,000 for testing

In [0]:
#Downloading imdb data set from tensorflow_datasets

imdb,info=tfds.load("imdb_reviews",with_info=True,as_supervised=True)

##**Preprocessing data**

**Spliting data set as train and test**

In [0]:
#spliting data set as train and test

train_data,test_data=imdb['train'],imdb['test']



In [0]:

training_sentences=[]
training_labels=[]
testing_sentences=[]
testing_labels=[]


**Appending train sentence and label from train data and test data**

In [0]:
#appending train and test data in our array
for s,l in train_data:
  training_sentences.append(str(s.numpy()))
  training_labels.append(l.numpy())
  
for s,l in test_data:
  testing_sentences.append(str(s.numpy()))
  testing_labels.append(l.numpy())
  
    

  

**Converting labels to array data stucture**

In [0]:
training_labels_final=np.array(training_labels)
testing_labels_final=np.array(testing_labels)

**Finding shape of train sentences and train labels using numpy**

In [0]:
print(np.shape(training_labels_final))
print(np.shape(training_sentences))

(25000,)
(25000,)


**Defining variable**

Vocab_size : Upper limit of diffrent words (  Note that the vocab_size is specified large enough so as to ensure unique integer encoding for each and every word.)

max_length : What will be the maximum length of sentence

trunc_type : If sentence exceed max_length then where to truncate a sentence from begining or at end using 'post and 'pre' methods


In [0]:
vocab_size=10000
embedding_dim=16
max_length=120
trunc_type='post'
oov_tok='<OOV>'


**Tokenize sequence**

 Tokenization is the task of chopping it up into pieces, called tokens.

 Eg.
 Input - How are you

 Output - 'How', 'are', 'you' 

In [0]:

tokenizer=Tokenizer(num_words=vocab_size,oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)

**Text to sequence**


texts_to_sequences Transforms each text in texts to a sequence of integers. So it basically takes each word in the text and replaces it with its corresponding integer value from the word_index dictionary. Nothing more, nothing less, certainly no magic involved.

In [0]:
word_index=tokenizer.word_index
sequences=tokenizer.texts_to_sequences(training_sentences)

**Example of text to sequence**

Here we have printed the first train sentence in text as well as with its corresponding integer that we have converted using text_to_sequence mothod in above cell.

In [12]:
print(sequences[0])
print(training_sentences[0])
print(training_labels_final[0])

[59, 12, 14, 35, 439, 400, 18, 174, 29, 1, 9, 33, 1378, 3401, 42, 496, 1, 197, 25, 88, 156, 19, 12, 211, 340, 29, 70, 248, 213, 9, 486, 62, 70, 88, 116, 99, 24, 5740, 12, 3317, 657, 777, 12, 18, 7, 35, 406, 8228, 178, 2477, 426, 2, 92, 1253, 140, 72, 149, 55, 2, 1, 7525, 72, 229, 70, 2962, 16, 1, 2880, 1, 1, 1506, 4998, 3, 40, 3947, 119, 1608, 17, 3401, 14, 163, 19, 4, 1253, 927, 7986, 9, 4, 18, 13, 14, 4200, 5, 102, 148, 1237, 11, 240, 692, 13, 44, 25, 101, 39, 12, 7232, 1, 39, 1378, 1, 52, 409, 11, 99, 1214, 874, 145, 10]
b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love af

**Padding sequence**

The pad_sequences() function in the Keras deep learning library can be used to pad variable length sequences.

The pad_sequences() function can also be used to pad sequences to a preferred length that may be longer than any observed sequences.

This can be done by specifying the “maxlen” argument to the desired length. Padding will then be performed on all sequences to achieve the desired length, as follows.

In [0]:

padded=pad_sequences(sequences,maxlen=max_length,truncating=trunc_type)
testing_sequences=tokenizer.texts_to_sequences(testing_sentences)
testing_padded=pad_sequences(testing_sequences,maxlen=max_length)

**Compairing padded sequence with normal sequence**

In below cell we can see that in padded output two 0 are added in starting so that the length will get uniformed as we have defined using max_length variable.

In [14]:
print(sequences[0])
print(padded[0])


[59, 12, 14, 35, 439, 400, 18, 174, 29, 1, 9, 33, 1378, 3401, 42, 496, 1, 197, 25, 88, 156, 19, 12, 211, 340, 29, 70, 248, 213, 9, 486, 62, 70, 88, 116, 99, 24, 5740, 12, 3317, 657, 777, 12, 18, 7, 35, 406, 8228, 178, 2477, 426, 2, 92, 1253, 140, 72, 149, 55, 2, 1, 7525, 72, 229, 70, 2962, 16, 1, 2880, 1, 1, 1506, 4998, 3, 40, 3947, 119, 1608, 17, 3401, 14, 163, 19, 4, 1253, 927, 7986, 9, 4, 18, 13, 14, 4200, 5, 102, 148, 1237, 11, 240, 692, 13, 44, 25, 101, 39, 12, 7232, 1, 39, 1378, 1, 52, 409, 11, 99, 1214, 874, 145, 10]
[   0    0   59   12   14   35  439  400   18  174   29    1    9   33
 1378 3401   42  496    1  197   25   88  156   19   12  211  340   29
   70  248  213    9  486   62   70   88  116   99   24 5740   12 3317
  657  777   12   18    7   35  406 8228  178 2477  426    2   92 1253
  140   72  149   55    2    1 7525   72  229   70 2962   16    1 2880
    1    1 1506 4998    3   40 3947  119 1608   17 3401   14  163   19
    4 1253  927 7986    9    4   18   13   1

**Integer value corresponding to words**

In the below cell there is the list of diffrent words with their corresponding unique index value which we have created using text_to_sequences function in one of the above cell.

In [15]:
tokenizer.word_index

{'<OOV>': 1,
 'the': 2,
 'and': 3,
 'a': 4,
 'of': 5,
 'to': 6,
 'is': 7,
 'br': 8,
 'in': 9,
 'it': 10,
 'i': 11,
 'this': 12,
 'that': 13,
 'was': 14,
 'as': 15,
 'for': 16,
 'with': 17,
 'movie': 18,
 'but': 19,
 'film': 20,
 "'s": 21,
 'on': 22,
 'you': 23,
 'not': 24,
 'are': 25,
 'his': 26,
 'he': 27,
 'have': 28,
 'be': 29,
 'one': 30,
 'all': 31,
 'at': 32,
 'by': 33,
 'they': 34,
 'an': 35,
 'who': 36,
 'so': 37,
 'from': 38,
 'like': 39,
 'her': 40,
 "'t": 41,
 'or': 42,
 'just': 43,
 'there': 44,
 'about': 45,
 'out': 46,
 "'": 47,
 'has': 48,
 'if': 49,
 'some': 50,
 'what': 51,
 'good': 52,
 'more': 53,
 'very': 54,
 'when': 55,
 'she': 56,
 'up': 57,
 'can': 58,
 'b': 59,
 'time': 60,
 'no': 61,
 'even': 62,
 'my': 63,
 'would': 64,
 'which': 65,
 'story': 66,
 'only': 67,
 'really': 68,
 'see': 69,
 'their': 70,
 'had': 71,
 'were': 72,
 'me': 73,
 'well': 74,
 'we': 75,
 'than': 76,
 'much': 77,
 'been': 78,
 'get': 79,
 'bad': 80,
 'will': 81,
 'people': 82,
 'do': 83,

##**Model Building**

Here we have used LSTM algorithm

 ***What is LSTM ?***

LSTM stands for long short term memory. It is a model or architecture that extends the memory of recurrent neural networks. Typically, recurrent neural networks have ‘short term memory’ in that they use persistent previous information to be used in the current neural network. Essentially, the previous information is used in the present task. That means we do not have a list of all of the previous information available for the neural node.

 ***How LSTM works ?***

LSTM introduces long-term memory into recurrent neural networks. It mitigates the vanishing gradient problem, which is where the neural network stops learning because the updates to the various weights within a given neural network become smaller and smaller. It does this by using a series of ‘gates’. These are contained in memory blocks which are connected through layer.

There are three types of gates within a unit:


1.   Input Gate: Scales input to cell (write)
2.   Output Gate: Scales output to cell (read)
3.   Forget Gate: Scales old cell value (reset)      




  



Each gate is like a switch that controls the read/write, thus incorporating the long-term memory function into the model.

**No. of layers**

We have used 6 hidden layers and one output layer with hidden layer activation function as relu and output layer activation function as sigmoid.




In [0]:

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
    
 
    

In [17]:
##Summary of model
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 120, 16)           160000    
_________________________________________________________________
bidirectional (Bidirectional (None, 64)                12544     
_________________________________________________________________
dense (Dense)                (None, 6)                 390       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 7         
Total params: 172,941
Trainable params: 172,941
Non-trainable params: 0
_________________________________________________________________


##**Training and validating model**

1.   We have used 5 epochs to train
2.   No of training data = No of validating data = 25000



In [18]:
num_epochs = 5
model.fit(padded, training_labels_final, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f4081903b38>

##**Predicting a sentence category**

Here we have taken a sentence input from user than convert into sequence of integer and then pad it and predict using the model that we have trained above.

In [0]:
sentence = "boring"
sequence = tokenizer.texts_to_sequences([sentence])[0]

sequence=pad_sequences([sequence],maxlen=120,padding='pre')

k=model.predict_classes(sequence,verbose=0)
if k==0:
   print("Negative Review")
else:
   print("Positive review")

Negative Review


##**Conclusion**

We have implemented LSTM RNN model in our dataet after preprocessing the dataset and concluded that the validating accuracy is **83 %**. 

which can further be increased by applying diffrent algorithm and using k fold technique in our dataset.