<a href="https://colab.research.google.com/github/dasari-mohana/IMDB_Sentiment_analysis/blob/main/IMDB_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IMDB dataset

## Description:
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. Provided a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag of words formats are provided. See the README file contained in the release for more details.

## Data Source
Keras provides access to the IMDB dataset built-in. The imdb.load_data() function allows you to load the dataset in a format that is ready for use in neural network and deep learning models.

The words have been replaced by integers that indicate the ordered frequency of each word in the dataset. The sentences in each review are therefore comprised of a sequence of integers.

## Objective:

To predict determine whether a given movie review has a positive or negative sentiment.

In [1]:
# Importing required libraries
import numpy as np
import pandas as pd
import tensorflow as tf

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM,Dense,Embedding,Flatten
from tensorflow.keras.preprocessing import sequence

np.random.seed(7)

In [2]:
# Importing dataset
# Keras provides access to the IMDB dataset built-in. The imdb.load_data() function allows you to load the dataset

from tensorflow.keras.datasets import imdb

We will also limit the total number of words that we are interested in modeling to the 5000 most frequent words, and zero out the rest. 

In [3]:
# load the dataset but only keep the top n words, zero the rest

top_words = 5000
(X_train,y_train),(X_test,y_test) = imdb.load_data(num_words=top_words)

The sequence length (no.of words) in each review varies, so we will constrain each review to be 500 words, truncating long reviews and pad the shorter reviews with zero values.

In [4]:
# Truncate and pad input sequences

max_review_length = 500

X_train = sequence.pad_sequences(X_train,maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test,maxlen=max_review_length)

In [5]:
# Create the model

embedding_vector_length = 32

# Buliding a sequential model

model = Sequential()
model.add(Embedding(input_dim=top_words, output_dim=embedding_vector_length, input_length=max_review_length))
model.add(LSTM(units=50))
model.add(Dense(units=1,activation='sigmoid'))
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 500, 32)           160000    
                                                                 
 lstm (LSTM)                 (None, 50)                16600     
                                                                 
 dense (Dense)               (None, 1)                 51        
                                                                 
Total params: 176,651
Trainable params: 176,651
Non-trainable params: 0
_________________________________________________________________


In [6]:
# Compiling the model with metric as accuracy and Adam optimizer
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

model.fit(X_train,y_train, batch_size=64, epochs=3,validation_data=(X_test,y_test))

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f112fb75a50>

In [7]:
# Performance of the model

scores = model.evaluate(X_test,y_test, verbose=0)
print('Model accuracy : %.2f%%' %(scores[1]*100))

Model accuracy : 87.80%


In [8]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.datasets import imdb
np.random.seed(7)

top_words = 5000
(X_train,y_train),(X_test,y_test) = imdb.load_data(num_words=top_words)

# Truncate and padding sequence
max_review_length = 500
X_train = sequence.pad_sequences(X_train,maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test,maxlen=max_review_length)

In [9]:
# Create the model
embedding_vector_length = 32

model = Sequential()
model.add(Embedding(input_dim=top_words, output_dim=embedding_vector_length, input_length=max_review_length))
model.add(Dropout(0.1))
model.add(LSTM(units=80))
model.add(Dropout(0.1))
model.add(Dense(units=1,activation='sigmoid'))
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, 500, 32)           160000    
                                                                 
 dropout (Dropout)           (None, 500, 32)           0         
                                                                 
 lstm_1 (LSTM)               (None, 100)               53200     
                                                                 
 dropout_1 (Dropout)         (None, 100)               0         
                                                                 
 dense_1 (Dense)             (None, 1)                 101       
                                                                 
Total params: 213,301
Trainable params: 213,301
Non-trainable params: 0
_________________________________________________________________


In [10]:
model.compile(optimizer='adam',loss='binary_crossentropy',metrics =['accuracy'])
model.fit(X_train,y_train,batch_size=64,epochs=4,validation_data=(X_test,y_test),verbose=0)

# model performance
scores = model.evaluate(X_test,y_test, verbose=0)
print('Model accuaracy %.2f%%' %(scores[1]*100))

Model accuaracy 86.98%


## LSTM Model with Dropout on Gates

In [17]:
np.random.seed(7)

top_words = 5000
(X_train,y_train),(X_test,y_test) = imdb.load_data(num_words=top_words)

# Truncate and padding sequence
max_review_length = 500
X_train = sequence.pad_sequences(X_train,maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test,maxlen=max_review_length)

# Create the model
embedding_vector_length = 32
model = Sequential()
model.add(Embedding(input_dim=top_words,output_dim=embedding_vector_length,input_length=max_review_length))
model.add(Dropout(0.2))
model.add(LSTM(units=90,dropout=0.2,recurrent_dropout=0.2))
model.add(Dense(units=1,activation='relu'))
print(model.summary())

model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
model.fit(X_train,y_train,batch_size=64,epochs=2,validation_data=(X_test,y_test),verbose=1)

# model performace
scores = model.evaluate(X_test,y_test,verbose=0)
print('Model accuracy: %.2f%%' %(scores[1]*100))

Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_5 (Embedding)     (None, 500, 32)           160000    
                                                                 
 dropout_5 (Dropout)         (None, 500, 32)           0         
                                                                 
 lstm_5 (LSTM)               (None, 90)                44280     
                                                                 
 dense_5 (Dense)             (None, 1)                 91        
                                                                 
Total params: 204,371
Trainable params: 204,371
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/2
Epoch 2/2
Model accuracy: 84.79%


## LSTM and CNN for sequence classification

In [14]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Convolution1D, MaxPooling1D
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.datasets import imdb
np.random.seed(7)

top_words = 5000
(X_train,y_train),(X_test,y_test) = imdb.load_data(num_words=top_words)

# Truncate and padding sequence
max_review_length = 500
X_train = sequence.pad_sequences(X_train,maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test,maxlen=max_review_length)

# Create the model
embedding_vector_length = 32

model = Sequential()
model.add(Embedding(input_dim=top_words,output_dim=embedding_vector_length,input_length=max_review_length))

model.add(Convolution1D(filters=32,kernel_size=3,padding='same',activation='relu'))
model.add(MaxPooling1D(pool_size=2,padding='same'))

model.add(Dropout(0.2))
model.add(LSTM(units=100,dropout=0.2,recurrent_dropout=0.2))
model.add(Dense(units=1,activation='relu'))
model.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, 500, 32)           160000    
                                                                 
 conv1d (Conv1D)             (None, 500, 32)           3104      
                                                                 
 max_pooling1d (MaxPooling1D  (None, 250, 32)          0         
 )                                                               
                                                                 
 dropout_3 (Dropout)         (None, 250, 32)           0         
                                                                 
 lstm_3 (LSTM)               (None, 100)               53200     
                                                                 
 dense_3 (Dense)             (None, 1)                 101       
                                                      

In [15]:
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
model.fit(X_train,y_train,batch_size=64,epochs=2,validation_data=(X_test,y_test),verbose=0)

# model performace
scores = model.evaluate(X_test,y_test,verbose=0)
print('Model accuracy: %.2f%%' %(scores[1]*100))

Model accuracy: 81.08%


# Conclusion

1.  Developed a simple single layer LSTM model for the IMDB movie review sentiment classification problem.
2.  Extended my LSTM model with layer-wise and LSTM-specific dropout to reduce overfitting.
3.  Finally, combined the spatial structure learning properties of a Convolutional Neural Network with the sequence learning of an LSTM. We can increase the model performance by increasing number of epochs.