## Problems of Natural Language Processing


*   Sentiment analysis of the movie reviews comes under the category of NLP problems and we will utilise NLP concepts to solve the challenge.
*   The steps we will be following include :  
    * Downloading the data
    * Creating train and test sets from the dataset
    * Converting text into numbers using tokenization
    * Sequencing and Padding of input data
    * Building the model
    * Fitting and Evaluating the model



In [15]:
# Checking for GPU
!nvidia-smi -L

GPU 0: Tesla T4 (UUID: GPU-b0845839-174e-628e-1c89-f31d7fb32d9e)


In [4]:
# Importing required libraries

import tensorflow as tf
import pandas as pd
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [5]:
# Preparation for downloading data from kaggle directly into colab

! pip install -q kaggle
from google.colab import files
files.upload()
!mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json


In [6]:
# Downloading and Unzipping the zipped file from Kaggle

!kaggle datasets download -d lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
!unzip imdb-dataset-of-50k-movie-reviews.zip

Downloading imdb-dataset-of-50k-movie-reviews.zip to /content
100% 25.7M/25.7M [00:02<00:00, 20.2MB/s]
100% 25.7M/25.7M [00:02<00:00, 11.4MB/s]
Archive:  imdb-dataset-of-50k-movie-reviews.zip
  inflating: IMDB Dataset.csv        


In [7]:
# Reading the downloaded csv file

data_csv = pd.read_csv("IMDB Dataset.csv")

In [8]:
# Viewing the data 

data_csv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [9]:
# Creating train and test sets from the data

train_data, train_label = data_csv["review"][:40000], data_csv["sentiment"][:40000]       # 80% for training
test_data, test_label = data_csv["review"][40000:], data_csv["sentiment"][40000:]         # 20% for testing

In [11]:
# Encoding of labels into 0 and 1 integers

def encode_sentiments(sentiment):                # function for labels
  if sentiment == "positive":
    return 1
  else:
    return 0

train_label_encoded = train_label.apply(encode_sentiments)
test_label_encoded = test_label.apply(encode_sentiments)

In [13]:
# Using Tokenizer class for tokenising the input reviews

vocab_size = 10000
oov_token = "<OOV>"

tokenizer = Tokenizer(num_words = vocab_size,               # Utilising Tokenizer class
                      oov_token=oov_token)

tokenizer.fit_on_texts(data_csv["review"])

In [14]:
# Transforming text into sequence of integers

train_sequences = tokenizer.texts_to_sequences(train_data)
test_sequences = tokenizer.texts_to_sequences(test_data)

In [16]:
# Padding of sequences to equalize the length of input tensors

max_length = 1000
padding_type='post'
truncation_type='post'

train_padded = pad_sequences(train_sequences,          # Utilising pad_sequences function
                             maxlen=max_length, 
                             padding=padding_type,
                             truncating=truncation_type
                             )

test_padded = pad_sequences(test_sequences,
                            maxlen=max_length, 
                            padding=padding_type,
                            truncating=truncation_type
                            )

In [17]:
# Creating the model 

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(10000, 16, input_length = 1000),
    tf.keras.layers.GlobalAveragePooling1D(),                           
    tf.keras.layers.Dense(32, activation="relu"),    
    tf.keras.layers.Dense(1, activation="sigmoid")                      # sigmoid because of binary classification
])

In [18]:
# Compiling the model

model.compile(optimizer="Adam",
              loss="binary_crossentropy",
              metrics=["accuracy"]
              )

In [19]:
# Fitting the model

model.fit(train_padded,
          train_label_encoded,
          epochs=30,
          batch_size = 512,
          validation_data = (test_padded, test_label_encoded)
          )

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x7f7c1cb48040>

In [21]:
# Evaluating the model performance

model.evaluate(test_padded, test_label_encoded)



[0.26453840732574463, 0.9003000259399414]