Importing the Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
from glob import glob
from sklearn.model_selection import train_test_split
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds

Loading Dataset

In [None]:
df = pd.read_csv("/content/IMDB Dataset.csv")

Statistics Analysis

In [None]:
df.head(5)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [None]:
df.tail(5)

Unnamed: 0,review,sentiment
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative
49999,No one expects the Star Trek movies to be high...,negative


In [None]:
df.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,negative
freq,5,25000


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


# Splitting the dataset into train and test

In [None]:
train_data, validation_data, test_data = tfds.load(
    name="imdb_reviews", 
    split=('train[:75%]', 'train[75%:]', 'test'),
    as_supervised=True)

In [None]:
print("Training Data: {}".format(len(train_data)))
print("Validation Data: {}".format(len(validation_data)))
print("Test Data: {}".format(len(test_data)))

Training Data: 18750
Validation Data: 6250
Test Data: 25000


To Check Batch of Labels 

We can see two possible labels, 0 and 1.0

0 - Negative Reviews and 1 - Positive Reviews 

In [None]:
train_example_batch, train_labels_batch = next(iter(train_data.batch(5)))
print(train_example_batch, "\n\n", train_labels_batch)

tf.Tensor(
[b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it."
 b'I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. The plot de

We are using Pre-trained Embedding Layer from TensorFlow Hub

Downloading and Defining the Embedding Layers from TFHub

In [31]:
embedding = "https://tfhub.dev/google/nnlm-en-dim50/2"
hub_layer = hub.KerasLayer(embedding, input_shape=[], #input shape is a list
                           dtype=tf.string, trainable=True)

Prints the output when passing three examples through the layer

In [32]:
hub_layer(train_example_batch[3:])

<tf.Tensor: shape=(2, 50), dtype=float32, numpy=
array([[ 4.70498234e-01,  5.04806712e-02,  2.35127181e-01,
         4.32975799e-01, -9.32876244e-02, -1.38524815e-01,
         5.53822902e-04, -9.48968381e-02, -3.88939619e-01,
         3.01228583e-01,  1.04856193e-01,  9.48758423e-02,
        -4.81238328e-02,  3.37131359e-02,  1.76941991e-01,
        -5.16853631e-01, -1.71630889e-01,  2.69593392e-02,
        -7.99699873e-02, -4.96775538e-01,  7.68390298e-02,
        -4.07506138e-01,  1.05867445e-01,  4.93401259e-01,
        -4.78378199e-02,  4.29327220e-01, -7.04555035e-01,
        -7.12106079e-02,  1.44147530e-01, -3.55651975e-01,
        -1.49616271e-01, -4.13918197e-02, -2.51194723e-02,
        -1.88087270e-01, -2.37523809e-01,  1.32853195e-01,
         1.41783431e-01,  3.13279063e-01,  1.49204180e-01,
        -7.55418539e-01, -9.40273777e-02, -1.78432092e-01,
        -2.09861219e-01,  7.48737007e-02, -6.61607608e-02,
        -3.99357304e-02, -1.40112877e-01,  2.31311023e-02,
       

We defining the Model Structure

The first layer is the embedding layer, followed by a dense/linear layer with 16 hidden nodes, and then a final dense layer with 1 node.

The beauty of Keras is that is requires very little code to create powerful Neural Networks.

In [33]:
model = tf.keras.Sequential()
model.add(hub_layer)
model.add(tf.keras.layers.Dense(16, activation='relu')) #note relu maps all the negative values to zero
model.add(tf.keras.layers.Dense(1, activation='sigmoid')) #sigmoid maps values between 0 - 1

In [34]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
keras_layer (KerasLayer)     (None, 50)                48190600  
_________________________________________________________________
dense (Dense)                (None, 16)                816       
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 48,191,433
Trainable params: 48,191,433
Non-trainable params: 0
_________________________________________________________________


Finally we compile and train the model

We are using ADAM optimizer, binary crossentropy loss function (as we are predicting labels either 0 or 1), and evaluting the model performance based on its accuracy on the labels.

In [35]:
model.compile(optimizer='adam', loss=tf.keras.losses.BinaryCrossentropy(from_logits=True), metrics=['accuracy'])

In [36]:
history = model.fit(train_data.shuffle(10000).batch(512), epochs=10,
                    validation_data=validation_data.batch(512), verbose=1)

Epoch 1/10


  '"`binary_crossentropy` received `from_logits=True`, but the `output`'


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


I had to look into the tf.data.Dataset documentation to understand what was going on when we batched a tf dataset, as I thought that it was only evaluating 512 test_data examples and not all the test data.

In [37]:
dataset = tf.data.Dataset.range(8)
dataset = dataset.batch(3) #or with batch shuffle ---dataset = dataset.shuffle(3).batch(3)
list(dataset.as_numpy_iterator())

[array([0, 1, 2]), array([3, 4, 5]), array([6, 7])]

Evaluating Model Performance on the Test Dataset

In [38]:
results = model.evaluate(test_data.batch(64), verbose=1)

for name, value in zip(model.metrics_names, results):
  print("%s: %3f" % (name, value))

loss: 0.375324
accuracy: 0.860400


# We can see that the model performed pretty well on the test_data and scores ~85%. Thats pretty good! Hopefully with some improvements in the model structure, and the embedding layer we can make a model that scores >90%!

# ---



Finally, we have a trained model and can pass unseen data into the model to make predictions. As the final layer of the model uses a sigmoid activation function the output will be mapped between 0-1.

The output of the model is a prediction on the likelihood that the text is a positive review.

In [39]:
examples = [
            'this is such an amaxing movie!', #this is same sentence tried earlier
            'The movie was great',
            'The movie was decent',
            'The movie was okies',
            'The movie was so awful and terrible....'
]

In [41]:
#tf.sigmoid maps values between 0 - 1
#tf.constant creates a tensor from a tensor-like object

original_results = model(tf.constant(examples))

for (x, y) in zip(original_results, examples):
  print("Input: {} --- {:.2f}".format(y, *[0]))


Input: this is such an amaxing movie! --- 0.00
Input: The movie was great --- 0.00
Input: The movie was decent --- 0.00
Input: The movie was okies --- 0.00
Input: The movie was so awful and terrible.... --- 0.00


Saving the model weights so we can use the model in the future

In [42]:
model.save_weights("keras_NLP_basic_weights.h5")

Closing Thoughts

I would like to experiment with Tensorflow using a lower level API, and not relying on the higher level API in Keras.

That being said, Keras is amazing at prototyping and getting started on a wide range of machine learning problems. I think it is beneficial to start with keras to get a sense of the problem you are working with, and then work down from there. I think it would be interesting to mess with other embedding layers, create a custom embedding layer, experiment with different model structures, and training parameters.