<a href="https://colab.research.google.com/github/Utomi-Tom/Movie-Binary-classification/blob/main/IMBD_movie_classification_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction on the IMDB review project
The IMDB dataset consist of: a set of 50,000 highly polarized reviews from the
Internet Movie Database. They’re split into 25,000 reviews for training and 25,000
reviews for testing, each set consisting of 50% negative and 50% positive reviews.


Just like the MNIST dataset, the IMDB dataset comes packaged with Keras. It has
already been preprocessed: the reviews (sequences of words) have been turned into
sequences of integers, where each integer stands for a specific word in a dictionary.

# Importing library


The following code will load the dataset (when you run it the first time, about
80 MB of data will be downloaded to your machine).

In [1]:
import tensorflow as tf
from tensorflow import keras

import numpy as np

In [2]:
# Import dataset 
(x_tr, y_tr), (x_tt, y_tt) = keras.datasets.imdb.load_data(num_words=10000)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


The argument *num_words* =10000 means you’ll only keep the top 10,000 most frequently
occurring words in the training data. Rare words will be discarded. This allows
you to work with vector data of manageable size.

In [3]:
word_index = keras.datasets.imdb.get_word_index()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb_word_index.json


Above code is how I quickly decode one of these reviews back to English
words:


*get_word_index* is a dictionary, mapping
words to an integer index.

In [4]:
r_word_ind = dict([(value, key) for (key, value) in word_index.items()])

The above code reverses the mapping process, by mapping
integer indices to words.

In [5]:
decoded_review = "".join([r_word_ind.get(i - 3, "?") for i in x_tr[3]])

In [6]:
decoded_review




The variables **x_tr and x_tt** are lists of reviews; each review is a list of
word indices (encoding a sequence of words).**y_tr and y_tt** are
lists of 0s and 1s, where 0 stands for negative and 1 stands for positive:

# Preparing the data

Because I cannot feed lists of integers into a neural network. I need to turn your lists into tensors and there are two ways to do that.


*  Pad your lists so that they all have the same length, turn them into an integer
tensor of shape (samples, word_indices), and then use as the first layer in
your network a layer capable of handling such integer tensors. 
*  One-hot encode your lists to turn them into vectors of 0s and 1s. This would
mean, for instance, turning the sequence [3, 5] into a 10,000-dimensional vector
that would be all 0s except for indices 3 and 5, which would be 1s. Then you
could use as the first layer in your network a Dense layer, capable of handling
floating-point vector data.



Let’s go with the latter solution to vectorize the data, which you’ll do manually for
maximum clarity.


In [7]:
# Encoding the integer sequences into a binary matrix

def vectorize_seq (sequence, dim= 10000):
  results = np.zeros((len(sequence), dim))
  for i, sequence in enumerate(sequence):
    results[i, sequence] = 1
  return results

In [8]:
train_data = vectorize_seq(x_tr)
print(train_data
      )

[[0. 1. 1. ... 0. 0. 0.]
 [0. 1. 1. ... 0. 0. 0.]
 [0. 1. 1. ... 0. 0. 0.]
 ...
 [0. 1. 1. ... 0. 0. 0.]
 [0. 1. 1. ... 0. 0. 0.]
 [0. 1. 1. ... 0. 0. 0.]]


In [None]:
test_data = vectorize_seq(x_tt)
print(test_data
      )

[[0. 1. 1. ... 0. 0. 0.]
 [0. 1. 1. ... 0. 0. 0.]
 [0. 1. 1. ... 0. 0. 0.]
 ...
 [0. 1. 1. ... 0. 0. 0.]
 [0. 1. 1. ... 0. 0. 0.]
 [0. 1. 1. ... 0. 0. 0.]]


In [10]:
# Vectorize labels as well

label_train = np.asarray(y_tr).astype("float32")
label_test = np.asarray(y_tt).astype("float32")

In [19]:
len(train_data)

25000

In [23]:
x_tr = train_data[12000:]
y_tr = label_train[12000:]

val_train = train_data[12001:20000]
val_label = label_train[12001:20000]

test_train= train_data[20001:25000]
test_label = label_train[20001:25000]

Now the data is ready to be fed into a neural network.

# Building your network

In [11]:
from keras import models
from keras import layers

In [30]:
# Defining Model architecture
model= models.Sequential()
model.add(layers.Dense(16, activation="relu", input_shape=(10000, )))
model.add(layers.Dense(16, activation= "relu"))
model.add(layers.Dense(1, activation="sigmoid"))

In [31]:
# Instantiate Optimizer and Loss functions

optimizer = tf.keras.optimizers.RMSprop()
loss = tf.keras.losses.BinaryCrossentropy(from_logits=True,)

In [32]:
# Compiling the model 

model.compile(optimizer=optimizer, loss=loss, metrics=["accuracy"])

In [33]:
history = model.fit(x_tr, y_tr, epochs=10, batch_size=100, validation_data=(val_train, val_label))

Epoch 1/10


  return dispatch_target(*args, **kwargs)


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [27]:
# Check learning history of the trained model

history_dict = history.history
type(history_dict)

history_dict.keys()

dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])