<a href="https://colab.research.google.com/github/akankshakusf/Project-DeepLearning-Sentiment-Analysis-of-IMDB-Movie-Reviews/blob/master/Sentiment_Analysis_with_RNNs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [25]:
# import ML packages
import numpy as np
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import pathlib
import io
import string
import time
import re
import numpy as random
import gensim.downloader as api
from PIL import Image
from sklearn.metrics import confusion_matrix, roc_curve

# import tf DL packages
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_probability as tfp
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Layer, Conv1D, InputLayer,BatchNormalization, Bidirectional, Dense, Flatten,Dropout,Input,Embedding,TextVectorization
from tensorflow.keras.layers import SimpleRNN, LSTM, GRU
from tensorflow.keras.losses import BinaryCrossentropy, CategoricalCrossentropy, SparseCategoricalCrossentropy
from tensorflow.keras.metrics import Accuracy, TopKCategoricalAccuracy, CategoricalAccuracy,SparseCategoricalAccuracy
from tensorflow.keras.optimizers import Adam
from google.colab import drive, files
from tensorboard.plugins import projector

In [26]:
# !pip install --upgrade numpy
# !pip install --upgrade gensim

# Data Preparation

In [27]:
train_ds,val_ds,test_ds=tfds.load('imdb_reviews', split=['train', 'test[:50%]', 'test[50%:]'],as_supervised=True)

In [7]:
train_ds

<_PrefetchDataset element_spec=(TensorSpec(shape=(), dtype=tf.string, name=None), TensorSpec(shape=(), dtype=tf.int64, name=None))>

In [28]:
for review,label in val_ds.take(2):
  print(review)
  print(label)

tf.Tensor(b"There are films that make careers. For George Romero, it was NIGHT OF THE LIVING DEAD; for Kevin Smith, CLERKS; for Robert Rodriguez, EL MARIACHI. Add to that list Onur Tukel's absolutely amazing DING-A-LING-LESS. Flawless film-making, and as assured and as professional as any of the aforementioned movies. I haven't laughed this hard since I saw THE FULL MONTY. (And, even then, I don't think I laughed quite this hard... So to speak.) Tukel's talent is considerable: DING-A-LING-LESS is so chock full of double entendres that one would have to sit down with a copy of this script and do a line-by-line examination of it to fully appreciate the, uh, breadth and width of it. Every shot is beautifully composed (a clear sign of a sure-handed director), and the performances all around are solid (there's none of the over-the-top scenery chewing one might've expected from a film like this). DING-A-LING-LESS is a film whose time has come.", shape=(), dtype=string)
tf.Tensor(1, shape=(),

First we standarize all the values
reference : https://www.tensorflow.org/api_docs/python/tf/strings/regex_replace
- https://github.com/google/re2
- convert to input lowercase
- review html tags through regex (regular expression)
- take off punctuations
- remove special chracters
- remove accented chracters
- reduce the word to root through one of these:
  - Stemming
    - stemming : PorterStemmer
    - lemmatization : Lematize
  - Tokenization
    - chracter tokenization
    - word tokenization
    - subword tokenization
    - n-gram tokenization
  - Vectorization
    - one-hot
    - bag of words
    - tf-idf
    - embeddings
    

- Standardization

In [29]:
def standardization(input_data):
  '''
  Input: raw reviews
  output: standardized reviews
  '''
  lowercase = tf.strings.lower(input_data)
  no_tag = tf.strings.regex_replace(lowercase,"<[^>]+>","")
  output = tf.strings.regex_replace(no_tag,"[%s]"%re.escape(string.punctuation),"")

  return output


In [32]:
standardization("There are films that make careers. For George Romero, it was NIGHT OF THE LIVING DEAD; for Kevin Smith, CLERKS; for Robert Rodriguez, EL MARIACHI. Add to that list Onur Tukel's absolutely amazing DING-A-LING-LESS. Flawless film-making, and as assured and as professional as any of the aforementioned movies. I haven't laughed this hard since I saw THE FULL MONTY. (And, even then, I don't think I laughed quite this hard... So to speak.) Tukel's talent is considerable: DING-A-LING-LESS is so chock full of double entendres that one would have to sit down with a copy of this script and do a line-by-line examination of it to fully appreciate the, uh, breadth and width of it. Every shot is beautifully composed (a clear sign of a sure-handed director), and the performances all around are solid (there's none of the over-the-top scenery chewing one might've expected from a film like this). DING-A-LING-LESS is a film whose time has come.")

<tf.Tensor: shape=(), dtype=string, numpy=b'there are films that make careers for george romero it was night of the living dead for kevin smith clerks for robert rodriguez el mariachi add to that list onur tukels absolutely amazing dingalingless flawless filmmaking and as assured and as professional as any of the aforementioned movies i havent laughed this hard since i saw the full monty and even then i dont think i laughed quite this hard so to speak tukels talent is considerable dingalingless is so chock full of double entendres that one would have to sit down with a copy of this script and do a linebyline examination of it to fully appreciate the uh breadth and width of it every shot is beautifully composed a clear sign of a surehanded director and the performances all around are solid theres none of the overthetop scenery chewing one mightve expected from a film like this dingalingless is a film whose time has come'>

- Define vocab size and sequence length

In [41]:
VOCAB_SIZE = 10000
SEQUENCE_LENGTH = 250

- reference  : https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization

In [42]:
vectorize_layer = tf.keras.layers.TextVectorization(
    max_tokens=VOCAB_SIZE,
    standardize= standardization,
    output_mode='int',
    output_sequence_length=SEQUENCE_LENGTH
  )


- Note : here in the above cell VOCAB_SIZE, SEQUENCE_LENGTH have been taken up with some guessing values. But we can also take a look at the data understand exactly what the values should be set to


In [46]:
# this takes while as it is going through all review of samples selected
lengths=[] #initalized empty lists
words=[]

for review, label in train_ds.take(10):
  for word in tf.strings.split(review, sep = ""):
    if word in words:
      pass
    else:
      words.append(word) # append the word to list
  lengths.append(len(tf.strings.split(review, sep=""))) # append the len to list

In [47]:
print(len(words))
print(lengths)

913
[116, 112, 132, 88, 81, 289, 557, 111, 223, 127]


- Note : after observing above result we can say that just by taking 10 samples we can see that the length is coming out to be 913. which means that lenght of whole data set will be quite big so we took 10000.
- Also for the sequence length you can see that the lenght is more than 500. max we see in the set of 10 is [116, 112, 132, 88, 81, 289, 557, 111, 223, 127]. So, yes I think that random values we have taken are good
- just for testing i change 10 to 100 samples for better visibility

In [50]:
# lets take mean and median to understand better
print(np.mean(lengths))
print(np.median(lengths))

183.6
121.5


- reference :https://stackoverflow.com/questions/72688923/how-to-adapt-textvectorization-layer-on-tf-dataset

In [51]:
# now lets attach this Vectorizer to dataset to get the vocabulary list
training_dataset = train_ds.map(lambda x, y :x)  ## input x, y and output
vectorize_layer.adapt(training_dataset)  ##adapt the vectorizer layer to the training dataset

*  Get Vocabulary out from the vectorizer

In [53]:
len(vectorize_layer.get_vocabulary())

10000

In [57]:
for review,label in train_ds.take(1):
  print(review)

tf.Tensor(b"This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.", shape=(), dtype=string)


In [59]:
#get the numerical vectors out by passing into vectorize_layer
def vectorizer(review, label):
  return vectorize_layer(review), label

In [60]:
train_dataset = train_ds.map(vectorizer)
val_dataset = val_ds.map(vectorizer)

In [62]:
#review vectorized data
for review,label in train_dataset.take(1):
  print(review)
  print(label)

tf.Tensor(
[  10   13   33  411  384   17   89   26    1    8   32 1337 3521   40
  491    1  192   22   84  149   18   10  215  317   26   64  239  212
    8  484   54   64   84  111   95   21 5502   10   91  637  737   10
   17    7   33  393 9554  169 2443  406    2   87 1205  135   65  142
   52    2    1 7408   65  245   64 2832   16    1 2851    1    1 1415
 4969    3   39    1 1567   15 3521   13  156   18    4 1205  881 7874
    8    4   17   12   13 4037    5   98  145 1234   11  236  696   12
   48   22   91   37   10 7285  149   37 1337    1   49  396   11   95
 1148  841  140    9    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0 

- prefetch the data so we have optimal perfromance and parallelism

In [68]:
# prefetch the data so we have optimal perfromance and parallelism
train_dataset = train_dataset.prefetch(buffer_size = tf.data.AUTOTUNE)
val_dataset = val_dataset.prefetch(buffer_size = tf.data.AUTOTUNE)