<a href="https://colab.research.google.com/github/ayush2991/imdb-reviews/blob/main/imdb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [114]:
import tensorflow as tf
from tensorflow.keras.datasets import imdb
import numpy as np

In [115]:
# imdb.load_data?
# Using this help command, we learn that:
# 1. The function call returns (x_train, y_train), (x_test, y_test)
# 2. words are indexed by frequency, so 1 = most common token
# 3. 0 is the placeholder for unknown tokens
# 4. index_from=3 so indices 0, 1, 2 are reserved for special tokens.

In [116]:
VOCAB_SIZE = 10000

In [117]:
word_index = imdb.get_word_index()
print("# unqiue tokens = ", len(word_index.items()))

# unqiue tokens =  88584


In [None]:
(x_train_text, y_train), (x_test_text, y_test) = imdb.load_data(num_words=VOCAB_SIZE)
# num_words = 10,000 is perhaps an arbitray choice but seems like 
# a good starting point because it is unlikely that people need more 
# than 10,000 real english words to express their views about a movie.
# This choice is also informed by my prior work on large NLP systems
# which required no more than 50,000 tokens for complete expression.
# We can always tune this hyperparameter, if need be.

In [None]:
# Let's see what the data looks like
print(x_train_text[0])

In [118]:
# and, in human-readable form
reverse_index = {v:k for (k, v) in word_index.items()}
words = [reverse_index.get(token-3 , '?') for token in x_train_text[0]]
review = ' '.join(words)
print(review)

? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy's that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you thi

In [None]:
# texts: A list of N reviews, where each review is a sequence of tokens of 
# variable length. For example, texts = [[3, 1, 8, ..], [9, 3, ..], ...]
# vocab_size: The size of one-hot vector to which we'll map each review text.
# Returns X, a matrix of shape (N, vocab_size)
def to_one_hot(texts, vocab_size):
  X = np.zeros((len(texts), vocab_size))
  for i, indices in enumerate(texts):
    X[i, indices] = 1
  return X

In [120]:
# Encode all reviews into one-hot vectors which we can feed into DNNs.
x_train = to_one_hot(x_train_text, VOCAB_SIZE)
x_test = to_one_hot(x_test_text, VOCAB_SIZE)

num_train_examples = len(y_train)
num_test_examples = len(y_test)

assert x_train.shape == (num_train_examples, VOCAB_SIZE)
assert x_test.shape == (num_test_examples, VOCAB_SIZE)

print(x_train.shape)
print(x_test.shape)

(25000, 10000)
(25000, 10000)
