In [1]:
import keras
keras.__version__

Using TensorFlow backend.


'2.3.1'

In [2]:
Downloading the dataset directly from the keras package where the text was already converted into integers

from keras.datasets import imdb
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

#The argument num_words=10000 means that we will only keep the top 10,000 most frequently occurring words in the 
#training data just for the purpose of working with data of manageable size.

In [3]:
Since we have limited to the top 10000 most frequent words, no word index will exceed 10000:

max([max(sequence) for sequence in train_data])


9999

In [4]:
Text data in the IMDB training dataset already converted into integers. Let us take a look at the very first review.

train_data[0]

[1,
 14,
 22,
 16,
 43,
 530,
 973,
 1622,
 1385,
 65,
 458,
 4468,
 66,
 3941,
 4,
 173,
 36,
 256,
 5,
 25,
 100,
 43,
 838,
 112,
 50,
 670,
 2,
 9,
 35,
 480,
 284,
 5,
 150,
 4,
 172,
 112,
 167,
 2,
 336,
 385,
 39,
 4,
 172,
 4536,
 1111,
 17,
 546,
 38,
 13,
 447,
 4,
 192,
 50,
 16,
 6,
 147,
 2025,
 19,
 14,
 22,
 4,
 1920,
 4613,
 469,
 4,
 22,
 71,
 87,
 12,
 16,
 43,
 530,
 38,
 76,
 15,
 13,
 1247,
 4,
 22,
 17,
 515,
 17,
 12,
 16,
 626,
 18,
 2,
 5,
 62,
 386,
 12,
 8,
 316,
 8,
 106,
 5,
 4,
 2223,
 5244,
 16,
 480,
 66,
 3785,
 33,
 4,
 130,
 12,
 16,
 38,
 619,
 5,
 25,
 124,
 51,
 36,
 135,
 48,
 25,
 1415,
 33,
 6,
 22,
 12,
 215,
 28,
 77,
 52,
 5,
 14,
 407,
 16,
 82,
 2,
 8,
 4,
 107,
 117,
 5952,
 15,
 256,
 4,
 2,
 7,
 3766,
 5,
 723,
 36,
 71,
 43,
 530,
 476,
 26,
 400,
 317,
 46,
 7,
 4,
 2,
 1029,
 13,
 104,
 88,
 4,
 381,
 15,
 297,
 98,
 32,
 2071,
 56,
 26,
 141,
 6,
 194,
 7486,
 18,
 4,
 226,
 22,
 21,
 134,
 476,
 26,
 480,
 5,
 144,
 30,
 5535,
 18,

In [5]:
We cannot feed numbers into a neural network. We have to vectorize them into tensors.

import numpy as np  #import the numpy library

def vectorize_text(sequences, dimension=10000):  #here 10000 refers to the 10000 most frequent words
    # Create an all-zero matrix of shape (len(sequences), dimension)
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.  # set specific indices of results[i] to 1s
    return results

# Our vectorized training data
x_train = vectorize_text(train_data)
# Our vectorized test data
x_test = vectorize_text(test_data)

In [6]:
The same train_data[0] above is converted into float type tensors

x_train[0]

array([0., 1., 1., ..., 0., 0., 0.])

In [7]:
We also have to vectorize our labels and that can be done easily using numpy

y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')


In [8]:

y_train[0]


1.0

In [None]:
Let us consider a sample Sequential model structure with a Dense layer as its first layer, i.e., the input layer.


from keras import models
from keras import layers

model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))


We define a Dense layer with hidden units, here it is 16, activation function, here it is 'relu', and if the Dense 
layer acts as the input or first layer of a neural network then we mention the input_shape parameter.

Notice how we are defining the input_shape parameter. This is to tell the model to expect each input sequence 
with 10000 as dimension. Of course that is because we have considered the 10000 most frequent words in the training
data. It can also be referred to as vocab_size.


In [9]:


samples = ['The cat sat on the mat.', 'The dog ate my homework.']



In [10]:
Creating a dictionary ourselves consisting unique index for each word, within the accepted range of vocabulary size,
in the training data

Let us assume a list of sentences: samples = ['The cat sat on the mat.', 'The dog ate my homework.']

token_index = {}  # First, build an index of all tokens in the data.
for s in samples:
    for word in s.split():  # We simply tokenize the samples via the 'split' method. In reality we would also strip 
        #punctuation and special characters
        if word not in token_index:
            token_index[word] = len(token_index) + 1  # Assign a unique index to each unique word
            # Note that we don't assign the number 0 to anything


In [11]:
Let us take a look at the dictionary that we formed in the cell above

token_index

{'The': 1,
 'cat': 2,
 'sat': 3,
 'on': 4,
 'the': 5,
 'mat.': 6,
 'dog': 7,
 'ate': 8,
 'my': 9,
 'homework.': 10}

In [12]:
We can then convert the words in the sentences to sequences of numbers as follows:

result=[]
for s in samples:
    seq=[]
    for word in s.split():
        seq.append(token_index[word])
        
    result.append(seq)
            

In [13]:
result

[[1, 2, 3, 4, 5, 6], [1, 7, 8, 9, 10]]

In [None]:
So the first sentence in samples is transformed into a sequence of numbers as follows:

'The cat sat on the mat.'  ->  [1, 2, 3, 4, 5, 6]