In [1]:
import numpy as np
from keras.datasets import imdb
import matplotlib.pyplot as plt
from jupyterthemes import jtplot
jtplot.style()
%matplotlib notebook

Using TensorFlow backend.


# IMDB Dataset

This dataset consists of 50,000 movie reviews, with a train/test split of 50% each. Both the input features and target are simply a 1D list of numpy arrays. Each element of the input array is a document vector, which in turn is represented as a Python list of integers. Each element in the target is a single number - either 0 or 1.

When Keras is creating the vocabulary of the dataset, it will assign indexes to tokens corresponding to their frequencies. So the token with the highest frequency has an index of 1, the next most frequent token has an index of 2, and so on. When we load the data we specify the `num_words` argument. This says only the top `num_words` words should be part of the vocabulary. A simple way to implement this is to throw away all tokens whose indexes are greater than `num_words`. A point of note here is that indexes start with 1, no token is assigned index 0.

The document vector consists of the indexes of the word at that position. Words that are not in the vocabulary are assigned a special *unknown* token. Given each review is of different length, it follows that lists in the input array are also of different lengths.

In [2]:
v = 10000
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=v)

In [3]:
# There are 25,000 samples in the train and test dataset each
print(x_train.shape, x_test.shape)

# Input is an array of lists
print(type(x_train), type(x_train[0]))

# Target is an array of ints with values 0 or 1
print(type(y_train), type(y_train[0]))
print(np.min(y_train), np.max(y_train))

# Each review has a different length
print(len(x_train[0]), len(x_train[1]), len(x_train[2]))
print(len(x_test[0]), len(x_test[1]), len(x_test[2]))

# Only the top 10,000 (actually 9,999) words are included
min_index = min([min(vec) for vec in x_train])
max_index = max([max(vec) for vec in x_train])
print(min_index, max_index)

# Indexes start with 1
sorted(imdb.get_word_index().items(), key=lambda kv: kv[1])[:15]

(25000,) (25000,)
<class 'numpy.ndarray'> <class 'list'>
<class 'numpy.ndarray'> <class 'numpy.int64'>
0 1
218 189 141
72 646 466
1 9999


[('the', 1),
 ('and', 2),
 ('a', 3),
 ('of', 4),
 ('to', 5),
 ('is', 6),
 ('br', 7),
 ('in', 8),
 ('it', 9),
 ('i', 10),
 ('this', 11),
 ('that', 12),
 ('was', 13),
 ('as', 14),
 ('for', 15)]

## Weirdnesses

### Weirdness 1
The indexes given by `imbd.get_word_index()` do not actually correspond with the indexes that appear in the word vector. They are all offset by 3. To distinguish these two types of indexes, lets call the indexes given by `get_word_index()` as raw indexes and the actual indexes as indexes. So a word with raw index *i* will actually appear in the document with index *i+3*. Say the document has the word *this*. It will not appear as number *11* in the document vector. Instead it will appear as number *14*. Conversly, if I see the number *11* in the document, that is its actual index, its raw index will be *8*, which means it represents the word *in*. In other words - `index = raw_index + 3`. 

Given the offset described above, the actual indexes should start from *4* because the raw indexes start from *1*. However, we saw above that the actual indexes start from *1*. This is because indexes *1*, *2*, and *3* have special meanings.

In the cells below I convert a document vector back to a document. First I use the raw index for the conversion, but the resulting document does not make a lot of sense. Then I use the actual indexes for the conversion and the resulting document is resonable.

In [4]:
# A document with the raw indexes does not make much sense
raw_indexes = imdb.get_word_index()
raw_tokens = {raw_index: word for word, raw_index in raw_indexes.items()}
vec = x_train[0]
doc = ' '.join([raw_tokens[index] for index in vec])
print(doc)

the as you with out themselves powerful lets loves their becomes reaching had journalist of lot from anyone to have after out atmosphere never more room and it so heart shows to years of every never going and help moments or of every chest visual movie except her was several of enough more with is now current film as you of mine potentially unfortunately of you than him that with out themselves her get for was camp of you movie sometimes movie that with scary but and to story wonderful that in seeing in character to of 70s musicians with heart had shadows they of here that with her serious to have does when from why what have critics they is you that isn't one will very to as itself with other and in of seen over landed for anyone of and br show's to whether from than out themselves history he name half some br of and odd was two most of mean for 1 any an boat she he should is thought frog but of script you not while history he heart to real at barrel but when from one bit then have tw

In [5]:
# A document with actual indexes is reasonable
tokens = {raw_index + 3: word for word, raw_index in raw_indexes.items()}
# Adding the speical tokens
tokens[1] = '<special>'
tokens[2] = '<special>'
tokens[3] = '<special>'
vec = x_train[0]
tokens_in_doc = [tokens[index] for index in vec] 
doc = ' '.join(tokens_in_doc)
print(doc)

<special> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <special> is an amazing actor and now the same being director <special> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <special> and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <special> to the two little boy's that played the <special> of norman and paul they were just brilliant children are often left out of the <special> list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and

### Weirdness 2
Indexes 1, 2, and 3 should have had special meanings. However, only indexes 1 and 2 have special meanings. 1 means *start of document* and 2 means *unknown token*. However, index number 3 has been completely forgotten! It has no token assigned to it and it does not appear in the dataset at all! When the proper args are given to load_data, I might see index 0. This has the special meaning of *padding*.

In the cell below first I re-print the first review with the right special tokens. Then I prove that there is no index 3.

In [6]:
# Printing the review with the right special tokens
tokens[0] = '<padding>'
tokens[1] = '<start>'
tokens[2] = '<unknown>'
tokens[3] = '<?>'
tokens_in_doc = [tokens[index] for index in vec] 
doc = ' '.join(tokens_in_doc)
print(doc)

<start> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <unknown> is an amazing actor and now the same being director <unknown> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <unknown> and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <unknown> to the two little boy's that played the <unknown> of norman and paul they were just brilliant children are often left out of the <unknown> list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and s

In [7]:
# Proof that index 3 does not appear anywhere in the data
for i, seq in enumerate(x_train):
    if 3 in seq:
        print(f'found it in {i}!')

### Weirdness 3
Instead of simply dropping tokens that are not part of the vocabulary, they are replaced with the special *unknown* token. In the code below, notice that *redford's* has an index of 22,665 (raw index of 22,662), so in x_train it was dropped out of the vocabulary and its occurance replaced with the special *unknown* token.

In [8]:
# Load the data without dropping any tokens
(x_train_full, _), (_, _) = imdb.load_data()
vec_full = x_train_full[0]
tokens_in_doc_full = [tokens[index] for index in vec_full] 
print('\n')
for index, token, index_full, token_full in zip(vec[10:30], tokens_in_doc[10:30], vec_full[10:30], tokens_in_doc_full[10:30]):
    print(f'[{index}] = {token}\t[{index_full}] = {token_full}')
    
print(raw_indexes["redford's"], tokens[raw_indexes["redford's"] + 3])



[458] = direction	[458] = direction
[4468] = everyone's	[4468] = everyone's
[66] = really	[66] = really
[3941] = suited	[3941] = suited
[4] = the	[4] = the
[173] = part	[173] = part
[36] = they	[36] = they
[256] = played	[256] = played
[5] = and	[5] = and
[25] = you	[25] = you
[100] = could	[100] = could
[43] = just	[43] = just
[838] = imagine	[838] = imagine
[112] = being	[112] = being
[50] = there	[50] = there
[670] = robert	[670] = robert
[2] = <unknown>	[22665] = redford's
[9] = is	[9] = is
[35] = an	[35] = an
[480] = amazing	[480] = amazing
22662 redford's
