In [1]:
from keras.datasets import imdb
import keras
import numpy as np
keras.__version__

Using TensorFlow backend.


'2.2.4'

In [2]:
type(imdb)

module

**DIR Function**

Using dir() on module object "imdb" returns a list of the attributes and methods of any object (say functions , modules, strings, lists, dictionaries etc.)

* For Modules/Library objects, it tries to return a list of names of all the attributes, contained in that module.
* If no parameters are passed it returns a list of names in the current local scope.



In [3]:
dir(imdb)

['__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '_remove_long_seq',
 'absolute_import',
 'division',
 'get_file',
 'get_word_index',
 'json',
 'load_data',
 'np',
 'print_function',

**load_data to get training and test data**

The old version of numpy had **allow_pickle=True** as the default value for **np.load** command which was assumed in keras while importing data. So, either we need to change the np.load command in imdb.py or we can change the default value just for importing the data and after that restore the old.

In [4]:
# save np.load
np_load_old = np.load

# modify the default parameters of np.load
np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)

Now, we will load data for 10,000 words

In [5]:
# call load_data with allow_pickle implicitly set to true
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

# restore np.load for future normal usage
np.load = np_load_old

Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz


**Let us see how our data looks like.**

In [6]:
print(train_data[0])

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]


The above are the indices of the words that are being used. The actual words can be found out using get_word_index() attribute of our **module imdb** 

In [12]:
word_to_ind = imdb.get_word_index()

In [13]:
ind_to_word = dict([(value, key) for key,value in ind_to_word.items()])

In [24]:
print("The {}th and {}th word in the vocabulary are \{}/ and \{}/ respectively.".format(16,22,ind_to_word[16],ind_to_word[22]))
print("The words \yes/ and \meet/ in the vocabulary have {}th and {}th index respectively".format(word_to_ind['yes'],word_to_ind['meet']))

The 16th and 22th word in the vocabulary are \with/ and \you/ respectively.
The words \yes/ and \meet/ in the vocabulary have 419th and 906th index respectively
