In [1]:
from keras.datasets import imdb
import keras
import numpy as np
keras.__version__

Using TensorFlow backend.


'2.2.4'

In [2]:
type(imdb)

module

**DIR Function**

Using dir() on module object "imdb" returns a list of the attributes and methods of any object (say functions , modules, strings, lists, dictionaries etc.)

* For Modules/Library objects, it tries to return a list of names of all the attributes, contained in that module.
* If no parameters are passed it returns a list of names in the current local scope.



In [3]:
print(dir(imdb))



**load_data to get training and test data**

The old version of numpy had **allow_pickle=True** as the default value for **np.load** command which was assumed in keras while importing data. So, either we need to change the np.load command in imdb.py or we can change the default value just for importing the data and after that restore the old.

In [4]:
# save np.load
np_load_old = np.load

# modify the default parameters of np.load
np.load = lambda *a,**k: np_load_old(*a, allow_pickle=True, **k)

Now, we will load data for 10,000 words

In [5]:
# call load_data with allow_pickle implicitly set to true
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)

# restore np.load for future normal usage
np.load = np_load_old

Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz


**Let us see how our data looks like.**

In [6]:
print(train_data[0])

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]


The above are the indices of the words that are being used. The actual words can be found out using get_word_index() attribute of our **module imdb** 

In [7]:
word_to_ind = imdb.get_word_index()

Downloading data from https://s3.amazonaws.com/text-datasets/imdb_word_index.json


In [8]:
ind_to_word = dict([(value, key) for key,value in word_to_ind.items()])

In [9]:
print("The {}th and {}th word in the vocabulary are \{}/ and \{}/ respectively.".format(16,22,ind_to_word[16],ind_to_word[22]))
print("The words \happy/ and \sad/ in the vocabulary have {}th and {}th index respectively".format(word_to_ind['happy'],word_to_ind['sad']))

The 16th and 22th word in the vocabulary are \with/ and \you/ respectively.
The words \happy/ and \sad/ in the vocabulary have 651th and 616th index respectively


We have loaded the data and understood its structure. Now, we need to get our data ready for modelling. First thing to notice is that we need to have each sample of the same shape.

* We can use embeddings of each word and equalize the length of each sentence by using padding.
* We can one-hot encode your lists to turn them into vectors of 0s and 1s. This would mean, that if there is at least one occurrence of a word in a sentence that word will have 1 against its index and otherwise 0 if there is no occurrence.

For now, we will be using the latter approach.

In [10]:
def vectorize_sequences(sequences, dimension=10000):
	results = np.zeros((len(sequences), dimension))
	for i, sequence in enumerate(sequences):
		results[i, sequence] = 1.
	return results

In [11]:
x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)

y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

We will first try a simple neural network in Keras. Usually, RNN or recurrent neural network works well for language data. There are two ways to implement a model in Keras.

1. Using keras.models.Sequential
2. Using keras.models.Model

We will use both of them in the above order

In [12]:
from keras.layers import Dense
from keras.models import Sequential

In [13]:
model = Sequential()
model.add(Dense(108, activation = 'relu', input_shape = [10000,]))
model.add(Dense(10, activation = 'relu'))
model.add(Dense(1, activation = 'sigmoid'))

In [14]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 108)               1080108   
_________________________________________________________________
dense_2 (Dense)              (None, 10)                1090      
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 11        
Total params: 1,081,209
Trainable params: 1,081,209
Non-trainable params: 0
_________________________________________________________________


In [15]:
class my_callback(keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs = {}):
        if(logs.get('acc') > 0.99):
            print("Stopping to prevent overfitting")
            self.model.stop_training = True

In [16]:
callback = my_callback()
model.compile(optimizer = 'Adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
model.fit(x_train, y_train, epochs = 20, callbacks = [callback])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Stopping to prevent overfitting


<keras.callbacks.History at 0x7f8e2290b240>

In [None]:
model.evaluate(x_test, y_test)

**We have got 86% accuracy.** But our insample accuracy is more than 99%. We are clearly suffering from overfitting.

In [17]:
from keras.layers import Input
from keras.models import Model

In [18]:
X = Input(shape = (10000,))
Y = Dense(108, activation = 'relu')(X)
Y = Dense(10, activation = 'relu')(Y)
Y = Dense(1, activation = 'sigmoid')(Y)

In [19]:
model = Model(inputs = [X], outputs = [Y])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 10000)             0         
_________________________________________________________________
dense_4 (Dense)              (None, 108)               1080108   
_________________________________________________________________
dense_5 (Dense)              (None, 10)                1090      
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 11        
Total params: 1,081,209
Trainable params: 1,081,209
Non-trainable params: 0
_________________________________________________________________


In [21]:
model.compile(optimizer = 'Adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
model.fit(x_train, y_train, epochs = 15, callbacks = [callback])

Epoch 1/15
Epoch 2/15
Epoch 3/15
Stopping to prevent overfitting


<keras.callbacks.History at 0x7f8e2a44d080>

In [22]:
model.evaluate(x_test, y_test)



[0.5419849581956864, 0.86428]

This gave 86.5% accuracy. The results are a bit different because of different random initialization of weights.