# Integer Encoding

Importing all the necessary libraries, including the IMDB Datasets, RNN layers, Embeddings and Dense layers.

In [17]:
from keras.datasets import imdb
from keras.layers import SimpleRNN, Dense
from keras import Sequential
from keras.preprocessing.sequence import pad_sequences

Splitting the dataset into test and train, and viewing how the vector representation of the data looks like

In [3]:
(x_train,y_train),(x_test,y_test)=imdb.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz


In [8]:
x_train

array([list([1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 22665, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 21631, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 31050, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]),
       list([1, 194, 1

In [10]:
len(x_train)

25000

In [14]:
len(x_train[0])

218

In [15]:
len(x_train[278])

140

In [13]:
max_len=0
for i,review in enumerate(x_train):
  if len(x_train[i])>max_len:
    #print(review)
    max_len=len(x_train[i])
print(max_len)

2494


Now, we have the maximum lenght of review, which we will use to pad the shorter sequences befre feeding into the RNN layer

In [19]:
x_train=pad_sequences(x_train, padding='post',maxlen=max_len)
x_test=pad_sequences(x_test,padding='post',maxlen=max_len)

Now our integer coded embeddings are ready, that is all sequences are of the same lenght which can now be fed into our model

In [20]:
len(x_train[0])
len(x_train[278])

2494

In [25]:
model = Sequential()
model.add(SimpleRNN(32, input_shape=(max_len,1), return_sequences=False))
model.add(Dense(1, activation='sigmoid'))
model.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 simple_rnn_3 (SimpleRNN)    (None, 32)                1088      
                                                                 
 dense_2 (Dense)             (None, 1)                 33        
                                                                 
Total params: 1121 (4.38 KB)
Trainable params: 1121 (4.38 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


We have 1088 trainable parameters, that is 32 weights of the current time step plus 32*32 from the previous time plus 32 biases which gives a total of 1088 parameters in the RNN layer.

In [26]:
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

model.fit(x_train,y_train,epochs=5,validation_data=(x_test,y_test))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x791f932c73a0>

# Embeddings Encoding

In the above code, we created the numerical vector from text using integer encoding. Now, we will create the numerical vector using embeddings:

The diffrenece is that, when we use interger encoding, the lenght of each numerical vector is the same as the lenght of the review that is the number of words in review. As RNN is designed to be trained on a fixed lenght input, we had to perform padding. This resulted in a **Sparse vector**, i.e. a vector with zeroes  along with actual integer encondings.
If we use embeddings, the embedded vector size can be initialised, which will hold the contexual meaning of the sentence.\
Steps:\
1) Pad the text sequence to a fixed lenght (n), so that all the rows are of same lenght.\
2) Initialise the the number of nodes for the embedding layer in the Neural network.\
3) Specify the output dimension (d) of the embedding vector.\
4)Predict the sequences from the text: Each sequence will be a list of size n, where each word is an embedded vector of size d.\
This results in creating a **Dense vector**.

In [23]:
from keras.datasets import imdb
from keras import Sequential
from keras.layers import Flatten, Embedding, SimpleRNN, Dense
from keras.utils import pad_sequences

In [33]:
(X_train,Y_train),(X_test,Y_test)=imdb.load_data(num_words=10000, oov_char=0)

In [34]:
len(X_train)

25000

In [35]:
X_train.shape

(25000,)

In [36]:
max_len=0
for i,review in enumerate(X_train):
  if len(X_train[i])>max_len:
    #print(review)
    max_len=len(X_train[i])
print(max_len)

2494


In [37]:
X_train=pad_sequences(X_train, padding='post', maxlen=2494)
X_test=pad_sequences(X_test, padding='post', maxlen=2494)

In [38]:
X_train.shape

(25000, 2494)

In [39]:
model2 = Sequential()
model2.add(Embedding(10000, 10,input_length=2494))
model2.add(SimpleRNN(32,return_sequences=False))
model2.add(Dense(1, activation='sigmoid'))

model2.summary()

Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_6 (Embedding)     (None, 2494, 10)          100000    
                                                                 
 simple_rnn_5 (SimpleRNN)    (None, 32)                1376      
                                                                 
 dense_5 (Dense)             (None, 1)                 33        
                                                                 
Total params: 101409 (396.13 KB)
Trainable params: 101409 (396.13 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [40]:
model2.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
history = model2.fit(X_train, Y_train,epochs=5,validation_data=(X_test,Y_test))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Reasons for lower accuracy:\
1) The IMDB dataset used, was encoded using integer encoding, where each index was represented by a number which is the frequency of that word in the corpus, that is the number of times that word appeared in all the reviews.
While defining the input dimension of the embedding layer, we we specified the word index as 10000, while there is a much bigger count of thw word index in the dataset. Hence the model did not get trained properly, and it got trained on a highly sparse dataset.\
2) We trained the model only with epochs=5, training for a longer time could have resulted in weights capturing more insights through backpropagation, resulting in better accuracy.\
3) Tweaking the number of nodes in the RNN layer.\
4) Incraesing the timesteps in context window of the RNN model.