UNDERSTANDING HOW EMBEDDING LAYER WORKS

In [18]:
docs = ['go india',
		'india india',
		'hip hip hurray',
		'jeetega bhai jeetega india jeetega',
		'bharat mata ki jai',
		'kohli kohli',
		'sachin sachin',
		'dhoni dhoni',
		'modi ji ki jai',
		'inquilab zindabad']

In [19]:
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer()

In [20]:
tokenizer.fit_on_texts(docs)

In [21]:
tokenizer.word_index  #gives unique integer encoding

{'india': 1,
 'jeetega': 2,
 'hip': 3,
 'ki': 4,
 'jai': 5,
 'kohli': 6,
 'sachin': 7,
 'dhoni': 8,
 'go': 9,
 'hurray': 10,
 'bhai': 11,
 'bharat': 12,
 'mata': 13,
 'modi': 14,
 'ji': 15,
 'inquilab': 16,
 'zindabad': 17}

In [22]:
len(tokenizer.word_index) #gives total words in vocabulary

17

In [23]:
sequences = tokenizer.texts_to_sequences(docs) #forms sequences in integer encoding form
sequences

[[9, 1],
 [1, 1],
 [3, 3, 10],
 [2, 11, 2, 1, 2],
 [12, 13, 4, 5],
 [6, 6],
 [7, 7],
 [8, 8],
 [14, 15, 4, 5],
 [16, 17]]

In [24]:
from keras.utils import pad_sequences
sequences = pad_sequences(sequences,padding='post') #adds 0's at last to make length same of each sequence
sequences

array([[ 9,  1,  0,  0,  0],
       [ 1,  1,  0,  0,  0],
       [ 3,  3, 10,  0,  0],
       [ 2, 11,  2,  1,  2],
       [12, 13,  4,  5,  0],
       [ 6,  6,  0,  0,  0],
       [ 7,  7,  0,  0,  0],
       [ 8,  8,  0,  0,  0],
       [14, 15,  4,  5,  0],
       [16, 17,  0,  0,  0]], dtype=int32)

there are many pretrained word embeddings you can use instead of training from scratch. These embeddings have already learned word meanings from large corpora like Wikipedia or Common Crawl and can improve performance, especially with smaller datasets.<br>

✅ Popular Pretrained Embeddings:<br>
Embedding	Dimensions	Trained On	Format	Link <br>
GloVe (Global Vectors)	50, 100, 200, 300	Wikipedia, Common Crawl	.txt	GloVe Website <br>
Word2Vec	300	Google News (100B words)	.bin	Google News Word2Vec <br>
FastText	300	Wikipedia, Common Crawl	.vec	FastText Website <br>
BERT-like models	768+	BooksCorpus + Wikipedia	Transformer	Use via transformers library (not traditional Embedding layer) <br>

In [26]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding
model = Sequential()
model.add(Embedding(17,output_dim=2,input_length=5))

model.summary()



Embedding Layer: Maps integer-encoded words (e.g., word indices) into dense vectors of fixed size. It's often the first layer in NLP models.

17:
input_dim=17 → This means your vocabulary size is 17. So, your integer input tokens range from 0 to 16.

output_dim=2:
Each word (integer index) will be mapped to a 2-dimensional dense vector.

input_length=5:
Each input sequence will have exactly 5 tokens (padded or truncated if needed). This is required to define the shape of the model.

In [27]:
model.compile('adam','accuracy')

In [28]:
pred = model.predict(sequences)
print(pred)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 891ms/step
[[[-0.04860233  0.01279378]
  [ 0.01847763 -0.00152392]
  [ 0.03929747  0.04431275]
  [ 0.03929747  0.04431275]
  [ 0.03929747  0.04431275]]

 [[ 0.01847763 -0.00152392]
  [ 0.01847763 -0.00152392]
  [ 0.03929747  0.04431275]
  [ 0.03929747  0.04431275]
  [ 0.03929747  0.04431275]]

 [[-0.01186358  0.04681218]
  [-0.01186358  0.04681218]
  [ 0.03784121 -0.01065962]
  [ 0.03929747  0.04431275]
  [ 0.03929747  0.04431275]]

 [[ 0.01236308 -0.01556264]
  [ 0.00440408 -0.00773573]
  [ 0.01236308 -0.01556264]
  [ 0.01847763 -0.00152392]
  [ 0.01236308 -0.01556264]]

 [[-0.01977192 -0.02267319]
  [ 0.03183892  0.0247363 ]
  [-0.0191428   0.0201512 ]
  [ 0.02957476  0.03012795]
  [ 0.03929747  0.04431275]]

 [[ 0.00922541 -0.0087984 ]
  [ 0.00922541 -0.0087984 ]
  [ 0.03929747  0.04431275]
  [ 0.03929747  0.04431275]
  [ 0.03929747  0.04431275]]

 [[-0.00711991 -0.04011116]
  [-0.00711991 -0.04011116]
  [ 0.03929747  0.0

NOW FOR IMDB DATA

In [30]:
from keras.datasets import imdb
from tensorflow.keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
from keras import Sequential
from keras.layers import Dense,SimpleRNN,Embedding,Flatten

In [31]:
(X_train,y_train),(X_test,y_test) = imdb.load_data()

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
[1m17464789/17464789[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 0us/step


In [33]:
X_train #already integer encoded

array([list([1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 22665, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 21631, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 31050, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]),
       list([1, 194, 1

In [38]:
X_train = pad_sequences(X_train,padding='post',maxlen=50)
X_test = pad_sequences(X_test,padding='post',maxlen=50)

uses Keras' pad_sequences to ensure all sequences (e.g., tokenized sentences) in X_train and X_test are the same length, which is essential for feeding into a neural network.

X_train / X_test: Lists of sequences, where each sequence is a list of integers (typically token IDs).

padding='post': Pads after the sequence (at the end).

maxlen=50: Makes every sequence exactly 50 tokens long.

In [39]:
X_train.shape

(25000, 50)

In [41]:
model = Sequential()
model.add(Embedding(input_dim=10000, output_dim=2, input_length=50))
model.add(SimpleRNN(32,return_sequences=False))
model.add(Dense(1, activation='sigmoid'))

model.summary()



In [42]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
history = model.fit(X_train, y_train,epochs=5,validation_data=(X_test,y_test))

Epoch 1/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 10ms/step - acc: 0.5456 - loss: 0.6701 - val_acc: 0.7918 - val_loss: 0.4500
Epoch 2/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 8ms/step - acc: 0.8187 - loss: 0.4116 - val_acc: 0.8050 - val_loss: 0.4343
Epoch 3/5
[1m782/782[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 8ms/step - acc: 0.8559 - loss: 0.3399 - val_acc: 0.7997 - val_loss: 0.4438
Epoch 4/5
[1m780/782[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 5ms/step - acc: 0.8772 - loss: 0.2995

KeyboardInterrupt: 