<a href="https://colab.research.google.com/github/ayush572/Word_embedding_using_keras/blob/main/Word_Embedding_Using_Keras.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Word Embedding using Keras

In [8]:
from tensorflow.keras.preprocessing.text import one_hot

In [9]:
sentences = ['glass of milk','the glass of juice', 'the cup of tea', 'I am a good boy',
             'I am a good developer', 'understanding the meaning of words']

In [10]:
# vocabulary size
voc_size = 10000

In [13]:
onehot_repr = [one_hot(words,voc_size) for words in sentences]

In [14]:
onehot_repr

[[1874, 1943, 996],
 [2707, 1874, 1943, 1384],
 [2707, 2314, 1943, 5940],
 [1803, 6564, 8195, 7522, 1522],
 [1803, 6564, 8195, 7522, 3686],
 [7769, 2707, 4927, 1943, 9507]]

Word embedding representation and conversion into the same for the embedding matrix formation

      man women king queen mango apple
gender -1   1   -0.95 0.93   0     0
age     0   0     -1    1    0.53  0.21
royal
fruit
bad
f=5


Here, one important thing is that how many dimension do we have to give (or say the features)

In [16]:
from tensorflow.keras.layers import Embedding
#whenever we want to pass anything to embedding layer, all the sentences to have same number of words and pad sequences is used to ensure that thing
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
import numpy as np

In [17]:
sent_len = 8 # means to make all the sentences of length 8, as for the embedding to work well, the len of sentences to be same
embedded_docs = pad_sequences(onehot_repr, padding='pre', maxlen = sent_len)
embedded_docs

array([[   0,    0,    0,    0,    0, 1874, 1943,  996],
       [   0,    0,    0,    0, 2707, 1874, 1943, 1384],
       [   0,    0,    0,    0, 2707, 2314, 1943, 5940],
       [   0,    0,    0, 1803, 6564, 8195, 7522, 1522],
       [   0,    0,    0, 1803, 6564, 8195, 7522, 3686],
       [   0,    0,    0, 7769, 2707, 4927, 1943, 9507]], dtype=int32)

In [18]:
dim=15

In [19]:
model = Sequential()
model.add(Embedding(voc_size, dim, input_length=sent_len))
model.compile('adam', 'mse') # adam optimizer and min squared error as the error function

In [20]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 8, 15)             150000    
                                                                 
Total params: 150000 (585.94 KB)
Trainable params: 150000 (585.94 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [21]:
# now to see how our words after getting passed from pad_sequences and Embedding layer, how have they
# been converted
model.predict(embedded_docs)



array([[[-3.61329205e-02, -2.02817079e-02,  2.77867205e-02,
          4.44273017e-02,  3.64359058e-02, -1.18954107e-03,
          1.28715672e-02, -8.68754461e-03,  3.24158929e-02,
          4.10338491e-03,  2.75407545e-02, -3.02889701e-02,
         -3.97624858e-02,  4.62256372e-04, -1.78452246e-02],
        [-3.61329205e-02, -2.02817079e-02,  2.77867205e-02,
          4.44273017e-02,  3.64359058e-02, -1.18954107e-03,
          1.28715672e-02, -8.68754461e-03,  3.24158929e-02,
          4.10338491e-03,  2.75407545e-02, -3.02889701e-02,
         -3.97624858e-02,  4.62256372e-04, -1.78452246e-02],
        [-3.61329205e-02, -2.02817079e-02,  2.77867205e-02,
          4.44273017e-02,  3.64359058e-02, -1.18954107e-03,
          1.28715672e-02, -8.68754461e-03,  3.24158929e-02,
          4.10338491e-03,  2.75407545e-02, -3.02889701e-02,
         -3.97624858e-02,  4.62256372e-04, -1.78452246e-02],
        [-3.61329205e-02, -2.02817079e-02,  2.77867205e-02,
          4.44273017e-02,  3.64359058

In [22]:
embedded_docs[0]

array([   0,    0,    0,    0,    0, 1874, 1943,  996], dtype=int32)

In [23]:
# each word has been represented in 'dim' vectors length
model.predict(embedded_docs[0])



array([[-0.03613292, -0.02028171,  0.02778672,  0.0444273 ,  0.03643591,
        -0.00118954,  0.01287157, -0.00868754,  0.03241589,  0.00410338,
         0.02754075, -0.03028897, -0.03976249,  0.00046226, -0.01784522],
       [-0.03613292, -0.02028171,  0.02778672,  0.0444273 ,  0.03643591,
        -0.00118954,  0.01287157, -0.00868754,  0.03241589,  0.00410338,
         0.02754075, -0.03028897, -0.03976249,  0.00046226, -0.01784522],
       [-0.03613292, -0.02028171,  0.02778672,  0.0444273 ,  0.03643591,
        -0.00118954,  0.01287157, -0.00868754,  0.03241589,  0.00410338,
         0.02754075, -0.03028897, -0.03976249,  0.00046226, -0.01784522],
       [-0.03613292, -0.02028171,  0.02778672,  0.0444273 ,  0.03643591,
        -0.00118954,  0.01287157, -0.00868754,  0.03241589,  0.00410338,
         0.02754075, -0.03028897, -0.03976249,  0.00046226, -0.01784522],
       [-0.03613292, -0.02028171,  0.02778672,  0.0444273 ,  0.03643591,
        -0.00118954,  0.01287157, -0.00868754, 