A word embedding is a class of approaches for representing words and documents using a dense vector representation.

In [104]:
# data visualisation and manipulation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style
import seaborn as sns
#configure
# sets matplotlib to inline and displays graphs below the corressponding cell.
%matplotlib inline  
style.use('fivethirtyeight')
sns.set(style='whitegrid',color_codes=True)

#nltk
import nltk

#stop-words
from nltk.corpus import stopwords
nltk.download('stopwords')

# tokenizing
from nltk import word_tokenize,sent_tokenize

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [0]:
##tensorflow >2.0
from tensorflow.keras.preprocessing.text import one_hot

In [107]:
### sentences
sent=[  'the glass of milk',
     'the glass of juice',
     'the cup of tea',
    'I am a good boy',
     'I am a good developer',
     'understand the meaning of words',
     'your videos are good',]
sent

['the glass of milk',
 'the glass of juice',
 'the cup of tea',
 'I am a good boy',
 'I am a good developer',
 'understand the meaning of words',
 'your videos are good']

**INTEGER ENCODING ALL THE DOCUMENTS**

After this all the unique words will be reprsented by an integer. For this we are using one_hot function from the Keras. Note that the vocab_size is specified large enough so as to ensure unique integer encoding for each and every word.

In [0]:
#vocab size
voc_size=50

In [109]:
onehot_repr=[one_hot(words,voc_size)for words in sent] 
print(onehot_repr)

[[10, 33, 48, 17], [10, 33, 48, 46], [10, 48, 48, 43], [21, 39, 16, 5, 46], [21, 39, 16, 5, 37], [37, 10, 41, 48, 31], [13, 11, 29, 5]]


**PADDING THE DOCS (to make very doc of same length)**

The Keras Embedding layer requires all individual documents to be of same length. Hence we wil pad the shorter documents with 0 for now. Therefore now in Keras Embedding layer the 'input_length' will be equal to the length (ie no of words) of the document with maximum length or maximum number of words.

In [0]:
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential

In [111]:
sent_length=8
embedded_docs=pad_sequences(onehot_repr,padding='pre',maxlen=sent_length)
print(embedded_docs)

[[ 0  0  0  0 10 33 48 17]
 [ 0  0  0  0 10 33 48 46]
 [ 0  0  0  0 10 48 48 43]
 [ 0  0  0 21 39 16  5 46]
 [ 0  0  0 21 39 16  5 37]
 [ 0  0  0 37 10 41 48 31]
 [ 0  0  0  0 13 11 29  5]]


In [0]:
dim=10

**PARAMETERS OF THE EMBEDDING LAYER**

'input_dim' = the vocab size that we will choose. In other words it is the number of unique words in the vocab.

'output_dim' = the number of dimensions we wish to embed into. Each word will be represented by a vector of this much dimensions.

'input_length' = lenght of the maximum document. which is stored in maxlen variable in our case.

In [0]:
model=Sequential()
model.add(Embedding(voc_size,10,input_length=sent_length))
model.compile('adam','mse')

In [114]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 8, 10)             500       
Total params: 500
Trainable params: 500
Non-trainable params: 0
_________________________________________________________________


In [115]:
print(model.predict(embedded_docs)[0]) # sentance 1 representation 

[[-0.02733698 -0.02462319 -0.03820319 -0.03506496 -0.0313495  -0.04966043
   0.0421304   0.00937941  0.02608326 -0.04202484]
 [-0.02733698 -0.02462319 -0.03820319 -0.03506496 -0.0313495  -0.04966043
   0.0421304   0.00937941  0.02608326 -0.04202484]
 [-0.02733698 -0.02462319 -0.03820319 -0.03506496 -0.0313495  -0.04966043
   0.0421304   0.00937941  0.02608326 -0.04202484]
 [-0.02733698 -0.02462319 -0.03820319 -0.03506496 -0.0313495  -0.04966043
   0.0421304   0.00937941  0.02608326 -0.04202484]
 [ 0.00321686  0.02367604 -0.03264217 -0.04508371 -0.03884301 -0.02859296
  -0.01672667 -0.00439918 -0.03695983 -0.04182911]
 [ 0.04063671  0.00095975  0.02020803 -0.00051867 -0.02977251 -0.01185254
   0.01610792 -0.01607718 -0.04178732 -0.02081467]
 [ 0.03244043 -0.01470071 -0.03146     0.04871193 -0.01202101  0.03124345
   0.00466467  0.0421304   0.02695843  0.0304435 ]
 [ 0.03362418 -0.04256827 -0.02428982  0.02579491 -0.0479595  -0.01110294
  -0.01982902  0.00185269  0.0046014  -0.0439818 ]]

In [116]:
embedded_docs[0]

array([ 0,  0,  0,  0, 10, 33, 48, 17], dtype=int32)