## A Detailed Guide to understand the Word Embeddings and Embedding Layer in Keras.

In this notebook I have explained the keras embedding layer. To do so I have created a sample corpus of just 3 documents and that should be sufficient to explain the working of the keras embedding layer.

Embeddings are useful in a variety of machine learning applications. Because of the fact I have attached many data sources to the kernel where I fell that embeddings and Keras embedding layer may prove to be useful.

Before diving in let us skim through some of the applilcations of the embeddings :

1.   The first application that strikes me is in the Collaborative        Filtering based Recommender Systems where we have to create the user      embeddings and the movie embeddings by decomposing the utility            matrix which contains the user-item ratings.

2.   The second use is in the Natural Language Processing and its related      applications whre we have to create the word embeddings for all the      words present in the documents of our corpus.



Thus the embedding layer in Keras can be used when we want to create the embeddings to embed higher dimensional data into lower dimensional vector space.


## 1. Importing Modules

In [3]:
# Ignore  the warnings
import warnings
warnings.filterwarnings('always')
warnings.filterwarnings('ignore')

# data visualisation and manipulation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style
import seaborn as sns
#configure
# sets matplotlib to inline and displays graphs below the corressponding cell.
%matplotlib inline  
style.use('fivethirtyeight')
sns.set(style='whitegrid',color_codes=True)

#nltk
import nltk

#stop-words
from nltk.corpus import stopwords
stop_words=set(nltk.corpus.stopwords.words('english'))

# tokenizing
from nltk import word_tokenize,sent_tokenize

#keras
import keras
from keras.preprocessing.text import one_hot,Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense , Flatten ,Embedding,Input
from keras.models import Model

Using TensorFlow backend.


### CREATING SAMPLE CORPUS OF DOCUMENTS ie TEXTS

In [0]:
sample_text_1="bitty bought a bit of butter"
sample_text_2="but the bit of butter was a bit bitter"
sample_text_3="so she bought some better butter to make the bitter butter better"

corp=[sample_text_1,sample_text_2,sample_text_3]
no_docs=len(corp)


### INTEGER ENCODING ALL THE DOCUMENTS

After this all the unique words will be reprsented by an integer. For this we are using one_hot function from the Keras. Note that the vocab_size is specified large enough so as to ensure unique integer encoding for each and every word.

#### Note one important thing that the integer encoding for the word remains same in different docs. eg 'butter' is denoted by 31 in each and every document.


In [5]:
vocab_size=50 
encod_corp=[]
for i,doc in enumerate(corp):
    encod_corp.append(one_hot(doc,50))
    print("The encoding for document",i+1," is : ",one_hot(doc,50))

The encoding for document 1  is :  [45, 47, 30, 4, 42, 36]
The encoding for document 2  is :  [30, 36, 4, 42, 36, 15, 30, 4, 35]
The encoding for document 3  is :  [19, 24, 47, 15, 10, 36, 37, 11, 36, 35, 36, 10]


### PADDING THE DOCS (to make very doc of same length)

The Keras Embedding layer requires all individual documents to be of same length. Hence we wil pad the shorter documents with 0 for now. Therefore now in Keras Embedding layer the 'input_length' will be equal to the length (ie no of words) of the document with maximum length or maximum number of words. 

#### To pad the shorter documents I am using pad_sequences functon from the Keras library.


In [6]:
# length of maximum document. will be nedded whenever create embeddings for the words
maxlen=-1
for doc in corp:
    tokens=nltk.word_tokenize(doc)
    if(maxlen<len(tokens)):
        maxlen=len(tokens)
print("The maximum number of words in any document is : ",maxlen)

The maximum number of words in any document is :  12


In [7]:
# now to create embeddings all of our docs need to be of same length. hence we can pad the docs with zeros.
pad_corp=pad_sequences(encod_corp,maxlen=maxlen,padding='post',value=0.0)
print("No of padded documents: ",len(pad_corp))

No of padded documents:  3


In [8]:
for i,doc in enumerate(pad_corp):
     print("The padded encoding for document",i+1," is : ",doc)

The padded encoding for document 1  is :  [45 47 30  4 42 36  0  0  0  0  0  0]
The padded encoding for document 2  is :  [30 36  4 42 36 15 30  4 35  0  0  0]
The padded encoding for document 3  is :  [19 24 47 15 10 36 37 11 36 35 36 10]



### ACTUAL CREATION OF THE EMBEDDINGS using KERAS EMBEDDING LAYER

Now all the documents are of same length (after padding). And so now we are ready to create and use the embeddings.

#### I will embed the words into vectors of 8 dimensions.


In [9]:
# specifying the input shape
input=Input(shape=(no_docs,maxlen),dtype='float64')

W0715 21:38:34.949850 140632073156480 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0715 21:38:34.978384 140632073156480 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.



In [10]:
'''
shape of input. 
each document has 12 element or words which is the value of our maxlen variable.

'''
word_input=Input(shape=(maxlen,),dtype='float64')  

# creating the embedding
word_embedding=Embedding(input_dim=vocab_size,output_dim=8,input_length=maxlen)(word_input)

word_vec=Flatten()(word_embedding) # flatten
embed_model =Model([word_input],word_vec) # combining all into a Keras model

W0715 21:40:41.086334 140632073156480 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.





PARAMETERS OF THE EMBEDDING LAYER ---

'input_dim' = the vocab size that we will choose. In other words it is the number of unique words in the vocab.

'output_dim' = the number of dimensions we wish to embed into. Each word will be represented by a vector of this much dimensions.

'input_length' = lenght of the maximum document. which is stored in maxlen variable in our case.


In [11]:
embed_model.compile(optimizer=keras.optimizers.Adam(lr=1e-3),loss='binary_crossentropy',metrics=['acc']) 
# compiling the model. parameters can be tuned as always.

W0715 21:42:51.356948 140632073156480 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:790: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0715 21:42:51.365508 140632073156480 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3376: The name tf.log is deprecated. Please use tf.math.log instead.

W0715 21:42:51.371954 140632073156480 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/nn_impl.py:180: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


In [12]:
print(type(word_embedding))
print(word_embedding)

<class 'tensorflow.python.framework.ops.Tensor'>
Tensor("embedding_1/embedding_lookup/Identity:0", shape=(?, 12, 8), dtype=float32)


In [13]:
print(embed_model.summary()) # summary of the model

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 12)                0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 12, 8)             400       
_________________________________________________________________
flatten_1 (Flatten)          (None, 96)                0         
Total params: 400
Trainable params: 400
Non-trainable params: 0
_________________________________________________________________
None


In [14]:
embeddings=embed_model.predict(pad_corp) # finally getting the embeddings.

W0715 21:44:27.131181 140632073156480 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:2741: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.



In [15]:
print("Shape of embeddings : ",embeddings.shape)
print(embeddings)

Shape of embeddings :  (3, 96)
[[-0.02699658 -0.03079704  0.03379023 -0.04982239  0.00083368  0.00569384
   0.04717347  0.04376452 -0.04690733  0.03076388 -0.00690231 -0.03882656
  -0.02259206 -0.03037192  0.00665653 -0.01084276  0.01365968  0.03528862
  -0.01280572  0.04850254 -0.04555798  0.01974492  0.02806634 -0.00859888
  -0.04621972  0.02860166 -0.00810762 -0.00442749 -0.00109739 -0.00604201
  -0.01278692 -0.04901706 -0.04353719  0.02420988 -0.03367801  0.04763379
   0.04046999 -0.04230182  0.03247703  0.02494914 -0.04569366  0.03029075
  -0.04342816 -0.04255356 -0.00922401  0.0213286  -0.03393956  0.03720582
  -0.04463922  0.04612506 -0.01916934 -0.01015208  0.00585507 -0.02827555
   0.0436017  -0.02507055 -0.04463922  0.04612506 -0.01916934 -0.01015208
   0.00585507 -0.02827555  0.0436017  -0.02507055 -0.04463922  0.04612506
  -0.01916934 -0.01015208  0.00585507 -0.02827555  0.0436017  -0.02507055
  -0.04463922  0.04612506 -0.01916934 -0.01015208  0.00585507 -0.02827555
   0.04

In [16]:
embeddings=embeddings.reshape(-1,maxlen,8)
print("Shape of embeddings : ",embeddings.shape) 
print(embeddings)

Shape of embeddings :  (3, 12, 8)
[[[-0.02699658 -0.03079704  0.03379023 -0.04982239  0.00083368
    0.00569384  0.04717347  0.04376452]
  [-0.04690733  0.03076388 -0.00690231 -0.03882656 -0.02259206
   -0.03037192  0.00665653 -0.01084276]
  [ 0.01365968  0.03528862 -0.01280572  0.04850254 -0.04555798
    0.01974492  0.02806634 -0.00859888]
  [-0.04621972  0.02860166 -0.00810762 -0.00442749 -0.00109739
   -0.00604201 -0.01278692 -0.04901706]
  [-0.04353719  0.02420988 -0.03367801  0.04763379  0.04046999
   -0.04230182  0.03247703  0.02494914]
  [-0.04569366  0.03029075 -0.04342816 -0.04255356 -0.00922401
    0.0213286  -0.03393956  0.03720582]
  [-0.04463922  0.04612506 -0.01916934 -0.01015208  0.00585507
   -0.02827555  0.0436017  -0.02507055]
  [-0.04463922  0.04612506 -0.01916934 -0.01015208  0.00585507
   -0.02827555  0.0436017  -0.02507055]
  [-0.04463922  0.04612506 -0.01916934 -0.01015208  0.00585507
   -0.02827555  0.0436017  -0.02507055]
  [-0.04463922  0.04612506 -0.01916934 



The resulting shape is (3,12,8).

3---> no of documents

12---> each document is made of 12 words which was our maximum length of any document.

& 8---> each word is 8 dimensional.


#### GETTING ENCODING FOR A PARTICULAR WORD IN A SPECIFIC DOCUMENT

In [17]:
for i,doc in enumerate(embeddings):
    for j,word in enumerate(doc):
        print("The encoding for ",j+1,"th word","in",i+1,"th document is : \n\n",word)

The encoding for  1 th word in 1 th document is : 

 [-0.02699658 -0.03079704  0.03379023 -0.04982239  0.00083368  0.00569384
  0.04717347  0.04376452]
The encoding for  2 th word in 1 th document is : 

 [-0.04690733  0.03076388 -0.00690231 -0.03882656 -0.02259206 -0.03037192
  0.00665653 -0.01084276]
The encoding for  3 th word in 1 th document is : 

 [ 0.01365968  0.03528862 -0.01280572  0.04850254 -0.04555798  0.01974492
  0.02806634 -0.00859888]
The encoding for  4 th word in 1 th document is : 

 [-0.04621972  0.02860166 -0.00810762 -0.00442749 -0.00109739 -0.00604201
 -0.01278692 -0.04901706]
The encoding for  5 th word in 1 th document is : 

 [-0.04353719  0.02420988 -0.03367801  0.04763379  0.04046999 -0.04230182
  0.03247703  0.02494914]
The encoding for  6 th word in 1 th document is : 

 [-0.04569366  0.03029075 -0.04342816 -0.04255356 -0.00922401  0.0213286
 -0.03393956  0.03720582]
The encoding for  7 th word in 1 th document is : 

 [-0.04463922  0.04612506 -0.01916934

Now this makes it easier to visualize that we have 3(size of corp) documents with each consisting of 12(maxlen) words and each word mapped to a 8-dimensional vector.



Just like above we can now use any other document. We can sent_tokenize the doc into sentences.

Each sentence has a list of words which we will integer encode using the 'one_hot' function as below.

Now each sentence will be having different number of words. So we will need to pad the sequences to the sentence with maximum words.

At this point we are ready to feed the input to Keras Embedding layer as shown above.

'input_dim' = the vocab size that we will choose

'output_dim' = the number of dimensions we wish to embed into

'input_length' = lenght of the maximum document
