<a href="https://colab.research.google.com/github/Yuvaraj-Premlal/NLP_Deep_Learning/blob/main/DeepLearning_in_NLP_%7C_Part_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Word Embeddings**
-  Another way to associate a vector with a word 
-  Generally these are dense word vectors
-  WE vectors are low dimensional floating point vectors opposed to OHE vectors that are binary, sparse and very high dimensional
-  WE vectors are learned from data

WE methods

-  Type 1 : Learn WE jointly with the main task
-  Type 2 : Use pretrained word embeddings

WE with Embedding Layers

-  Simplest way to associate a dense vector with a word is to choose the vector at random

-  This will lead to unstructured embedding space that may not recognize similar words as well

-  In Good WE, the geometric relationships between word vectors should reflect the semantic relationship between these words.

- WE are meant to map human language in to geometric space.



In [1]:
from keras.layers import Embedding
embedding_layer = Embedding(1000,64) # Number of possible tokens = 1000 ; Dimensionality of Embeddings = 64

# embedding layer is a dictionary that maps integer indices to dense vectors
# it takes integers as input , it looks up these integers in an internal dictionary, and it returns corresponding vectors
# Word index -> Embedding layer -> Corresponding word vector

-  Embedding layer takes a 2D tensor of integers, of shape (samples, sequence_length) where each entry is a sequence of integers.

-  It can embed sequences of variable lengths: for instance, you could feed in to the Embedding layer in the previous example batches with shapes (32,10) [ batch of 32 sequences of length 10]

-  All sequences in a batch must have the same length; shorter it is , padding with zeros is necessary ; longer it is, truncation is necessary

- This embedding layer returns a 3D tensor of shape (samples, sequence_length, embedding_dimensionality)

- When an embedding layer is instantiated, its weights are initially random, just with any other layer.

In [2]:
# IMDB movie review sentiment prediction task

from keras.datasets import imdb
from keras import preprocessing

max_features = 10000
maxlen = 20

(x_train, y_train),(x_test,y_test) = imdb.load_data(num_words = max_features)
#print(x_train,y_train)
x_train.shape

(25000,)

In [3]:
# DECODE before padding

word_index = imdb.get_word_index()
reverse_word_index = dict([(value,key) for (key,value) in word_index.items()])
decoded_review = ''.join([reverse_word_index.get(i-3,'?') for i in x_train[1]])
decoded_review

"?bighairbigboobsbadmusicandagiantsafetypinthesearethewordstobestdescribethisterriblemovieilovecheesyhorrormoviesandi'veseenhundredsbutthishadgottobeonoftheworstevermadetheplotispaperthinandridiculoustheactingisanabominationthescriptiscompletelylaughablethebestistheendshowdownwiththecopandhowheworkedoutwhothekillerisit'sjustsodamnterriblywrittentheclothesaresickeningandfunnyinequal?thehairisbiglotsofboobs?menwearthosecut?shirtsthatshowofftheir?sickeningthatmenactuallyworethemandthemusicisjust?trashthatplaysoverandoveragaininalmosteveryscenethereistrashymusicboobsand?takingawaybodiesandthegymstilldoesn'tclosefor?alljokingasidethisisatrulybadfilmwhoseonlycharmistolookbackonthedisasterthatwasthe80'sandhaveagoodoldlaughathowbadeverythingwasbackthen"

In [4]:
x_train = preprocessing.sequence.pad_sequences(x_train,maxlen=maxlen)
x_test = preprocessing.sequence.pad_sequences(x_test,maxlen=maxlen)

print(x_train,y_train)
x_train[1]

[[  65   16   38 ...   19  178   32]
 [  23    4 1690 ...   16  145   95]
 [1352   13  191 ...    7  129  113]
 ...
 [  11 1818 7561 ...    4 3586    2]
 [  92  401  728 ...   12    9   23]
 [ 764   40    4 ...  204  131    9]] [1 0 0 ... 0 1 0]


array([  23,    4, 1690,   15,   16,    4, 1355,    5,   28,    6,   52,
        154,  462,   33,   89,   78,  285,   16,  145,   95], dtype=int32)

In [5]:
# DECODE after padding
word_index = imdb.get_word_index()
reverse_word_index = dict([(value,key) for (key,value) in word_index.items()])
decoded_review = ''.join([reverse_word_index.get(i-3,'?') for i in x_train[1]])
decoded_review

"onthedisasterthatwasthe80'sandhaveagoodoldlaughathowbadeverythingwasbackthen"

In [7]:
print(x_train.shape,y_train.shape)


(25000, 20) (25000,)


In [12]:
from keras.models import Sequential
from keras.layers import Flatten,Dense

model = Sequential()
model.add(Embedding(10000,2,input_length=maxlen)) # After embedding layer, the activations have shape(samples,max_len,8)
model.add(Flatten()) # Flattens 3D tensors in to 2D tensor of shape (sample, maxlen*8)

model.add(Dense(1,activation='sigmoid'))
model.compile(optimizer='rmsprop',loss='binary_crossentropy',metrics=['acc'])
model.summary()

history = model.fit(x_train,y_train,epochs=10,batch_size=32,validation_split=0.2)



Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_6 (Embedding)     (None, 20, 2)             20000     
                                                                 
 flatten_5 (Flatten)         (None, 40)                0         
                                                                 
 dense_5 (Dense)             (None, 1)                 41        
                                                                 
Total params: 20,041
Trainable params: 20,041
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
