<a href="https://colab.research.google.com/github/fabnancyuhp/DEEP-LEARNING/blob/main/NOTEBOOKS/Word_Embedding_Examples.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Example : Word vectorization with tf.keras embedding layer
We load the dataset we'll use to compute the word embedding. We lowercase, remove the digits and remove the punctuations.


In [27]:
import pandas as pd
url = "https://raw.githubusercontent.com/fabnancyuhp/DEEP-LEARNING/main/DATA/text_for_embedding.parquet.brotli"
text_for_embedding = pd.read_parquet(url)

import re
def preprocess_text(x):
    punct_tag=re.compile(r'[^\w\s]')
    new_text=punct_tag.sub(r'',x)
    new_text = re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '' , new_text)
    new_text = re.sub(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+)',"", new_text)
    new_text = re.sub(r'[0-9]', '', new_text)
    return(new_text.lower())

text_for_embedding['text'] = text_for_embedding['text'].apply(lambda x:preprocess_text(x))
text_for_embedding = text_for_embedding.reset_index()
text_for_embedding.head(3)

Unnamed: 0,index,class,text
0,12775,1,common sense is prevailing in brexit negotiati...
1,930,1,paul manafort the indicted former campaign man...
2,4467,1,us representative mark walker said after a mee...


We use Tokenize object form tensorflow.keras.preprocessing.text. We transform each text in text_for_embedding['text'] to a sequence of integers. 

In [18]:
from tensorflow.keras.preprocessing.text import Tokenizer

#X = [text.split() for text in  list(text_for_embedding['text'])]
X = text_for_embedding['text'].to_numpy()
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X)
X_seq = tokenizer.texts_to_sequences(X)

In [None]:
vocab_size = len(tokenizer.word_index)+1

Tokenizer object creates a word index dictionary. We search the integer associated with the famous automakers gm and peugeot. We also display the vocabulary size of the corpus (text_for_embedding['text']). The words of the corpus are stored in word_index object with their associated number.

In [19]:
print(tokenizer.word_index['gm'])
print(tokenizer.word_index['peugeot'])
print(tokenizer.word_index['handsome'])
print("vocab_size:"+str(len(tokenizer.word_index)+1))

8813
43337
21466
vocab_size:178373


Here, we create a word sequence of the first sentence text_for_embedding['text'][0] and we compare it with X_seq[0].

In [29]:
print(text_for_embedding['text'][0])   
print([tokenizer.word_index[word] for word in text_for_embedding['text'][0].split()][0:4])
print(X_seq[0][0:4])

common sense is prevailing in brexit negotiations between britain and the european union france s foreign minister said on friday as he welcomed signs that talks were moving into a new phase after an initial breakthrough  the european commission said on friday enough progress had been made in brexit negotiations with britain and that a second phase of discussions should begin ending an impasse over the status of the irish border  the work that has been done on negotiations  is gradually leading us to common sense  jeanyves le drian told france inter radio  we wanted the conditions for britain s withdrawal to be clearly defined to be able to move into another phase that s what s going to happen now i hope
[1162, 1285, 11, 12314]
[1162, 1285, 11, 12314]


We retain 1000 words per text using pad_sequence object:

In [30]:
vocab_size = len(tokenizer.word_index)+1
from tensorflow.keras.preprocessing.sequence import pad_sequences
X_pad = pad_sequences(X_seq,maxlen=1000,padding='post')
X_pad[0][0:7]

array([ 1162,  1285,    11, 12314,     6,  1082,  1192], dtype=int32)

We build a model to predict class from the text. After the neural network is trained we will get word embeddings as a side effect. So the problem for predict the class is almost like a fake problem. In fact we care about word embeddings. In the ANN we use to make word vectorization, we put an Embedding layer called "embedding". Each word of word_index is embedded in 15 sized dense vector.

In [31]:
vocab_size = len(tokenizer.word_index)+1
embeded_vector_size = 15
max_length = 1000

from tensorflow.keras.layers import Dense, Embedding, Flatten
from tensorflow.keras.models import Sequential

model = Sequential()
model.add(Embedding(vocab_size,embeded_vector_size,input_length=max_length,name="embedding"))
model.add(Flatten())
model.add(Dense(1,activation='sigmoid'))

model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['binary_crossentropy'])

In [32]:
Y = text_for_embedding['class'].values
model.fit(X_pad,Y,epochs=15)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<keras.callbacks.History at 0x7f70d6c2cdc0>

We get the Embedding matrix produced by our model:

In [33]:
len(tokenizer.word_index)+1
Embedding_matrix = model.get_layer('embedding').get_weights()[0]

Embedding_matrix.shape,len(tokenizer.word_index)+1

((178373, 15), 178373)

Each row of Embedding Matrix corresponds to a word from tokenizer.word_index. Below, we make a dictionary with words encountered in the corpus and their related vectorizations.

In [34]:
dict_embeding = dict([(word,Embedding_matrix[tokenizer.word_index[word]]) for word in tokenizer.word_index.keys()])

In [35]:
dict_embeding['car']

array([-0.06600994,  0.02541907, -0.02483669,  0.0974196 ,  0.07336339,
       -0.02285413,  0.11127128,  0.01865002, -0.08324052,  0.00216753,
        0.15210138, -0.111652  , -0.03299259,  0.06876224, -0.06726278],
      dtype=float32)

In [37]:
dict_embeding['automobile']

array([ 1.41754709e-02, -1.14046864e-01,  3.99329215e-02, -7.66363591e-02,
       -7.85041898e-02,  1.10874735e-01, -1.17618717e-01, -4.69670035e-02,
        1.41314939e-02, -8.44414309e-02, -8.52738619e-02,  7.90638998e-02,
       -6.32217125e-05, -7.66031295e-02,  1.01314396e-01], dtype=float32)

We compute the cosine distance between the car vector and automobile vector:

In [38]:
from scipy.spatial import distance
distance.cosine(dict_embeding['car'], dict_embeding['automobile'])

1.781082808971405

## Example : Word2Vec with gensim
We use the Corpus from the previous example.

In [39]:
Corpus = text_for_embedding['text']

Word2vec is an unsupervised learning algorithm. In contrary to the above example, we don't need the class to make our words vectorization. The input of the word2vect algorithm is the corpus. We convert each words into a 100 dimentional vector.

In [40]:
import gensim

X = [d.split() for d in text_for_embedding['text'].tolist()]
DIM = 100

#w2v_model = gensim.models.Word2Vec(sentences = X,vector_size=DIM,window=10,min_count=1)
w2v_model = gensim.models.Word2Vec(sentences = X,size=DIM,window=10,min_count=1)

The vocabulary size is given by len(w2v_model.wv). The vector representation of the word car is given by w2v_model.wv.get_vector("car", norm=True) or by w2v_model.wv.get_vector("car").

In [41]:
#print("Vocabulary size: "+str(len(w2v_model.wv)))
print("The vector representation of the word car: ")
w2v_model.wv.get_vector("car")[0:10] 

The vector representation of the word car: 


array([ 2.533081 , -3.367085 , -0.4285773,  2.5477087, -2.9588268,
       -0.4585598, -0.7733542,  2.5632012,  4.073978 , -4.035179 ],
      dtype=float32)

The most similar word to France is w2v_model.wv.most_similar("france")

In [42]:
w2v_model.wv.most_similar("france")[0:3]

[('italy', 0.8047981858253479),
 ('germany', 0.7870280742645264),
 ('netherlands', 0.766010046005249)]

In [43]:
w2v_model.wv.get_vector("automobile")[0:10] 

array([-0.22998844,  0.5625361 ,  0.3646414 ,  0.22749312,  0.09547658,
        0.32680476, -0.10790078,  0.4935212 ,  0.17794877,  0.1389324 ],
      dtype=float32)

We compute the cosinus distance between car and automobile.

In [44]:
from scipy.spatial import distance
distance.cosine(w2v_model.wv.get_vector("automobile"), w2v_model.wv.get_vector("car"))

0.9186250045895576

We compute the cosinus distance between friut and orange.

In [45]:
from scipy.spatial import distance
distance.cosine(w2v_model.wv.get_vector("fruit"), w2v_model.wv.get_vector("orange"))

0.5883492529392242

# Example : pretrained glove embedding
In this case, the vectorization is already done. We just extract the vectors representations.

In [46]:
url = "http://nlp.stanford.edu/data/glove.42B.300d.zip"

import requests, io, zipfile
filename = "glove.42B.300d.txt"
#Remove "blob",  Replace github.com by raw.githubusercontent.com
#url = "https://raw.githubusercontent.com/fabnancyuhp/DEEP-LEARNING/main/DATA/superconduct.zip"
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))

In [47]:
import numpy as np
embeddings_index = dict()
f = z.open('glove.42B.300d.txt')
for line in f:
	values = line.split()
	word = values[0].decode("utf-8")
	coefs = np.asarray(values[1:], dtype='float32')
	embeddings_index[word] = coefs
f.close()

We display the vector of the word car and its shape

In [49]:
print(embeddings_index['car'][0:20])
print(str(embeddings_index['car'].shape))

[ 0.59128   -0.38927   -0.16089    0.043683  -0.43888    0.11397
 -2.9075     0.13149   -0.30903   -0.57064   -0.72339   -0.44372
 -0.12936   -0.32073    0.50047    0.47942   -0.43085    0.0043741
 -0.24877    0.35756  ]
(300,)


We compute the cosine distance between the car vector and automobile vector:

In [50]:
#embeddings_index['automobile']
from scipy.spatial import distance
distance.cosine(embeddings_index['car'], embeddings_index['automobile'])

0.26692473888397217