<a href="https://colab.research.google.com/github/fabnancyuhp/DEEP-LEARNING/blob/main/NOTEBOOKS/Word_Embedding_Examples.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Example : Word vectorization with tf.keras embedding layer
We load the dataset we'll use to compute the word embedding. We lowercase, remove the digits and remove the punctuations.
https://orbifold.net/default/embedding-and-tokenizer-in-keras/<br>
https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/

In [2]:
import pandas as pd
url = "https://raw.githubusercontent.com/fabnancyuhp/DEEP-LEARNING/main/DATA/text_for_embedding.parquet.brotli"
text_for_embedding = pd.read_parquet(url)

import re
def preprocess_text(x):
    punct_tag=re.compile(r'[^\w\s]')
    new_text=punct_tag.sub(r'',x)
    new_text = re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '' , new_text)
    new_text = re.sub(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+)',"", new_text)
    new_text = re.sub(r'[0-9]', '', new_text)
    return(new_text.lower())

text_for_embedding['text'] = text_for_embedding['text'].apply(lambda x:preprocess_text(x))
text_for_embedding.head(3)

Unnamed: 0,class,text
12775,1,common sense is prevailing in brexit negotiati...
930,1,paul manafort the indicted former campaign man...
4467,1,us representative mark walker said after a mee...


We use Tokenize object form tensorflow.keras.preprocessing.text. We transform each text in text_for_embedding['text'] to a sequence of integers. 

In [3]:
from tensorflow.keras.preprocessing.text import Tokenizer

X = [text.split() for text in  list(text_for_embedding['text'])]
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X)
X_seq = tokenizer.texts_to_sequences(X)

In [None]:
vocab_size = len(tokenizer.word_index)+1

from tensorflow.keras.preprocessing.text import one_hot
#encoded_review = [one_hot(d,vocab_size) for d in list(text_for_embedding['text'])]

#from tensorflow.keras.preprocessing.sequence import pad_sequences
#X_pad = pad_sequences(encoded_review,maxlen=1000,padding='post')

Tokenizer object creates a word index dictionary. We search the integer associated with the famous automakers gm and peugeot.

In [4]:
print(tokenizer.word_index['gm'])
print(tokenizer.word_index['peugeot'])
print(tokenizer.word_index['handsome'])
#one_hot('gm peugeot gm',vocab_size)

8807
43318
21452


We retain 1000 words per text using pad_sequence object:

In [5]:
vocab_size = len(tokenizer.word_index)+1
from tensorflow.keras.preprocessing.sequence import pad_sequences
X_pad = pad_sequences(X_seq,maxlen=1000,padding='post')
X_pad[0][0:7]

array([ 1163,  1283,    11, 12303,     6,  1082,  1193], dtype=int32)

We build a model to predict class from the text. After the neural network is trained we will get word embeddings as a side effect. So the problem for predict the class is almost like a fake problem. In fact we care about word embeddings. In the ANN we use to make word vectorization, we put an Embedding layer called "embedding".

In [6]:
vocab_size = len(tokenizer.word_index)+1
embeded_vector_size = 15
max_length = 1000

from tensorflow.keras.layers import Dense, Embedding, Flatten
from tensorflow.keras.models import Sequential

model = Sequential()
model.add(Embedding(vocab_size,embeded_vector_size,input_length=max_length,name="embedding"))
model.add(Flatten())
model.add(Dense(1,activation='sigmoid'))

model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['binary_crossentropy'])

In [None]:
Y = text_for_embedding['class'].values
model.fit(X_pad,Y,epochs=15)

We get the Embedding matrix produced by our model:

In [None]:
len(tokenizer.word_index)+1
Embedding_matrix = model.get_layer('embedding').get_weights()[0]

Embedding_matrix.shape,len(tokenizer.word_index)+1

Each row of Embedding Matrix corresponds to a word from tokenizer.word_index. Below, we make a dictionary with words encountered in the corpus and their related vectorizations.

In [None]:
dict_embeding = dict([(word,Embedding_matrix[tokenizer.word_index[word]]) for word in tokenizer.word_index.keys()])

In [None]:
dict_embeding['car']

In [None]:
dict_embeding['automobile']

In [None]:
from scipy.spatial import distance
distance.cosine(car, auto)

## Example : Word2Vec with gensim
We use the Corpus from the previous example.

In [None]:
Corpus = text_for_embedding['text']

Word2vec is an unsupervised learning algorithm. In contrary to the above example, we don't need the class to make our words vectorization. The input of the word2vect algorithm is the corpus. We convert each words into a 100 dimentional vector.

In [None]:
import gensim

X = [d.split() for d in text_for_embedding['text'].tolist()]
DIM = 100

#w2v_model = gensim.models.Word2Vec(sentences = X,vector_size=DIM,window=10,min_count=1)
w2v_model = gensim.models.Word2Vec(sentences = X,size=DIM,window=10,min_count=1)

The vocabulary size is given by len(w2v_model.wv). The vector representation of the word car is given by w2v_model.wv.get_vector("car", norm=True) or by w2v_model.wv.get_vector("car").

In [None]:
#print("Vocabulary size: "+str(len(w2v_model.wv)))
print("The vector representation of the word car: ")
w2v_model.wv.get_vector("car")[0:10] 

The most similar word to France is w2v_model.wv.most_similar("france")

In [None]:
w2v_model.wv.most_similar("france")[0:3]

In [None]:
w2v_model.wv.get_vector("automobile")[0:10] 

We compute the cosinus distance between car and automobile.

In [None]:
from scipy.spatial import distance
distance.cosine(w2v_model.wv.get_vector("automobile"), w2v_model.wv.get_vector("car"))

We compute the cosinus distance between friut and orange.

In [None]:
from scipy.spatial import distance
distance.cosine(w2v_model.wv.get_vector("fruit"), w2v_model.wv.get_vector("orange"))

# Example : pretrained glove embedding

In [None]:
url = "http://nlp.stanford.edu/data/glove.42B.300d.zip"

import requests, io, zipfile
filename = "glove.42B.300d.txt"
#Remove "blob",  Replace github.com by raw.githubusercontent.com
#url = "https://raw.githubusercontent.com/fabnancyuhp/DEEP-LEARNING/main/DATA/superconduct.zip"
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))

In [None]:
import numpy as np
embeddings_index = dict()
f = z.open('glove.42B.300d.txt')
for line in f:
	values = line.split()
	word = values[0].decode("utf-8")
	coefs = np.asarray(values[1:], dtype='float32')
	embeddings_index[word] = coefs
f.close()

In [7]:
embeddings_index.keys()

NameError: ignored