# **Word Embeddings**

This is a supervised learning project.

1) The first part contains a simple implementation on how to build a NN with an embedding layer.

2) Second part is **word to vector** implementation.

## **First Part**

In [None]:
import numpy as np
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Embedding
from tensorflow import keras 

# CReate a list with 10 possible reviews. First 5 are positive last 5 are negative
reviews = ['nice food',
        'amazing restaurant',
        'too good',
        'just loved it!',
        'will go again',
        'horrible food',
        'never go there',
        'poor service',
        'poor quality',
        'needs improvement']

sentiment = np.array([1,1,1,1,1,0,0,0,0,0])

In [None]:
vocab_size = 30 
# Convert the review into numeric values using one hot encoding 
encoded_reviews = [one_hot(d, vocab_size) for d in reviews]
encoded_reviews

In [None]:
# We need to padd the reviess so every sample have equal size
max_length = 4
padded_reviews = pad_sequences(encoded_reviews, maxlen=max_length, padding='post')
print(padded_reviews)

In [None]:
# Set the embeded vector size
embeded_vector_size  = 5

# Create the model
model = Sequential()
model.add(Embedding(vocab_size, embeded_vector_size, input_length=max_length,name="embedding"))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

In [None]:
X = padded_reviews
y = sentiment

In [None]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

In [None]:
model.fit(X,y, epochs=50, verbose = 0)

In [None]:
loss, accuracy = model.evaluate(X, y)
print(f'loss = {loss} and Accuracy = {accuracy}')

In [None]:
# Through the NN training we also automatically get the embedding for this model which are : 

# Weigths : 
weights = model.get_layer('embedding').get_weights()[0]
len(weights)

In [None]:
print(f' For nice the weigths : {weights[2]}')
print(f' For good the weigths : {weights[8]}')

In general for bigger datasets we expect similar words ( like nice and good) to have similar values for their weights

Another apporach would be to train a model and in a latter stage to load the already trained model's embeddings for training a different model or predictions. 

## **Second Part** Word to vector

Embeddings are not hand cragted. Instead, they are learnt during NN training.

How it workds :
1) Take a fake problem : fill in a missing word in a sentence
2) Solve it using NN
3) You get word embeddings as a side effect

2 Methods to implement that, both of them are called fake problem and our aim is to extract the word embeddings : 

1) CBOW : Continuous Bag Of Words : Given a context words predict target word
2) Skip Gram : Given the target predict context words

With the all the above our goal is to be able to transform a word to a vector and be capable to use math on theses vectors. 

In [None]:
import gensim
import pandas as pd

## Reading and Exploring the Dataset
The dataset we are using here is a subset of Amazon reviews from the Cell Phones & Accessories category. The data is stored as a JSON file and can be read using pandas.

Link to the Dataset: http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Cell_Phones_and_Accessories_5.json.gz

In [None]:
# Before running unzip the "Cell_Phones_and_Accessories_5.json.gz"
df = pd.read_json("Cell_Phones_and_Accessories_5.json", lines=True) # Line = True means that each line is a particular object

In [24]:
print(df.shape)
df.reviewText[0]

(194439, 9)


"They look good and stick good! I just don't like the rounded shape because I was always bumping it and Siri kept popping up and it was irritating. I just won't buy a product like this again"

We are going to train a word to vector only using the review column of the dataset 

### First we are going to Preprocess the data 

In [26]:
# gensim lib gas easy to use utilities in comparison to tensorflow
# For example using the simple preprocess function we are going to tokenize the sentence
print(gensim.utils.simple_preprocess(df.reviewText[0]))
#Everything now is in lowercases and it will keep only the words by erasing "!, I etc.". In some parts we are having an issue let's say don't will become don
review_text = df.reviewText.apply(gensim.utils.simple_preprocess) # create a new pandas series

In [34]:
model = gensim.models.Word2Vec(
    window = 10, # Window of a sentence
    min_count = 2, # at least 2 words need to be in a sentence in order to be used in the training
    workers = 4, # How many CPUs to use for the training 
)

In [35]:
# Build a vocabulary with ubique list of words
model.build_vocab(review_text, progress_per = 1000)

In [36]:
#By default the epochs are set to 5
print(f' Total examples for training = {model.corpus_count}')
model.train(review_text, total_examples = model.corpus_count, epochs= 5)

 Total examples for training = 194439


(61507050, 83868975)

In [38]:
model.wv.most_similar("bad") # Find the words that are similar to bad

[('terrible', 0.6868460178375244),
 ('shabby', 0.6632394790649414),
 ('horrible', 0.6291980743408203),
 ('good', 0.5971243977546692),
 ('funny', 0.5388672947883606),
 ('okay', 0.538292646408081),
 ('awful', 0.533338189125061),
 ('legit', 0.5261659622192383),
 ('ok', 0.5207666754722595),
 ('cheap', 0.5106683969497681)]

In [40]:
model.wv.similarity(w1 = "cheap", w2 = "inexpensive") # Print similarities between 2 words

0.51925254

In [41]:
model.wv.similarity(w1 = "great", w2 = "great") # Print similarities between 2 words

1.0