# **Word Embedding**
- A word embedding is a class of approaches for representing words and documents using a dense vector representation.
- It is an improvement over more the traditional bag-of-word model encoding schemes where large sparse vectors were used to represent each word or to score each word within a vector to represent an entire vocabulary
- These representations were sparse because the vocabularies were vast and a given word or document would be represented by a large vector comprised mostly of zero values. Instead, in an embedding, words are represented by dense vectors where a vector represents the projection of the word into a continuous vector space.
- The position of a word within the vector space is learned from text and is based on the words that surround the word when it is used. The position of a word in the learned vector space is referred to as its embedding. Two popular examples of methods of learning word embeddings from text include:
    - **Word2Vec.**
    - **GloVe.**


**In addition to these carefully designed methods, a word embedding can be learned as part of a deep learning model. This can be a slower approach, but tailors the model to a specific training dataset.**

# **Keras Embedding Layer**
- Keras offers an Embedding layer that can be used for neural networks on text data. It requires that the input data be integer encoded, so that each word is represented by a unique integer.
- The Embedding layer is initialized with random weights and will learn an embedding for all of the words in the training dataset


In [None]:
import numpy as np
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Embedding

In [None]:
reviews = ['nice food', 
           'amazing restaurant', 
           'too good', 
           'just loved it!',
           'will go again', 
           'horrible food', 
           'never go there',
           'poor service',
           'poor quality',
           'needs improvement']

sentiment = np.array([1, 1, 1, 1, 1, 0, 0, 0, 0, 0])

- **Keras provides the one_hot() function that creates a hash of each word as an efficient integer encoding.** 
- **We will estimate the vocabulary size of 50, which is much larger than needed to reduce the probability of collisions from the hash function.**

In [None]:
# Apply on a sample
one_hot('amazing restaurant', 50)

In [None]:
vocab_size = 50
encoded_docs = [one_hot(r, vocab_size) for r in reviews]
encoded_docs

- **The sequences have different lengths and Keras prefers inputs to be vectorized and all inputs to have the same length.**
- **We will pad all input sequences to have the length of 3 words**

In [None]:
# pad documents to a max length of 3 words
max_length = 3
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
padded_docs

- **We are now ready to define our Embedding layer as part of our neural network model.**
- **The Embedding has a vocabulary of 50 and an input length of 3. We will choose a small embedding space of 8 dimensions(features).**
- **The model is a simple binary classification model. Importantly, the output from the Embedding layer will be 3 vectors of 8 dimensions each, one for each word. We flatten this to a one 24-element vector to pass on to the Dense output layer**


In [None]:
embed_vector_size = 8

# define the model
model = Sequential()
model.add(Embedding(vocab_size, embed_vector_size, input_length=max_length))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [None]:
# summarize the model
model.summary()

In [None]:
# fit the model
model.fit(padded_docs, sentiment, epochs=50, verbose=1)

In [None]:
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, sentiment, verbose=0)
print('Accuracy: %0.2f' % (accuracy*100))

-----------------------------------------

# **Word2vec**
- **We will use a NLP module [gensim].**

In [None]:
import gensim
import pandas as pd

**The dataset we are using here is a subset of Amazon reviews from the Cell Phones & Accessories category. The data is stored as a JSON file and can be read using pandas.**

In [None]:
df = pd.read_json("/kaggle/input/amazon-reviews/Cell_Phones_and_Accessories_5.json", lines=True)
df

In [None]:
df.shape

## **Simple Preprocessing & Tokenization**
- The first thing to do for any data science task is to clean the data.
- For NLP, we apply various processing like converting all the words to lower case, trimming spaces, removing punctuations. This is something we will do over here too.

- Additionally, we can also remove stop words like 'and', 'or', 'is', 'the', 'a', 'an' and convert words to their root forms like 'running' to 'run'.

In [None]:
# We will use 'simple_preprocess' utils function from gensim
review_text = df['reviewText'].apply(gensim.utils.simple_preprocess)

In [None]:
review_text

In [None]:
# Preprocessed text
review_text.loc[0]

In [None]:
# Original Text
df.reviewText.loc[0]

## **Training the Word2Vec Model**
- **Train the model for reviews. Use a window of size 10 i.e. 10 words before the present word and 10 words ahead. A sentence with at least 2 words should only be considered, configure this using *min_count* parameter.**

#### **Create Word2Vec model**

In [None]:
model = gensim.models.Word2Vec(window=10, min_count=2, workers=4)

### **Build Vocabulary**

In [None]:
model.build_vocab(review_text, progress_per=1000)

### **Train Model** 

In [None]:
model.train(review_text, total_examples=model.corpus_count, epochs=model.epochs)

### **Find similar words**

In [None]:
model.wv.most_similar("bad")

In [None]:
# Get similarity between two words
model.wv.similarity(w1="great", w2="good")

-------------------------------------------------

### Further Reading

You can read about gensim more at https://radimrehurek.com/gensim/models/word2vec.html