# **Embeddings**

An embedding is a relatively low-dimensional space into which we can translate high-dimensional vectors. Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words. Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space. An embedding can be learned and reused across models.

# Embedding Layers in Keras

[**Embedding Layers**](https://keras.io/layers/embeddings/) are a handy feature of Keras that allows the program to automatically insert additional information into the data flow of the neural network. In the previous section, we saw that $Word2vec$ could expand words to a 300 dimension vector.  An embedding layer would allow us to insert these 300-dimension vectors in the place of word indexes automatically.

Programmers often use embedding layers with Natural Language Processing (NLP); however, they can be used in any instance where we wish to insert a lengthier vector in an index value place. In some ways, we can think of an embedding layer as dimension expansion. However, the hope is that these additional dimensions provide more information to the model and provide a better score.

### Simple Embedding Layer Example

* `input_dim` = How large is the vocabulary? This parameter is the number of items in our "lookup table."

* `output_dim` = How many numbers are in the vector that we wish to return?

* `input_length` = How many items are in the input feature vector that we need to transform?

Now we create a neural network with a vocabulary size of $10$, which will reduce those values between $0-9$ to $4$ number vectors. Each feature vector coming in will have two such features. This neural network does nothing more than passing the embedding onto the output. But it does let us see what the embedding is doing.

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding
import numpy as np

model = Sequential()
embedding_layer = Embedding(input_dim=10, output_dim=4, input_length=2)
model.add(embedding_layer)
model.compile("adam", "mse")

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 2, 4)              40        
                                                                 
Total params: 40
Trainable params: 40
Non-trainable params: 0
_________________________________________________________________


For this neural network, which is just an embedding layer, the input is a vector of size 2. These two inputs are integer numbers from 0 to 9 (corresponding to the requested `input_dim` quantity of 10 values). Looking at the summary above, we see that the embedding layer has 40 parameters. This value comes from the embedded lookup table that contains four amounts (`output_dim`) for each of the 10 (`input_dim`) possible integer values for the two inputs. The output is 2 (`input_length`) length 4 (`output_dim`) vectors, resulting in a total output size of 8, which corresponds to the Output Shape given in the summary above.

Now, let us query the neural network with two rows. The input is two integer values, as was specified when we created the neural network.

In [None]:
input_data = np.array([[1, 2]])

pred = model.predict(input_data)

print(input_data.shape)
print(pred)

(1, 2)
[[[ 0.00410897 -0.03215675 -0.04756094  0.01548951]
  [ 0.01074568 -0.04540538 -0.02540435  0.04160034]]]


Here we see two length-4 vectors that Keras looked up for each of the input integers. Recall that Python arrays are zero-based. Keras replaced the value of 1 with the second row of the $10 \times 4$ lookup matrix. Similarly, Keras replaced the value of 2 with the third row of the lookup matrix. The following code displays the lookup matrix in its entirety. The embedding layer performs no mathematical operations other than inserting the correct row from the lookup table.

In [None]:
embedding_layer.get_weights()

[array([[-0.02190864, -0.03589261,  0.04029857, -0.02416231],
        [ 0.00410897, -0.03215675, -0.04756094,  0.01548951],
        [ 0.01074568, -0.04540538, -0.02540435,  0.04160034],
        [-0.01743712, -0.0209429 ,  0.04248278, -0.0130057 ],
        [-0.02578279,  0.00065221,  0.03479834, -0.03152712],
        [ 0.01642852, -0.01104081,  0.01013689, -0.04970313],
        [-0.04276519,  0.03464342, -0.00210343, -0.01458585],
        [-0.0394882 , -0.02345122, -0.00465295, -0.01985071],
        [ 0.01859817,  0.02167213, -0.03111919,  0.04176745],
        [-0.0026991 ,  0.02542211,  0.04700674, -0.00861781]],
       dtype=float32)]

The values above are random parameters that Keras generated as starting points. Generally, we will either transfer an embedding or train these random values into something useful. The next section demonstrates how to embed a hand-coded embedding.

### Transferring An Embedding

Now, we see how to hard-code an embedding lookup that performs a simple one-hot encoding.  One-hot encoding would transform the input integer values of 0, 1, and 2 to the vectors $[1,0,0]$, $[0,1,0]$, and $[0,0,1]$ respectively. The following code replaced the random lookup values in the embedding layer with this one-hot coding-inspired lookup table.

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding
import numpy as np

embedding_lookup = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]])

model = Sequential()
embedding_layer = Embedding(input_dim=3, output_dim=3, input_length=2)
model.add(embedding_layer)
model.compile("adam", "mse")

embedding_layer.set_weights([embedding_lookup])

We have the following parameters to the Embedding layer:
    
* `input_dim` = 3: There are three different integer categorical values allowed.

* `output_dim` = 3: Per one-hot encoding, three columns represent a categorical value with three possible values.

* `input_length` = 2: The input vector has two of these categorical values.

Now we query the neural network with two categorical values to see the lookup performed.

In [None]:
input_data = np.array([[0, 1]])

pred = model.predict(input_data)

print(input_data.shape)
print(pred)

(1, 2)
[[[1. 0. 0.]
  [0. 1. 0.]]]


The given output shows that we provided the program with two rows from the one-hot encoding table. This encoding is a correct one-hot encoding for the values 0 and 1, where there are up to 3 unique values possible. 

The next section demonstrates how to train this embedding lookup table.

### Training an Embedding

First, we make use of the following imports.

In [None]:
from numpy import array
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Embedding, Dense

We create a neural network that classifies restaurant reviews according to positive or negative. This neural network can accept strings as input, such as given here. This code also includes positive or negative labels for each review.

In [None]:
# Define 10 resturant reviews.
reviews = [
    "Never coming back!",
    "Horrible service",
    "Rude waitress",
    "Cold food.",
    "Horrible food!",
    "Awesome",
    "Awesome service!",
    "Rocks!",
    "poor work",
    "Couldn't have done better",
]

# Define labels (1 = negative, 0 = positive)
labels = array([1, 1, 1, 1, 1, 0, 0, 0, 0, 0])

We define a vocabulary size of 50 words. Though we do not have 50 words, it is okay to use a value larger than needed. If there are more than 50 words, the least frequently used words in the training set are automatically dropped by the embedding layer during training. For input, we one-hot encode the strings. Note that we use the TensorFlow one-hot encoding method here rather than Scikit-Learn. Scikit-learn would expand these strings to the 0's and 1's as we would typically see for dummy variables. TensorFlow translates all of the words to index values and replaces each word with that index.

In [None]:
VOCAB_SIZE = 50
encoded_reviews = [one_hot(d, VOCAB_SIZE) for d in reviews]
print(f"Encoded reviews: {encoded_reviews}")

Encoded reviews: [[5, 30, 25], [21, 42], [21, 38], [43, 26], [21, 26], [8], [8, 42], [12], [2, 7], [3, 11, 36, 48]]


The program one-hot encodes these reviews to word indexes; however, their lengths are different. We pad these reviews to 4 words and truncate any words beyond the fourth word.

In [None]:
MAX_LENGTH = 4

padded_reviews = pad_sequences(encoded_reviews, maxlen=MAX_LENGTH, padding="post")

print(padded_reviews)

[[ 5 30 25  0]
 [21 42  0  0]
 [21 38  0  0]
 [43 26  0  0]
 [21 26  0  0]
 [ 8  0  0  0]
 [ 8 42  0  0]
 [12  0  0  0]
 [ 2  7  0  0]
 [ 3 11 36 48]]


Each review is padded by appending zeros at the end, as specified by the $padding="post"$ setting.

Next, we create a neural network to learn to classify these reviews. 

In [None]:
model = Sequential()
embedding_layer = Embedding(VOCAB_SIZE, 8, input_length=MAX_LENGTH)
model.add(embedding_layer)
model.add(Flatten())
model.add(Dense(1, activation="sigmoid"))
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

print(model.summary())

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_2 (Embedding)     (None, 4, 8)              400       
                                                                 
 flatten (Flatten)           (None, 32)                0         
                                                                 
 dense (Dense)               (None, 1)                 33        
                                                                 
Total params: 433
Trainable params: 433
Non-trainable params: 0
_________________________________________________________________
None


This network accepts four integer inputs that specify the indexes of a padded movie review. The first embedding layer converts these four indexes into four vectors of length 8. These vectors come from the lookup table that contains 50 (`VOCAB_SIZE`) rows of vectors of length 8. This encoding is evident by the 400 (8 times 50) parameters in the embedding layer. The size of the output from the embedding layer is 32 (4 words expressed as 8-number embedded vectors). A single output neuron is connected to the embedding layer by 33 weights (32 from the embedding layer and a single bias neuron). Because this is a single-class classification network, we use the sigmoid activation function and `binary_crossentropy`.

The program now trains the neural network. Both the embedding lookup and dense 33 weights are updated to produce a better score.

In [None]:
# Fit the Model.
model.fit(padded_reviews, labels, epochs=100, verbose=1)

We can see the learned embeddings. Think of each word's vector as a location in the 8 dimension space where words associated with positive reviews are close to other words with positive reviews. Similarly, training places negative reviews close to each other. In addition to the training setting these embeddings, the 33 weights between the embedding layer and output neuron similarly learn to transform these embeddings into an actual prediction. We can see these embeddings here.

In [None]:
print(embedding_layer.get_weights()[0].shape)
print(embedding_layer.get_weights())

We can now evaluate this neural network's accuracy, including both the embeddings and the learned Dense Layer.

In [None]:
loss, accuracy = model.evaluate(padded_reviews, labels, verbose=1)
print(f"Accuracy: {accuracy}")

Accuracy: 1.0


In [None]:
print(f"Log-loss: {loss}")

Log-loss: 0.43393024802207947


However, the loss is not perfect, meaning that even though the predicted probabilities indicated a correct prediction in every case, the program did not achieve absolute confidence in each correct answer.  The lack of confidence was likely due to the small amount of noise (previously discussed) in the dataset.  Additionally, the fact that some words appeared in both positive and negative reviews contributed to this lack of absolute certainty.

# **Word Embeddings**

*Word embeddings is a technique where individual words get transformed into a numerical representation of the word (i.e., a vector). Each word gets mapped to one vector, and this vector is then learned in a way that resembles a neural network. The vectors try to capture various characteristics of that word with regard to the overall text. These characteristics can include the semantic relationship of the word, definitions, context, etc. With these numerical representations, we can do many things like identify similarities or dissimilarities between words.*

Word Embeddings [**[Wikipedia]**](https://en.wikipedia.org/wiki/Word_embedding) is an approach for representing words and documents. Word Embedding is a numeric vector input that represents a word in a lower-dimensional space. It allows words with similar meanings to have a similar representation. It can also approximate meaning. A word vector with 50 values can represent 50 unique features.

### **Goals of Word Embeddings**

*   To reduce dimensionality.
*   To use a word to predict the words around it.
*   Inter-word semantics must be captured.

#### **How are Word Embeddings used?**

*   Word Embeddings are used as input to machine learning models. Take the words $\rightarrow$ Give their numeric representation $\rightarrow$ Use in training or inference.

*   To represent or visualize any underlying patterns of usage in the corpus that was used to train them.

### **Implementations of Word Embeddings**

Word Embeddings are a method of extracting features out of text so that we can input those features into a machine learning model to work with text data. They try to preserve syntactical and semantic information. The methods such as Bag of Words (BOW), CountVectorizer, and TF-IDF rely on the word count in a sentence but do not save any syntactical or semantic information. In these algorithms, the size of the vector is the number of elements in the vocabulary. We can get a sparse matrix if most of the elements are zero. Large input vectors will mean a huge number of weights which will result in the high computation required for training. Word Embeddings give a solution to these problems. There are two different approaches to get Word Embeddings.

## 1.   **Word2Vec:**
In Word2Vec, every word is assigned a vector. We start with either a random vector or a one-hot vector.

**One-Hot Vector:** A representation where only one bit in a vector is 1. If there are 500 words in the corpus, then the vector length will be 500. After assigning vectors to each word, we take a window size and iterate through the entire corpus. While we do this, two neural embedding methods are used:

#### 1.1.  **Continuous Bag of Words (CBOW):**
In this model, we try to fit the neighboring words in the window to the central word.

![image.png](https://media.geeksforgeeks.org/wp-content/uploads/cbow-1.png)

#### 1.2.  **Skip Gram:**
In this model, we try to make the central word closer to the neighboring words. It is the complete opposite of the CBOW model. It is shown that this method produces more meaningful embeddings.

![image.png](https://media.geeksforgeeks.org/wp-content/uploads/skip_gram.png)

After applying the above neural embedding methods, we get trained vectors of each word after many iterations through the corpus. These trained vectors preserve syntactical or semantic information and are converted to lower dimensions. The vectors with similar meaning or semantic information are placed close to each other in space.

## 2.   **GloVe:**

### **Common Errors made:**

*   We need to use the exact same pipeline during deploying our model as was used to create the training data for the word embedding. If we use a different tokenizer or different method of handling white space, punctuation, etc. we might end up with incompatible inputs.

*   Words in our input that do not have a pre-trained vector are known as Out of Vocabulary words ($OOV$). We should replace those words with $"UNK"$, which means unknown, and then handle them separately.

*   **Dimension Mismatch:** Vectors can be of many lengths. If we train a model with vectors of length (say 400) and then try to apply vectors of length 1000 at inference time, we will run into errors. So make sure to use the same dimensions throughout.

### **References:**

*   [**Word Embeddings in NLP - GeeksforGeeks**](https://www.geeksforgeeks.org/word-embeddings-in-nlp/)

*   [**Word Embeddings Blog**](https://colah.github.io/posts/2014-07-NLP-RNNs-Representations/)

*   [**The Illustrated Word2vec**](https://jalammar.github.io/illustrated-word2vec/)

In [None]:
!pip install texthero
!pip install textblob
!pip install spacy==3.3

In [None]:
# Import Library.
import pandas as pd
import numpy as np
from textblob import TextBlob
import texthero as hero

from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential

# Vocabulary Size.
voc_size = 10000

# Read Dataset.
data = pd.read_csv("spam.csv")

# Text Cleaning and Preprocessing.
data["Message"] = data["Message"].pipe(hero.clean).pipe(hero.remove_urls)
data["Message"] = data["Message"].apply(
    lambda x: str(TextBlob(x).correct())
)  # Spelling Correction.
data["Class"] = data["Category"].apply(lambda x: 1 if x == "spam" else 0)

# Split Dataset into Dependent and Independent Features.
X = data["Message"]
y = data["Class"]

In [None]:
# One Hot Representation.
onehot_repr = [one_hot(words, voc_size) for words in X]

max_length = 15
embedded_docs = pad_sequences(onehot_repr, padding="post", maxlen=max_length)
print(embedded_docs)

[[2965 3992 4165 ... 6374 5709 3435]
 [6852 3870 2317 ...    0    0    0]
 [5065 1102  883 ... 1985 9646  652]
 ...
 [ 344 6988 2796 ...    0    0    0]
 [5913 6513 9233 ... 5487    0    0]
 [5082 9877 4832 ...    0    0    0]]


In [None]:
model = Sequential()
model.add(Embedding(voc_size, 10, input_length=max_length))
model.compile(optimizer="adam", loss="mse")
model.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_3 (Embedding)     (None, 15, 10)            100000    
                                                                 
Total params: 100,000
Trainable params: 100,000
Non-trainable params: 0
_________________________________________________________________


In [None]:
print(model.predict(embedded_docs))