# Word Embedding

Word embedding is a technique used in natural language processing to map words or phrases *from a vocabulary* to *vectors of real numbers*, which represent the words in a continuous vector space. These embeddings capture semantic meanings, relationships, and context of the words, enabling algorithms to process text using these numerical representations.

The primary benefits of word embeddings include:
- **Dimensionality Reduction:** Instead of using high-dimensional one-hot encoded vectors, word embeddings reduce the dimensionality, which helps in handling the curse of dimensionality in large datasets.
- **Semantic Similarity:** Words with similar meanings are often placed close together in the embedding space. For example, "king" and "queen" might be positioned nearer each other than "king" and "apple".
- **Context Capture:** Modern embedding techniques take the context of words into account, which allows subtle differences in usage to be captured, such as the different meanings of "bank" in "river bank" and "savings bank".

Some popular methods for generating word embeddings include:
- **Word2Vec:** Developed by a team led by Tomas Mikolov at Google, it offers two architectures: Continuous Bag-of-Words (CBOW) and Skip-Gram. Both methods use a shallow neural network model to produce embeddings.
- **GloVe (Global Vectors for Word Representation):** Developed by Stanford University researchers, GloVe constructs embeddings by analyzing word co-occurrence statistics across a corpus to learn relationships.
- **FastText:** Developed by Facebook’s AI Research lab, FastText extends Word2Vec to consider subword information, which helps in capturing meanings of shorter words and enhances the understanding of morphologically rich languages.

Word embeddings are a foundational technique in many NLP tasks like text classification, sentiment analysis, machine translation, and more. They can be trained from scratch or leveraged through pre-trained models available in libraries like TensorFlow, PyTorch, and Hugging Face’s Transformers.

Using Keras Embedding Layer for Word Embedding
-----
The Keras `Embedding` layer is a simple yet powerful tool in TensorFlow for handling word embeddings in deep learning models designed for natural language processing tasks. The layer essentially maps integer indices (which represent specific words) to dense vectors of fixed size. It's a way to convert text data into a form that neural networks can work with effectively.

Here’s a basic breakdown of how the `Embedding` layer works and how to use it:

### Key Features
- **Input Dimension:** This is the size of the vocabulary, i.e., the total number of unique words in the dataset.
- **Output Dimension:** This is the dimensionality of the embedding vector. Each word will be represented by a vector of this size.
- **Input Length:** This is the length of input sequences, as all sequences need to be of the same length in a neural network.

### Basic Usage
When you instantiate an `Embedding` layer, you need to specify the `input_dim` (vocabulary size), `output_dim` (dimension of the embedding vector), and optionally, `input_length` (length of input sequences). The layer is initialized with random weights and will learn an embedding for all of the words in the training dataset.

Here is a simple example in Keras:

```python
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding

model = Sequential()
# Adding Embedding layer
# input_dim = 1000 (vocabulary size)
# output_dim = 64 (embedding dimension)
# input_length = 10 (length of input sequences)
model.add(Embedding(input_dim=1000, output_dim=64, input_length=10))
model.summary()
```

### Training
The `Embedding` layer is trained just like any other layer in a neural network: the model attempts to reduce the loss function during training, adjusting the weights (i.e., the embeddings) using backpropagation. Over time, these embeddings adjust to encapsulate useful properties and relationships among words based on the training data.

### Benefits and Use Cases
- **Reduced Dimensionality:** It maps high-dimensional one-hot vectors to lower-dimensional dense vectors.
- **Context Sensitivity:** In combination with other layers (like LSTM or GRU), it can capture contextual relationships between words in sequences.
- **Pre-trained Embeddings:** You can initialize the `Embedding` layer with pre-trained word embeddings such as Word2Vec or GloVe to enhance model performance, especially when you have limited data for training.

### Initializing with Pre-trained Word Embeddings
If you want to use pre-trained embeddings, you can load them and then initialize the `Embedding` layer with these weights. Here’s a brief example using GloVe embeddings:

```python
import numpy as np

# Assume `embedding_matrix` is loaded from GloVe
# input_dim is the vocabulary size
# output_dim is the embedding dimension (e.g., 50, 100, 200, 300 for GloVe)

model = Sequential()
model.add(Embedding(input_dim=vocabulary_size, 
                    output_dim=embedding_dimension, 
                    weights=[embedding_matrix], 
                    input_length=input_length, 
                    trainable=False))  # Set trainable to False to keep the embeddings fixed
```

This flexibility and simplicity make the Keras `Embedding` layer highly useful for various NLP tasks, such as sentiment analysis, text classification, and more, allowing for efficient learning and representation of text data.

Now, let's see how it realy works in practice.

In [1]:
import numpy as np
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Embedding

reviews = ['nice food',
        'amazing restaurant',
        'too good',
        'just loved it!',
        'will go again',
        'horrible food',
        'never go there',
        'poor service',
        'poor quality',
        'needs improvement']

sentiment = np.array([1, 1, 1, 1, 1, 0, 0, 0, 0, 0])

2024-05-05 07:27:31.364304: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-05 07:27:31.364446: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-05 07:27:31.516807: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [2]:
one_hot("amazing restaurant", 30)  # `30` determines the range that the words can take a number and convert to it
                                   # This number usually equals to the size of the vocabulary, making it possible 
                                   # for each word to have a unique value

[5, 29]

> Next we do the encoding on all the sentences

In [3]:
vocab_size = 30
encoded_reviews = [one_hot(review, vocab_size) for review in reviews]
print(encoded_reviews)

[[15, 9], [5, 29], [26, 25], [26, 17, 23], [1, 1, 4], [8, 9], [11, 1, 9], [27, 23], [27, 27], [9, 29]]


> We need to standardize the lengths of the sequences we achieved in the previous step. To do so, we used the `pad_sequences`.

In [4]:
max_length = 4
padded_reviews = pad_sequences(encoded_reviews, maxlen=max_length, padding='post')
print(padded_reviews)

[[15  9  0  0]
 [ 5 29  0  0]
 [26 25  0  0]
 [26 17 23  0]
 [ 1  1  4  0]
 [ 8  9  0  0]
 [11  1  9  0]
 [27 23  0  0]
 [27 27  0  0]
 [ 9 29  0  0]]


In [5]:
embeded_vector_size = 5

model = Sequential()
model.add(Embedding(vocab_size, embeded_vector_size, name="embedding"))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

model.build((None, max_length))  # Explicitly build the model and define the max_length

In [6]:
X = padded_reviews
y = sentiment

In [7]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

In [8]:
model.fit(X, y, epochs=80, verbose=0)

<keras.src.callbacks.history.History at 0x7a0d1091a950>

In [9]:
# evaluate the model
loss, accuracy = model.evaluate(X, y)
accuracy

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 163ms/step - accuracy: 1.0000 - loss: 0.5614


1.0

Now, the actual `word embeddings` are the acquired weights in embedding layer.  Which are accessible from the following command. 

In [10]:
weights = model.get_layer('embedding').get_weights()[0]
print("The number of weights is: ", len(weights))
print(weights[:5])

The number of weights is:  30
[[ 0.0191403  -0.01421062  0.0062328  -0.05365282 -0.0060914 ]
 [ 0.04781847 -0.09340276 -0.04859197 -0.08498313 -0.03928798]
 [ 0.01348182  0.02770411  0.01283503  0.01935459 -0.0065825 ]
 [-0.03045956  0.02385424 -0.0367759  -0.03711759 -0.00880497]
 [ 0.10512549 -0.06518326 -0.05221771  0.10601646  0.07076833]]


Each word is embedded into a vector of 5 values as we determined this. 