# Embeddings

In the context of natural language processing (NLP), "embeddings" refer to dense vector representations of words (or sometimes phrases and sentences) in a continuous vector space. These vector representations are learned through unsupervised machine learning techniques like Word2Vec, GloVe, or FastText, where words with similar meanings or appearing in similar contexts are mapped to vectors that are close together in the vector space.

In traditional NLP, words have been typically represented using one-hot encoding, where each word is represented as a sparse binary vector, with a 1 in the position corresponding to the word's index in the vocabulary and 0s everywhere else. However, one-hot encoded vectors suffer from several limitations:

- **High Dimensionality:** One-hot encoded vectors are very high-dimensional, with the dimensionality equal to the size of the vocabulary. This leads to increased computational complexity and storage requirements.

- **Lack of Semantic Information:** One-hot vectors do not capture any semantic relationships between words. Each word is treated as an isolated entity with no notion of similarity or relatedness to other words.

Embeddings address these limitations and offer several advantages in NLP:

- **Low-dimensional Dense Representations:** Word embeddings are low-dimensional dense vectors, typically ranging from 50 to 300 dimensions, making them computationally efficient and memory-friendly compared to one-hot vectors.
- **Semantic Relationships:** Embeddings capture semantic relationships between words. Words with similar meanings or appearing in similar contexts will have similar vector representations, enabling models to understand the meaning and context of words.
- **Generalization:** Word embeddings allow NLP models to generalize better across different tasks and datasets. Pre-trained word embeddings can be used as features for various downstream tasks, even if the training data for the downstream task is limited.
- **Out-of-Vocabulary (OOV) Words:** Word embeddings provide representations for words not seen during training (OOV words) by generalizing from the context of other words.
- **Efficiency:** Once trained, word embeddings can be efficiently stored and reused, which is especially important for large-scale NLP applications.
- **Capturing Analogies:** Word embeddings can capture analogical relationships like "king" - "man" + "woman" ≈ "queen," allowing models to perform analogy-based reasoning.

## Word2Vec

Word2Vec is a popular technique for learning word embeddings, which are dense vector representations of words in a continuous vector space. Word embeddings capture semantic relationships between words, allowing machines to understand and work with words in a more meaningful way. Word2Vec was introduced by researchers at Google in 2013, and it has since become one of the foundational techniques in natural language processing (NLP) and other related fields.

The basic idea behind Word2Vec is to represent each word in a high-dimensional vector space, where words with similar meanings or contexts are located close to each other. The key intuition behind Word2Vec is the distributional hypothesis, which posits that words appearing in similar contexts tend to have similar meanings. For example, in the sentences "I love cats" and "I adore felines," the words "love" and "adore" are likely to be used in similar contexts and have similar semantic meanings.

Word2Vec can be trained using two main architectures: Continuous Bag of Words (CBOW) and Skip-gram. Let's explore each of these in detail:

### 1. Continuous Bag of Words (CBOW)
CBOW aims to predict a target word based on its surrounding context words. Given a sequence of words in a sentence, CBOW tries to predict the middle word based on the surrounding context words. The context window size determines how many words before and after the target word are considered as the context.

For example, consider the sentence: "The cat sat on the mat." If we set the context window size to 2 and assume "sat" is the target word, CBOW will use the context words "The," "cat," "on," and "the" to predict the word "sat."

The architecture involves the following steps:
- Convert the context words to their corresponding word embeddings.
- Average these embeddings to create a context vector.
- Use this context vector as input to a neural network to predict the target word.

Let's implement it using python. Python provides a package named `gensim` to make our job easy. Let's begin by installing the package.

```
pip install gensim
```

In [1]:
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence

In [2]:
# Sample corpus (list of sentences)
corpus = [
    "I love cats",
    "I adore felines",
    "Dogs are loyal",
    "Cats and dogs are pets",
    "The sun is shining"
]

In [3]:
# Tokenize the sentences into words
tokenized_corpus = [sentence.lower().split() for sentence in corpus]

# CBOW model
cbow_model = Word2Vec(sentences=tokenized_corpus, vector_size=100, window=2, sg=0, min_count=1, workers=4)

In [4]:
# Function to find similar words using the trained model
def find_similar_words(model, word, top_n=5):
    similar_words = model.wv.most_similar(positive=[word], topn=top_n)
    return similar_words

In [5]:
# Test the model
target_word = "cats"
print(f"CBOW - Similar words to '{target_word}': {find_similar_words(cbow_model, target_word)}")

CBOW - Similar words to 'cats': [('pets', 0.19913175702095032), ('felines', 0.17272792756557465), ('shining', 0.17018885910511017), ('sun', 0.14589877426624298), ('is', 0.06408977508544922)]


### 2. Skip-gram
Skip-gram works in the opposite way of CBOW. It aims to predict context words given a target word. In other words, it tries to find the context words that are most likely to appear in the given sentence with a particular target word.

For the same example sentence, "The cat sat on the mat," if "sat" is the target word, Skip-gram will try to predict the context words "The," "cat," "on," and "the."

The architecture involves the following steps:
- Convert the target word to its corresponding word embedding.
- Use this embedding as input to a neural network to predict the context words.

In [6]:
# Skip-gram model
skipgram_model = Word2Vec(sentences=tokenized_corpus, vector_size=100, window=2, sg=1, min_count=1, workers=4)

In [7]:
# Test the model
target_word = "cats"
print(f"Skip-gram - Similar words to '{target_word}': {find_similar_words(skipgram_model, target_word)}")

Skip-gram - Similar words to 'cats': [('pets', 0.19913214445114136), ('felines', 0.17272792756557465), ('shining', 0.17018885910511017), ('sun', 0.1459500789642334), ('is', 0.06408977508544922)]


Word2Vec is trained using a large corpus of text data, and the neural network parameters (word embeddings) are adjusted during training to maximize the likelihood of correctly predicting the context words or target words. Once trained, the word embeddings can be used in various downstream NLP tasks, such as sentiment analysis, machine translation, and named entity recognition, among others.

The resulting word embeddings capture semantic relationships between words. Words with similar meanings or appearing in similar contexts will have similar vector representations, and their Euclidean or cosine distances in the vector space will be small. This property allows the word embeddings to be used in similarity calculations, analogy tasks (e.g., "king" - "man" + "woman" ≈ "queen"), and even for visualization purposes.

## Glove Word Embedding
GloVe stands for Global Vectors for word representation and was developed by researchers at Stanford University. It is unsupervised learning algorithm aiming to generate word embeddings by aggregating global word co-occurrence matrices from a given corpus. To start with GloVe, first we have to download the pre-trained model hosted [here](https://nlp.stanford.edu/projects/glove/). A total of four pre-trained models are available there. Get your own choice.

The basic idea behind the GloVe word embedding is to derive the relationship between the words from statistics.

To use glove word embedding with our way, you first need to install python scipy and numpy libraries (if not installed already). Copy the below command to do so.

```
pip3 install scipy
pip3 install numpy
```

In [None]:
import numpy as np
from scipy import spatial

glove_filepath = 'glove.6B.50d.txt'

embeddings_dict = {}
with open(glove_filepath, 'r') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], "float32")
        embeddings_dict[word] = vector

# Find similar Vectors
def find_similar_vectors(inputs):
    return sorted(embeddings_dict.keys(), key=lambda word: spatial.distance.euclidean(embeddings_dict[word], embeddings_dict[inputs]))

# Example:
find_similar_vectors('king')[:5]    # Get top 5 similar words

## FastText Word Embedding

Facebook AI Research lab developed an open-source word-embedding library called `FastText` with the purpose of achieving more accurate and scalable solutions qucikly while processing large text data. Similar to `GloVe` Word Embedding, `FastText` is also the modified version of `Word2Vec`.

Unlike Word2Vec which feeds individual words to neural network, FastText breaks a word into character n-grams and then feeds those character n-grams to the neural network. For instance: the tri-gram of the word fasttext is:

`<fa`, `fas`, `ast`, `stt`, `tte`, `tex`, `ext`, `xt>`

The embedding vectors for each of these words are obtained after training the neural network. These independent embedding vectors are finally added up to obtain the word embedding vector of the original word `fasttext`.

**How is FastText better than Word2Vec?**

- Compound words like `fasttext` can be properly represented even if the data do not contain the word `fasttext` as other words like `fast` and `text` contain the same n-grams.
- Even though the words like `fast`, `faster`, `fastest` share the same redical, word2vec handles them independently according to the context. FastText on the other hand facilitates parameter sharing among such words and does efficient utilization of the morphological structure.

Let's try implementing it for real. Python provides an open-source library `gensim` that makes working with fasttext easy. Let's begin by installing gensim library. We will use `nltk` for preprocessing, so let's install both of the libraries.

```
pip3 install nltk
pip3 install gensim
```

In [3]:
from gensim.models import FastText
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/arun/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
data = ["This is a sentence.", "This is another sentence."]
data = [word_tokenize(sentence) for sentence in data]

In [5]:
model = FastText(data, vector_size=128, window=5, min_count=1, workers=4,sg=1)
# model.save('./fasttext.ft')
fmodel = model.wv

In [6]:
# fmodel['this']
fmodel.similar_by_word('this', topn=3)

[('This', 0.25602468848228455),
 ('is', 0.1581001877784729),
 ('.', 0.05240068957209587)]

This code will first load the text data into a list of strings. Then, it will create a FastText model with a vector size of 128, a window size of 5, and a minimum count of 1. The model will then be trained on the text data. Finally, the word embeddings for the word "this" will be printed.

Here is an explanation of the code:

- The `gensim.models.FastText` class is used to create a FastText model.
- The `vector_size` parameter specifies the size of the word embeddings.
- The `window` parameter specifies the size of the context window.
- The `min_count` parameter specifies the minimum number of times a word must appear in the text data in order to be included in the model.
- The `model.wv` property returns KeyedVectors object that contains the word embeddings. The KeyedVectors object has a number of methods that can be used to access and manipulate the word embeddings.

## BERT
BERT (Bidirectional Encoder Representations from Transformers) is a neural network model that was pre-trained on a massive dataset of text and code. It can be used for a variety of natural language processing (NLP) tasks, such as question answering, text classification, and sentiment analysis.

**How does BERT work?**

BERT is a transformer-based model, which means that it uses a stack of self-attention layers to learn the relationships between words in a sentence. The model is pre-trained on a massive dataset of text and code, which allows it to learn the contextual meaning of words.

**How to use BERT for embedding?**

BERT can be used to generate word embeddings, which are vector representations of words that capture their semantic meaning. To generate word embeddings using BERT, you first need to tokenize the input text into individual words or subwords (using the BERT tokenizer). You can then pass the tokenized input through the BERT model to generate a sequence of hidden states. The hidden states can then be used to represent the words in the input text.

To implement BERT, we will use HuggingFace's `transformers` library and `transformers` requires `pytorch` installed. So let's begin by installing the required libraries.

```
pip3 install torch
pip3 install transformers
```

In [8]:
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Tokenize the input text
text = 'The quick brown fox jumps over the lazy dog.'
tokens = tokenizer(text=text, return_tensors="pt")

# Get the BERT embeddings
embeddings = model(**tokens).last_hidden_state

# Print the embeddings for each token
for token, embedding in zip(tokens["input_ids"], embeddings):
    print(token, embedding.shape)

tensor([  101,  1996,  4248,  2829,  4419, 14523,  2058,  1996, 13971,  3899,
         1012,   102]) torch.Size([12, 768])


This code will first load the BERT model and the BERT tokenizer. It will then tokenize the input text and convert the tokens to IDs. The IDs will then be passed to the BERT model, which will generate a sequence of hidden states. The hidden states will then be used to represent the words in the input text.