<a href="https://colab.research.google.com/github/arkeodev/nlp/blob/main/Embeddings/embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unraveling the World of Embeddings

In the realm of artificial intelligence, embeddings serve as the cornerstone for translating various types of data into a language that machines can understand. From the intricacies of human language to the complexities of images and sounds, embeddings transform raw data into dense vectors, paving the way for nuanced machine comprehension. This article explores the multifaceted world of embeddings, encompassing text, images, and audio, highlighting their development, applications, and how they're fine-tuned for specific tasks.

<figure>
    <img src="https://raw.githubusercontents.com/arkeodev/nlp/main/Decoding_Algorithms/images/greedy_decoder.png" width="400" height="400" alt="Greedy Decoder">
    <figcaption>Greedy Decoder</figcaption>
</figure>

## The Evolution of Text Representation Techniques

The evolution of text representation techniques in machine learning from simple models like One-Hot Encoding, Bag of Words (BoW), and Term Frequency-Inverse Document Frequency (TF-IDF) to complex embeddings illustrates a journey towards capturing the nuances of language more effectively.

Each method addressed specific limitations of its predecessors, adding layers of sophistication and bringing us closer to a more profound understanding of text semantics.

Here will not be addressed for the details of each of these techniques. It can be got more information in the Blog Article called [Step by Step Guide to Master NLP – Word Embedding and Text Vectorization](https://www.analyticsvidhya.com/blog/2021/06/part-5-step-by-step-guide-to-master-nlp-text-vectorization-approaches/#) for each of them.



### 1. One-Hot Encoding

**Description**: In one-hot encoding, each word in the vocabulary is represented by a vector where one element is set to 1, and the rest are set to 0. The vector's length equals the size of the vocabulary, and each word is assigned a unique position in this vector space.



**Advantages**:
- **Base Method**: Being the foundational method, one-hot encoding's main advantage was its straightforward approach to turning text data into a numerical form that machine learning algorithms could process.
- **Simplicity**: Easy to understand and implement.
- **Uniqueness**: Each word is uniquely represented, with no overlap between representations.

**Disadvantages**:
- **Sparsity**: One-hot vectors are extremely sparse, leading to inefficient use of memory and computational resources, especially with large vocabularies.
- **No Semantic Information**: This method does not capture any semantic relationships between words. Words are treated as independent entities, making it impossible to gauge similarity or relatedness.

In [2]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np

# Example vocabulary
words = np.array(['cat', 'dog', 'bird', 'fish']).reshape(-1, 1)

# One-hot encode
encoder = OneHotEncoder(sparse_output=False)
one_hot_encoded = encoder.fit_transform(words)

print(one_hot_encoded)


[[0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [1. 0. 0. 0.]
 [0. 0. 0. 1.]]


### 2. Bag of Words (BoW)

**Description**: The Bag of Words model represents text as an unordered collection of words, disregarding grammar and word order but keeping multiplicity. A document is represented by a vector indicating the frequency of each word from the vocabulary in the document.



**Advantages**:
- **Frequency Information**: Captures the frequency of words in the document, providing more information than one-hot encoding.
- **Simplicity**: Still relatively simple to understand and implement.
- BoW addressed the lack of information in one-hot encoding by incorporating word frequency, which offered a basic form of "importance" to words in a document.

**Disadvantages**:
- **Sparsity and Dimensionality**: Similar to one-hot encoding, BoW vectors can become very sparse and high-dimensional with large vocabularies, leading to inefficiencies.
- **Lack of Context and Order**: BoW does not account for the order of words, losing important syntactic and semantic information. It treats "dog bites man" and "man bites dog" identically.
- **No Semantic Relationships**: BoW cannot capture semantic relationships or the meaning of words within the context of a sentence.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample text
documents = ["the cat sat on the mat", "the dog sat on the log"]

# Create and fit the CountVectorizer to the documents
vectorizer = CountVectorizer()

# This step generates the vocabulary and transforms the documents into a sparse matrix
bow = vectorizer.fit_transform(documents)

# Vocabulary mapping: word -> column index
vocabulary = vectorizer.vocabulary_

# Print the vocabulary to understand how words are indexed
print("Vocabulary:")
for word, index in sorted(vocabulary.items(), key=lambda item: item[1]):
    print(f"{word}: {index}")

# Convert the BoW sparse matrix to a dense array and print it
dense_bow = bow.toarray()
print("\nBag of Words representation (dense array):")
print(dense_bow)

# Explain the output based on the vocabulary
print("\nInterpreting the BoW representation:")
print("Each row in the dense array corresponds to a document.")
print("Each column represents a word from the vocabulary, in the order printed above.")
print("Values represent how many times each word appears in each document.")

# For a more detailed explanation
print("\nDetailed explanation for the first document:")
print(documents[0])
for word, index in sorted(vocabulary.items(), key=lambda item: item[1]):
    print(f"The word '{word}' appears {dense_bow[0, index]} time(s).")

Vocabulary:
cat: 0
dog: 1
log: 2
mat: 3
on: 4
sat: 5
the: 6

Bag of Words representation (dense array):
[[1 0 0 1 1 1 2]
 [0 1 1 0 1 1 2]]

Interpreting the BoW representation:
Each row in the dense array corresponds to a document.
Each column represents a word from the vocabulary, in the order printed above.
Values represent how many times each word appears in each document.

Detailed explanation for the first document:
the cat sat on the mat
The word 'cat' appears 1 time(s).
The word 'dog' appears 0 time(s).
The word 'log' appears 0 time(s).
The word 'mat' appears 1 time(s).
The word 'on' appears 1 time(s).
The word 'sat' appears 1 time(s).
The word 'the' appears 2 time(s).


### 3. Term Frequency-Inverse Document Frequency (TF-IDF)

**Description**: TF-IDF builds upon BoW by weighting the word frequencies based on how commonly they appear across documents. Words that appear frequently in a document but less frequently across multiple documents are given higher importance.

**Advantages**:
- **Weighted Importance**: Weights word frequencies to highlight words that are important in a document but not common across all documents.
- **Reduces Impact of Common Words**: Helps mitigate the effect of commonly used words that may not contribute much to the overall meaning of documents.
- TF-IDF built on BoW's frequency counts by adding a weighting scheme that emphasizes the significance of words based on their distribution across documents, thus providing a rudimentary form of context sensitivity.

**Disadvantages**:
- **Still Lacks Context and Semantics**: While TF-IDF provides a way to highlight more "important" words, it still doesn't capture word meanings, relationships, or the context within which words appear.
- **High Dimensionality**: Like BoW, TF-IDF suffers from high dimensionality issues, leading to sparse representations that can be computationally expensive to work with.

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample text
documents = ["the cat sat on the mat", "the dog sat on the log"]

# Create and fit the TfidfVectorizer to the documents
tfidf_vectorizer = TfidfVectorizer()

# This step generates the vocabulary and transforms the documents into a TF-IDF-weighted sparse matrix
tfidf = tfidf_vectorizer.fit_transform(documents)

# Vocabulary mapping: word -> column index
vocabulary = tfidf_vectorizer.vocabulary_

# Print the vocabulary to understand how words are indexed
print("Vocabulary:")
for word, index in sorted(vocabulary.items(), key=lambda item: item[1]):
    print(f"{word}: {index}")

# Convert the TF-IDF sparse matrix to a dense array and print it
dense_tfidf = tfidf.toarray()
print("\nTF-IDF representation (dense array):")
print(dense_tfidf)

# Explain the output based on the vocabulary
print("\nInterpreting the TF-IDF representation:")
print("Each row in the dense array corresponds to a document.")
print("Each column represents a word from the vocabulary, in the order printed above.")
print("Values are the TF-IDF weights, representing the importance of each word in each document relative to the corpus.")

# For a more detailed explanation
print("\nDetailed explanation for the first document:")
print(documents[0])
for word, index in sorted(vocabulary.items(), key=lambda item: item[1]):
    print(f"The word '{word}' has a TF-IDF weight of {dense_tfidf[0, index]:.4f} in the first document.")

Vocabulary:
cat: 0
dog: 1
log: 2
mat: 3
on: 4
sat: 5
the: 6

TF-IDF representation (dense array):
[[0.44554752 0.         0.         0.44554752 0.31701073 0.31701073
  0.63402146]
 [0.         0.44554752 0.44554752 0.         0.31701073 0.31701073
  0.63402146]]

Interpreting the TF-IDF representation:
Each row in the dense array corresponds to a document.
Each column represents a word from the vocabulary, in the order printed above.
Values are the TF-IDF weights, representing the importance of each word in each document relative to the corpus.

Detailed explanation for the first document:
the cat sat on the mat
The word 'cat' has a TF-IDF weight of 0.4455 in the first document.
The word 'dog' has a TF-IDF weight of 0.0000 in the first document.
The word 'log' has a TF-IDF weight of 0.0000 in the first document.
The word 'mat' has a TF-IDF weight of 0.4455 in the first document.
The word 'on' has a TF-IDF weight of 0.3170 in the first document.
The word 'sat' has a TF-IDF weight of 0.3

### 4. N-Grams

**Description**: N-grams are sequences of *n* contiguous items from a given sample of text or speech. In text processing, these items are typically words or characters. N-grams help capture local context and sequence information within text data, providing a foundation for modeling language beyond individual words.

**Advantages**:

- **Contextual Information**: N-grams incorporate context by considering sequences of words or characters, capturing more information about language structure than individual words.
- **Improved Language Modeling**: By analyzing sequences of words, n-grams allow for better prediction of the next item in a sequence, enhancing language modeling tasks.
- **Flexibility and Simplicity**: N-grams offer a simple yet flexible approach to text representation, allowing adjustments in granularity by changing the value of *n*.

**Disadvantages**:

- **Explosion of Features**: As *n* increases, the number of possible n-grams can grow exponentially, leading to high-dimensional feature spaces and computational challenges.
- **Fixed Window Size**: N-grams capture context within a fixed window size (*n*), which may not always align with the actual scope of contextual dependencies in text.
- **Lack of Deep Semantic Understanding**: While n-grams can model the presence and co-occurrence of sequences, they lack the ability to capture deeper semantic relationships in the way that embeddings do.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

# Sample text
documents = ["the cat sat on the mat", "the dog sat on the log"]

# Create and fit the CountVectorizer to the documents with bigrams
vectorizer = CountVectorizer(ngram_range=(2, 2))

# Generate the vocabulary and transform the documents into a sparse matrix of bigrams
bigrams = vectorizer.fit_transform(documents)

# Vocabulary mapping: bigram -> column index
vocabulary = vectorizer.vocabulary_

# Print the vocabulary to understand how bigrams are indexed
print("Vocabulary (Bigrams):")
for bigram, index in sorted(vocabulary.items(), key=lambda item: item[1]):
    print(f"{bigram}: {index}")

# Convert the sparse matrix to a dense array and print it
dense_bigrams = bigrams.toarray()
print("\nBigram representation (dense array):")
print(dense_bigrams)

# Explain the output based on the vocabulary
print("\nInterpreting the Bigram representation:")
print("Each row in the dense array corresponds to a document.")
print("Each column represents a bigram from the vocabulary, in the order printed above.")
print("Values represent how many times each bigram appears in each document.")

Vocabulary (Bigrams):
cat sat: 0
dog sat: 1
on the: 2
sat on: 3
the cat: 4
the dog: 5
the log: 6
the mat: 7

Bigram representation (dense array):
[[1 0 1 1 1 0 0 1]
 [0 1 1 1 0 1 1 0]]

Interpreting the Bigram representation:
Each row in the dense array corresponds to a document.
Each column represents a bigram from the vocabulary, in the order printed above.
Values represent how many times each bigram appears in each document.


### 5. Transition to Embeddings

While the aforementioned methods provided essential advancements, they still fell short in capturing the complex semantics of language, contextual nuances, and the relationships between words. **Embeddings** emerged as a solution to these limitations by offering:

- **Dense Representations**: Unlike the sparse representations of previous methods, embeddings are dense, significantly reducing dimensionality and improving computational efficiency.
- **Semantic Information**: Embeddings capture not just the presence of words but their meanings, nuances, and the relationships between them, based on how they're used in large corpora.
- **Context Awareness**: With the advent of context-based embeddings (like those from BERT), the representation can vary depending on the word's usage, capturing its meaning more accurately in different contexts.

## Types of Embeedings

Embeddings are powerful tools in machine learning and natural language processing that convert various types of data into dense vector representations, making it easier for models to process and analyze. Here's a brief overview of the types of vector embeddings:

1. **Word Embeddings**
- **Description**: Represent individual words as vectors.
- **Techniques**: Word2Vec, GloVe, FastText.
- **Applications**: Semantic analysis, natural language processing tasks.

2. **Sentence Embeddings**
- **Description**: Represent entire sentences as vectors.
- **Models**: Universal Sentence Encoder (USE), SkipThought.
- **Applications**: Sentence similarity, sentiment analysis, document classification.

3. **Document Embeddings**
- **Description**: Represent longer texts like articles, papers, or books as vectors.
- **Techniques**: Doc2Vec, Paragraph Vectors.
- **Applications**: Document classification, information retrieval, content analysis.

4. **Image Embeddings**
- **Description**: Represent images as vectors by capturing visual features.
- **Techniques**: Convolutional Neural Networks (CNNs), ResNet, VGG.
- **Applications**: Image classification, object detection, image similarity.

5. **Audio Embeddings**
- **Description**: Represent audio signals or sounds as vectors by capturing their acoustic features and characteristics. Audio embeddings translate the complex, time-series data of audio files into a structured, high-dimensional space where similar sounds are represented by vectors that are close to each other.
- **Techniques**: Deep learning models like WaveNet, Transformer-based models, and spectrogram-based CNNs.
- **Applications**: Speech recognition, music classification, sound event detection, speaker identification, and emotion analysis from voice.
- **Purpose**: Capture the nuances of audio content, including the tone, pitch, rhythm, and other acoustic properties, enabling machines to understand and process audio data effectively.

6. **User Embeddings**
- **Description**: Represent users in a system or platform as vectors.
- **Applications**: Recommendation systems, personalized marketing, user segmentation.
- **Purpose**: Capture user preferences, behaviors, and characteristics.

7. **Product Embeddings**
- **Description**: Represent products in e-commerce or recommendation systems as vectors.
- **Applications**: Product recommendation, comparison, and analysis.
- **Purpose**: Capture product attributes, features, and semantic information.

8. **Multi-Modal Embeddings**
- **Description**: Represent data that combines information from multiple modalities (e.g., text, images, audio) as vectors. Multi-modal embeddings aim to capture the complementary and shared information across different types of data.
- **Techniques**: Fusion techniques in deep learning that might involve early fusion, late fusion, or hybrid approaches to integrate features from multiple neural network branches, each processing a different modality. Transformer-based models are increasingly used for their ability to handle sequences of different data types.
- **Applications**: Cross-modal information retrieval (e.g., finding images based on text descriptions), automatic captioning of images and videos, visual question answering (VQA), and enhanced recommendation systems that consider visual, textual, and auditory information.

  Multi-modal embeddings are at the forefront of advancing AI's ability to understand and interact with the world in a more human-like manner. By effectively combining information from various sources, these embeddings facilitate a deeper understanding of content, context, and user intentions, paving the way for innovative applications that span across text, vision, and audio domains.

- **Purpose**: Leverage the strengths of different data types to improve the performance of tasks that require understanding complex relationships between them, providing a richer representation of the data than would be possible with a single modality.


## The Context Based Embeddings vs. Word Based Embeddings

Context-based embeddings and word-based embeddings are two approaches for representing words in natural language processing (NLP) tasks. Both aim to convert text into numerical vectors, but they differ significantly in how they capture the meaning and usage of words in language. Understanding these differences is crucial for selecting the appropriate technique for various NLP applications.

### Word-Based Embeddings

**Description**: Word-based embeddings represent words  as dense vectors in a continuous vector space. Each word is mapped to a single vector, and this representation is fixed, regardless of the word's context in different sentences. Popular models include Word2Vec, GloVe, and FastText.

**Advantages**:
- **Simplicity**: Easy to use and understand. Each word has a single, pre-computed vector.
- **Efficiency**: Since each word is represented by a single vector, these embeddings can be pre-trained on large corpora and reused across different tasks.
- **Semantic Similarity**: Word-based embeddings capture semantic relationships between words. Words that appear in similar contexts tend to have vectors that are close together in the embedding space.

**Disadvantages**:
- **Polysemy and Homonymy**: They struggle with words that have multiple meanings (polysemy) or words that sound the same but have different meanings (homonymy), since each word has only one vector representation.
- **Lack of Context**: The meaning of a word can change based on its context, which is not captured in a static, word-based embedding.

### Context-Based Embeddings





**Description**: Context-based embeddings, also known as contextual embeddings, generate representations for words that take into account the surrounding words. As a result, the same word can have different embeddings based on its context, allowing for a more nuanced understanding of language. Examples include BERT, GPT (Generative Pretrained Transformer), and ELMo.

**Advantages**:
- **Dynamic Contextual Understanding**: These embeddings capture the meaning of words in context, addressing the limitations of word-based embeddings with polysemous words.
- **Richer Representations**: By considering the context, context-based embeddings can capture more complex linguistic patterns and relationships, leading to better performance on a variety of NLP tasks.
- **Adaptability**: They can be fine-tuned for specific tasks, allowing the pre-trained models to adapt to the nuances of a particular dataset or domain.

**Disadvantages**:
- **Computational Complexity**: Generating context-based embeddings is computationally more intensive than using static word-based embeddings. Processing requires significant computational resources, especially for large documents.
- **Implementation Complexity**: Working with models like BERT and GPT can be more complex due to the need for fine-tuning, handling tokenization in a specific way, and managing larger model sizes.

## What do "Words" Mean According to the Tokenisation?

When discussing embeddings in the context of natural language processing (NLP), the term "words" can indeed refer to actual words, but depending on the tokenization strategy, it can also refer to subwords or characters. This distinction is important because different models and approaches to embeddings might treat the input text in varied ways, leading to different representations

### Words

- **Traditional Approach**: Initially, embeddings were primarily focused on words as the basic units of language, with each unique word in the vocabulary getting its own vector representation.
- **Tokenization**: Involves splitting text into individual words based on spaces and punctuation.

### Subwords

- **Models like BERT and GPT**: These models often use subword tokenization strategies (e.g., Byte-Pair Encoding (BPE), SentencePiece, WordPiece) to handle the vocabulary more efficiently.
- **Advantages**: Subword tokenization helps deal with the problem of out-of-vocabulary (OOV) words by breaking down unknown words into known subword units, allowing the model to generate embeddings for words it hasn't explicitly seen during training.
- **Example**: The word "unbelievable" might be tokenized into "un", "##believ", and "##able".

### Characters

- **Character-Level Models**: Some models and approaches operate at the character level, treating each character as the basic unit for generating embeddings.
- **Applications**: Character-level embeddings are particularly useful in tasks like named entity recognition (NER) in languages with rich morphology, or in models focusing on spelling and phonetics.

### Implications of Different Tokenizations for Embeddings

- **Flexibility and Coverage**: Subword and character-level tokenization offer more flexibility and better coverage of the language, especially for languages with large vocabularies or agglutinative languages where words can have many forms.
- **Context Sensitivity**: Regardless of whether embeddings are generated for words, subwords, or characters, context-based embedding models can dynamically adjust the representation based on the surrounding text. This means that the same subword or character can have different embeddings depending on its context, enhancing the model's ability to capture nuanced meanings.
- **Computational Efficiency**: Subword and character-level approaches can improve computational efficiency by reducing the size of the vocabulary that the model needs to handle directly, albeit at the cost of potentially increased complexity in processing sequences.

## Word Based Embeddings

### Word2Vec

#### Skipgrams

#### Continuous bag of words

## Context Based Embeddings

## Training and Fine-Tuning: Customizing Embeddings

### Fine-Tuning for Domains and Tasks

Contextualized embeddings generated by transformer models can be fine-tuned for specific domains or tasks, and this is a common practice in natural language processing (NLP) to achieve state-of-the-art results on a wide range of tasks.

### Fine-Tuning Process

Fine-tuning involves taking a pre-trained transformer model, which has learned general language representations from a large corpus of text, and continuing the training process on a smaller, task-specific dataset. This allows the model to adjust its weights, including the embeddings, to better capture the nuances and terminology of the specific domain or task. Here's a general overview of how fine-tuning works:


1. **Start with a Pre-trained Model**: You begin with a model that has been pre-trained on a large, general-purpose dataset. This model has developed a broad understanding of the language, including its syntax and semantics.

2. **Select a Task-specific Dataset**: You then choose a smaller dataset that is specific to your task or domain. This dataset could be related to medical texts, legal documents, customer reviews, etc., depending on your needs.

3. **Continue Training**: The pre-trained model is then trained (or fine-tuned) on this task-specific dataset. During this process, all parts of the model, including the initial word embeddings and the transformer layers that produce the contextualized embeddings, are updated to better align with the specifics of the task or domain.

4. **Adjust Learning Rate**: It's common practice to use a smaller learning rate during fine-tuning than was used during the initial pre-training. This helps prevent the model from "forgetting" its general understanding of the language while it learns the specifics of the new task.

5. **Evaluation and Adjustment**: After fine-tuning, the model is evaluated on a separate validation set to ensure it has effectively adapted to the task. Adjustments may be made to the training process based on this evaluation to improve performance.

### Benefits of Fine-Tuning

- **Improved Performance**: Fine-tuning allows the model to adapt its pre-learned language representations to the specific lexical and syntactical characteristics of a domain or task, often leading to improved performance compared to using a pre-trained model directly.

- **Efficiency**: Because the model has already learned a lot of general language knowledge during pre-training, fine-tuning on a specific task requires relatively less data and computational resources compared to training a model from scratch.

- **Flexibility**: This approach is flexible and can be applied across different tasks (e.g., text classification, question answering, named entity recognition) and domains (e.g., finance, healthcare, law) by simply changing the task-specific dataset used for fine-tuning.



### Conclusion

Fine-tuning contextualized embeddings and transformer models on specific domains or tasks is a powerful technique in NLP. It leverages the broad language understanding acquired during pre-training and specializes the model to perform well on tasks that require more specific knowledge or understanding.

## Conclusion


Embeddings have emerged as fundamental to advancing artificial intelligence, enabling machines to process and understand the vast complexities of human language, visual content, and sound. By continually refining these representations and tailoring them to specific applications, we unlock new potentials for AI to interact with the world in increasingly sophisticated and intuitive ways. The journey of embeddings, from simple vector representations to complex, context-aware models, illustrates the ongoing evolution of machine intelligence and its boundless future prospects.

## References

- For the pre-embedding techniques: [Step by Step Guide to Master NLP – Word Embedding and Text Vectorization](https://www.analyticsvidhya.com/blog/2021/06/part-5-step-by-step-guide-to-master-nlp-text-vectorization-approaches/#)