# Embeddings

![Embeddings](./images/6-embeddings-model.png)


## What are embeddings ?
- numerical vectors.
- computers, models cannot interpret information intelligibly in their raw format and require numerical data as input
- vectors are numerical values that represent information in a multi-dimensional space.


- numerical representations of real-world objects that machine learning (ML) and artificial intelligence (AI) systems use to understand complex knowledge domains
- algorithms understand that the difference between 2 and 3 is 1, indicating a close relationship between 2 and 3 as compared to 2 and 100
- a bird-nest and a lion-den are analogous pairs, while day-night are opposite terms
- embeddings convert real-world objects into complex mathematical representations that capture inherent properties and relationships between real-world data

## Use cases
1. Q&A
- embeddings generation paired with a vector database allow you to find close matches between questions and content in a knowledge repository
2. Personalized recommendations
- use embeddings to find vacation destinations, colleges, vehicles, or other products based on the criteria provided by the user
- this could take the form of a simple list of matches, or you could then use an LLM to process each recommendation and explain how it satisfies the user’s criteria
- you could also use this approach to generate custom “10 best” articles for a user based on their specific needs.
3. Data management
- you have data sources that don’t map cleanly to each other, but you do have text content that describes the data record, you can use embeddings to identify potential duplicate records
4. Content grouping
- use embeddings to help facilitate grouping similar content into categories that you might not know ahead of time

## Benefits
### 1. Reduce data dimensionality
- in data science, the term dimension typically refers to a feature or attribute of the data
- higher-dimensional data in AI refers to datasets with many features or attributes that define each data point
- an image can be considered high-dimensional data because each pixel color value is a separate dimension
- embeddings reduce the number of dimensions by identifying commonalities and patterns between various features. This consequently reduces the computing resources and time required to process raw data

### 2. Train large language models
- clean the training data from irregularities affecting model learning
- repurpose pre-trained models by adding new embeddings for transfer learning

## Types of embeddings:
### 1. Image embeddings
- object detection, image recognition, and other visual-related tasks
Example: `Tesla Vision`
### 2. Text embeddings
- natural language processing software can more accurately understand the context and relationships of words
Example: `ChatGPT`, `Google Search`
### 3. Audio embeddings
- extract patterns identifying genre, words, beats
Example: `Shazam`
### 4. Knowledge graph embeddings
Example: Recommendation engines, Amazon/ emag recommended list


## Example

### Movie listing (human)
- The Conference (Horror, 2023, Movie)
- Upload (Comedy, 2023, TV Show, Season 3)
- Tales from the Crypt (Horror, 1989, TV Show, Season 7)
- Dream Scenario (Horror-Comedy, 2023, Movie)

- beyond year of release, model has no concept for genre, movie/show/season
- embedding vectors encode non-numerical data into a series of values that ML models can understand and relate

### Movie listing (parsed)
- The Conference (1.2, 2023, 20.0)
- Upload (2.3, 2023, 35.5)
- Tales from the Crypt (1.2, 1989, 36.7)
- Dream Scenario (1.8, 2023, 20.0)

- model could find that The Conference/Tales from the Crypt are same genre, `1.2`.
- format, season, episode encoded shows similarity between Upload, Tales of the Crypt and The Conference/Dream Scenario

## Graphic representation
![Retrieval Flow](./images/6-embeddings-vectors.png)

## How embeddings work
- `one-hot encoding` maps categorical forms to learnable forms

### Example
#### Human view
| Fruits  | Price |
| ------- | ----- |
| Apple   |  3.0  |
| Orange  |  5.0  |
| Carrot  |  10.0 |

####  `one-hot encoding` view
| Apple  | Orange | Pear | Price |
| ------ | ------ | ---- | ----- |
|   1    |   0    |  0   |  3.0  |
|   0    |   1    |  0   |  5.0  |
|   0    |   0    |  1   |  10.0 |

- [1,0,0,3.00], [0,1,0,5.00], [0,0,1,10.00]

- as more categories are added, more space with sparsely populated values
- embeddings ensure vectors remain manageable with expanding input features
- dimensionality reduction allows embeddings to retain information that ML models use to find similarities and differences from input data

## Embedding models
- embedding models are algorithms trained to encapsulate information into dense representations in a multi-dimensional space

Example models: `Principal component analysis`, `Singular value decomposition`, `Word2Vec`, `BERT`

### How are embedding models created
- neural networks used to create 
- one of the hidden layers learns how to factorize input features into vectors. This occurs before feature processing layers

1. Engineers feed the neural network with some vectorized samples prepared manually.
2. The neural network learns from the patterns discovered in the sample and uses the knowledge to make accurate predictions from unseen data.
3. Occasionally, engineers may need to fine-tune the model to ensure it distributes input features into the appropriate dimensional space. 
4. Over time, the embeddings operate independently, allowing the ML models to generate recommendations from the vectorized representations. 
5. Engineers continue to monitor the performance of the embedding and fine-tune with new data.

# How embeddings fits in to our current journey

We can use `embedding models` to generate the `embedding vectors` that we place in a `vector store`. **This is the offline data preparation step**

![Data flow](./images/6-embeddings-data-flow.jpg)

At a high level this component will be a part of the following:

![High level flow](./images/6-embeddings-high-level.png)

## Import dependencies

In [1]:
var BedrockEmbeddings = require('@langchain/community/embeddings/bedrock').BedrockEmbeddings;
var fs = require('fs');


## Instantiate the `embeddings model` client

In [2]:
var embeddingsClient = new BedrockEmbeddings({
    model:'amazon.titan-embed-text-v2:0',
    region:'us-east-1'
});

## Load data into memory

- also remove empty lines, basic data cleanup step

In [3]:
var embeddingData = [];
var inputData = fs.readFileSync('/workspace/packages/llm/src/notebooks/data/6-letter.txt', 'utf-8').split('\n').filter(row => row.trim() !== ''); // remove empty rows

## Create embedding container

- will hold original data and the embedding equivalent outputted from the embedding model

In [4]:
class EmbedItem {
    constructor(text) {
        this.text = text;
        this.embedding = null;
        this.init();
    }
    init() {
        embeddingsClient.embedQuery(this.text).then((embedding) => {this.embedding = embedding});
    }
}

## Generate embeddings

In [5]:
for (let text of inputData) {
   embeddingData.push(new EmbedItem(text));
}

33

## Embeddings output

In [None]:
console.log(embeddingData);

## Similarity

- now we have embeddings generated from text
- we now can see the relationship between them, what is the similarity of a given input to the entire data set
- there are various algorithms that can be used to compute this given a `feature vectors` dataset
- one of them is `Cosine similarity`
- more details [here](https://en.wikipedia.org/wiki/Cosine_similarity)

### Cosine Similarity
![Cosine similarity](./images/6-embeddings-cosine.png)

## Cosine similarity implementation

- we can leverage existing library for advanced mathematical functions

```shell
yarn add mathjs
```

In [6]:
var math = require('mathjs');

function calculateCosineSimilarity(a, b) {
    return math.dot(a, b) / (math.norm(a) * math.norm(b));
}

## Find most similar data point to given input using cosine similarity

In [7]:
var input = "looked hard at how we were working together as a team and asked our corporate employees";
var inputEmbedding = null;
embeddingsClient.embedQuery(input).then((embedding) => {
    inputEmbedding = embedding
    console.log(inputEmbedding);
});

Promise { <pending> }

[
  -0.035174266,    0.014952936,  -0.031765305,   -0.03610398,    0.011311548,
    0.03393464,    0.042147137, -0.0015688961,   0.022158237,    0.017122274,
  -0.075926825,  -0.0062755845,  -0.027426628, -0.0079800645,   0.0071665626,
  -0.016424987,    0.026806818,   0.010226878,  -0.025102338,    0.058262218,
  -0.031455398,    0.017122274,   0.017044798,  0.0008958203,    -0.03548417,
   -0.01402322,     0.04245704,  -0.022313189,   0.043386757, -0.00087160897,
    0.05082449,     0.04183723,  -0.011543976,    0.00573325,   0.0062368466,
    0.04679572,    0.019833947,   0.041527323,  -0.012163787,    0.005113439,
   0.022778047,     0.03331483, -0.0015882653,   0.015960129,   -0.013713314,
    0.03331483,   -0.021693379,   0.036878742,  -0.043696664,     -0.0514443,
   0.015805176,    0.042766947,  -0.031765305,    0.04121742,   -0.053923544,
      0.091732,    0.010691737,  -0.014488078,    0.06693957,   0.0085998755,
  -0.049274962,   -0.031455398,  -0.014720507,  0.0023436598, 

In [8]:
class ComparisonResult {
    constructor(text, similarity) {
        this.text = text;
        this.similarity = similarity;
    }
}

In [9]:
var cosineComparisons = [];

for (var embeddingItem of embeddingData) {
    var similarityScore = calculateCosineSimilarity(embeddingItem.embedding, inputEmbedding);
    
    cosineComparisons.push(new ComparisonResult(embeddingItem.text, similarityScore));
}

cosineComparisons.sort((a, b) => b.similarity - a.similarity); // sort in decreasing similiarity order

for (let c of cosineComparisons) {
    console.log(c.similarity.toFixed(6), "\t", c.text);
}

0.284783 	 We also looked hard at how we were working together as a team and asked our corporate employees to come back to the office at least three days a week, beginning in May. During the pandemic, our employees rallied to get work done from home and did everything possible to keep up with the unexpected circumstances that presented themselves. It was impressive and I’m proud of the way our collective team came together to overcome unprecedented challenges for our customers, communities, and business. But, we don’t think it’s the best long-term approach. We’ve become convinced that collaborating and inventing is easier and more effective when we’re working together and learning from one another in person. The energy and riffing on one another’s ideas happen more freely, and many of the best Amazon inventions have had their breakthrough moments from people staying behind after a meeting and working through ideas on a whiteboard, or continuing the conversation on the walk back from a 