# Exercise 1: Introduction to Embeddings
This notebook will explore how we can represent text as vectors. 

We will use the [MiniLM](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model from [HuggingFace](https://huggingface.co/) to embed the text. This model takes text (can be a single sentence or a paragraph) as input and outputs a vector of numerical features. The model is trained on a large corpus of text, and the features are optimized to capture the meaning of the text. 

As shown in the figures below, the model embeds the sentences into similar vectors if the sentences have a similar meaning. It does not matter that the exact words in the sentences are different. This also makes the embeddings robust to typos. 

![](../../assets/embedding-vectors-1.png)
![](../../assets/embedding-vectors-2.png)

In this notebook, you will explore:
- How to measure the similarity between two embeddings.
- How to make embeddings.
- Which types of sentences are similar in the embedding space, and which are not.
- How to visualize the embedding space.


In [None]:
import random
import json
import numpy as np
from langchain_community.embeddings import HuggingFaceEmbeddings
from llm_in_production.visualization_utils import plot_embeddings_interactively, plot_similarity_head_map
from llm_in_production.huggingface_utils import get_device

## Exercise 1a: Cosine similarity
Before we can start embedding sentences, we need to be able to measure the similarity between two embeddings.

This is most commonly done using the cosine similarity. The cosine similarity does not depend on the magnitudes of the two vectors, only on their angle: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude.

![](../../assets/cosine-similarity.png)

The cosine similarity is calculated as follows:

$$
\text{cosine similarity}(a, b) = \frac{{\sum_{i=1}^{n} a_i \cdot b_i}}{{\sqrt{\sum_{i=1}^{n} a_i^2} \cdot \sqrt{\sum_{i=1}^{n} b_i^2}}}
$$

In vector format, the cosine similarity is calculated as follows:
$$
\text{{cosine similarity}}(\mathbf{a}, \mathbf{b}) = \frac{{\mathbf{a} \cdot \mathbf{b}}}{{|\mathbf{a}| \cdot |\mathbf{b}|}}
$$

Numpy has some useful functions that we can use to calculate the cosine similarity in a fast and efficient way for large vectors:
- [np.dot](https://numpy.org/doc/stable/reference/generated/numpy.dot.html): Compute the dot product of two arrays.
- [np.linalg.norm](https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html): vector norm.


In this exercise, you will implement the cosine similarity function.
We have created a function skeleton for you to fill in and some tests to check if your implementation is correct.


In [None]:
def cosine_similarity(a: np.ndarray, b: np.ndarray):
    """
    Compute the cosine similarity between two vectors.
    :param a: The first vector of shape (n_features,).
    :param b: The second vector of shape (n_features,).
    :return: The cosine similarity between the two vectors as numpy scalar (shape ()).
    """
    # YOUR CODE HERE START: compute the cosine similarity between the two vectors
    # YOUR CODE HERE END


assert np.isclose(cosine_similarity(np.array([1, 0]), np.array([1, 0])),1.), f"Expected 1 but got {cosine_similarity(np.array([1, 0]), np.array([1, 0]))}"
assert np.isclose(cosine_similarity(np.array([1, 0]), np.array([0, 1])),0.), f"Expected 0 but got {cosine_similarity(np.array([1, 0]), np.array([0, 1]))}"
assert np.isclose(cosine_similarity(np.array([1, 0]), np.array([-1, 0])), -1.), f"Expected -1 but got {cosine_similarity(np.array([1, 0]), np.array([-1, 0]))}"
assert np.isclose(cosine_similarity(np.array([1, 0]), np.array([0, -1])), 0.), f"Expected 0 but got {cosine_similarity(np.array([1, 0]), np.array([0, -1]))}"
assert np.isclose(cosine_similarity(np.array([1, 0]), np.array([1, 1])), 1.0 / np.sqrt(2)), f"Expected 1/np.sqrt(2) but got {cosine_similarity(np.array([1, 0]), np.array([1, 1]))}"
assert np.isclose(cosine_similarity(np.array([1, 0]), np.array([1, -1])), 1/np.sqrt(2)), f"Expected 1/np.sqrt(2) but got {cosine_similarity(np.array([1, 0]), np.array([1, -1]))}"


## How to make embeddings?
Now that we can measure the similarity between two embeddings, we can start exploring the embeddings.

In this notebook, we will use the [MiniLM](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model from [HuggingFace](https://huggingface.co/) to embed text. To make it easier to use, we have wrapped the model in a class called [HuggingFaceEmbeddings](https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.huggingface.HuggingFaceEmbeddings.html) from LangChain. This class has two functions:
- [embed_query](https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.huggingface.HuggingFaceEmbeddings.html#langchain.embeddings.huggingface.HuggingFaceEmbeddings.embed_query): Embed a single string.
- [embed_documents](https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.huggingface.HuggingFaceEmbeddings.html#langchain.embeddings.huggingface.HuggingFaceEmbeddings.embed_documents): Embed a list of strings.

First, lets create our embedding function by running the cell below. This will download the model from the HuggingFace model hub and load it into memory. This can take a while the first time you run it. However, the model will be cached on your computer, so it will be much faster the next time you run it.

In [None]:
# This function checks if the accelerator is available like a GPU and if so, it will use it.
device = get_device()
# Here we create the embedding function that will be used to embed the sentences.
embedding_func = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2", model_kwargs={"device": get_device()})

Now that we have created the embedding function, we can use it to embed sentences using the [embed_query](https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.huggingface.HuggingFaceEmbeddings.html#langchain.embeddings.huggingface.HuggingFaceEmbeddings.embed_query) function. Run the cell below to see how it works.

In [None]:
sentence = "Bob just got home from work."
embedding = embedding_func.embed_query(sentence)
print(f"The type of embedding: '{type(embedding)}")
print(f"The embedding has {len(embedding)} features.")
print(f"The first 10 features of the embedding: {embedding[:10]}")


In [None]:
sentence1 = "Bob just got home from work." 
sentence2 = "Bob just arrived home from day-job." # Same meaning as sentence1
sentence3 = "Cows are grazing in the field." # Totally different sentence

embedding1 = np.array(embedding_func.embed_query(sentence1))
embedding2 = np.array(embedding_func.embed_query(sentence2))
embedding3 = np.array(embedding_func.embed_query(sentence3))
print(f"The cosine similarity between the sentence 1 and 2 is: {cosine_similarity(embedding1, embedding2)}")
print(f"The cosine similarity between the sentence 1 and 3 is: {cosine_similarity(embedding1, embedding3)}")
print(f"The cosine similarity between the sentence 2 and 3 is: {cosine_similarity(embedding2, embedding3)}")

In [None]:
sentences = [
    "Bob just got home from work.",
    "Bob just arrived home from day-job.",
    "Cows are grazing in the field.",
]
# We can also embed multiple sentences at once.
embeddings = np.array(embedding_func.embed_documents(sentences))
print(f"The type of embeddings: '{type(embeddings)}")
print(f"The embeddings has a shape of: {embeddings.shape} (n_samples, n_features).")


print(f"The cosine similarity between the sentence 1 and 2 is: {cosine_similarity(embeddings[0], embeddings[1])}")
print(f"The cosine similarity between the sentence 1 and 3 is: {cosine_similarity(embeddings[0], embeddings[2])}")
print(f"The cosine similarity between the sentence 2 and 3 is: {cosine_similarity(embeddings[1], embeddings[2])}")

Rerun the cell above a couple of times. Each time your try to change the sentances slightly. For example what happens if:
- You change the name from Bob to Alice?
- What happens if you make the sentences plural? 
- What happens if your introduce a typos?

## Exercise 1b: Similarity matrix
We now know how to use the [embed_query](https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.huggingface.HuggingFaceEmbeddings.html#langchain.embeddings.huggingface.HuggingFaceEmbeddings.embed_query) and [embed_documents](https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.huggingface.HuggingFaceEmbeddings.html#langchain.embeddings.huggingface.HuggingFaceEmbeddings.embed_documents) functions to embed sentences. So, let's put them to the test by making a similarity matrix. 

A similarity matrix is a matrix that shows the similarity between all the sentences in a dataset. The similarity matrix is a square matrix of shape (n_samples, n_samples). The similarity matrix is symmetric, so the similarity between sentences i and j is the same as the similarity between sentences j and i. 

In this exercise, you have to do the following:
- Embed all the sentences.
- Calculate the similarity between each sentence pair and store it in the similarity matrix (a simple for loop will do the trick).

After you have implemented the code and run the cell, ask yourself the following questions:
- Which sentences are similar and dissimilar?
- How does the similarity matrix show this?
- What happens if you change the sentences slightly?
- Which type of changes cause the similarity to change the most?
- Which type of change causes the similarity to change the least?


In [None]:
sentences = [
    "feline friends say",
    "canine companions say",
    "Bovine buddies said",
    "The cat is walking in the bedroom",
    "The kittens are in the bedroom",
    "A dog was running across the kitchen",
    "The puppies were running around in the kitchen",
]

# Here we embed all the sentences.
embeddings = embedding_func.embed_documents(sentences)
# We convert the embeddings to a numpy array/matrix of shape (n_samples, n_features).
embeddings = np.array(embeddings)

# Here we initialize the similarity matrix.
similarity_matrix = np.zeros((len(sentences), len(sentences)))

# YOUR CODE HERE START: fill in the similarity matrix using cosine_similarity function
# YOUR CODE HERE END

plot_similarity_head_map(similarity_matrix, sentences, plot_title="Similarity matrix")

## Pydata Amsterdam dataset
So far, we have only been working with toy example data. Let's try to apply what we have learned to a real-world dataset. We will use the [Pydata Amsterdam](https://amsterdam.pydata.org/) dataset for this. This dataset contains the titles, abstracts, and descriptions of all the talks at the Pydata Amsterdam conference in 2023. The dataset is stored in the [pydata.json](pydata.json) file.
The data contains all kinds of about the venue, the speakers, the talks, etc. For this exercise, we will only use the talk data. The talks data is stored in the `talks` field. The `talks` field is a list of talk objects which has the following attributes:

- title: The title of the talk. Typically, a single sentence.
- abstract: The abstract of the talk. A short description that gets people to click on the talk page.
- description: The description of the talk. A longer description of the talk that describes the talk's content, the audience, the prerequisites, etc. At least, that is the general idea. In practice, the abstract and description of everybody adhere to these rules.
- speakers: A list of speakers that gave the talk.
- duration: The duration of the talk.
- date: The date of the talk.
- room: The room where the talk was given.

We will mainly focus on the title, abstracts and descriptions of the talks. Let's load the data.

In [None]:
with open("pydata.json", "r") as f:
    talks = json.load(f)["talks"]
    titles = [talk["title"] for talk in talks]
    abstracts = [talk["abstract"] for talk in talks]
    descriptions = [talk["description"] for talk in talks]

This code will print some random talks from the dataset so we can see what it looks like. Change the seed to see different talks.

In [None]:
random.seed(42)
for _ in range(3):
    talk_idx = random.randint(0, len(talks))
    
    print(f"Talk #{talk_idx}")
    print(f"Title: {titles[talk_idx]}")
    print(f"Abstract:")
    print(abstracts[talk_idx])
    print()
    print(f"Description:")
    print(descriptions[talk_idx])
    print("#" * 80)

## Exercise 1c: Embeddings space exploration
This exercise will explore the embedding space of the different talks. At PyData, talks are typically related to Python, data engineering, machine learning, LLMs, etc. So, we expect we can also find a similar grouping in the embedding space. In this exercise, you will do the following:
- We will be embed some text using the `embedding_func`.
- We then cluster the embeddings using the [KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) algorithm from [scikit-learn](https://scikit-learn.org/stable/).
- We then project the embeddings to a 2D space using [UMAP](https://umap-learn.readthedocs.io/en/latest/) from [umap-learn](https://umap-learn.readthedocs.io/en/latest/).
- Finally, we will plot the embeddings in an interactive plot using [plotly](https://plotly.com/) to explore the embedding space and see if our hypothesis is correct.

You don't need to know how the KMeans and UMAP algorithms work. We have already implemented them for you. If you are interested in how we did it, check out the `plot_embeddings_interactively` function in the [llm_in_production/visualization_utils.py](../../llm_in_production/visualization_utils.py) file.

There are still some open questions left:
- How many clusters should we use?
- Which text should we embed? The title, abstract, description, or a combination of them?
- Embedding the text.
- Validating if our hypothesis is correct.

In the next set of exercises, you will have to answer these questions.

#### Part i: title embeddings
In this sub-exercise, you will embed the titles of the talks and explore the embedding space. 
Your tasks are:
- Embed the `titles` using the `embedding_func`.
- Play around with the number of clusters and see which number of clusters gives the best results.
- Visually inspect the embedding space and see if you recognize any clusters (e.g., Python, Data Engineering, LLMs, etc.).

In [None]:
n_cluster = 6
# YOUR CODE HERE START: embed the titles using the embedding_func
# YOUR CODE HERE END

assert isinstance(title_embeddings, np.ndarray), f"Expected numpy array but got {type(title_embeddings)}"
assert len(title_embeddings) == len(titles), f"Expected {len(titles)} embeddings but got {len(title_embeddings)}"

plot_embeddings_interactively(
    embeddings=title_embeddings, 
    titles=titles, 
    plot_title="Title embeddings",
    n_cluster=n_cluster,
)

#### Part ii: abstracts embeddings
In this sub-exercise, you will embed the abstracts of the talks and explore the embedding space. 
Your tasks are:
- Embed the `abstracts` using the `embedding_func`.
- Play around with the number of clusters and see which number of clusters gives the best results.
- Visually inspect the embedding space and see if you recognize any clusters (e.g., Python, Data Engineering, LLMs, etc.).

In [None]:
n_cluster = 6
# YOUR CODE HERE START: embed the abstracts using the embedding_func
# YOUR CODE HERE END

assert isinstance(abstract_embeddings, np.ndarray), f"Expected numpy array but got {type(abstract_embeddings)}"
assert len(abstract_embeddings) == len(titles), f"Expected {len(titles)} embeddings but got {len(abstract_embeddings)}"

plot_embeddings_interactively(
    embeddings=abstract_embeddings, 
    titles=titles, 
    plot_title="Abstract embeddings",
    n_cluster=n_cluster
)

#### Part iii: descriptions embeddings
In this sub-exercise, you will embed the descriptions of the talks and explore the embedding space. 
Your tasks are:
- Embed the `descriptions` using the `embedding_func`.
- Play around with the number of clusters and see which number of clusters gives the best results.
- Visually inspect the embedding space and see if you recognize any clusters (e.g., Python, Data Engineering, LLMs, etc.).

In [None]:
n_cluster = 6
# YOUR CODE HERE START: embed the descriptions using the embedding_func
# YOUR CODE HERE END
assert len(description_embeddings) == len(titles), f"Expected {len(titles)} embeddings but got {len(description_embeddings)}"

plot_embeddings_interactively(
    embeddings=description_embeddings, 
    titles=titles, 
    plot_title="Description embeddings",
    n_cluster=n_cluster
)

#### Part iv: combined embeddings
In this sub-exercise, you will make a combined text of the title, abstract, and description and embed it.
Your tasks are:
- Combine the `title`, `abstract`, and `description` into one meaningful text. Note, it may be helpful to keep a reference to the different parts of the text, e.g. f"title: {title}...".
- Embed the `combined_text` using the `embedding_func`.
- Play around with the number of clusters and see which number of clusters gives the best results.
- Visually inspect the embedding space and see if you recognize any clusters (e.g., Python, Data Engineering, LLMs, etc.).

In [None]:
n_cluster = 6
combined_texts = []

for talk_idx in range(len(talks)):
    title = titles[talk_idx]
    abstract = abstracts[talk_idx]
    description = descriptions[talk_idx]
    combined_text = ...
    # YOUR CODE HERE START: combine the title, abstract and description into one meaningful text
    # YOUR CODE HERE END
    combined_texts.append(combined_text)
    
combined_text_embeddings = ...
# YOUR CODE HERE START: embed the combined_text using the embedding_func and convert it to a numpy array
# YOUR CODE HERE END

assert isinstance(abstract_embeddings, np.ndarray), f"Expected numpy array but got {type(abstract_embeddings)}"
assert len(combined_text_embeddings) == len(titles), f"Expected {len(titles)} embeddings but got {len(combined_text_embeddings)}"

plot_embeddings_interactively(
    embeddings=description_embeddings, 
    titles=titles, 
    plot_title="Description embeddings",
    n_cluster=n_cluster
)