<a href="https://colab.research.google.com/github/badlogic/genai-workshop/blob/main/03_unsupervised_learning_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Text Embeddings
Text embedding models are a type of unsupervised machine learning model. They learn how to take a piece of text and transform it into a vector in what's called a latent space. This process is commonly referred to as (text) embedding.

By transforming texts into vectors, we can measure their similarity mathematically, by calculating their distances to each other in the latent space using some distance metric or similarity measure. The beauty of this approach is, that this transformation implicitely captures the meanings and semantic relations of the texts' contents. It thereby does away with many of the problems old natural language processing systems had.

This has an enormous amount of applications, both directly, and indirectly as part of bigger systems (including LLMs!).

To illustrate the power of embeddings, we are going to use a pre-trained text embedding model called `jina-embeddings-v2-base-de` from Hugging Face. The specific model can handle both English and German language texts. In fact, multi-linguality is a property of many popular text embedding models.

We will embed textual information about movies (title and description) and implement to popular "downstream" tasks for embeddings;:

1. Finding similar movies (naive recommender)
2. Finding movies that are match a user query (naive dense retrieval)

## Setup and Hugging Face login
This specific model requires granted access from the model creators. Visit [https://huggingface.co/jinaai/jina-embeddings-v2-base-de](https://huggingface.co/jinaai/jina-embeddings-v2-base-de) the model page while logged into Hugging Face, and ask for access. Once approved, we can use the `huggingface_hub` modules `notebook_login` function to log into Hugging Face from within this notebook to download and use the model.

In [1]:
!pip install huggingface_hub
from huggingface_hub import notebook_login
import torch
import numpy as np
import requests
from transformers import AutoModel
from numpy.linalg import norm

notebook_login()



VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Loading the embedding model

Now that we are logged in, we can use the Hugging Face `AutoModel` class download and load the model. We refer to the model via its id, which is composed of the author name and model name, e.g. `jinaai/jina-embeddings-v2-base-de`.

Alternatively, we can also point the `AutoModel` class to a local directory that contains an already downloaded model.

In [2]:
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-de', trust_remote_code=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## Convenience functions

Next, we defined a few convenience functions.

We need a way to calculate the similarity between two vectors. The `cos_sim` function allows us to do just that by calculating the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) between two vectors. A similarity of `1` means the vectors are equal, while a similiarty of `0` means they are unequal. Usually, we'll get a value between `0` and `1`.

In [3]:
cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))

We also create a function called `embed_local`, which uses the locally loaded jina model to embed texts provided as parameters. You can pass a single string, or a list of strings, and will get a Numpy array, or a list of Numpy arrays back, that work with `cos_sim`.

We also have a global variable `embed` which points to the currently "active" embedding function we want to use. We can assign another function to this variable, to swap out the embedding function globally. This will come in handy later.

In [4]:
def embed_local(texts):
  return model.encode(texts)

embed = embed_local

## Movie dataset

We defined a simple dataset inline. It consists of 6 movies, each composed of the movie title and plot. These have been manually fetched from IMDB. We could of course also load such a dataset from a file or via an API. We keep things simple here.

In [5]:
movies = [
  {"title": "Star Trek: Generations", "plot": "With the help of long presumed dead Captain Kirk, Captain Picard must stop a deranged scientist willing to murder on a planetary scale in order to enter a space matrix."},
  {"title": "Star Wars: Episode V - The Empire Strikes Back", "plot": "After the Rebels are overpowered by the Empire, Luke Skywalker begins his Jedi training with Yoda, while his friends are pursued across the galaxy by Darth Vader and bounty hunter Boba Fett."},
  {"title": "Pulp Fiction", "plot": "The lives of two mob hitmen, a boxer, a gangster and his wife, and a pair of diner bandits intertwine in four tales of violence and redemption."},
  {"title": "Fight Club", "plot": "An insomniac office worker and a devil-may-care soap maker form an underground fight club that evolves into much more."},
  {"title": "Ghostbusters", "plot": "Three parapsychologists forced out of their university funding set up shop as a unique ghost removal service in New York City, attracting frightened yet skeptical customers."},
  {"title": "Ghostbusters II", "plot": "The discovery of a massive river of ectoplasm and a resurgence of spectral activity allows the staff of Ghostbusters to revive the business."}
]

## Creating embedding vectors for each movie
We embed both the title and plot as a single string for each movie, using the function pointed to be `embed` (which is `embed_local`).

Each vector generated by the model has a fixed length of 768 dimensions. This is a property of the jina model. Other embedding models might generate more or less dimensions.

In [12]:
embeddings = embed([movie["title"] + " " + movie["plot"] for movie in movies])
print("Dimensions of first vector: " + str(embeddings[0].shape[0]))
print("First 10 dimensions of the first vector:\n" + str(embeddings[0][:10]))
print("Dimensions of second vector: " + str(embeddings[1].shape[0]))
print("First 10 dimensions of the second vector:\n" + str(embeddings[1][:10]))

Dimensions of first vector: 768
First 10 dimensions of the first vector:
[-0.15949857 -0.2693168   0.12828872  0.03978242 -0.00434926 -0.18323022
 -0.05444287  0.08139315 -0.20358722 -0.19820127]
Dimensions of second vector: 768
First 10 dimensions of the second vector:
[-0.18870565  0.13675256 -0.12258591  0.03999804  0.00593293 -0.11011664
 -0.34029493  0.23738301 -0.23080073 -0.03576482]


Again, we can not precisely say, what each value/dimension in these vectors mean. We do however know, that they encode the meaning and semantic relationships of the original text.

## Similarity search
We can now use those embeddings to find movies that are similar to a query. The query could be anything, like a user query, or the title and plot of another movie.

We can wrap this idea up in a function called `similarity_search`. Its parameters are as follows:

* `query`: a string or vector to find similar movies for
* `movies`: the list of movies
* `embeddings`: the list of vectors for each movie

The function first embeds the query text, if it isn't already an vector. It then calculates the cosine similarity between the query text vector and the movie vectors and stores the resulting similarity and corresponding movie in a list. Finally, the similarity list is sorted in descending order by similarity, and returned to the caller.

In [7]:
def similarity_search(query, movies, embeddings):
  query_embedding = embed(query) if isinstance(query, str) else query
  similarities = []
  for i in range(len(movies)):
    similarity = cos_sim(query_embedding, embeddings[i])
    similarities.append((movies[i], similarity))

  similarities.sort(key=lambda x: x[1], reverse=True)
  return similarities

## Naive movie recommender
Let's implement a movie recommender function. It takes as input:

* `movie`: the movie for which we want to get recommendations
* `movies`: the full list of movies (including the `movie`)
* `embeddings`: the embeddings of each movie

The function returns a list of tuples, each consisting of a movie and its similarity with the input `movie`. The list is sorted by similarity in descending order.



In [13]:
def recommend_movies(movie, movies, embeddings):
  movie_index = movies.index(movie)
  embedding = embeddings[movies.index(movie)]
  movies_to_search = [movies[i] for i in range(len(movies)) if i != movie_index]
  embeddings_to_search = [embeddings[i] for i in range(len(embeddings)) if i != movie_index]
  return similarity_search(embedding, movies_to_search, embeddings_to_search)

We can now output the list of most similar movies, for each movie

In [14]:
for i in range(len(movies)):
  print("Recommended movies for " + movies[i]["title"])
  recommended = recommend_movies(movies[i], movies, embeddings)
  for j in range(len(recommended)):
    print("\t" + recommended[j][0]["title"] + ": " + str(recommended[j][1]))
  print()

Recommended movies for Star Trek: Generations
	Star Wars: Episode V - The Empire Strikes Back: 0.25328436
	Pulp Fiction: 0.14811614
	Fight Club: 0.114247344
	Ghostbusters II: 0.10307089
	Ghostbusters: 0.05439057

Recommended movies for Star Wars: Episode V - The Empire Strikes Back
	Star Trek: Generations: 0.25328436
	Fight Club: 0.12428885
	Ghostbusters II: 0.05385293
	Pulp Fiction: 0.04211531
	Ghostbusters: -0.02177306

Recommended movies for Pulp Fiction
	Fight Club: 0.3838614
	Ghostbusters II: 0.1724575
	Ghostbusters: 0.16229264
	Star Trek: Generations: 0.14811614
	Star Wars: Episode V - The Empire Strikes Back: 0.04211531

Recommended movies for Fight Club
	Pulp Fiction: 0.3838614
	Ghostbusters: 0.17885996
	Star Wars: Episode V - The Empire Strikes Back: 0.12428885
	Star Trek: Generations: 0.114247344
	Ghostbusters II: 0.058754705

Recommended movies for Ghostbusters
	Ghostbusters II: 0.35342678
	Fight Club: 0.17885996
	Pulp Fiction: 0.16229264
	Star Trek: Generations: 0.05439057


## Naive dense retrieval
Retrieval is generally defined as finding relevant information like text, image, or audio documents for a given query. There are multiple ways to implement such retrieval systems, which are also known as "search engines". Here we'll focus on text retrieval.

Previously, a search engine had to first split up a user query into words (tokens), lower case them (normalization) and remove endings (stemming). To capture synonyms, the query terms were also expanded, in the simplest case via a synonym dictionary, so that more relevant documents could be found.

The resulting expanded query terms were then matched against an [inverted index](https://en.wikipedia.org/wiki/Inverted_index) of documents. To build the index, the documents' texts also had to go through this processing pipeline.

To find documents matching the query, similarity measures like [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) or [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) were used. These measures are actually related to the cosine similarity measure discussed above. Document representations in the invertex index are equivalent to sparse vectors, where a vector may not have an entry for each dimension of the vector space. E.g. the vector for the text "The dog is brown" only has entries for "the", "dog", "is", and "brown", but not for "house", or "car".

These sparse retrieval systems faced many challenges, with the biggest being, that they work on a word level, instead of a meaning and semantic relatedness level. They were also exceptionally hard to get working in multi-lingual settings.

"Dense retrieval" is based on vectors generated for texts by embedding models. The "dense" part stems from the fact, that the transformation does away entirely with words, and instead maps the text to semantic (albeit unobservable) concepts. We do no longer have to care for things like stemming or synonyms. These issues are automagically solved by the embedding model. These models also lend themselves well to handle multi-lingual retrieval, as semantic concepts transfer well between languages, as opposed to lexical words.

Our `similarity_search` function does just that for movie titles and plots! We've already seen how we can use dense retrieval to find similar movies for a given movie.

We can use the exact same principle to find movies that best match a user query, like those entered in a search engine. Dense retrieval simplifies implementing search engines tremendously. Let's try it out.

In the code below, we use the `similarity_search` function to find movies matching a user query, and print out the results.

In [19]:
def retrieve(query, movies, embeddings):
  results = similarity_search(query, movies, embeddings)
  print(">>> " + query)
  for i in range(len(results)):
    result = results[i]
    print(f'{result[0]["title"]}: {result[1]}')
  print()


retrieve("I want a movie with space ships", movies, embeddings)
retrieve("Give me cerebral sci-fi movies", movies, embeddings)
retrieve("I want to be scared!", movies, embeddings)

>>> I want a movie with space ships
Star Trek: Generations: 0.2362593412399292
Star Wars: Episode V - The Empire Strikes Back: 0.13309141993522644
Ghostbusters II: 0.10731838643550873
Ghostbusters: 0.05222714692354202
Pulp Fiction: 0.052068136632442474
Fight Club: 0.0020262785255908966

>>> Give me cerebral sci-fi movies
Star Trek: Generations: 0.2368042767047882
Ghostbusters II: 0.2060040682554245
Ghostbusters: 0.17845042049884796
Star Wars: Episode V - The Empire Strikes Back: 0.1370648890733719
Pulp Fiction: 0.12603998184204102
Fight Club: -0.019208833575248718

>>> I want to be scared!
Ghostbusters: 0.21073882281780243
Fight Club: 0.12972189486026764
Ghostbusters II: 0.10533657670021057
Star Trek: Generations: 0.08501248061656952
Pulp Fiction: 0.05737336724996567
Star Wars: Episode V - The Empire Strikes Back: 0.04409937933087349



I've also deployed the jina model via [Hugging Face inference endpoints](https://huggingface.co/inference-endpoints/dedicated). This frees us from any local hardware constraints, and runs the model in the cloud on some beefy, GPU equipped hardware. We can use this remotely executed model via a simple `POST` request. The API endpoint expects a JSON object of the form:

```
{"inputs": [first_text, second_text, ..., n_th_text], "params": {}}
```

It will return a list of lists of numbers.

We wrap all of this up in a nice little function called `embed_remote`, which behaves like `embed_local`, so we can swap it out in our code.