###         ###

In [None]:
# Run this cell, then hide it before the presentation
my_api_key = ''
secret_word = 'embeddings'

In [None]:
# # Uncomment this cell to get everything installed in colab.
# # You will get a bunch of logs and errors.  Don't worry about them.  Everything will be installed properly in the end.

%pip install openai
%pip install langchain
%pip install numpy
%pip install chromadb
%pip install tiktoken

Collecting openai
  Downloading openai-1.3.7-py3-none-any.whl (221 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/221.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━[0m [32m112.6/221.4 kB[0m [31m3.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m221.4/221.4 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.25.2-py3-none-any.whl (74 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.0/75.0 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.2-py3-none-any.whl (76 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.9/76.9 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-py

### French Toast

When I was in college, my friends and I had a favorite game to play during long car trips. It was called French Toast.

<img src=https://www.donwoodlock.com/ml301-Nov2023/marta_embeddings_presentation/image1.png width="700">    

<br/><br/>

<img src=https://www.donwoodlock.com/ml301-Nov2023/marta_embeddings_presentation/image2.png width="700">

<br/><br/>

<img src=https://www.donwoodlock.com/ml301-Nov2023/marta_embeddings_presentation/image3.png width="700">

Natural language models like ChatGPT are great at playing French Toast!
* the models map every single word to a point in a multi-dimensional space;
* the closer two words are in meaning, the closer their corresponding points are;
* this mapping is called an embedding.

<img src=https://www.donwoodlock.com/ml301-Nov2023/marta_embeddings_presentation/image4.png width="700">

Source: https://openai.com/blog/introducing-text-and-code-embeddings
</br></br></br>

### Embeddings ###

Words in the plot above are represented in a 3-dimensional space.</br>
In practice, the embeddings used in modern Natural Language Models (NLMs) have thousands of dimensions. The original vectors for the embedding above had 2048 dimensions!</br>
</br>
  
We can try out embeddings ourselves. ChatGPT provides an API to convert text into its corresponding embedding.</br>
Let's start by importing some code dependencies.

In [None]:
# you might need to run: pip install langchain
from langchain.embeddings import OpenAIEmbeddings

# Use your own API key here
embedding_module = OpenAIEmbeddings(openai_api_key = my_api_key)

We imported the OpenAIEmbeddings tool. It uses a newer embedding model, which has 1536 dimensions.</br>
The function below takes a string of text and returns its embedding.

In [None]:
# you might need to run: pip install numpy
import numpy as np

def text_embedding(text: str) -> np.ndarray:
    return np.array(embedding_module.embed_documents([text])[0])

Get the embedding of "french toast":

In [None]:
my_embedding = text_embedding("french toast")
my_embedding

array([-0.00523991,  0.00127193,  0.01377016, ...,  0.0225271 ,
        0.01101095, -0.02228097])

Check the dimensionality of the embedding vector:

In [None]:
len(my_embedding)

1536

</br></br>

### Euclidean Distance ###

Words with similar meanings are close to each other in the embedding space.

<img src=https://www.donwoodlock.com/ml301-Nov2023/marta_embeddings_presentation/image5.png width="500">

Let's test it! Similar words should have a smaller euclidean distance.

<img src=https://www.donwoodlock.com/ml301-Nov2023/marta_embeddings_presentation/image6.png width="300">

In [None]:
def distance(vector_a, vector_b):
    return np.linalg.norm(vector_a - vector_b)

Let's say the secret word is "cake".
Is it closer to "french toast", or is it closer to "InterSystems IRIS"? </br>
All distances are normalized between 0 and 1.

In [None]:
embedding_1 = text_embedding("french toast")
embedding_2 = text_embedding("cake")

distance(embedding_1, embedding_2)

0.5828727431872559

In [None]:
embedding_1 = text_embedding("InterSystems IRIS")
embedding_2 = text_embedding("cake")

distance(embedding_1, embedding_2)

0.7115007800881934

Another popular metric is the cosine similarity between two vectors:

In [None]:
def cosine_similarity(vector_a, vector_b):
    return np.dot(vector_a, vector_b) / (np.linalg.norm(vector_a) * np.linalg.norm(vector_b))

</br></br>

### How are embeddings generated? ###

**Step 1: clean up your data**

Start with a corpus of text. </br>

Remove:
* punctuation
* stopwords ("is", "are", "a", "the", etc...)
* numbers

</br>

**Step 2: use a training algorithm**

Training algorithms generate an intial mapping.</br>
A famous family of embedding algorithms is called Word2Vec. The algorithms looks at a sliding interval of words in a corpus of text.

<img src=https://www.donwoodlock.com/ml301-Nov2023/marta_embeddings_presentation/theory3.png width="700">

How does a given word influence the probability of other words appearing in the same interval? </br>

The stronger the correlation between the appearence of the two words, then closer the model will place them in the embedding space. </br>
The algorithm used a neural network to re-organize the words in space. </br>

Word2Vec has now been replaced by more complex "transformer" models. "GPT" stands for "Generative Pre-trained Transformers".

</br>

**Step 3: pick a function for sentence embedding**

We have talked about embeddings for individual words, but we can calculate an embedding for a sentence or a paragraph, too. </br>
The embedding of a sentence is still a single vector, a single point in the embedding space. </br>

Embeddings are used to measure similarity between text elements. </br>
The closer the two elements are in meaning, the closer they will be in the embedding space. We can compare a word to a whole paragraph.

<img src=https://www.donwoodlock.com/ml301-Nov2023/marta_embeddings_presentation/theory1.png width="700">

A sentence's embedding is usually some weighted average of the words it contains. </br>
Words that carry the most meaning should have a higher weight. </br></br>

Some formulas are optimized for a given style of sentence. The OpenAI API has a dedicated method to embed queries.

In [None]:
def query_embedding(text: str) -> np.ndarray:
    return np.array(embedding_module.embed_query(text))

question = query_embedding("How can I cancel my shoes order?")


</br>

### Use case examples ###

We just learned how to embed a question. Now let's see how we can use embeddings to find an answer. </br>
Look at the three paragraphs below. Which one is most relevant to the question?

In [None]:
paragraphs = ["""It's easier than you think to make restaurant-quality French toast in the comfort of your own kitchen.
You just need a skillet and a few staple ingredients. The best breads for French toast are brioche, sourdough, French bread, or challah.
French toast is traditionally made with day-old slices because they absorb the eggy mixture better than fresh ones.""",

"""If your order has not shipped yet, you can easily cancel it from our website. Log into your account, navigate to the
Orders section, select your order, and click Cancel. If your order has already shipped, you will be able to return the item after you
receive it. You can print a pre-paid mailing label on our website, and drop off the package at any post office.""",

"""InterSystems IRIS makes it easier to build high-performance, machine learning-enabled applications that connect data
and application silos.It provides high performance database management, interoperability, and analytics capabilities, all built-in
from the ground up to speed and simplify your most demanding data-intensive applications."""]


We can:
1. use the API to embed each of the paragraphs above;
2. calculate the euclidean distance between each paragraph and the question ""How can cancel shoes my shoes order?";
3. display the paragraph which is closest in meaning to the question

In [None]:
distances = []

for par in paragraphs:
    embedding = text_embedding(par)
    distances.append(distance(embedding, question))

print(paragraphs[np.argmin(distances)])


If your order has not shipped yet, you can easily cancel it from our website. Log into your account, navigate to the
Orders section, select your order, and click Cancel. If your order has already shipped, you will be able to return the item after you
receive it. You can print a pre-paid mailing label on our website, and drop off the package at any post office.


</br>

Confluence provides another example of how embeddings can be useful in our day-to-day life.</br>
When we type a new page title, the Confluence UI will display a list of pages with similar titles.

<img src=https://www.donwoodlock.com/ml301-Nov2023/marta_embeddings_presentation/theory2.png width="700">

Pre-computing an embedding for each page's title can be an efficient way to later tell if two pages are similar. </br>
Disclaimer: I do not know what technology Confluence is using for this feature.
</br></br></br>


### Let's play! ###

The function below allows us to play the French Toast game with ChatGPT embeddings. </br>
I have already set a secret_word variable.

In [None]:
def is_it_more_like(word_1, word_2):
    if secret_word == word_1 or secret_word == word_2:
        return "It's "+secret_word+"!"
    distance_1 = distance(text_embedding(secret_word), text_embedding(word_1))
    distance_2 = distance(text_embedding(secret_word), text_embedding(word_2))
    if distance_1 < distance_2:
        return "It's more like "+word_1+"."
    else:
        return "It's more like "+word_2+"."

I am thinking of something, and it's not French toast!

In [None]:
is_it_more_like('french toast', 'videogames')

"It's more like videogames."

</br></br>

### Thank you!