# Embeddings

**Query Models**
- Searching for similar documents

**Document Models**
- Embedding documents

**Similarty Models**
- Clustering, regression, anomaly detection, visualization

In [None]:
import openai
from openai.embeddings_utils import get_embedding, cosine_similarity
import pandas as pd
import re

### Set up OpenAI API

In [None]:
OPENAI_KEY = ""
OPENAI_RESOURCE_ENDPOINT = ""

In [None]:
openai.api_type = "azure"
openai.api_key = OPENAI_KEY
openai.api_base = OPENAI_RESOURCE_ENDPOINT
openai.api_version = "2022-12-01"

## Getting some data

In [None]:
df = pd.read_csv("data/data-embedding.csv")
df.head()

Cleaning up the data

In [None]:
def normalize_text(s, sep_token = " \n "):
    s = re.sub(r'\s+',  ' ', s).strip()
    s = re.sub(r". ,","",s)
    s = s.replace("..",".")
    s = s.replace(". .",".")
    s = s.replace("\n", "")
    s = s.strip()
    
    return s

df['description'] = df["description"].apply(lambda x : normalize_text(x))

### Embedding

Notice that although the first sentence is very short, the size of the returned vector is identically the same!

In [None]:
text_to_embed = "This is a short sentence"
embeddings = get_embedding(text_to_embed, engine = "embedding-ada")
print(text_to_embed)
print('Array Size (Short Text) - Ada: ', len(embeddings))
print(embeddings)

print("=====================================")
embeddings = get_embedding(df["description"][1], engine = "embedding-ada")
print(df["description"][1])
print('Array Size (Short Text) - Ada: ', len(embeddings))
print(embeddings)

But depending on the model that you use the size will be different, this means that also the knowledge saved in the vector is less or more depending on the size of the vector.
DaVinci model in this case contains more knowledge then Ada

In [None]:
embeddings = get_embedding(df["description"][1], engine = "embedding-babbage")
print('Array Size (Long Text) - Babbage: ', len(embeddings))

embeddings = get_embedding(df["description"][1], engine = "embedding-curie")
print('Array Size (Long Text) - Curie: ', len(embeddings))

embeddings = get_embedding(df["description"][1], engine = "embedding-davinci")
print('Array Size (Long Text) - Davinci: ', len(embeddings))

In [None]:
tokenizer = tiktoken.get_encoding("cl100k_base")
sample_encode = tokenizer.encode(df["description"][1]) 
print("No of tokens: ", len(sample_encode))
tokenizer.decode_tokens_bytes(sample_encode)


## Calculate embeddings

Instead of recalculating all the embeddings for the data, you can make use of the preloaded dataset available in the pickle file.
Do save your embeddings after calculating them, especially if you are converting a big data set, otherwise you will have a cost each time you need it.

In [None]:

#df['embedding'] = df["description"].apply(lambda x : get_embedding(x, engine = 'text-embedding-ada-002'))
#pd.to_pickle(df, "data/data-embedding.pkl")
df = pd.read_pickle("data/data-embedding.pkl")
df.head()

This functions will calculate the cosine similarity between our query and the dataset.

In [None]:
def search_docs(df, user_query:str, engine:str, top_n=3, to_print=True):
    embedding = get_embedding(
        user_query,
        engine=engine
    )
    df["similarities"] = df.embedding.apply(lambda x: cosine_similarity(x, embedding))

    res = (
        df.sort_values("similarities", ascending=False)
        .head(top_n)
    )
    if to_print:
        display(res)
    return res

And here we call the function with the query "who is edgar davids". It will return us the top 3 results linked to this query.

In [None]:
res = search_docs(df, "who is edgar davids", top_n=4, engine="text-embedding-ada-002", to_print=False)
for i in res.index:
    print(df["description"][i])
    print("=====================================")