<a href="https://colab.research.google.com/github/filipecalegario/criacomp/blob/main/2024_2_CRIACOMP_Embeddings_and_Visualization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# OpenAI Word Embeddings, Semantic Search

Word embeddings are a way of representing words and phrases as vectors. They can be used for a variety of tasks, including semantic search, anomaly detection, and classification. In the video on OpenAI Whisper, I mentioned how words whose vectors are numerically similar are also similar in semantic meaning. In this tutorial, we will learn how to implement semantic search using OpenAI embeddings. Understanding the Embeddings concept will be crucial to the next several videos in this series since we will use it to build several practical applications.

To get started, we will need to install and import OpenAI and input an API Key. We learned how to do this in [Video 3 of this series](https://www.youtube.com/watch?v=LWYgjcZye1c).

In [1]:
!pip install -q openai

from openai import OpenAI
from google.colab import userdata

openAI_client = OpenAI(api_key = userdata.get('OPENAI_KEY'))

# Read Data File Containing Words

Now that we have configured OpenAI, let's start with a simple CSV file with familiar words. From here we'll build up to a more complex semantic search using sentences from the Fed speech. [Save the linked "words.csv" as a CSV](https://gist.github.com/hackingthemarkets/25240a55e463822d221539e79d91a8d0) and upload it to Google Colab. Once the file is uploaded, let's read it into a pandas dataframe using the code below:

In [2]:
import pandas as pd

df = pd.read_csv('words.csv')
print(df)

             text
0             red
1        potatoes
2            soda
3          cheese
4           water
5            blue
6          crispy
7       hamburger
8          coffee
9           green
10           milk
11       la croix
12         yellow
13      chocolate
14   french fries
15          latte
16           cake
17          brown
18   cheeseburger
19       espresso
20     cheesecake
21          black
22          mocha
23          fizzy
24         carbon
25         banana
26        saudade
27        longing
28       feelings
29  baião de dois
30        buchada
31         cuscuz
32          verde
33        amarelo
34          rouge
35   luiz gonzaga
36            aoi
37      tartaruga
38          zebra
39         girafa
40        giraffe


# Calculate Word Embeddings

To use word embeddings for semantic search, you first compute the embeddings for a corpus of text using a word embedding algorithm. What does this mean? We are going to create a numerical representation of each of these words. To perform this computation, we'll use OpenAI's 'get_embedding' function.

Since we have our words in a pandas dataframe, we can use "apply" to apply the get_embedding function to each row in the dataframe. We then store the calculated word embeddings in a new text file called "word_embeddings.csv" so that we don't have to call OpenAI again to perform these calculations.

## OpenAI Text Embedding

In [3]:
def get_embedding(openai_client, input, model):
  return openai_client.embeddings.create(input=input, model=model).data[0].embedding

## Jina Text Embedding

In [None]:
!pip install transformers
from transformers import AutoModel

jina_embedding_model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True) # trust_remote_code is needed to use the encode method
#embeddings = jina_embedding_model.encode(['How is the weather today?', 'What is the current weather like today?'])

def get_embedding_jina(input):
  return jina_embedding_model.encode(input)

In [4]:
df['embedding'] = df['text'].apply(lambda x: get_embedding(client, x, 'text-embedding-3-small'))
df.to_csv('word_embeddings.csv')

# Semantic Search

Now that we have our word embeddings stored, let's load them into a new dataframe and use it for semantic search. Since the 'embedding' in the CSV is stored as a string, we'll use apply() and to interpret this string as Python code and convert it to a numpy array so that we can perform calculations on it.

In [None]:
import numpy as np

df = pd.read_csv('word_embeddings.csv')
df['embedding'] = df['embedding'].apply(eval).apply(np.array)
df

Let's now prompt ourselves for a search term that isn't in the dataframe. We'll use word embeddings to perform a semantic search for the words that are most similar to the word we entered. I'll first try the word "hot dog". Then we'll come back and try the word "yellow".

In [None]:
search_term = input('Enter a search term: ')


Now that we have a search term, let's calculate an embedding or vector for that search term using the OpenAI get_embedding function.

In [None]:
# semantic search
search_term_vector = get_embedding(client, search_term, "text-embedding-3-small")
search_term_vector

 Once we have a vector representing that word, we can see how similar it is to other words in our dataframe by calculating the cosine similarity of our search term's word vector to each word embedding in our dataframe.

Reference: https://platform.openai.com/docs/guides/embeddings/use-cases

In [None]:
import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

In [None]:
df["similarities"] = df['embedding'].apply(lambda x: cosine_similarity(x, search_term_vector))

df

# Sorting By Similarity

Now that we have calculated the similarities to each term in our dataframe, we simply sort the similarity values to find the terms that are most similar to the term we searched for. Notice how the foods are most similar to "hot dog". Not only that, it puts fast food closer to hot dog. Also some colors are ranked closer to hot dog than others. Let's go back and try the word "yellow" and walk through the results.

In [None]:
df.sort_values("similarities", ascending=False).head(20)

# Adding Words Together

What's even more interesting is that we can add word vectors together. What happens when we add the numbers for milk and espresso, then search for the word vector most similar to milk + espresso? Let's make a copy of the original dataframe and call it food_df. We'll operate on this copy. Let's try adding word together. Let's add milk + espresso and store the results in milk_espresso_vector.

In [None]:
food_df = df.copy()

milk_vector = food_df['embedding'][10]
espresso_vector = food_df['embedding'][19]

milk_espresso_vector = milk_vector + espresso_vector
milk_espresso_vector

Now let's find the words most similar to milk + espresso. If you have never done this before, it's pretty surprising that you can add words together like this and find similar words using numbers.

In [None]:
food_df["similarities"] = food_df['embedding'].apply(lambda x: cosine_similarity(x, milk_espresso_vector))
food_df.sort_values("similarities", ascending=False)

# Visualizing the Vectors

## Configurations

In [None]:
%pip install umap-learn

Collecting umap-learn
  Downloading umap-learn-0.5.3.tar.gz (88 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.2/88.2 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pynndescent>=0.5 (from umap-learn)
  Downloading pynndescent-0.5.10.tar.gz (1.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: umap-learn, pynndescent
  Building wheel for umap-learn (setup.py) ... [?25l[?25hdone
  Created wheel for umap-learn: filename=umap_learn-0.5.3-py3-none-any.whl size=82808 sha256=ec2c8c6ec8e1314b7de812a390cbacaa6522b15f2f5a1584f650c11c3b05ae41
  Stored in directory: /root/.cache/pip/wheels/a0/e8/c6/a37ea663620bd5200ea1ba0907ab3c217042c1d035ef606acc
  Building wheel for pynndescent (setup.py) ... [?25l[?25hdone
  Created wheel for py

In [None]:
from __future__ import print_function
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
from IPython.display import display, Javascript, HTML
import numpy as np
from sklearn.datasets import load_iris, load_digits
from sklearn.model_selection import train_test_split
import seaborn as sns
import pandas as pd
import umap
import codecs, json

In [None]:
sns.set(style='white', context='notebook', rc={'figure.figsize':(14,10)})

In [None]:
reducer = umap.UMAP(init='random')

In [None]:
reducer

### Functions Definition

In [None]:
def run_umap(data, n_neighbors, min_dis, n_components, metric, spread):
  reducer.n_neighbors = n_neighbors
  reducer.min_dist = min_dis
  reducer.n_components = n_components
  reducer.metric = metric
  reducer.spread = spread
  embedding = reducer.fit_transform(data)
  return embedding

In [None]:
def make_viz_embed(data, color = [], labels = []):
  embed = f"""
    <div id="observablehq-viewof-containerEl-96fe8cff"></div>
    <script type="module">
    import {{Runtime, Inspector}} from "https://cdn.jsdelivr.net/npm/@observablehq/runtime@4/dist/runtime.js";
    import define from "https://api.observablehq.com/@radames/umap-jupyter-notebook-scattergl.js?v=3";
    const inspect = new Inspector(document.querySelector("#observablehq-viewof-containerEl-96fe8cff"));
    const notebook = (new Runtime).module(define, name => {{
    if(name === "viewof containerEl") return inspect;
        return ["init"].includes(name);
    }})
    notebook.redefine('points', {json.dumps(data,separators=(',', ':'))})
    notebook.redefine('colors', {json.dumps(colors,separators=(',', ':'))})
    notebook.redefine('labels', {json.dumps(labels,separators=(',', ':'))})
    </script>

  """
  return embed

In [None]:
def render(data, colors, labels, n_neighbors=100, min_dis=0.5, n_components=3, metric='euclidean', spread = 1.0):
  embedding = run_umap(data, n_neighbors, min_dis, n_components, metric, spread)
  html_str = make_viz_embed(embedding.tolist(), colors, labels)
  display(HTML(html_str))


## Loading Data

In [None]:
casos_uso_df = pd.read_csv('word_embeddings.csv')

In [None]:
casos_uso_df.embedding = casos_uso_df.embedding.apply(eval).apply(np.array)

In [None]:
casos_uso_df

Unnamed: 0.1,Unnamed: 0,Itens,Categoria,embedding
0,0,Midjouney,"""aplicacoes""","[-0.009065642952919006, -0.021264266222715378,..."
1,1,Openjourney,"""aplicacoes""","[-0.0028128710109740496, -0.002215629443526268..."
2,2,DALL E,"""aplicacoes""","[-0.011047448962926865, -0.021905938163399696,..."
3,3,Tome.app,"""aplicacoes""","[0.003620806382969022, 0.006396056618541479, 0..."
4,4,Stable diffusion,"""aplicacoes""","[-0.01985820196568966, 0.016730502247810364, 0..."
...,...,...,...,...
203,203,Identificação de falhas de segurança,"""casos_uso""","[-0.015269014053046703, 0.009737345390021801, ..."
204,204,Fake news para manipular eleições,"""casos_uso""","[-0.030033115297555923, 0.02663939818739891, -..."
205,205,Geração de planos de crime,"""casos_uso""","[-0.008206862024962902, -0.015600252896547318,..."
206,206,Nova trending do tiktok,"""casos_uso""","[-0.03929344564676285, -0.008771595545113087, ..."


In [None]:
casos_uso_df.embedding

In [None]:
# Adapting for the expected data format

output_list = list()
for n in casos_uso_df.embedding:
  inter_output_line = list()
  for m in n:
    inter_output_line.append(m)
  output_list.append(inter_output_line)

output_list[0:1]

In [None]:
colors = [sns.color_palette()[0] for x in output_list]
colors

## Visualization

In [None]:
render(output_list, colors, casos_uso_df['Itens'].to_list(), n_neighbors=3, min_dis=0.5, n_components=3, metric='cosine')