# OpenAI Word Embeddings, Semantic Search

Word embeddings are a way of representing words and phrases as vectors. They can be used for a variety of tasks, including semantic search, anomaly detection, and classification. This notebook will illustrate how words whose vectors are numerically similar are also similar in semantic meaning and we will learn how to implement semantic search using OpenAI embeddings.

In [1]:
# Import Python libraries
import os
import openai
from Utilities.envVars import *

# Set OpenAI API key and endpoint
openai.api_type = "azure"
openai.api_version = OpenAiVersion
openai_api_key = OpenAiKey
assert openai_api_key, "ERROR: Azure OpenAI Key is missing"
openai.api_key = openai_api_key
openAiEndPoint = f"{OpenAiEndPoint}"
openai.api_base = openAiEndPoint

# Read Data File Containing Words

Now that we have configured OpenAI, let's start with a simple CSV file with familiar words. From here we'll build up to a more complex semantic search using sentences from the Fed speech. [Save the linked "words.csv" as a CSV](https://gist.github.com/hackingthemarkets/25240a55e463822d221539e79d91a8d0) and upload it to Google Colab. Once the file is uploaded, let's read it into a pandas dataframe using the code below:

In [2]:
import openai
import pandas as pd
import numpy as np
from openai.embeddings_utils import get_embedding, cosine_similarity

In [3]:
df = pd.read_csv('./Data/CSV/words.csv')
print(df)

            text
0            red
1       potatoes
2           soda
3         cheese
4          water
5           blue
6         crispy
7      hamburger
8         coffee
9          green
10          milk
11      la croix
12        yellow
13     chocolate
14  french fries
15         latte
16          cake
17         brown
18  cheeseburger
19      espresso
20    cheesecake
21         black
22         mocha
23         fizzy
24        carbon
25        banana


# Calculate Word Embeddings

To use word embeddings for semantic search, you first compute the embeddings for a corpus of text using a word embedding algorithm. What does this mean? We are going to create a numerical representation of each of these words. To perform this computation, we'll use OpenAI's 'get_embedding' function.

Since we have our words in a pandas dataframe, we can use "apply" to apply the get_embedding function to each row in the dataframe. We then store the calculated word embeddings in a new text file called "word_embeddings.csv" so that we don't have to call OpenAI again to perform these calculations.

In [4]:
get_embedding("the fox crossed the road", engine=OpenAiEmbedding)

[-0.0005842710961587727,
 0.00034204122493974864,
 -0.020236846059560776,
 0.007038079667836428,
 -0.013861545361578465,
 0.025349711999297142,
 -0.022307241335511208,
 -0.02234511449933052,
 0.013861545361578465,
 -0.033176813274621964,
 0.0291875172406435,
 0.008811802603304386,
 0.03297482430934906,
 -0.01589406654238701,
 0.004800412338227034,
 -0.0010730704525485635,
 0.01949200965464115,
 -0.0009681304800324142,
 0.018381066620349884,
 -0.0306266937404871,
 -0.009840687736868858,
 0.029490500688552856,
 0.002311835763975978,
 -0.02372116968035698,
 -0.0038598976098001003,
 0.006419486366212368,
 0.015275473706424236,
 -0.013823672197759151,
 -0.0028278562240302563,
 -0.011595473624765873,
 0.008123775012791157,
 0.0018668270204216242,
 -0.02880878560245037,
 0.0004947170382365584,
 -0.013861545361578465,
 -0.017446864396333694,
 -0.0018857636023312807,
 -0.00871080718934536,
 0.011273551732301712,
 -0.027369609102606773,
 0.027142370119690895,
 0.005340103525668383,
 -0.016651529

In [5]:
df['embedding'] = df['text'].apply(lambda x: get_embedding(x, engine=OpenAiEmbedding))
df.to_csv('./Data/CSV/word_embeddings.csv')

# Semantic Search

Now that we have our word embeddings stored, let's load them into a new dataframe and use it for semantic search. Since the 'embedding' in the CSV is stored as a string, we'll use apply() and to interpret this string as Python code and convert it to a numpy array so that we can perform calculations on it.

In [6]:
df = pd.read_csv('./Data/CSV/word_embeddings.csv')
df['embedding'] = df['embedding'].apply(eval).apply(np.array)
df

Unnamed: 0.1,Unnamed: 0,text,embedding
0,0,red,"[1.8579006791696884e-05, -0.024676261469721794..."
1,1,potatoes,"[0.005025846417993307, -0.031079445034265518, ..."
2,2,soda,"[0.025859491899609566, -0.0074522839859128, -0..."
3,3,cheese,"[-0.003942061681300402, -0.009351087734103203,..."
4,4,water,"[0.019031280651688576, -0.01257743313908577, 0..."
5,5,blue,"[0.005434895399957895, -0.0072994716465473175,..."
6,6,crispy,"[-0.0010056837927550077, -0.005415474995970726..."
7,7,hamburger,"[-0.013206875883042812, -0.0018223668448626995..."
8,8,coffee,"[-0.0007566262502223253, -0.01945229433476925,..."
9,9,green,"[0.01538460049778223, -0.010931522585451603, 0..."


Let's now prompt ourselves for a search term that isn't in the dataframe. We'll use word embeddings to perform a semantic search for the words that are most similar to the word we entered. I'll first try the word "hot dog". Then we'll come back and try the word "yellow".

In [7]:
search_term = input('Enter a search term: ')

Enter a search term:  hot dog


Now that we have a search term, let's calculate an embedding or vector for that search term using the OpenAI get_embedding function.

In [8]:
# semantic search
search_term_vector = get_embedding(search_term, engine=OpenAiEmbedding)

 Once we have a vector representing that word, we can see how similar it is to other words in our dataframe by calculating the cosine similarity of our search term's word vector to each word embedding in our dataframe.

In [9]:
df["similarities"] = df['embedding'].apply(lambda x: cosine_similarity(x, search_term_vector))
df

Unnamed: 0.1,Unnamed: 0,text,embedding,similarities
0,0,red,"[1.8579006791696884e-05, -0.024676261469721794...",0.81207
1,1,potatoes,"[0.005025846417993307, -0.031079445034265518, ...",0.816856
2,2,soda,"[0.025859491899609566, -0.0074522839859128, -0...",0.820797
3,3,cheese,"[-0.003942061681300402, -0.009351087734103203,...",0.824127
4,4,water,"[0.019031280651688576, -0.01257743313908577, 0...",0.798268
5,5,blue,"[0.005434895399957895, -0.0072994716465473175,...",0.786934
6,6,crispy,"[-0.0010056837927550077, -0.005415474995970726...",0.820502
7,7,hamburger,"[-0.013206875883042812, -0.0018223668448626995...",0.876765
8,8,coffee,"[-0.0007566262502223253, -0.01945229433476925,...",0.799683
9,9,green,"[0.01538460049778223, -0.010931522585451603, 0...",0.785477


# Sorting By Similarity

Now that we have calculated the similarities to each term in our dataframe, we simply sort the similarity values to find the terms that are most similar to the term we searched for. Notice how the foods are most similar to "hot dog". Not only that, it puts fast food closer to hot dog. Also some colors are ranked closer to hot dog than others. Let's go back and try the word "yellow" and walk through the results.

In [10]:
df.sort_values("similarities", ascending=False).head(20)

Unnamed: 0.1,Unnamed: 0,text,embedding,similarities
7,7,hamburger,"[-0.013206875883042812, -0.0018223668448626995...",0.876765
18,18,cheeseburger,"[-0.018216602504253387, 0.005054355598986149, ...",0.856907
14,14,french fries,"[0.0014476682990789413, -0.016491735354065895,...",0.838477
3,3,cheese,"[-0.003942061681300402, -0.009351087734103203,...",0.824127
2,2,soda,"[0.025859491899609566, -0.0074522839859128, -0...",0.820797
6,6,crispy,"[-0.0010056837927550077, -0.005415474995970726...",0.820502
1,1,potatoes,"[0.005025846417993307, -0.031079445034265518, ...",0.816856
13,13,chocolate,"[0.0015591585543006659, -0.013005273416638374,...",0.816746
0,0,red,"[1.8579006791696884e-05, -0.024676261469721794...",0.81207
16,16,cake,"[-0.013669420965015888, -0.016827935352921486,...",0.811998


# Adding Words Together

What's even more interesting is that we can add word vectors together. What happens when we add the numbers for milk and espresso, then search for the word vector most similar to milk + espresso? Let's make a copy of the original dataframe and call it food_df. We'll operate on this copy. Let's try adding word together. Let's add milk + espresso and store the results in milk_espresso_vector.

In [11]:
food_df = df.copy()

milk_vector = food_df['embedding'][10]
espresso_vector = food_df['embedding'][19]

milk_espresso_vector = milk_vector + espresso_vector
milk_espresso_vector

array([-0.02157659, -0.03206679, -0.01620988, ..., -0.00423221,
        0.00078145, -0.02898556])

Now let's find the words most similar to milk + espresso. If you have never done this before, it's pretty surprising that you can add words together like this and find similar words using numbers.

In [12]:
food_df["similarities"] = food_df['embedding'].apply(lambda x: cosine_similarity(x, milk_espresso_vector))
food_df.sort_values("similarities", ascending=False)

Unnamed: 0.1,Unnamed: 0,text,embedding,similarities
19,19,espresso,"[-0.02250584401190281, -0.012747502885758877, ...",0.960501
10,10,milk,"[0.0009292512550018728, -0.019319288432598114,...",0.960501
15,15,latte,"[-0.015634099021553993, -0.003942839801311493,...",0.922975
22,22,mocha,"[-0.012487593106925488, -0.026140518486499786,...",0.899327
8,8,coffee,"[-0.0007566262502223253, -0.01945229433476925,...",0.895382
3,3,cheese,"[-0.003942061681300402, -0.009351087734103203,...",0.885276
13,13,chocolate,"[0.0015591585543006659, -0.013005273416638374,...",0.88344
2,2,soda,"[0.025859491899609566, -0.0074522839859128, -0...",0.874156
4,4,water,"[0.019031280651688576, -0.01257743313908577, 0...",0.866049
7,7,hamburger,"[-0.013206875883042812, -0.0018223668448626995...",0.852722


# Microsoft Earnings Call Transcript

Let's tie this back to finance. I have attached some text from a recent [Microsoft earnings call here](https://gist.github.com/hackingthemarkets/1c827a7750384fcf52c84594ef216a2d). Click on "raw" and save the file as a CSV. Upload it to Google Colab as microsoft-earnings.csv. Let's use what we just learned to perform a semantic search on sentences in the Microsoft earnings call. We'll start by reading the paragraphs into a pandas dataframe.

In [13]:
earnings_df = pd.read_csv('.\Data\CSV\microsoft-earnings.csv')
earnings_df

Unnamed: 0,text
0,"Thank you, Brett. To start, I want to outline ..."
1,"With that context, this quarter, the Microsoft..."
2,It helps them align their spend with demand an...
3,We are the platform of choice for customers' S...
4,Now to data and AI. With our Microsoft Intelli...
...,...
57,Other income and expense should be roughly $10...
58,"And finally, as a reminder, for Q2 cash flow, ..."
59,And FX should decrease COGS and operating expe...
60,With the high margins in our Windows OEM busin...


Once we have the dataframe, we'll once again compute the embeddings for each line in our CSV file.

In [15]:
earnings_df['embedding'] = earnings_df['text'].apply(lambda x: get_embedding(x, engine=OpenAiEmbedding))
earnings_df.to_csv('.\Data\CSV\earnings-embeddings.csv')

If you download the earnings_embeddings.csv file locally and open it up, you'll see that our embeddings are for entire paragraphs - not just words. This means that we'll be able to search on similar sentences even if there isn't an exact match for the string we search for. We are searching on meaning.

In [16]:
earnings_search = input("Search earnings for a sentence:")

Search earnings for a sentence: Cloud Service


In [17]:
earnings_search_vector = get_embedding(earnings_search, engine=OpenAiEmbedding)

In [18]:
earnings_df["similarities"] = earnings_df['embedding'].apply(lambda x: cosine_similarity(x, earnings_search_vector))
earnings_df

Unnamed: 0,text,embedding,similarities
0,"Thank you, Brett. To start, I want to outline ...","[-0.009504559449851513, -0.003731543431058526,...",0.751764
1,"With that context, this quarter, the Microsoft...","[-0.0016425022622570395, -0.028921114280819893...",0.819800
2,It helps them align their spend with demand an...,"[0.00882813148200512, -0.031995125114917755, 0...",0.797868
3,We are the platform of choice for customers' S...,"[0.011994920670986176, -0.024179913103580475, ...",0.823995
4,Now to data and AI. With our Microsoft Intelli...,"[-0.004754434805363417, 0.0038801338523626328,...",0.805990
...,...,...,...
57,Other income and expense should be roughly $10...,"[-0.01832527294754982, -0.014160438440740108, ...",0.690947
58,"And finally, as a reminder, for Q2 cash flow, ...","[-0.012947804294526577, -0.010494815185666084,...",0.717691
59,And FX should decrease COGS and operating expe...,"[0.0009612650028429925, -0.01565629616379738, ...",0.774287
60,With the high margins in our Windows OEM busin...,"[0.010544494725763798, -0.03846913203597069, -...",0.751667


In [19]:
earnings_df.sort_values("similarities", ascending=False)

Unnamed: 0,text,embedding,similarities
5,"Cosmos DB now supports postscript SQL, making ...","[-0.004414066206663847, -0.005979579407721758,...",0.829130
11,"All up more than 400,000 organizations now use...","[-0.001035509048961103, -0.020460907369852066,...",0.824110
3,We are the platform of choice for customers' S...,"[0.011994920670986176, -0.024179913103580475, ...",0.823995
12,Our cloud for sustainability is off to a fast ...,"[0.008903194218873978, -0.01629571244120598, 0...",0.823931
1,"With that context, this quarter, the Microsoft...","[-0.0016425022622570395, -0.028921114280819893...",0.819800
...,...,...,...
29,"Thank you, Satya, and good afternoon, everyone...","[0.013616200536489487, -0.01455166470259428, -...",0.723732
54,"In More Personal Computing, we expect revenue ...","[0.007634567562490702, -0.031139109283685684, ...",0.718495
58,"And finally, as a reminder, for Q2 cash flow, ...","[-0.012947804294526577, -0.010494815185666084,...",0.717691
46,"This quarter, other income and expense was $54...","[-0.030022066086530685, -0.006907156202942133,...",0.695648
