<a href="https://colab.research.google.com/github/filipecalegario/ref-aulas-criacomp/blob/main/2023_1_CRIACOMP_Word_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# OpenAI Word Embeddings, Semantic Search

Word embeddings are a way of representing words and phrases as vectors. They can be used for a variety of tasks, including semantic search, anomaly detection, and classification. In the video on OpenAI Whisper, I mentioned how words whose vectors are numerically similar are also similar in semantic meaning. In this tutorial, we will learn how to implement semantic search using OpenAI embeddings. Understanding the Embeddings concept will be crucial to the next several videos in this series since we will use it to build several practical applications.

To get started, we will need to install and import OpenAI and input an API Key. We learned how to do this in [Video 3 of this series](https://www.youtube.com/watch?v=LWYgjcZye1c).

In [None]:
!pip install openai -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/75.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━[0m [32m71.7/75.5 kB[0m [31m2.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.5/75.5 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import openai
import pandas as pd
import numpy as np
from getpass import getpass

openai.api_key = getpass()

··········


In [None]:
openai.organization_id = ''

# Read Data File Containing Words

Now that we have configured OpenAI, let's start with a simple CSV file with familiar words. From here we'll build up to a more complex semantic search using sentences from the Fed speech. [Save the linked "words.csv" as a CSV](https://gist.github.com/hackingthemarkets/25240a55e463822d221539e79d91a8d0) and upload it to Google Colab. Once the file is uploaded, let's read it into a pandas dataframe using the code below:

In [None]:
df = pd.read_csv('words.csv')
print(df)

            text
0            red
1       potatoes
2           soda
3         cheese
4          water
5           blue
6         crispy
7      hamburger
8         coffee
9          green
10          milk
11      la croix
12        yellow
13     chocolate
14  french fries
15         latte
16          cake
17         brown
18  cheeseburger
19      espresso
20    cheesecake
21         black
22         mocha
23         fizzy
24        carbon
25        banana


# Calculate Word Embeddings

To use word embeddings for semantic search, you first compute the embeddings for a corpus of text using a word embedding algorithm. What does this mean? We are going to create a numerical representation of each of these words. To perform this computation, we'll use OpenAI's 'get_embedding' function.

Since we have our words in a pandas dataframe, we can use "apply" to apply the get_embedding function to each row in the dataframe. We then store the calculated word embeddings in a new text file called "word_embeddings.csv" so that we don't have to call OpenAI again to perform these calculations.

In [None]:
get_embedding("the fox crossed the road", engine='text-embedding-ada-002')

[-0.0005497358506545424,
 0.0003819362900685519,
 -0.020266875624656677,
 0.007079525385051966,
 -0.013881421647965908,
 0.0253272857517004,
 -0.02231123112142086,
 -0.02231123112142086,
 0.013868802227079868,
 -0.03313874080777168,
 0.029075268656015396,
 0.008802083320915699,
 0.03291159123182297,
 -0.015875298529863358,
 0.004849032964557409,
 -0.001055303611792624,
 0.019547566771507263,
 -0.0009748543961904943,
 0.01838657446205616,
 -0.03069056198000908,
 -0.00983688049018383,
 0.029504330828785896,
 0.0022746603935956955,
 -0.02371199242770672,
 -0.003902572439983487,
 0.006407538428902626,
 0.015269564464688301,
 -0.01379308570176363,
 -0.0027920587453991175,
 -0.01162253599613905,
 0.008139560930430889,
 0.0019213149789720774,
 -0.02884811908006668,
 0.0005071451305411756,
 -0.01379308570176363,
 -0.017402255907654762,
 -0.001864527352154255,
 -0.008713747374713421,
 0.011281810700893402,
 -0.027384260669350624,
 0.0271066315472126,
 0.005319108720868826,
 -0.01665770635008812

In [None]:
from openai.embeddings_utils import get_embedding

df['embedding'] = df['text'].apply(lambda x: get_embedding(x, engine='text-embedding-ada-002'))
df.to_csv('word_embeddings.csv')

# Semantic Search

Now that we have our word embeddings stored, let's load them into a new dataframe and use it for semantic search. Since the 'embedding' in the CSV is stored as a string, we'll use apply() and to interpret this string as Python code and convert it to a numpy array so that we can perform calculations on it.

In [None]:
df = pd.read_csv('word_embeddings.csv')
df['embedding'] = df['embedding'].apply(eval).apply(np.array)
df

Unnamed: 0.1,Unnamed: 0,text,embedding
0,0,red,"[-7.451758574461564e-05, -0.024687238037586212..."
1,1,potatoes,"[0.00496138958260417, -0.03108060173690319, 0...."
2,2,soda,"[0.025804834440350533, -0.007458584848791361, ..."
3,3,cheese,"[-0.003219075035303831, -0.008848825469613075,..."
4,4,water,"[0.019021987915039062, -0.012530029751360416, ..."
5,5,blue,"[0.005356860347092152, -0.00736568309366703, 0..."
6,6,crispy,"[-0.00097661220934242, -0.005434627644717693, ..."
7,7,hamburger,"[-0.013190791942179203, -0.0018121899338439107..."
8,8,coffee,"[-0.0007453818107023835, -0.019452422857284546..."
9,9,green,"[0.015370187349617481, -0.01084914244711399, 0..."


Let's now prompt ourselves for a search term that isn't in the dataframe. We'll use word embeddings to perform a semantic search for the words that are most similar to the word we entered. I'll first try the word "hot dog". Then we'll come back and try the word "yellow".

In [None]:
search_term = input('Enter a search term: ')


Enter a search term: cabelo


Now that we have a search term, let's calculate an embedding or vector for that search term using the OpenAI get_embedding function.

In [None]:
# semantic search
search_term_vector = get_embedding(search_term, engine="text-embedding-ada-002")
search_term_vector

[-0.01212204061448574,
 0.006593022495508194,
 0.015846053138375282,
 -0.021457405760884285,
 -0.0008352111326530576,
 0.011431705206632614,
 -0.004449181724339724,
 -0.016061387956142426,
 0.01940539851784706,
 -0.009557032026350498,
 0.019494066014885902,
 -0.0018857563845813274,
 -0.0030336768832057714,
 -0.01206504087895155,
 -0.02473808452486992,
 0.0028151762671768665,
 0.028880096971988678,
 0.009937033988535404,
 0.02971610054373741,
 -0.00440168147906661,
 -0.002094757044687867,
 -0.003242677543312311,
 -0.008220694027841091,
 -0.013705379329621792,
 -0.0060483538545668125,
 -0.0041388473473489285,
 0.020140068605542183,
 -0.03143877163529396,
 0.0018746729474514723,
 -0.038506798446178436,
 0.03186944127082825,
 -0.007378358393907547,
 -0.015466052107512951,
 -0.02064673602581024,
 -0.018569396808743477,
 -0.008651362732052803,
 0.0067703560926020145,
 -0.020064067095518112,
 -0.012932710349559784,
 -0.004873516503721476,
 -0.007587358821183443,
 -0.003325011348351836,
 -0.00

 Once we have a vector representing that word, we can see how similar it is to other words in our dataframe by calculating the cosine similarity of our search term's word vector to each word embedding in our dataframe.

In [None]:
from openai.embeddings_utils import cosine_similarity

df["similarities"] = df['embedding'].apply(lambda x: cosine_similarity(x, search_term_vector))

df

Unnamed: 0.1,Unnamed: 0,text,embedding,similarities
0,0,red,"[-7.451758574461564e-05, -0.024687238037586212...",0.78444
1,1,potatoes,"[0.00496138958260417, -0.03108060173690319, 0....",0.786962
2,2,soda,"[0.025804834440350533, -0.007458584848791361, ...",0.776095
3,3,cheese,"[-0.003219075035303831, -0.008848825469613075,...",0.775896
4,4,water,"[0.019021987915039062, -0.012530029751360416, ...",0.776406
5,5,blue,"[0.005356860347092152, -0.00736568309366703, 0...",0.764256
6,6,crispy,"[-0.00097661220934242, -0.005434627644717693, ...",0.774372
7,7,hamburger,"[-0.013190791942179203, -0.0018121899338439107...",0.786444
8,8,coffee,"[-0.0007453818107023835, -0.019452422857284546...",0.773416
9,9,green,"[0.015370187349617481, -0.01084914244711399, 0...",0.767451


# Sorting By Similarity

Now that we have calculated the similarities to each term in our dataframe, we simply sort the similarity values to find the terms that are most similar to the term we searched for. Notice how the foods are most similar to "hot dog". Not only that, it puts fast food closer to hot dog. Also some colors are ranked closer to hot dog than others. Let's go back and try the word "yellow" and walk through the results.

In [None]:
df.sort_values("similarities", ascending=False).head(20)

Unnamed: 0.1,Unnamed: 0,text,embedding,similarities
19,19,espresso,"[-0.022490037605166435, -0.012700098566710949,...",0.791579
16,16,cake,"[-0.013701263815164566, -0.016818759962916374,...",0.788505
13,13,chocolate,"[0.0014960544649511576, -0.013033152557909489,...",0.787033
1,1,potatoes,"[0.00496138958260417, -0.03108060173690319, 0....",0.786962
7,7,hamburger,"[-0.013190791942179203, -0.0018121899338439107...",0.786444
10,10,milk,"[0.0008786617545410991, -0.019358297809958458,...",0.785437
14,14,french fries,"[0.001444610534235835, -0.016480565071105957, ...",0.785028
0,0,red,"[-7.451758574461564e-05, -0.024687238037586212...",0.78444
23,23,fizzy,"[-0.01298743300139904, -0.010277776047587395, ...",0.780707
17,17,brown,"[-0.0034083013888448477, -0.015826432034373283...",0.778133


# Adding Words Together

What's even more interesting is that we can add word vectors together. What happens when we add the numbers for milk and espresso, then search for the word vector most similar to milk + espresso? Let's make a copy of the original dataframe and call it food_df. We'll operate on this copy. Let's try adding word together. Let's add milk + espresso and store the results in milk_espresso_vector.

In [None]:
food_df = df.copy()

milk_vector = food_df['embedding'][10]
espresso_vector = food_df['embedding'][19]

milk_espresso_vector = milk_vector + espresso_vector
milk_espresso_vector

array([-0.02161138, -0.0320584 , -0.01621401, ..., -0.00422706,
        0.00090106, -0.02885086])

Now let's find the words most similar to milk + espresso. If you have never done this before, it's pretty surprising that you can add words together like this and find similar words using numbers.

In [None]:
food_df["similarities"] = food_df['embedding'].apply(lambda x: cosine_similarity(x, milk_espresso_vector))
food_df.sort_values("similarities", ascending=False)

Unnamed: 0.1,Unnamed: 0,text,embedding,similarities
10,10,milk,"[0.0008786617545410991, -0.019358297809958458,...",0.960518
19,19,espresso,"[-0.022490037605166435, -0.012700098566710949,...",0.960518
15,15,latte,"[-0.015595527365803719, -0.003943499177694321,...",0.922984
22,22,mocha,"[-0.012470506131649017, -0.026239056140184402,...",0.899248
8,8,coffee,"[-0.0007453818107023835, -0.019452422857284546...",0.895397
3,3,cheese,"[-0.003219075035303831, -0.008848825469613075,...",0.884538
13,13,chocolate,"[0.0014960544649511576, -0.013033152557909489,...",0.883478
2,2,soda,"[0.025804834440350533, -0.007458584848791361, ...",0.874156
4,4,water,"[0.019021987915039062, -0.012530029751360416, ...",0.866097
7,7,hamburger,"[-0.013190791942179203, -0.0018121899338439107...",0.85263


# Microsoft Earnings Call Transcript

Let's tie this back to finance. I have attached some text from a recent [Microsoft earnings call here](https://gist.github.com/hackingthemarkets/1c827a7750384fcf52c84594ef216a2d). Click on "raw" and save the file as a CSV. Upload it to Google Colab as microsoft-earnings.csv. Let's use what we just learned to perform a semantic search on sentences in the Microsoft earnings call. We'll start by reading the paragraphs into a pandas dataframe.

In [None]:
earnings_df = pd.read_csv('microsoft-earnings.csv')
earnings_df

Unnamed: 0,text
0,"Thank you, Brett. To start, I want to outline ..."
1,"With that context, this quarter, the Microsoft..."
2,It helps them align their spend with demand an...
3,We are the platform of choice for customers' S...
4,Now to data and AI. With our Microsoft Intelli...
...,...
57,Other income and expense should be roughly $10...
58,"And finally, as a reminder, for Q2 cash flow, ..."
59,And FX should decrease COGS and operating expe...
60,With the high margins in our Windows OEM busin...


Once we have the dataframe, we'll once again compute the embeddings for each line in our CSV file.

In [None]:
earnings_df['embedding'] = earnings_df['text'].apply(lambda x: get_embedding(x, engine='text-embedding-ada-002'))
earnings_df.to_csv('earnings-embeddings.csv')

If you download the earnings_embeddings.csv file locally and open it up, you'll see that our embeddings are for entire paragraphs - not just words. This means that we'll be able to search on similar sentences even if there isn't an exact match for the string we search for. We are searching on meaning.

In [None]:
earnings_search = input("Search earnings for a sentence:")

Search earnings for a sentence:artificial intelligence demand cloud products


In [None]:

earnings_search_vector = get_embedding(earnings_search, engine="text-embedding-ada-002")
earnings_search_vector

[-0.010817994363605976,
 -0.02560822106897831,
 -0.0023312214761972427,
 0.0036306483671069145,
 0.020452769473195076,
 0.027453476563096046,
 -0.02305866777896881,
 -0.00854311604052782,
 0.0005885277641937137,
 -0.03546836972236633,
 0.022776948288083076,
 0.022974152117967606,
 0.010162997990846634,
 -0.004535669460892677,
 -0.01735386624932289,
 0.0059794774278998375,
 0.02036825381219387,
 0.007099308539181948,
 0.0042891656048595905,
 -0.02156555838882923,
 -0.010691220872104168,
 0.025185642763972282,
 0.015184632502496243,
 -0.004394810181111097,
 -0.009655904956161976,
 -0.01467753853648901,
 0.02812960185110569,
 -0.036510732024908066,
 0.001288862549699843,
 0.011339173652231693,
 0.024255970492959023,
 -0.01767784170806408,
 -0.0077824764885008335,
 -0.005585071165114641,
 -0.010205255821347237,
 0.002774928230792284,
 -0.0033559727016836405,
 -0.012205458246171474,
 0.018734287470579147,
 -0.006662644911557436,
 0.016015702858567238,
 0.01366335153579712,
 0.01407184358686

In [None]:

earnings_df["similarities"] = earnings_df['embedding'].apply(lambda x: cosine_similarity(x, earnings_search_vector))

earnings_df


Unnamed: 0,text,embedding,similarities
0,"Thank you, Brett. To start, I want to outline ...","[-0.009504559449851513, -0.003731543431058526,...",0.749671
1,"With that context, this quarter, the Microsoft...","[-0.0016425022622570395, -0.028921114280819893...",0.800683
2,It helps them align their spend with demand an...,"[0.008828130550682545, -0.03199512138962746, 0...",0.796945
3,We are the platform of choice for customers' S...,"[0.011994918808341026, -0.024179909378290176, ...",0.800598
4,Now to data and AI. With our Microsoft Intelli...,"[-0.004754434805363417, 0.0038801338523626328,...",0.820970
...,...,...,...
57,Other income and expense should be roughly $10...,"[-0.01832527294754982, -0.014160438440740108, ...",0.688968
58,"And finally, as a reminder, for Q2 cash flow, ...","[-0.012947804294526577, -0.010494815185666084,...",0.715620
59,And FX should decrease COGS and operating expe...,"[0.0009612650028429925, -0.01565629616379738, ...",0.771278
60,With the high margins in our Windows OEM busin...,"[0.010544494725763798, -0.03846913203597069, -...",0.757963


In [None]:
earnings_df.sort_values("similarities", ascending=False)

Unnamed: 0,text,embedding,similarities
5,"Cosmos DB now supports postscript SQL, making ...","[-0.00441406574100256, -0.005979578942060471, ...",0.846190
12,Our cloud for sustainability is off to a fast ...,"[0.008922123350203037, -0.016263997182250023, ...",0.823616
4,Now to data and AI. With our Microsoft Intelli...,"[-0.004754434805363417, 0.0038801338523626328,...",0.820970
11,"All up more than 400,000 organizations now use...","[-0.001035509048961103, -0.020460907369852066,...",0.813934
9,Power Automate has more than seven million mon...,"[-0.025241363793611526, -0.034008558839559555,...",0.806075
...,...,...,...
29,"Thank you, Satya, and good afternoon, everyone...","[0.013616200536489487, -0.01455166470259428, -...",0.718449
44,Operating expenses increased 2% and 5% in cons...,"[0.017512913793325424, 0.008054900914430618, 0...",0.716947
58,"And finally, as a reminder, for Q2 cash flow, ...","[-0.012947804294526577, -0.010494815185666084,...",0.715620
57,Other income and expense should be roughly $10...,"[-0.01832527294754982, -0.014160438440740108, ...",0.688968


# Sentences of the Fed Speech

Let's use the Fed Speech example once more. Let's calculate the word embeddings for a particular sentence in the November 2nd speech that we discussed in the OpenAI Whisper tutorial. Then we'll take a new sentence from a future speech that isn't in our dataset, and find the most similar sentence in our dataset. Here is the sentence we will use to search for similarity:

"the inflation is too damn high"

As we did previously, take [the linked CSV file](https://gist.github.com/hackingthemarkets/9b55ea8b73c7f4e04b42a9f8eddb8393) and upload it to Google Colab as fed-speech.csv. We'll once again read it into a pandas dataframe.

In [None]:
fed_df = pd.read_csv('fed-speech.csv')
fed_df

Unnamed: 0,text
0,Good afternoon
1,My colleagues and I are strongly committed to ...
2,We have both the tools that we need and the re...
3,Price stability is the responsibility of the F...
4,"Without price stability, the economy does not ..."
5,"In particular, without price stability, we wil..."
6,"Today, the FOMC raised our policy interest rat..."
7,We are moving our policy stance purposefully t...
8,"In addition, we are continuing the process of ..."
9,Restoring price stability will likely require ...


We'll once again calculate the embeddings and save them in a new CSV file.

In [None]:
fed_df['embedding'] = fed_df['text'].apply(lambda x: get_embedding(x, engine='text-embedding-ada-002'))
fed_df.to_csv('fed-embeddings.csv')

We'll then enter the new sentence that we want to find similarity for:

"We will continue to increase interest rates and tighten monetary policy"

In [None]:
fed_sentence = input('Enter something Jerome Powell said: ')


Enter something Jerome Powell said: the inflation is too damn high


Again we'll get the vector for this sentence, find the cosine similarity, and sort by most similar.

In [None]:
fed_sentence_vector = get_embedding(fed_sentence, engine="text-embedding-ada-002")
fed_sentence_vector

[-0.00413066940382123,
 -0.011251280084252357,
 -0.005313646513968706,
 -0.02224256657063961,
 -0.012122263200581074,
 0.0024195776786655188,
 -0.03860924765467644,
 -0.005732887890189886,
 -0.016691673547029495,
 -0.0204096008092165,
 0.022372564300894737,
 0.006987363565713167,
 0.023464541882276535,
 0.006652620155364275,
 0.014026726596057415,
 0.011277279816567898,
 0.0338253416121006,
 0.007643850985914469,
 0.02031860314309597,
 -0.015677694231271744,
 0.0025706999003887177,
 0.011101783253252506,
 -0.0122522609308362,
 -0.0034319330006837845,
 -0.020214606076478958,
 -0.0012877873377874494,
 0.016340680420398712,
 -0.02594749443233013,
 -0.0051089003682136536,
 -0.002343204338103533,
 0.007513853255659342,
 -0.0077023496851325035,
 -0.03166738152503967,
 -0.0024634518194943666,
 -0.020019609481096268,
 -0.03564530611038208,
 -0.013870729133486748,
 -0.016990669071674347,
 -0.0031215641647577286,
 -0.00859933253377676,
 0.026168489828705788,
 -0.010932786390185356,
 0.0133507391

In [None]:
fed_df = pd.read_csv('fed-embeddings.csv')
fed_df['embedding'] = fed_df['embedding'].apply(eval).apply(np.array)
fed_df


Unnamed: 0.1,Unnamed: 0,text,embedding
0,0,Good afternoon,"[-0.017524775117635727, 0.02069251798093319, -..."
1,1,My colleagues and I are strongly committed to ...,"[-0.026972517371177673, -0.012394015677273273,..."
2,2,We have both the tools that we need and the re...,"[0.003941578324884176, -0.015006175264716148, ..."
3,3,Price stability is the responsibility of the F...,"[0.009378707036376, -0.016561055555939674, -0...."
4,4,"Without price stability, the economy does not ...","[-0.003026996273547411, -0.014454687014222145,..."
5,5,"In particular, without price stability, we wil...","[-0.03618694841861725, -0.008898851461708546, ..."
6,6,"Today, the FOMC raised our policy interest rat...","[-0.024621201679110527, -0.02114815264940262, ..."
7,7,We are moving our policy stance purposefully t...,"[-0.025701606646180153, -0.012234759517014027,..."
8,8,"In addition, we are continuing the process of ...","[-0.03149143233895302, 0.0019273122306913137, ..."
9,9,Restoring price stability will likely require ...,"[-0.010953230783343315, -0.020290518179535866,..."


In [None]:

fed_df["similarities"] = fed_df['embedding'].apply(lambda x: cosine_similarity(x, fed_sentence_vector))

fed_df


Unnamed: 0.1,Unnamed: 0,text,embedding,similarities
0,0,Good afternoon,"[-0.017524775117635727, 0.02069251798093319, -...",0.750047
1,1,My colleagues and I are strongly committed to ...,"[-0.026972517371177673, -0.012394015677273273,...",0.826724
2,2,We have both the tools that we need and the re...,"[0.003941578324884176, -0.015006175264716148, ...",0.770154
3,3,Price stability is the responsibility of the F...,"[0.009378707036376, -0.016561055555939674, -0....",0.775339
4,4,"Without price stability, the economy does not ...","[-0.003026996273547411, -0.014454687014222145,...",0.80408
5,5,"In particular, without price stability, we wil...","[-0.03618694841861725, -0.008898851461708546, ...",0.775005
6,6,"Today, the FOMC raised our policy interest rat...","[-0.024621201679110527, -0.02114815264940262, ...",0.787081
7,7,We are moving our policy stance purposefully t...,"[-0.025701606646180153, -0.012234759517014027,...",0.812895
8,8,"In addition, we are continuing the process of ...","[-0.03149143233895302, 0.0019273122306913137, ...",0.745955
9,9,Restoring price stability will likely require ...,"[-0.010953230783343315, -0.020290518179535866,...",0.790525


In [None]:

fed_df.sort_values("similarities", ascending=False)

Unnamed: 0.1,Unnamed: 0,text,embedding,similarities
24,24,The recent inflation data again have come in h...,"[-0.021040253341197968, -0.009753845632076263,...",0.871317
22,22,Inflation remains well above our longer run go...,"[-0.023937253281474113, -0.0032772799022495747...",0.869225
31,31,My colleagues and I are acutely aware that hig...,"[-0.011414038017392159, -0.01515731681138277, ...",0.847498
29,29,The longer the current amount of high inflatio...,"[-0.018355058506131172, -0.012731979601085186,...",0.847374
32,32,We are highly attentive to the risks that high...,"[-0.025864068418741226, -0.015762366354465485,...",0.833747
27,27,"Despite elevated inflation, longer term inflat...","[-0.023557519540190697, -0.024205774068832397,...",0.828953
1,1,My colleagues and I are strongly committed to ...,"[-0.026972517371177673, -0.012394015677273273,...",0.826724
37,37,"It will take time, however, for the full effec...","[-0.02066067047417164, -0.018034202978014946, ...",0.826675
46,46,Reducing inflation is likely to require a sust...,"[-0.03423553332686424, -0.014666956849396229, ...",0.823727
26,26,Russia's war against Ukraine has boosted price...,"[-0.009621184319257736, -0.019101163372397423,...",0.818095


# Calculating Cosine Similarity

We used the Cosine Similarity function, but how does it actually work? Cosine similarity is just calculating the similarity between two vectors. There is a mathematical equation for calculating the angle between two vectors.

![](https://drive.google.com/uc?export=view&id=1cehvtx7LKuFeq_LqfnLi-gzIz1D1wSf9)

In [None]:
v1 = np.array([1,2,3])
v2 = np.array([4,5,6])

# (1 * 4) + (2 * 5) + (3 * 6)
dot_product = np.dot(v1, v2)
dot_product

32

In [None]:
# square root of (1^2 + 2^2 + 3^2) = square root of (1+4+9) = square root of 14
np.linalg.norm(v1)

3.7416573867739413

In [None]:
# square root of (4^2 + 5^2 + 6^2) = square root of (16+25+36) = square root of 14
np.linalg.norm(v2)

8.774964387392123

In [None]:
magnitude = np.linalg.norm(v1) * np.linalg.norm(v2)
magnitude

32.83291031876401

In [None]:
dot_product / magnitude

0.9746318461970762

In [None]:
from scipy import spatial

result = 1 - spatial.distance.cosine(v1, v2)

result

0.9746318461970761