# OpenAI Word Embeddings, Semantic Search

Word embeddings are a way of representing words and phrases as vectors. They can be used for a variety of tasks, including semantic search, anomaly detection, and classification. In the video on OpenAI Whisper, I mentioned how words whose vectors are numerically similar are also similar in semantic meaning. In this tutorial, we will learn how to implement semantic search using OpenAI embeddings. Understanding the Embeddings concept will be crucial to the next several videos in this series since we will use it to build several practical applications.

To get started, we will need to install and import OpenAI and input an API Key. We learned how to do this in [Video 3 of this series](https://www.youtube.com/watch?v=LWYgjcZye1c).

Video explaining this Notebook:
https://www.youtube.com/watch?v=xzHhZh7F25I&t=954s

In [2]:
!pip install openai -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.1/70.1 KB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m20.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m264.6/264.6 KB[0m [31m31.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m158.8/158.8 KB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.2/199.2 KB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.2/114.2 KB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
import openai
import pandas as pd
import numpy as np
from getpass import getpass

openai.api_key = getpass()

··········


In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Microsoft Earnings Call Transcript

Let's tie this back to finance. I have attached some text from a recent [Microsoft earnings call here](https://gist.github.com/hackingthemarkets/1c827a7750384fcf52c84594ef216a2d). Click on "raw" and save the file as a CSV. Upload it to Google Colab as microsoft-earnings.csv. Let's use what we just learned to perform a semantic search on sentences in the Microsoft earnings call. We'll start by reading the paragraphs into a pandas dataframe.

In [5]:
from openai.embeddings_utils import get_embedding


In [None]:
earnings_df = pd.read_csv('sample_data/microsoft-earnings.csv')
earnings_df

In [8]:
import ast
def string_to_float_list(s):
    # Use ast.literal_eval() to safely evaluate the string representation of the list
    float_list = ast.literal_eval(s)
    # Convert each item in the list to a float
    float_list = [float(item) for item in float_list]
    return float_list

In [11]:
earnings_embeddings = pd.read_csv('sample_data/earnings-embeddings.csv',  index_col=0)
earnings_embeddings['embedding']= earnings_embeddings['embedding'].apply(lambda x: string_to_float_list(x))

In [12]:
len(earnings_embeddings['embedding'][0])

1536

Let´s make our query

In [40]:
earnings_search = input("Please provide your question to the Chatbot: ")

Please provide your question to the Chatbot: Can you tell me about the cloud products


In [42]:

earnings_search_vector = get_embedding(earnings_search, engine="text-embedding-ada-002")


In [None]:
earnings_df

In [43]:
from openai.embeddings_utils import cosine_similarity


## ALEJO Note:
We have our dataframe, which columns are:
- **text:** extracts of Microsoft conversations about company earnings, not the best data.
- **embedding:** that text transform to embeddings, High Dimensional Vector:
- **similarities:** this is the Cosine Similarity distance calculated from Our Inquiry User Question (transformed to Embedding) to all our Knowledge database (the text column).

In [44]:

earnings_embeddings["similarities"] = earnings_embeddings['embedding'].apply(lambda x: cosine_similarity(x, earnings_search_vector))


earnings_embeddings.sort_values('similarities', ascending=False)


Unnamed: 0,text,embedding,similarities
11,"All up more than 400,000 organizations now use...","[-0.001035509048961103, -0.020460907369852066,...",0.825284
12,Our cloud for sustainability is off to a fast ...,"[0.008903194218873978, -0.01629571244120598, 0...",0.822660
13,Microsoft 365 is the cloud-first platform that...,"[-0.011653101071715355, -0.021100856363773346,...",0.815596
1,"With that context, this quarter, the Microsoft...","[-0.0016284669982269406, -0.028918568044900894...",0.814623
3,We are the platform of choice for customers' S...,"[0.011994918808341026, -0.024179909378290176, ...",0.809728
...,...,...,...
22,We once again saw record engagement among our ...,"[-0.0400715135037899, -0.012114028446376324, 0...",0.719108
44,Operating expenses increased 2% and 5% in cons...,"[0.017512913793325424, 0.008054900914430618, 0...",0.714871
58,"And finally, as a reminder, for Q2 cash flow, ...","[-0.01292774174362421, -0.010481415316462517, ...",0.707471
46,"This quarter, other income and expense was $54...","[-0.030121874064207077, -0.007023234851658344,...",0.687930


### Make Function to retrieve K similar

In [45]:
def get_top_k_similar(df, query_row, k):
    
    df_similarities = df.copy()
    # Compute cosine similarity between the query row and all other rows    
    df_similarities["similarities"] = df_similarities['embedding'].apply(lambda x: cosine_similarity(x, query_row))
    top_k = df_similarities.sort_values('similarities', ascending=False).iloc[0:k]['text'].values
    # Return the top K similar rows as a new DataFrame
    return "/n ".join(top_k)

We choose a number K of documents to retrieve, the most similar ones we concatenate them.

In [46]:
query_result_db = (get_top_k_similar(earnings_embeddings, earnings_search_vector, 3))
query_result_db

"All up more than 400,000 organizations now use our business applications. Now on to Industry Solutions. We are seeing increased adoption of our industry and cross-industry clouds. Bank of Queensland chose our cloud for financial services to deliver new digital experiences for its customers./n Our cloud for sustainability is off to a fast start as organizations like Telstra use the solution to track their environmental footprint. New updates provide insights on hard-to-measure Scope 3 carbon emissions, and we are seeing record growth in healthcare, driven, in part, by our Nuance DAX ambient intelligence solutions, which automatically documents patient encounters at the point of care. Physicians tell us DAX dramatically improves their productivity, and it's quickly becoming an on-ramp to our broader healthcare offerings. Now on to new systems of work, Microsoft 365, Teams, and Viva uniquely enable employees to thrive in today's digitally connected distributed world of work./n Microsoft 

In [47]:
import openai
import os


# Define function to generate text using the OpenAI API
def generate_text(prompt, model="text-davinci-003", max_tokens=1024, temperature=0.1):
    """
    Generates text using the OpenAI API.

    Arguments:
    prompt -- The prompt to use for generating text.
    model -- The name of the model to use (default "davinci").
    max_tokens -- The maximum number of tokens to generate (default 1024).
    temperature -- The sampling temperature to use (default 0.5).

    Returns:*
    The generated text.
    """

    # Call the OpenAI API to generate text
    response = openai.Completion.create(
        engine=model,
        prompt=prompt,
        max_tokens=max_tokens,
        temperature=temperature
    )

    # Get the generated text from the API response
    generated_text = response.choices[0].text

    return generated_text


With the User Question + Documents retrieve, we generate our Prompt to GPT3. This prompt needs to be defined.

In [48]:
prompt = '''You are an AI assistant that answers questions to a user.
You are given the following extracted parts of a long document as "Context" and a question.
Use the Context to answer the Question.


Context: "{}"

Question: "{}"

Answer the question:'''.format(earnings_search, query_result_db)




In [49]:
from IPython.display import Markdown

def print_markdown(text):
    """
    Prints Markdown-formatted text to the console.
    """
    display(Markdown(text))


In [50]:
print(prompt)

You are an AI assistant that answers questions to a user.
You are given the following extracted parts of a long document as "Context" and a question.
Use the Context to answer the Question.


Context: "Can you tell me about the cloud products"

Question: "All up more than 400,000 organizations now use our business applications. Now on to Industry Solutions. We are seeing increased adoption of our industry and cross-industry clouds. Bank of Queensland chose our cloud for financial services to deliver new digital experiences for its customers./n Our cloud for sustainability is off to a fast start as organizations like Telstra use the solution to track their environmental footprint. New updates provide insights on hard-to-measure Scope 3 carbon emissions, and we are seeing record growth in healthcare, driven, in part, by our Nuance DAX ambient intelligence solutions, which automatically documents patient encounters at the point of care. Physicians tell us DAX dramatically improves their p

##ALEJO Note: 

Notice that the prompt needs to be refined, the model sometimes starts confabulating. 

In [52]:

generated_text = generate_text(prompt)
print_markdown(generated_text)

 Our cloud products are seeing increased adoption across various industries, such as financial services, sustainability, and healthcare. Microsoft 365, Teams, and Viva are cloud-first platforms that support all types of workers and reduce cost and complexity for IT. Microsoft Teams is the de facto standard for collaboration and is used by hundreds of millions of people to meet, call, chat, collaborate, and do business.

In [54]:
"What Companies Use Power Automate?"

'What Companies Use Power Automate?'

In [53]:
earnings_search = input("Please provide your question to the Chatbot: ")

Please provide your question to the Chatbot: "What Companies Use Power Automate?"


In [55]:
earnings_search_vector = get_embedding(earnings_search, engine="text-embedding-ada-002")


In [56]:

earnings_embeddings["similarities"] = earnings_embeddings['embedding'].apply(lambda x: cosine_similarity(x, earnings_search_vector))


earnings_embeddings.sort_values('similarities', ascending=False)

Unnamed: 0,text,embedding,similarities
9,Power Automate has more than seven million mon...,"[-0.025379547849297523, -0.03403877094388008, ...",0.882659
8,Now on to Power Platform. We are helping custo...,"[-0.02147563174366951, -0.013491101562976837, ...",0.821460
17,We've also built a completely new suite for ou...,"[-0.03199895843863487, -0.02947564609348774, -...",0.798388
19,"Accenture, for example, has deployed Windows 1...","[0.01380225084722042, -0.027254054322838783, 0...",0.797166
11,"All up more than 400,000 organizations now use...","[-0.001035509048961103, -0.020460907369852066,...",0.789121
...,...,...,...
45,Our data center investments continue to be bas...,"[-0.022294508293271065, -0.027116671204566956,...",0.691357
56,We expect Xbox content and services revenue to...,"[-0.003946698736399412, -0.016969818621873856,...",0.685625
29,"Thank you, Satya, and good afternoon, everyone...","[0.013616200536489487, -0.01455166470259428, -...",0.682844
57,Other income and expense should be roughly $10...,"[-0.01832527294754982, -0.014160438440740108, ...",0.669254


In [57]:
query_result_db = (get_top_k_similar(earnings_embeddings, earnings_search_vector, 3))
query_result_db

"Power Automate has more than seven million monthly active users and is being used by companies like Brown-Forman, Komatsu, Mass, T-Mobile to digitize manual business processes and save thousands of hours of employee time. And we're going further with new AI-powered capabilities and power automate that turn natural language into advanced workflows. Now on to Dynamics 365. From customer experience and service to finance and supply chain, we continue to take share across all categories we serve./n Now on to Power Platform. We are helping customers save time and money with our end-to-end suite spanning Low-Code/No-Code tools, robotic process automation, virtual agents, and business intelligence. Power BI is the market leader in business intelligence in the cloud and is growing faster than competition, as companies like Walmart standardize on the tool for reporting and analytics. Power Apps is the market leader in Low-Code/No-Code tools and has nearly 15 million monthly active users, up mo

In [58]:
prompt = '''You are an AI assistant that answers questions to a user.
You are given the following extracted parts of a long document as "Context" and a question.
Use the Context to answer the Question.


Context: "{}"

Question: "{}"

Answer the question:'''.format(earnings_search, query_result_db)


In [59]:
print(prompt)

You are an AI assistant that answers questions to a user.
You are given the following extracted parts of a long document as "Context" and a question.
Use the Context to answer the Question.


Context: ""What Companies Use Power Automate?""

Question: "Power Automate has more than seven million monthly active users and is being used by companies like Brown-Forman, Komatsu, Mass, T-Mobile to digitize manual business processes and save thousands of hours of employee time. And we're going further with new AI-powered capabilities and power automate that turn natural language into advanced workflows. Now on to Dynamics 365. From customer experience and service to finance and supply chain, we continue to take share across all categories we serve./n Now on to Power Platform. We are helping customers save time and money with our end-to-end suite spanning Low-Code/No-Code tools, robotic process automation, virtual agents, and business intelligence. Power BI is the market leader in business intel

In [60]:

generated_text = generate_text(prompt)
print_markdown(generated_text)

 Power Automate is being used by companies such as Brown-Forman, Komatsu, Mass, T-Mobile, Walmart, Finastra, SES, Unilever, Adobe, Crayon, and PwC to digitize manual business processes and save thousands of hours of employee time.