# Using `text-davinci-003` to answer questions posed in natural language, using a custom dataset



---
**Credit**: Adapted from this [OpenAI notebook](https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb)

---

Many use cases require us to respond to user questions with relevant and accurate answers. For example, a customer support chatbot may need to provide answers to common support questions.

The GPT models have picked up a lot of general knowledge in training - remember GPT-3 was trained on 500 billion tokens! - but we often would like to have the model *use our own dataset or library* of more specific information to answer the questions (e.g., we would like our customer service chatbot to consult a library of service manuals when it answers a user question). We'd expect those tailored responses to be more helpful and accurate than generic responses uninformed by our specific data.

In this notebook we will demonstrate a method for enabling `text-davinci-003` to answer questions using a library of text as a reference. We'll be using a dataset of Wikipedia articles about the 2020 Summer Olympic Games but the same approach can be used with a library of books, articles, documentation, service manuals, or much much more. 

## Setup

Let's get started by installing the openai python package


In [None]:
!pip install openai

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting openai
  Downloading openai-0.27.6-py3-none-any.whl (71 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.9/71.9 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiohttp
  Downloading aiohttp-3.8.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
Collecting yarl<2.0,>=1.0
  Downloading yarl-1.9.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting frozenlist>=1.1.1
  Downloading frozenlist-1.3.3-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (149 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

Next, we import the necessary packages, including numpy and pandas.

In [None]:
import numpy as np
import pandas as pd
import openai

First, let's take a quick look at all the GPT models that are available ([Link](https://platform.openai.com/docs/models/gpt-3-5)). 


---


We will use the most recent version of the GPT family just before ChatGPT was released - the `text-davinci-003` model - in this colab.

We can use the ChatGPT API as well (and we give a code example below) but since ChatGPT's training cutoff date is later, it "knows" about the 2020 Summer Olympics and the questions may be too easy. BTW, ChatGPT is referred to as `gpt-3.5-turbo` when making API calls.



---



---






We will be using pre-trained contextual embeddings as well. For that, we will use the latest/greatest `text-embedding-ada-002` model ([link](https://openai.com/blog/new-and-improved-embedding-model)).

In [None]:
COMPLETIONS_MODEL = "text-davinci-003"
EMBEDDING_MODEL = "text-embedding-ada-002"

Finally, let's set the OpenAI API key. You can get yours [here](https://platform.openai.com/account/api-keys).

In [None]:
openai.api_key = "paste the key here"

## Prompting without custom data

Before we try anything fancy, let's simply ask `text-davinci-003` a question on the 2020 Summer Olympics and see how it responds. 

First, we prepare the prompt.

In [None]:
prompt = "Who won the 2020 Summer Olympics men's high jump?"

Next, we make the request to the model, using the openai API. [Documentation](https://platform.openai.com/docs/api-reference/completions/create?lang=python).


In [None]:
result = openai.Completion.create(
    prompt=prompt,              # well, your prompt goes here!
    temperature=0,              # setting this to zero tells GPT to pick the most likely next word
    max_tokens=300,             # the model will stop generating when it has generated 300 tokens
    model=COMPLETIONS_MODEL     # which model you want to use
)

In [None]:
# to pose the same question to ChatGPT instead of `text-davinci-003`
# you can use this code
# we aren't using ChatGPT in this example
# since it "knows" about the 2020 Summer Olympics :)
# unlike `text-davinci-003`

# completion = openai.ChatCompletion.create(
#   model="gpt-3.5-turbo",
#   messages=[
#     {"role": "user", "content": prompt}
#   ]
# )

# print(completion.choices[0].message)

Let's extract just the text of the response.

In [None]:
print(result["choices"][0]["text"].strip(" \n"))

Marcelo Chierighini of Brazil won the gold medal in the men's high jump at the 2020 Summer Olympics.


Let's Google this name and see if the answer is correct.

<br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> 













Well, Marcelo is a **gold medalist swimmer, not a high jumper**!! 


<br>

<br>



Sounds like `text-davinci-003` could use some help. 😆


### "Engineering" the prompt to reduce hallucinations



One simple thing we can try right off the bat is to tell `text-davinci-003` to say "I don't know" if it doesn't know rather than make stuff up i.e., "hallucinate".


How? By asking nicely? 😀 Well, almost.



**By asking explicitly!**

Let's modify our prompt as follows.


In [None]:
prompt = """Answer the question as truthfully as possible, and if you're unsure of the answer, say "Sorry, I don't know".

Q: Who won the 2020 Summer Olympics men's high jump?
A:"""

Note the explicit extra instruction in the above prompt: *as truthfully as possible, and if you're unsure of the answer, say "Sorry, I don't know"*

In [None]:
openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=300,
    model=COMPLETIONS_MODEL
)["choices"][0]["text"].strip(" \n")

"Sorry, I don't know."

Wow, it worked. The model is being humble and honest 👀.

It is an interesting question as to why Instruct-GPT doesn't know this. Let's check the [cutoff date](https://help.openai.com/en/articles/6639781-do-the-openai-api-models-have-knowledge-of-current-events) for the training data.

## Using custom data

To help the model answer a question, we can provide custom data **in the prompt itself**. This extra information we provide in the prompt is referred to as **context**.



### Manually enriching the prompt with custom data

We will first show how to do this by ***manually*** finding and adding information (that's relevant to the question) to the prompt.

First, we will use the following passage on the 2020 Summer Olympics **high jump event** taken from Wikipedia as context:
>The men's high jump event at the 2020 Summer Olympics took place between 30 July and 1 August 2021 at the Olympic Stadium.
33 athletes from 24 nations competed; the total possible number depended on how many nations would use universality places 
to enter athletes in addition to the 32 qualifying through mark or ranking (no universality places were used in 2021).
Italian athlete Gianmarco Tamberi along with Qatari athlete Mutaz Essa Barshim emerged as joint winners of the event following
a tie between both of them as they cleared 2.37m. Both Tamberi and Barshim agreed to share the gold medal in a rare instance
where the athletes of different nations had agreed to share the same medal in the history of Olympics. 
Barshim in particular was heard to ask a competition official "Can we have two golds?" in response to being offered a 
'jump off'. Maksim Nedasekau of Belarus took bronze. The medals were the first ever in the men's high jump for Italy and 
Belarus, the first gold in the men's high jump for Italy and Qatar, and the third consecutive medal in the men's high jump
for Qatar (all by Barshim). Barshim became only the second man to earn three medals in high jump, joining Patrik Sjöberg
of Sweden (1984 to 1992).


Second, we will **explicitly tell the model to make use of the provided context**. 


There's a deeper lesson here: **telling LLMs explicitly what you want them to do often helps** (kinda like parenting? 🤔)

In [None]:
prompt = """Answer the question as truthfully as possible using the 
provided text, and if the answer is not contained within the text below, 
say "I don't know"

Context:
The men's high jump event at the 2020 Summer Olympics took place between 30 July and 1 August 2021 at the Olympic Stadium.
33 athletes from 24 nations competed; the total possible number depended on how many nations would use universality places 
to enter athletes in addition to the 32 qualifying through mark or ranking (no universality places were used in 2021).
Italian athlete Gianmarco Tamberi along with Qatari athlete Mutaz Essa Barshim emerged as joint winners of the event following
a tie between both of them as they cleared 2.37m. Both Tamberi and Barshim agreed to share the gold medal in a rare instance
where the athletes of different nations had agreed to share the same medal in the history of Olympics. 
Barshim in particular was heard to ask a competition official "Can we have two golds?" in response to being offered a 
'jump off'. Maksim Nedasekau of Belarus took bronze. The medals were the first ever in the men's high jump for Italy and 
Belarus, the first gold in the men's high jump for Italy and Qatar, and the third consecutive medal in the men's high jump
for Qatar (all by Barshim). Barshim became only the second man to earn three medals in high jump, joining Patrik Sjöberg
of Sweden (1984 to 1992).

Q: Who won the 2020 Summer Olympics men's high jump?
A:"""



Take a moment to notice what the prompt has grown to.


OK, let's run it.

In [None]:
openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=300,
    model=COMPLETIONS_MODEL
)["choices"][0]["text"].strip(" \n")

'Gianmarco Tamberi and Mutaz Essa Barshim emerged as joint winners of the event.'

Nicely done, `text-davinci-003`!



---



But maybe it wasn't super hard since the answer is literally in the context we provided.


Let's make it a bit harder.


Notice the last line in the context we provided.
>Barshim became only the second man to earn three medals in high jump, joining Patrik Sjöberg of Sweden (1984 to 1992).

This tempts me to ask: Who is the **first** man to win three medals in high jump? 

Wicked, eh? 😸 

Let's try it.

In [None]:
prompt = """Answer the question as truthfully as possible using the 
provided text, and if the answer is not contained within the text below, 
say "I don't know"

Context:
The men's high jump event at the 2020 Summer Olympics took place between 30 July and 1 August 2021 at the Olympic Stadium.
33 athletes from 24 nations competed; the total possible number depended on how many nations would use universality places 
to enter athletes in addition to the 32 qualifying through mark or ranking (no universality places were used in 2021).
Italian athlete Gianmarco Tamberi along with Qatari athlete Mutaz Essa Barshim emerged as joint winners of the event following
a tie between both of them as they cleared 2.37m. Both Tamberi and Barshim agreed to share the gold medal in a rare instance
where the athletes of different nations had agreed to share the same medal in the history of Olympics. 
Barshim in particular was heard to ask a competition official "Can we have two golds?" in response to being offered a 
'jump off'. Maksim Nedasekau of Belarus took bronze. The medals were the first ever in the men's high jump for Italy and 
Belarus, the first gold in the men's high jump for Italy and Qatar, and the third consecutive medal in the men's high jump
for Qatar (all by Barshim). Barshim became only the second man to earn three medals in high jump, joining Patrik Sjöberg
of Sweden (1984 to 1992).

Q: Who is the first man to win three medals in high jump?
A:"""


Notice that the question has changed. Everything else is unchanged.

In [None]:
openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=300,
    model=COMPLETIONS_MODEL
)["choices"][0]["text"].strip(" \n")

'Patrik Sjöberg of Sweden.'

WHOAH!!!!

👏 👍

Not sure if a traditional search engine could have done that!



---



---



### Automatically enriching the prompt with custom data

**Manually** adding extra information into the prompt obviously doesn't scale. So, we will now show how to **automatically** enrich the prompt with custom relevant data.

First thing to note. We typically can't just include **all** the custom data into the prompt due to an important reason.

The prompt for every model has a limit (called the **context window**) on how many tokens you can send in and get out. For `text-davinci-003`,  the context window is 4097 tokens ([link](https://help.openai.com/en/articles/6643408-how-do-davinci-and-text-davinci-003-differ)).

Note that the context window includes both the prompt and the response - **together**, they can't exceed 4097 tokens. We will get deeper into this a bit later but for now, understand this is one key reason we can't include ALL data in the prompt. Another reason is expense. OpenAI charges by the token and these charges can easily add up.

(BTW, GPT-4's context window is way bigger - it ranges from 8,192 to 32,768 tokens, depending on the particular GPT-4 model!)

If we can't include all the custom data, the logical thing to do is to only include data that's **relevant** to the question.

How can we measure the relevance between a question and a piece of (our custom) data?

Using pretrained contextual embeddings!



---



---



This is our overall process.



**One-time setup**
* Preprocess the custom dataset by splitting it into 'sections'
* We calculate an embedding vector for each section using the `text-embedding-ada-002` model and store it somewhere handy


**Each time we receive a question, we do this:**
* We calculate an embedding vector for the question (again using the same `text-embedding-ada-002` model)
* For each section in our custom dataset, we calculate the *cosine similarity* between that section's embedding vector and the question's embedding vector
* We rank the sections from most-cosine-similar to the question to least-cosine-similar
* Starting from the most-cosine-similar section, include as many sections into the prompt as can fit into the context window
* Send the prompt into `text-davinci-003`.

#### One-time setup

We first need to break up the custom dataset into "sections".

Sections should be large enough to contain enough information to answer a question; but small enough to fit one or several into the `text-davinci-003` prompt. 

Approximately a paragraph of text is usually a good length, but you should experiment for your particular use case. In this example, Wikipedia articles are already grouped into headers, so we will use these to define our sections. This preprocessing has already been done in [this notebook](https://github.com/openai/openai-cookbook/blob/main/examples/fine-tuned_qa/olympics-1-collect-data.ipynb), so we will load the results and use them. 

In [None]:
# OpenAI has hosted the processed dataset, so we can download it directly without having to recreate it.
# This dataset has already been split into sections, one row for each section of the Wikipedia page.

df = pd.read_csv('https://cdn.openai.com/API/examples/data/olympics_sections_text.csv')
df = df.set_index(["title", "heading"])
print(f"{len(df)} rows in the data.")
df.sample(5)

3964 rows in the data.


Unnamed: 0_level_0,Unnamed: 1_level_0,content,tokens
title,heading,Unnamed: 2_level_1,Unnamed: 3_level_1
Russian Olympic Committee athletes at the 2020 Summer Olympics,Road,ROC has entered a squad of four riders (three ...,57
Philippines at the 2020 Summer Olympics,Boxing,The Philippines entered four boxers (two per g...,405
Federated States of Micronesia at the 2020 Summer Olympics,Swimming,Federated States of Micronesia received a univ...,55
Honduras at the 2020 Summer Olympics,Summary,Honduras competed at the 2020 Summer Olympics ...,64
Yemen at the 2020 Summer Olympics,Summary,Yemen competed at the 2020 Summer Olympics in ...,50


Next, we need to calculate an embedding vector for each section. 

Recall that an embedding is a vector of numbers that helps us understand how semantically similar or different the texts are. The closer two embeddings are to each other, the more similar are their contents. See the [documentation on OpenAI embeddings](https://beta.openai.com/docs/guides/embeddings) for more information.

Since this is a small example, we can store the embeddings locally. If you have a larger dataset, consider using a vector search engine like [Pinecone](https://www.pinecone.io/) or [Weaviate](https://github.com/semi-technologies/weaviate).

This function calculates the embedding using `text-embedding-ada-002`, given a piece of text. The API call is simple (see below). [Link](https://openai.com/blog/new-and-improved-embedding-model).

In [None]:
def get_embedding(text: str, model: str=EMBEDDING_MODEL) -> list[float]:
    result = openai.Embedding.create(
      model=model,       # which embedding model we want to use
      input=text         # feed in the text for which you want to calc the embedding
    )
    return result["data"][0]["embedding"]



Let's try it on "HODL is amazing!!" 😃

In [None]:
e = get_embedding("HODL is amazing!!")

In [None]:
e

[-0.007875289767980576,
 0.0033526234328746796,
 -0.009611068293452263,
 -0.027181001380085945,
 -0.022436540573835373,
 -0.001679526176303625,
 -0.008299591019749641,
 0.006573456339538097,
 0.006730962079018354,
 -0.032092608511447906,
 0.007020258344709873,
 0.01144327875226736,
 -0.007624566555023193,
 0.0029074284248054028,
 0.014876262284815311,
 0.010343952104449272,
 0.01805209368467331,
 0.01083897054195404,
 0.019312139600515366,
 -0.0018306031124666333,
 -0.029572516679763794,
 0.0028479620814323425,
 -0.037929967045784,
 0.006258444860577583,
 -0.019775014370679855,
 -0.005548061337321997,
 0.020816480740904808,
 -0.01360335759818554,
 0.009591781534254551,
 -0.03808426111936569,
 0.009553208947181702,
 -0.008948900736868382,
 -0.02529093064367771,
 -0.006724533159285784,
 -0.019427858293056488,
 -0.009257483296096325,
 -0.01832210272550583,
 0.005342339631170034,
 0.0001882435317384079,
 0.00999679695814848,
 0.026692410930991173,
 -0.01148828025907278,
 -0.001528449123725

Let's see how long the embedding vector is.

In [None]:
len(e)

1536

OK, that matches the official version:
>The new embeddings have only 1536 dimensions, one-eighth the size of davinci-001 embeddings, making the new embeddings more cost effective in working with vector databases.

Given a Pandas DF with a column of text, we can use the `get_embedding` function to calculate the embeddings for all the text in the column.

In [None]:
def compute_doc_embeddings(df: pd.DataFrame) -> dict[tuple[str, str], list[float]]:
    """
    Create an embedding for each row in the dataframe using the OpenAI Embeddings API.
    
    Return a dictionary that maps between each embedding vector and the index of the row that it corresponds to.
    """
    return {
        idx: get_embedding(r.content) for idx, r in df.iterrows()
    }

If you'd like to calculate the embeddings from scratch, uncomment the below line and run. Warning - it will take some time!

In [None]:
# document_embeddings = compute_doc_embeddings(df)

But happily for us, OpenAI has calculated the embeddings for us so we don't have to! We download them and write a small function to load them.

In [None]:
def load_embeddings(fname: str) -> dict[tuple[str, str], list[float]]:
    """
    Read the document embeddings and their keys from a CSV.
    
    fname is the path to a CSV with exactly these named columns: 
        "title", "heading", "0", "1", ... up to the length of the embedding vectors.
    """
    
    df = pd.read_csv(fname, header=0)
    max_dim = max([int(c) for c in df.columns if c != "title" and c != "heading"])
    return {
           (r.title, r.heading): [r[str(i)] for i in range(max_dim + 1)] for _, r in df.iterrows()
    }

document_embeddings = load_embeddings(
    "https://cdn.openai.com/API/examples/data/olympics_sections_document_embeddings.csv"
    )

In [None]:
# An example embedding:
example_entry = list(document_embeddings.items())[0]
print(example_entry)

(('2020 Summer Olympics', 'Summary'), [0.0037565305829048, -0.0061981128528714, -0.0087078781798481, -0.0071364338509738, -0.0025227521546185, 0.0150650832802057, -0.0218573585152626, -0.0057435631752014, -0.0066429222933948, -0.0316626504063606, 0.0160261318087577, 0.0097858104854822, -0.01212998945266, -0.0207404643297195, -0.021844370290637, -0.0121949249878525, 0.0238054282963275, -0.0157793760299682, -0.0024188549723476, -0.0130715575069189, -0.0248444005846977, 0.0085845002904534, -0.005526028573513, 0.0148053402081131, -0.0083052767440676, -0.0011428684229031, 0.0157014541327953, -0.0164936687797307, 0.0329613648355007, -0.020337862893939, -0.0105260778218507, 0.0108637437224388, -0.0094286641106009, 0.0089156720787286, -0.0033539291471242, -0.0162079520523548, -0.0153637873008847, -0.0127598661929368, 0.005344208329916, -0.0134416911751031, 0.0056494064629077, 0.0196365565061569, -0.0063831796869635, 0.0098182782530784, -0.0046623833477497, 0.0232339948415756, 0.000809667049907

So we have split our custom data into sections, and calculated embedding vectors for each. Next we will use these embeddings to answer our users' questions.



#### Each time we receive a question

* We calculate an embedding vector for the question (again using the same `text-embedding-ada-002` model) with the `get_embedding` funtion we defined above.
* For each section in our custom dataset, we calculate the cosine similarity between that section's embedding vector and the question's embedding vector
* We rank the sections from most-cosine-similar to the question to least-cosine-similar

We first define a couple of helper functions.

In [None]:
def vector_similarity(x: list[float], y: list[float]) -> float:
    """
    Returns the similarity between two vectors.
    
    Because OpenAI Embeddings are normalized to length 1, the cosine similarity is the same as the dot product.
    """
    return np.dot(np.array(x), np.array(y))

In [None]:
def order_document_sections_by_query_similarity(query: str, contexts: dict[(str, str), np.array]) -> list[(float, (str, str))]:
    """
    Calc the embedding for the supplied query and calc the cosine
    similarity against all the pre-calculated section embeddings. 
    
    Return the list of sections, sorted by relevance in descending order.
    """
    query_embedding = get_embedding(query)
    
    document_similarities = sorted([
        (vector_similarity(query_embedding, doc_embedding), doc_index) for doc_index, doc_embedding in contexts.items()
    ], reverse=True)
    
    return document_similarities

Let's test these functions out.

In [None]:
order_document_sections_by_query_similarity("Who won the men's high jump?", document_embeddings)[:5]

[(0.8848838116467932,
  ("Athletics at the 2020 Summer Olympics – Men's high jump", 'Summary')),
 (0.8634516122222147,
  ("Athletics at the 2020 Summer Olympics – Men's pole vault", 'Summary')),
 (0.8616689251543945,
  ("Athletics at the 2020 Summer Olympics – Men's long jump", 'Summary')),
 (0.8560916109708381,
  ("Athletics at the 2020 Summer Olympics – Men's triple jump", 'Summary')),
 (0.8469427954223732,
  ("Athletics at the 2020 Summer Olympics – Men's 110 metres hurdles",
   'Summary'))]

In [None]:
order_document_sections_by_query_similarity("Who won the women's high jump?", document_embeddings)[:5]

We can see that the most relevant sections for each question include the summaries for the Men's and Women's high jump competitions - which is exactly what we would expect.

#### Starting from the most-cosine-similar section, include as many sections into the prompt as can fit into the context window


Once we've calculated the most relevant pieces of context, we construct a prompt by simply prepending them to the supplied query. It is helpful to use a query separator to help the model distinguish between separate pieces of text.

In [None]:
# We are using up 300 tokens for the output so we have a 
# budget of ~3700 tokens we can use. But that's overkill
# in this example and I don't want a big bill from OpenAI
# - remember, they charge by the token! -
# so we will just use 500 tokens.
MAX_SECTION_LEN = 500

# It helps the LLM if we provide the sections with a nice
# separator.

SEPARATOR = "\n* "

In [None]:
HEADER = """
Answer the question as truthfully as possible using the provided context, 
and if the answer is not contained within the text below, 
say "I don't know."\n\nContext:\n
"""

In [None]:
def construct_prompt(question: str, context_embeddings: dict, df: pd.DataFrame) -> str:
    """
    Fetch relevant 
    """
    most_relevant_document_sections = order_document_sections_by_query_similarity(question, context_embeddings)
    
    chosen_sections = []
    chosen_sections_len = 0
    chosen_sections_indexes = []
     
    for _, section_index in most_relevant_document_sections:
        # Add sections until we run out of context window        
        document_section = df.loc[section_index]
        
        chosen_sections_len += document_section.tokens 
        if chosen_sections_len > MAX_SECTION_LEN:
            break
            
        chosen_sections.append(SEPARATOR + document_section.content.replace("\n", " "))
        chosen_sections_indexes.append(str(section_index))
            
    # Useful diagnostic information
    print(f"Selected {len(chosen_sections)} document sections:")
    print("\n".join(chosen_sections_indexes))
        
    return HEADER + "".join(chosen_sections) + "\n\n Q: " + question + "\n A:"

In [None]:
prompt = construct_prompt(
    "Who won the 2020 Summer Olympics men's high jump?",
    document_embeddings,
    df
)

print("===\n", prompt)

Selected 2 document sections:
("Athletics at the 2020 Summer Olympics – Men's high jump", 'Summary')
("Athletics at the 2020 Summer Olympics – Men's long jump", 'Summary')
===
 
Answer the question as truthfully as possible using the provided context, 
and if the answer is not contained within the text below, 
say "I don't know."

Context:


* The men's high jump event at the 2020 Summer Olympics took place between 30 July and 1 August 2021 at the Olympic Stadium. 33 athletes from 24 nations competed; the total possible number depended on how many nations would use universality places to enter athletes in addition to the 32 qualifying through mark or ranking (no universality places were used in 2021). Italian athlete Gianmarco Tamberi along with Qatari athlete Mutaz Essa Barshim emerged as joint winners of the event following a tie between both of them as they cleared 2.37m. Both Tamberi and Barshim agreed to share the gold medal in a rare instance where the athletes of different natio

We have now obtained the sections that are most relevant to the question. As a final step, let's put it all together to get an answer to the question.

#### Send the prompt into `text-davinci-003`!

Now that we've retrieved the relevant sections and constructed our prompt, we can finally answer the user's query.

In [None]:
COMPLETIONS_API_PARAMS = {
    # We use temperature of 0.0 because it gives the most predictable, factual answer.
    "temperature": 0.0,
    "max_tokens": 300,
    "model": COMPLETIONS_MODEL,
}

In [None]:
def answer_query_with_context(
    query: str,
    df: pd.DataFrame,
    document_embeddings: dict[(str, str), np.array],
    show_prompt: bool = True
) -> str:
    prompt = construct_prompt(
        query,
        document_embeddings,
        df
    )
    
    if show_prompt:
        print(prompt)

    response = openai.Completion.create(
                prompt=prompt,
                **COMPLETIONS_API_PARAMS
            )

    return response["choices"][0]["text"].strip(" \n")

In [None]:
answer_query_with_context("Who won the 2020 Summer Olympics men's high jump?", df, document_embeddings)

Selected 2 document sections:
("Athletics at the 2020 Summer Olympics – Men's high jump", 'Summary')
("Athletics at the 2020 Summer Olympics – Men's long jump", 'Summary')

Answer the question as truthfully as possible using the provided context, 
and if the answer is not contained within the text below, 
say "I don't know."

Context:


* The men's high jump event at the 2020 Summer Olympics took place between 30 July and 1 August 2021 at the Olympic Stadium. 33 athletes from 24 nations competed; the total possible number depended on how many nations would use universality places to enter athletes in addition to the 32 qualifying through mark or ranking (no universality places were used in 2021). Italian athlete Gianmarco Tamberi along with Qatari athlete Mutaz Essa Barshim emerged as joint winners of the event following a tie between both of them as they cleared 2.37m. Both Tamberi and Barshim agreed to share the gold medal in a rare instance where the athletes of different nations ha

'Gianmarco Tamberi and Mutaz Essa Barshim emerged as joint winners of the event following a tie between both of them as they cleared 2.37m. Both Tamberi and Barshim agreed to share the gold medal.'

Let's ask a question for an Olympics event that never happened!

In [None]:
answer_query_with_context("Who won the 2019 Summer Olympics men's high jump?", df, document_embeddings)

Selected 2 document sections:
("Athletics at the 2020 Summer Olympics – Men's high jump", 'Summary')
("Athletics at the 2020 Summer Olympics – Men's long jump", 'Summary')

Answer the question as truthfully as possible using the provided context, 
and if the answer is not contained within the text below, 
say "I don't know."

Context:


* The men's high jump event at the 2020 Summer Olympics took place between 30 July and 1 August 2021 at the Olympic Stadium. 33 athletes from 24 nations competed; the total possible number depended on how many nations would use universality places to enter athletes in addition to the 32 qualifying through mark or ranking (no universality places were used in 2021). Italian athlete Gianmarco Tamberi along with Qatari athlete Mutaz Essa Barshim emerged as joint winners of the event following a tie between both of them as they cleared 2.37m. Both Tamberi and Barshim agreed to share the gold medal in a rare instance where the athletes of different nations ha

"I don't know."

Good, it is trying to be humble and say "I don't know".

Let's change the header to "allow" it to lie 👀 and see if it takes the bait.

In [None]:
HEADER = """
Answer the question using the provided context."\n\nContext:\n
"""

In [None]:
answer_query_with_context("Who won the 2019 Summer Olympics men's high jump?", df, document_embeddings)

Selected 2 document sections:
("Athletics at the 2020 Summer Olympics – Men's high jump", 'Summary')
("Athletics at the 2020 Summer Olympics – Men's long jump", 'Summary')

Answer the question using the provided context."

Context:


* The men's high jump event at the 2020 Summer Olympics took place between 30 July and 1 August 2021 at the Olympic Stadium. 33 athletes from 24 nations competed; the total possible number depended on how many nations would use universality places to enter athletes in addition to the 32 qualifying through mark or ranking (no universality places were used in 2021). Italian athlete Gianmarco Tamberi along with Qatari athlete Mutaz Essa Barshim emerged as joint winners of the event following a tie between both of them as they cleared 2.37m. Both Tamberi and Barshim agreed to share the gold medal in a rare instance where the athletes of different nations had agreed to share the same medal in the history of Olympics. Barshim in particular was heard to ask a com

'Italian athlete Gianmarco Tamberi and Qatari athlete Mutaz Essa Barshim emerged as joint winners of the event following a tie between both of them as they cleared 2.37m.'

WOW! We caught it in the act! 😁

If we don't explicitly tell it not to, it will "hallucinate"! That little extra phrase in the header - `as truthfully as possible` - changes its behavior!!!



## More Examples

Let's have some fun and try some more examples. First, let's go back to the old header.

In [None]:
HEADER = """
Answer the question as truthfully as possible using the provided context, 
and if the answer is not contained within the text below, 
say "I don't know."\n\nContext:\n
"""

In [None]:
query = "Why was the 2020 Summer Olympics originally postponed?"
answer = answer_query_with_context(query, df, document_embeddings)

print(f"\nQ: {query}\nA: {answer}")

In [None]:
query = "In the 2020 Summer Olympics, how many gold medals did the country which won the most medals win? Explain step by step."
answer = answer_query_with_context(query, df, document_embeddings)

print(f"\nQ: {query}\nA: {answer}")

In [None]:
query = "What was unusual about the men’s shotput competition?"
answer = answer_query_with_context(query, df, document_embeddings)

print(f"\nQ: {query}\nA: {answer}")

In [None]:
query = "In the 2020 Summer Olympics, how many bronze medals did Italy win?"

answer = answer_query_with_context(query, df, document_embeddings)

print(f"\nQ: {query}\nA: {answer}")

Let's try to trick it a bit! 😉

In [None]:
query = "In the 2020 Summer Olympics, how many titanium medals did Italy win?"

answer = answer_query_with_context(query, df, document_embeddings)

print(f"\nQ: {query}\nA: {answer}")

Nice job, `text-davinci-003`!!! 👏

## Conclusion
By combining pretrained contextual embeddings and `text-davinci-003`, we have created a question-answering model which can answer questions in natural language using a custom dataset. It also **tries** not to make stuff up and says "I don't know" when it doesn't know the answer! **But this is not guaranteed.** 

For this example we have used a dataset of Wikipedia articles, but that dataset could be replaced with books, articles, documentation, service manuals, or much much more. 





---

How you can use this approach to "understand" a dense 56-page legal document:
A fun [example](https://www.youtube.com/watch?v=ih9PBGVVOO4)

---


