### ss: for debug breakdown see `ss/ss_Question_answering_using_embeddings` via jupyter in vscode

# Question Answering using Embeddings

Many use cases require GPT-3 to respond to user questions with insightful answers. For example, a customer support chatbot may need to provide answers to common questions. The GPT models have picked up a lot of general knowledge in training, but we often need to ingest and use a large library of more specific information.

In this notebook we will demonstrate a method for enabling GPT-3 able to answer questions using a library of text as a reference, by using document embeddings and retrieval. We'll be using a dataset of Wikipedia articles about the 2020 Summer Olympic Games. Please see [this notebook](fine-tuned_qa/olympics-1-collect-data.ipynb) to follow the data gathering process.

In [1]:
import pandas as pd
import openai
import numpy as np
import pickle
from transformers import GPT2TokenizerFast

COMPLETIONS_MODEL = "text-davinci-002"

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


By default, GPT-3 isn't an expert on the 2020 Olympics:

In [2]:
import os

In [3]:
# os.environ["OPENAI_API_KEY"] = "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"

In [4]:
openai.api_key= os.getenv("OPENAI_API_KEY")
openai.api_key

'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'

In [5]:
prompt = "Who won the 2020 Summer Olympics men's high jump?"

openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=300,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    model=COMPLETIONS_MODEL
)["choices"][0]["text"].strip(" \n")

"The 2020 Summer Olympics men's high jump was won by Mariusz Przybylski of Poland."

Mariusz Przybylski is a professional footballer from Poland, and not much of a high jumper! Evidently GPT-3 needs some assistance here. 

The first issue to tackle is that the model is hallucinating an answer rather than telling us "I don't know". This is bad because it makes it hard to trust the answer that the model gives us! 

# 0) Preventing hallucination with prompt engineering

We can address this hallucination issue by being more explicit with our prompt:


In [6]:
prompt = """Answer the question as truthfully as possible, and if you're unsure of the answer, say "Sorry, I don't know".

Q: Who won the 2020 Summer Olympics men's high jump?
A:"""

openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=300,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    model=COMPLETIONS_MODEL
)["choices"][0]["text"].strip(" \n")

"Sorry, I don't know."

To help the model answer the question, we provide extra contextual information in the prompt. When the total required context is short, we can include it in the prompt directly. For example we can use this information taken from Wikipedia. We update the initial prompt to tell the model to explicitly make use of the provided text.

In [7]:
prompt = """Answer the question as truthfully as possible using the provided text, and if the answer is not contained within the text below, say "I don't know"

Context:
The men's high jump event at the 2020 Summer Olympics took place between 30 July and 1 August 2021 at the Olympic Stadium.
33 athletes from 24 nations competed; the total possible number depended on how many nations would use universality places 
to enter athletes in addition to the 32 qualifying through mark or ranking (no universality places were used in 2021).
Italian athlete Gianmarco Tamberi along with Qatari athlete Mutaz Essa Barshim emerged as joint winners of the event following
a tie between both of them as they cleared 2.37m. Both Tamberi and Barshim agreed to share the gold medal in a rare instance
where the athletes of different nations had agreed to share the same medal in the history of Olympics. 
Barshim in particular was heard to ask a competition official "Can we have two golds?" in response to being offered a 
'jump off'. Maksim Nedasekau of Belarus took bronze. The medals were the first ever in the men's high jump for Italy and 
Belarus, the first gold in the men's high jump for Italy and Qatar, and the third consecutive medal in the men's high jump
for Qatar (all by Barshim). Barshim became only the second man to earn three medals in high jump, joining Patrik Sjöberg
of Sweden (1984 to 1992).

Q: Who won the 2020 Summer Olympics men's high jump?
A:"""

openai.Completion.create(
    prompt=prompt,
    temperature=0,
    max_tokens=300,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    model=COMPLETIONS_MODEL
)["choices"][0]["text"].strip(" \n")

"Gianmarco Tamberi and Mutaz Essa Barshim won the 2020 Summer Olympics men's high jump."

Adding extra information into the prompt only works when the dataset of extra content that the model may need to know is small enough to fit in a single prompt. What do we do when we need the model to choose relevant contextual information from within a large body of information?

**In the remainder of this notebook, we will demonstrate a method for augmenting GPT-3 with a large body of additional contextual information by using document embeddings and retrieval.** This method answers queries in two steps: first it retrieves the information relevant to the query, then it writes an answer tailored to the question based on the retrieved information. The first step uses the [Embedding API](https://beta.openai.com/docs/guides/embeddings), the second step uses the [Completions API](https://beta.openai.com/docs/guides/completion/introduction).
 
The steps are:
* Preprocess the contextual information by splitting it into chunks and create an embedding vector for each chunk.
* On receiving a query, embed the query in the same vector space as the context chunks and find the context embeddings which are most similar to the query.
* Prepend the most relevant context embeddings to the query prompt.
* Submit the question along with the most relevant context to GPT, and receive an answer which makes use of the provided contextual information.

# 1) Preprocess the document library

We plan to use document embeddings to fetch the most relevant part of parts of our document library and insert them into the prompt that we provide to GPT-3. We therefore need to break up the document library into "sections" of context, which can be searched and retrieved separately. 

Sections should be large enough to contain enough information to answer a question; but small enough to fit one or several into the GPT-3 prompt. We find that approximately a paragraph of text is usually a good length, but you should experiment for your particular use case. In this example, Wikipedia articles are already grouped into semantically related headers, so we will use these to define our sections. This preprocessing has already been done in [this notebook](fine-tuned_qa/olympics-1-collect-data.ipynb), so we will load the results and use them.

In [8]:
# We have hosted the processed dataset, so you can download it directly without having to recreate it.
# This dataset has already been split into sections, one row for each section of the Wikipedia page.

df = pd.read_csv('https://cdn.openai.com/API/examples/data/olympics_sections_text.csv')
df = df.set_index(["title", "heading"])
print(f"{len(df)} rows in the data.")
df.sample(5)

3964 rows in the data.


Unnamed: 0_level_0,Unnamed: 1_level_0,content,tokens
title,heading,Unnamed: 2_level_1,Unnamed: 3_level_1
Softball at the 2020 Summer Olympics – Qualification,Americas Qualifying Event,Two quota spots were allocated to the winner a...,120
Netherlands at the 2020 Summer Olympics,Archery,Three Dutch archers qualified for the men's ev...,85
Athletics at the 2020 Summer Olympics,Schedule,"Apart from the race walks and marathon, nine t...",105
Volleyball at the 2020 Summer Olympics – Women's European qualification,Pools composition,The hosts Netherlands and the top seven ranked...,54
Argentina at the 2020 Summer Olympics,Taekwondo,Argentina entered one athlete into the taekwon...,71


We preprocess the document sections by creating an embedding vector for each section. An embedding is a vector of numbers that helps us understand how semantically similar or different the texts are. The closer two embeddings are to each other, the more similar are their contents. See the [documentation on OpenAI embeddings](https://beta.openai.com/docs/guides/embeddings) for more information.

This indexing stage can be executed offline and only runs once to precompute the indexes for the dataset so that each piece of content can be retrieved later. Since this is a small example, we will store and search the embeddings locally. If you have a larger dataset, consider using a vector search engine like [Pinecone](https://www.pinecone.io/) or [Weaviate](https://github.com/semi-technologies/weaviate) to power the search.

For the purposes of this tutorial we chose to use Curie embeddings, which are 4096-dimensional embeddings at a very good price and performance point. Since we will be using these embeddings for retrieval, we’ll use the "search" embeddings (see the [documentation](https://beta.openai.com/docs/guides/embeddings)).

## 2022-12-21 ss see [openai new embeddings](https://openai.com/blog/new-and-improved-embedding-model/)
* so change below from `currie` and models to `text-embedding-ada-002`
* ```Unification of capabilities. We have significantly simplified the interface of the /embeddings endpoint by merging the five separate models shown above (text-similarity, text-search-query, text-search-doc, code-search-text and code-search-code) into a single new model. This single representation performs better than our previous embedding models across a diverse set of text search, sentence similarity, and code search benchmarks.```

In [9]:
MODEL_NAME = "curie"

DOC_EMBEDDINGS_MODEL = f"text-search-{MODEL_NAME}-doc-001"
QUERY_EMBEDDINGS_MODEL = f"text-search-{MODEL_NAME}-query-001"

In [10]:
def get_embedding(text: str, model: str) -> list[float]:
    result = openai.Embedding.create(
      model=model,
      input=text
    )
    return result["data"][0]["embedding"]

def get_doc_embedding(text: str) -> list[float]:
    return get_embedding(text, DOC_EMBEDDINGS_MODEL)

def get_query_embedding(text: str) -> list[float]:
    return get_embedding(text, QUERY_EMBEDDINGS_MODEL)

def compute_doc_embeddings(df: pd.DataFrame) -> dict[tuple[str, str], list[float]]:
    """
    Create an embedding for each row in the dataframe using the OpenAI Embeddings API.
    
    Return a dictionary that maps between each embedding vector and the index of the row that it corresponds to.
    """
    return {
        idx: get_doc_embedding(r.content.replace("\n", " ")) for idx, r in df.iterrows()
    }

In [11]:
def load_embeddings(fname: str) -> dict[tuple[str, str], list[float]]:
    """
    Read the document embeddings and their keys from a CSV.
    
    fname is the path to a CSV with exactly these named columns: 
        "title", "heading", "0", "1", ... up to the length of the embedding vectors.
    """
    
    df = pd.read_csv(fname, header=0)
    max_dim = max([int(c) for c in df.columns if c != "title" and c != "heading"])
    return {
           (r.title, r.heading): [r[str(i)] for i in range(max_dim + 1)] for _, r in df.iterrows()
    }

Again, we have hosted the embeddings for you so you don't have to re-calculate them from scratch.

In [35]:
document_embeddings = load_embeddings("https://cdn.openai.com/API/examples/data/olympics_sections_document_embeddings.csv")

# ===== OR, uncomment the below line to recaculate the embeddings from scratch. ========

## ss this is wierd?? they are embedding the docs using old? currie model? but then using the new ada model to complete??
# context_embeddings = compute_doc_embeddings(df)

## begin ss

look at `document_embeddings()` code above. df created in pandas is converted to dict...

In [31]:
ss_df = pd.read_csv("https://cdn.openai.com/API/examples/data/olympics_sections_document_embeddings.csv", header=0)

In [32]:
ss_df.head()

Unnamed: 0,title,heading,0,1,2,3,4,5,6,7,...,4086,4087,4088,4089,4090,4091,4092,4093,4094,4095
0,2020 Summer Olympics,Summary,-0.000897,0.002714,-0.00031,0.006602,-0.00986,0.019903,-0.001742,-0.033593,...,-0.005145,-0.011642,-0.001595,0.012264,0.00605,-0.001135,-0.022682,0.016751,0.020379,0.003457
1,2020 Summer Olympics,Host city selection,-0.005577,0.002411,0.013611,0.008135,-0.00967,0.013684,-0.025077,-0.015912,...,-0.002125,-0.001044,-0.008501,0.007273,0.013565,0.003856,-0.012933,-0.001128,0.014986,0.002653
2,2020 Summer Olympics,Impact of the COVID-19 pandemic,-0.007205,-0.022554,0.008785,-0.008046,-0.021775,0.01549,-0.01017,-0.049533,...,0.00243,-0.001514,-0.013224,0.015384,0.003454,-0.000574,-0.023864,0.009197,0.023209,-0.0079
3,2020 Summer Olympics,Qualifying event cancellation and postponement,0.00939,-0.00873,-0.00704,-0.006851,-0.009161,0.018375,-0.01181,-0.037736,...,-0.004033,0.000491,-0.000358,0.006301,0.002385,-0.010261,-0.014301,0.013509,0.022564,0.006614
4,2020 Summer Olympics,Effect on doping tests,-0.003449,-0.003978,0.010705,-0.010677,-0.002489,0.012567,-0.006442,-0.061913,...,0.001444,0.015333,-0.004995,0.011464,0.003408,0.003587,-0.014023,0.022349,0.008187,0.008282


In [14]:
type(document_embeddings)
ssdocembdf = pd.DataFrame.from_dict(document_embeddings)

In [26]:
ssdocembdf.to_csv('ss_olympics_sections_document_embeddings.csv', index=False)

In [27]:
xdf = pd.read_csv('olympics_sections_document_embeddings.csv')

  xdf = pd.read_csv('olympics_sections_document_embeddings.csv')


In [28]:
xdf.head()

Unnamed: 0,2020 Summer Olympics,2020 Summer Olympics.1,2020 Summer Olympics.2,2020 Summer Olympics.3,2020 Summer Olympics.4,2020 Summer Olympics.5,2020 Summer Olympics.6,2020 Summer Olympics.7,2020 Summer Olympics.8,2020 Summer Olympics.9,...,Cuba at the 2020 Summer Olympics.9,Cuba at the 2020 Summer Olympics.10,Cuba at the 2020 Summer Olympics.11,Cuba at the 2020 Summer Olympics.12,Cuba at the 2020 Summer Olympics.13,Haiti at the 2020 Summer Olympics,Haiti at the 2020 Summer Olympics.1,Haiti at the 2020 Summer Olympics.2,Haiti at the 2020 Summer Olympics.3,Haiti at the 2020 Summer Olympics.4
0,Summary,Host city selection,Impact of the COVID-19 pandemic,Qualifying event cancellation and postponement,Effect on doping tests,Postponement to 2021,Calls for cancellation,Costs and insurance,Public opinion and COVID-19 effect during and ...,Development and preparation,...,Swimming,Table tennis,Taekwondo,Weightlifting,Wrestling,Summary,Athletics,Judo,Swimming,Taekwondo
1,-0.00089670566,-0.0055773463,-0.0072051142,0.009390047,-0.0034491313,-0.00065042876,-0.018073034,-0.017382856,-0.010526956,-0.007739272,...,-0.008434482,-0.012393502,-0.0021798091,-0.009051862,-0.0027981652,-0.006525255,-0.013115763,0.0034225625,-0.007726513,-0.0075833844
2,0.0027141054,0.002410587,-0.0225536,-0.008730016,-0.003978028,-0.0070385886,-0.00710012,-0.002339317,0.007109458,-0.013500629,...,0.00693819,0.0066829505,0.0120239975,0.0023049156,0.0034500759,-0.00047803757,-0.009591711,-0.00059340993,0.0025724529,-0.004795242
3,-0.00030984893,0.013611108,0.008785105,-0.0070403353,0.010704512,0.0064756847,0.015986323,0.013760688,0.013014824,0.016122727,...,0.0015780124,-0.0038093277,0.01649163,0.010040016,0.008989019,0.0012408277,-0.0009492279,0.011610489,0.0060237506,0.01191545
4,0.0066024954,0.008134585,-0.008046006,-0.006851126,-0.010677389,-0.0039037166,-0.00816558,-0.011789802,-0.01320075,-0.0040737786,...,0.008632453,0.012705403,0.014928877,0.014458742,0.022164958,0.0075322385,0.009272604,0.020381654,0.012102429,0.017809602


In [18]:
first_key = list(document_embeddings)[0]
first_val = list(document_embeddings.values())[0]
first_key, first_val
len(first_val)

4096

In [15]:
ssdocembdf.shape

(4096, 3964)

In [16]:
ssdocembdf.head()

Unnamed: 0_level_0,2020 Summer Olympics,2020 Summer Olympics,2020 Summer Olympics,2020 Summer Olympics,2020 Summer Olympics,2020 Summer Olympics,2020 Summer Olympics,2020 Summer Olympics,2020 Summer Olympics,2020 Summer Olympics,...,Cuba at the 2020 Summer Olympics,Cuba at the 2020 Summer Olympics,Cuba at the 2020 Summer Olympics,Cuba at the 2020 Summer Olympics,Cuba at the 2020 Summer Olympics,Haiti at the 2020 Summer Olympics,Haiti at the 2020 Summer Olympics,Haiti at the 2020 Summer Olympics,Haiti at the 2020 Summer Olympics,Haiti at the 2020 Summer Olympics
Unnamed: 0_level_1,Summary,Host city selection,Impact of the COVID-19 pandemic,Qualifying event cancellation and postponement,Effect on doping tests,Postponement to 2021,Calls for cancellation,Costs and insurance,Public opinion and COVID-19 effect during and after the Games,Development and preparation,...,Swimming,Table tennis,Taekwondo,Weightlifting,Wrestling,Summary,Athletics,Judo,Swimming,Taekwondo
0,-0.000897,-0.005577,-0.007205,0.00939,-0.003449,-0.00065,-0.018073,-0.017383,-0.010527,-0.007739,...,-0.008434,-0.012394,-0.00218,-0.009052,-0.002798,-0.006525,-0.013116,0.003423,-0.007727,-0.007583
1,0.002714,0.002411,-0.022554,-0.00873,-0.003978,-0.007039,-0.0071,-0.002339,0.007109,-0.013501,...,0.006938,0.006683,0.012024,0.002305,0.00345,-0.000478,-0.009592,-0.000593,0.002572,-0.004795
2,-0.00031,0.013611,0.008785,-0.00704,0.010705,0.006476,0.015986,0.013761,0.013015,0.016123,...,0.001578,-0.003809,0.016492,0.01004,0.008989,0.001241,-0.000949,0.01161,0.006024,0.011915
3,0.006602,0.008135,-0.008046,-0.006851,-0.010677,-0.003904,-0.008166,-0.01179,-0.013201,-0.004074,...,0.008632,0.012705,0.014929,0.014459,0.022165,0.007532,0.009273,0.020382,0.012102,0.01781
4,-0.00986,-0.00967,-0.021775,-0.009161,-0.002489,0.003901,-0.009222,-0.003265,-0.012891,-0.017384,...,0.001442,0.001203,-0.028148,-0.006269,-0.021449,0.003057,-0.005753,-0.014332,0.004552,-0.018927


In [33]:
list(document_embeddings.items())[0]

(('2020 Summer Olympics', 'Summary'),
 [-0.00089670566,
  0.0027141054,
  -0.00030984893,
  0.0066024954,
  -0.009860336,
  0.019903438,
  -0.0017420078,
  -0.033592764,
  -0.017947821,
  -0.005254581,
  -0.009243493,
  0.0028511814,
  0.010783314,
  0.038673718,
  -0.008951064,
  -0.009933443,
  0.0055150255,
  0.023412585,
  -0.0036896297,
  0.008325084,
  -0.013625357,
  0.00834336,
  0.0051769046,
  0.020506574,
  0.006031345,
  -0.01423763,
  -0.012272874,
  -0.034671098,
  0.0012976531,
  -0.0020150177,
  -0.0067898324,
  -0.015690636,
  0.014731104,
  0.07438659,
  0.024728514,
  0.009449108,
  0.013762433,
  0.01743607,
  -0.0048113684,
  -0.0891908,
  -0.015571836,
  0.022919111,
  0.013131883,
  -0.014667135,
  0.018578371,
  -0.015571836,
  -0.00065967836,
  -0.0152702695,
  -0.0063420506,
  0.0013056492,
  0.027067946,
  -0.025879953,
  0.0046971384,
  0.0056749475,
  -0.010554854,
  -0.0020412905,
  -0.002305162,
  -0.0007704815,
  0.007671688,
  0.0097506745,
  -0.0154256

## end ss

In [34]:
# An example embedding:
example_entry = list(document_embeddings.items())[0]
print(f"{example_entry[0]} : {example_entry[1][:5]}... ({len(example_entry[1])} entries)")

('2020 Summer Olympics', 'Summary') : [-0.00089670566, 0.0027141054, -0.00030984893, 0.0066024954, -0.009860336]... (4096 entries)


So we have split our document library into sections, and encoded them by creating embedding vectors that represent each chunk. Next we will use these embeddings to answer our users' questions.

# 2) Find the most similar document embeddings to the question embedding

At the time of question-answering, to answer the user's query we compute the query embedding of the question and use it to find the most similar document sections. Since this is a small example, we store and search the embeddings locally. If you have a larger dataset, consider using a vector search engine like [Pinecone](https://www.pinecone.io/) or [Weaviate](https://github.com/semi-technologies/weaviate) to power the search.

In [38]:
def vector_similarity(x: list[float], y: list[float]) -> float:
    """
    We could use cosine similarity or dot product to calculate the similarity between vectors.
    In practice, we have found it makes little difference. 
    """
    return np.dot(np.array(x), np.array(y))

def order_document_sections_by_query_similarity(query: str, contexts: dict[(str, str), np.array]) -> list[(float, (str, str))]:
    """
    Find the query embedding for the supplied query, and compare it against all of the pre-calculated document embeddings
    to find the most relevant sections. 
    
    Return the list of document sections, sorted by relevance in descending order.
    """
    query_embedding = get_query_embedding(query)
    
    document_similarities = sorted([
        (vector_similarity(query_embedding, doc_embedding), doc_index) for doc_index, doc_embedding in contexts.items()
    ], reverse=True)
    
    return document_similarities

In [39]:
order_document_sections_by_query_similarity("Who won the men's high jump?", document_embeddings)[:5]

[(0.42962625596241344,
  ("Athletics at the 2020 Summer Olympics – Men's high jump", 'Summary')),
 (0.40670511466655446,
  ("Athletics at the 2020 Summer Olympics – Women's high jump", 'Summary')),
 (0.4046927661451428,
  ("Athletics at the 2020 Summer Olympics – Men's high jump", 'Background')),
 (0.4042442976710603,
  ("Athletics at the 2020 Summer Olympics – Men's triple jump", 'Summary')),
 (0.40219236319882934,
  ("Athletics at the 2020 Summer Olympics – Women's long jump", 'Summary'))]

## start ss

In [None]:
# does not work
ADA_EMBEDDINGS= f"text-embedding-ada-002"
DOC_EMBEDDINGS_MODEL=ADA_EMBEDDINGS
QUERY_EMBEDDINGS_MODEL=ADA_EMBEDDINGS
# DOC_EMBEDDINGS_MODEL = f"text-search-{MODEL_NAME}-doc-001"
# QUERY_EMBEDDINGS_MODEL = f"text-search-{MODEL_NAME}-query-001"

In [None]:
[QUERY_EMBEDDINGS_MODEL,DOC_EMBEDDINGS_MODEL]

In [None]:
order_document_sections_by_query_similarity("Who won the men's high jump?", document_embeddings)[:5]

In [None]:
 query_embedding = get_query_embedding("Who won the men's high jump?")

In [None]:
[type(document_embeddings),type( query_embedding)]
print(document_embeddings.items())

In [None]:
len(query_embedding)

## end ss

In [40]:
order_document_sections_by_query_similarity("Who won the women's high jump?", document_embeddings)[:5]

[(0.4287929146349248,
  ("Athletics at the 2020 Summer Olympics – Women's high jump", 'Summary')),
 (0.4194122846175017,
  ("Athletics at the 2020 Summer Olympics – Women's long jump", 'Summary')),
 (0.41152657076657995,
  ("Athletics at the 2020 Summer Olympics – Women's triple jump", 'Summary')),
 (0.4096367709206329,
  ("Athletics at the 2020 Summer Olympics – Men's high jump", 'Summary')),
 (0.4059521236876147,
  ("Athletics at the 2020 Summer Olympics – Women's pole vault", 'Summary'))]

We can see that the most relevant document sections for each question are the summaries for the Men's and Women's high jump competitions - which is exactly what we would expect.

# 3) Add the most relevant document sections to the query prompt

Once we've calculated the most relevant pieces of context, we construct a prompt by simply prepending them to the supplied query. It is helpful to use a query separator to help the model distinguish between separate pieces of text.

In [41]:
MAX_SECTION_LEN = 500
SEPARATOR = "\n* "

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
separator_len = len(tokenizer.tokenize(SEPARATOR))

f"Context separator contains {separator_len} tokens"

'Context separator contains 3 tokens'

In [45]:
def construct_prompt(question: str, context_embeddings: dict, df: pd.DataFrame) -> str:
    """
    Fetch relevant 
    """
    most_relevant_document_sections = order_document_sections_by_query_similarity(question, context_embeddings)
    
    chosen_sections = []
    chosen_sections_len = 0
    chosen_sections_indexes = []
     
    for _, section_index in most_relevant_document_sections:
        # Add contexts until we run out of space.        
        document_section = df.loc[section_index]
        
        chosen_sections_len += document_section.tokens + separator_len
        if chosen_sections_len > MAX_SECTION_LEN:
            break
            
        chosen_sections.append(SEPARATOR + document_section.content.replace("\n", " "))
        chosen_sections_indexes.append(str(section_index))
            
    # Useful diagnostic information
    print(f"Selected {len(chosen_sections)} document sections:")
    print("\n".join(chosen_sections_indexes))
    
    header = """Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't know."\n\nContext:\n"""
    
    return header + "".join(chosen_sections) + "\n\n Q: " + question + "\n A:"

In [44]:
df.shape

(3964, 2)

In [46]:
prompt = construct_prompt(
    "Who won the 2020 Summer Olympics men's high jump?",
    document_embeddings,
    df
)

print("===\n", prompt)

Selected 3 document sections:
("Athletics at the 2020 Summer Olympics – Women's high jump", 'Summary')
("Athletics at the 2020 Summer Olympics – Men's high jump", 'Summary')
("Athletics at the 2020 Summer Olympics – Men's triple jump", 'Summary')
===
 Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't know."

Context:

* The women's high jump event at the 2020 Summer Olympics took place on 5 and 7 August 2021 at the Japan National Stadium. Even though 32 athletes qualified through the qualification system for the Games, only 31 took part in the competition. This was the 22nd appearance of the event, having appeared at every Olympics since women's athletics was introduced in 1928.
* The men's high jump event at the 2020 Summer Olympics took place between 30 July and 1 August 2021 at the Olympic Stadium. 33 athletes from 24 nations competed; the total possible number depended on how many nations

We have now obtained the document sections that are most relevant to the question. As a final step, let's put it all together to get an answer to the question.

# 4) Answer the user's question based on the context.

Now that we've retrieved the relevant context and constructed our prompt, we can finally use the Completions API to answer the user's query.

In [47]:
COMPLETIONS_API_PARAMS = {
    # We use temperature of 0.0 because it gives the most predictable, factual answer.
    "temperature": 0.0,
    "max_tokens": 300,
    "model": COMPLETIONS_MODEL,
}

In [48]:
def answer_query_with_context(
    query: str,
    df: pd.DataFrame,
    document_embeddings: dict[(str, str), np.array],
    show_prompt: bool = False
) -> str:
    prompt = construct_prompt(
        query,
        document_embeddings,
        df
    )
    
    if show_prompt:
        print(prompt)

    response = openai.Completion.create(
                prompt=prompt,
                **COMPLETIONS_API_PARAMS
            )

    return response["choices"][0]["text"].strip(" \n")

In [53]:
answer_query_with_context("Who won the 2020 Summer Olympics men's high jump?", df, document_embeddings,show_prompt=False)

Selected 3 document sections:
("Athletics at the 2020 Summer Olympics – Women's high jump", 'Summary')
("Athletics at the 2020 Summer Olympics – Men's high jump", 'Summary')
("Athletics at the 2020 Summer Olympics – Men's triple jump", 'Summary')


'Gianmarco Tamberi and Mutaz Essa Barshim emerged as joint winners of the event following a tie between both of them as they cleared 2.37m.'

Wow! By combining the Embeddings and Completions APIs, we have created a question-answering model which can answer questions using a large base of additional knowledge. It also understands when it doesn't know the answer! 

For this example we have used a dataset of Wikipedia articles, but that dataset could be replaced with books, articles, documentation, service manuals, or much much more. **We can't wait to see what you create with GPT-3!**

# More Examples

Let's have some fun and try some more examples.

In [90]:
query = "Why was the 2020 Summer Olympics originally postponed?"
answer = answer_query_with_context(query, df, document_embeddings)

print(f"\nQ: {query}\nA: {answer}")

ValueError: shapes (1536,) and (4096,) not aligned: 1536 (dim 0) != 4096 (dim 0)

In [55]:
query = "In the 2020 Summer Olympics, how many gold medals did the country which won the most medals win?"
answer = answer_query_with_context(query, df, document_embeddings)

print(f"\nQ: {query}\nA: {answer}")

Selected 1 document sections:
('2020 Summer Olympics medal table', 'Summary')

Q: In the 2020 Summer Olympics, how many gold medals did the country which won the most medals win?
A: The United States won the most medals overall, with 113, and the most gold medals, with 39.


In [56]:
query = "What was unusual about the men’s shotput competition?"
answer = answer_query_with_context(query, df, document_embeddings)

print(f"\nQ: {query}\nA: {answer}")

Selected 3 document sections:
("Athletics at the 2020 Summer Olympics – Men's shot put", 'Summary')
("Athletics at the 2020 Summer Olympics – Men's shot put", 'Background')
("Athletics at the 2020 Summer Olympics – Men's hammer throw", 'Competition format')

Q: What was unusual about the men’s shotput competition?
A: The same three competitors received the same medals in back-to-back editions of an the same individual event.


In [57]:
query = "In the 2020 Summer Olympics, how many silver medals did Italy win?"
answer = answer_query_with_context(query, df, document_embeddings)

print(f"\nQ: {query}\nA: {answer}")

Selected 1 document sections:
('Italy at the 2020 Summer Olympics', 'Summary')

Q: In the 2020 Summer Olympics, how many silver medals did Italy win?
A: 10


Our Q&A model is less prone to hallucinating answers, and has a better sense of what it does or doesn't know. This works when the information isn't contained in the context; when the question is nonsensical; or when the question is theoretically answerable but beyond GPT-3's powers!

In [58]:
query = "What is the total number of medals won by France, multiplied by the number of Taekwondo medals given out to all countries?"
answer = answer_query_with_context(query, df, document_embeddings)

print(f"\nQ: {query}\nA: {answer}")

Selected 3 document sections:
('France at the 2020 Summer Olympics', 'Taekwondo')
('2020 Summer Olympics medal table', 'Medal count')
('Taekwondo at the 2020 Summer Olympics – Qualification', 'Qualification summary')

Q: What is the total number of medals won by France, multiplied by the number of Taekwondo medals given out to all countries?
A: I don't know.


In [59]:
query = "What is the tallest mountain in the world?"
answer = answer_query_with_context(query, df, document_embeddings)

print(f"\nQ: {query}\nA: {answer}")

Selected 7 document sections:
('Chile at the 2020 Summer Olympics', 'Mountain biking')
('South Korea at the 2020 Summer Olympics', 'Sport climbing')
("Cycling at the 2020 Summer Olympics – Men's cross-country", 'Competition format')
("Ski mountaineering at the 2020 Winter Youth Olympics – Boys' individual", 'Summary')
("Cycling at the 2020 Summer Olympics – Women's cross-country", 'Competition format')
('Portugal at the 2020 Summer Olympics', 'Mountain biking')
('Slovenia at the 2020 Summer Olympics', 'Mountain biking')

Q: What is the tallest mountain in the world?
A: I don't know.


In [60]:
query = "Who won the grimblesplatch competition at the 2020 Summer Olympic games?"
answer = answer_query_with_context(query, df, document_embeddings)

print(f"\nQ: {query}\nA: {answer}")

Selected 8 document sections:
("Gymnastics at the 2020 Summer Olympics – Women's trampoline", 'Summary')
("Rowing at the 2020 Summer Olympics – Women's quadruple sculls", 'Summary')
("Cycling at the 2020 Summer Olympics – Women's sprint", 'Summary')
("Cycling at the 2020 Summer Olympics – Women's team sprint", 'Summary')
("Wrestling at the 2020 Summer Olympics – Women's freestyle 62 kg", 'Summary')
("Cycling at the 2020 Summer Olympics – Women's BMX freestyle", 'Summary')
("Rowing at the 2020 Summer Olympics – Women's lightweight double sculls", 'Summary')
("Wrestling at the 2020 Summer Olympics – Women's freestyle 68 kg", 'Summary')

Q: Who won the grimblesplatch competition at the 2020 Summer Olympic games?
A: I don't know.


# begin ss

In [None]:
import tenacity
from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential,
)  # for exponential backoff

In [61]:
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,content,tokens
title,heading,Unnamed: 2_level_1,Unnamed: 3_level_1
2020 Summer Olympics,Summary,The 2020 Summer Olympics (Japanese: 2020年夏季オリン...,726
2020 Summer Olympics,Host city selection,The International Olympic Committee (IOC) vote...,126
2020 Summer Olympics,Impact of the COVID-19 pandemic,"In January 2020, concerns were raised about th...",374
2020 Summer Olympics,Qualifying event cancellation and postponement,Concerns about the pandemic began to affect qu...,298
2020 Summer Olympics,Effect on doping tests,Mandatory doping tests were being severely res...,163


In [62]:
df.columns

Index(['content', 'tokens'], dtype='object')

In [63]:
type(document_embeddings)

dict

In [64]:
x=list(document_embeddings.items())[0]

In [68]:
# An example embedding:
example_entry = list(document_embeddings.items())[0]
print(f"{example_entry[0]} : {example_entry[1][:5]}... ({len(example_entry[1])} entries)")

('2020 Summer Olympics', 'Summary') : [-0.00089670566, 0.0027141054, -0.00030984893, 0.0066024954, -0.009860336]... (4096 entries)


In [100]:
ADA_EMBEDDINGS= f"text-embedding-ada-002"
DOC_EMBEDDINGS_MODEL=ADA_EMBEDDINGS
QUERY_EMBEDDINGS_MODEL=ADA_EMBEDDINGS
# DOC_EMBEDDINGS_MODEL = f"text-search-{MODEL_NAME}-doc-001"
# QUERY_EMBEDDINGS_MODEL = f"text-search-{MODEL_NAME}-query-001"


In [108]:
@retry(wait=wait_random_exponential(multiplier=30,min=1, max=60), stop=stop_after_attempt(6))
def ss_get_embedding(text: str, model: str) -> list[float]:
    result = openai.Embedding.create(
      model=model,
      input=text
    )
    return result["data"][0]["embedding"]

In [109]:
@retry(wait=wait_random_exponential(multiplier=20,min=1, max=60), stop=stop_after_attempt(6))
def ss_get_doc_embedding(text: str) -> list[float]:
    return ss_get_embedding(text=text, model=DOC_EMBEDDINGS_MODEL)

In [110]:
@retry(wait=wait_random_exponential(multiplier=10,min=1, max=60), stop=stop_after_attempt(6))
def ss_compute_doc_embeddings(df: pd.DataFrame) -> dict[tuple[str, str], list[float]]:
    """
    Create an embedding for each row in the dataframe using the OpenAI Embeddings API.
    
    Return a dictionary that maps between each embedding vector and the index of the row that it corresponds to.
    """
    return {
        idx: ss_get_doc_embedding(r.content.replace("\n", " ")) for idx, r in df.iterrows()
    }

In [112]:
ss_context_embeddings = ss_compute_doc_embeddings(df)

KeyboardInterrupt: 

### embeddings different
* `ss_context_embeddings` different that `document_embeddings`
* this furthers my belief that the document_embbeddings were **not** created via `ada` but by `currie`

In [117]:
[len(document_embeddings.items()),len(ss_context_embeddings.items())]

[3964, 3964]

In [120]:
list(ss_context_embeddings.items())[0]

(('2020 Summer Olympics', 'Summary'),
 [0.004835863132029772,
  -0.007446258794516325,
  -0.007789136376231909,
  -0.008487829007208347,
  -0.001518687466159463,
  0.01516423188149929,
  -0.024738917127251625,
  -0.0057642194442451,
  -0.006928708404302597,
  -0.029992055147886276,
  0.0144008444622159,
  0.010467460379004478,
  -0.013288111425936222,
  -0.020456185564398766,
  -0.02061145193874836,
  -0.014193824492394924,
  0.023872019723057747,
  -0.01767435297369957,
  -0.002443809062242508,
  -0.01455610990524292,
  -0.025178834795951843,
  0.006608474068343639,
  -0.006113565992563963,
  0.016044067218899727,
  -0.007957340218126774,
  -0.0020831411238759756,
  0.015215987339615822,
  -0.0153841907158494,
  0.032502174377441406,
  -0.02054675854742527,
  -0.011638418771326542,
  0.011703112162649632,
  -0.008707788772881031,
  0.007989686913788319,
  -0.0026815589517354965,
  -0.017389699816703796,
  -0.017506148666143417,
  -0.014038559049367905,
  0.002869171090424061,
  -0.013

In [124]:
list(document_embeddings.items())[0]

(('2020 Summer Olympics', 'Summary'),
 [-0.00089670566,
  0.0027141054,
  -0.00030984893,
  0.0066024954,
  -0.009860336,
  0.019903438,
  -0.0017420078,
  -0.033592764,
  -0.017947821,
  -0.005254581,
  -0.009243493,
  0.0028511814,
  0.010783314,
  0.038673718,
  -0.008951064,
  -0.009933443,
  0.0055150255,
  0.023412585,
  -0.0036896297,
  0.008325084,
  -0.013625357,
  0.00834336,
  0.0051769046,
  0.020506574,
  0.006031345,
  -0.01423763,
  -0.012272874,
  -0.034671098,
  0.0012976531,
  -0.0020150177,
  -0.0067898324,
  -0.015690636,
  0.014731104,
  0.07438659,
  0.024728514,
  0.009449108,
  0.013762433,
  0.01743607,
  -0.0048113684,
  -0.0891908,
  -0.015571836,
  0.022919111,
  0.013131883,
  -0.014667135,
  0.018578371,
  -0.015571836,
  -0.00065967836,
  -0.0152702695,
  -0.0063420506,
  0.0013056492,
  0.027067946,
  -0.025879953,
  0.0046971384,
  0.0056749475,
  -0.010554854,
  -0.0020412905,
  -0.002305162,
  -0.0007704815,
  0.007671688,
  0.0097506745,
  -0.0154256

In [123]:
ss_context_embeddings.keys()

dict_keys([('2020 Summer Olympics', 'Summary'), ('2020 Summer Olympics', 'Host city selection'), ('2020 Summer Olympics', 'Impact of the COVID-19 pandemic'), ('2020 Summer Olympics', 'Qualifying event cancellation and postponement'), ('2020 Summer Olympics', 'Effect on doping tests'), ('2020 Summer Olympics', 'Postponement to 2021'), ('2020 Summer Olympics', 'Calls for cancellation'), ('2020 Summer Olympics', 'Costs and insurance'), ('2020 Summer Olympics', 'Public opinion and COVID-19 effect during and after the Games'), ('2020 Summer Olympics', 'Development and preparation'), ('2020 Summer Olympics', 'Venues and infrastructure'), ('2020 Summer Olympics', 'Security'), ('2020 Summer Olympics', 'Volunteers'), ('2020 Summer Olympics', 'Medals'), ('2020 Summer Olympics', 'Torch relay'), ('2020 Summer Olympics', 'Biosecurity protocols'), ('2020 Summer Olympics', 'Ticketing'), ('2020 Summer Olympics', 'Cultural festival'), ('2020 Summer Olympics', 'Opening ceremony'), ('2020 Summer Olympics

### interesting: ChatGP3 and codepilot
produces same code

In [127]:
# chatgp3
import csv
def write_dict_to_csv(dict_data, csv_file_path):
    with open(csv_file_path, "w", newline="") as csv_file:
        writer = csv.DictWriter(csv_file, dict_data.keys())
        writer.writeheader()
        writer.writerow(dict_data)

In [126]:
write_dict_to_csv(ss_context_embeddings, "ss_context_embeddings.csv")

In [None]:
# codepilot
import csv
def write_dict_to_csv(dict_data, csv_file_path):
    with open(csv_file_path, "w", newline="") as csv_file:
        writer = csv.DictWriter(csv_file, dict_data.keys())
        writer.writeheader()
        writer.writerow(dict_data)


In [106]:
df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,content,tokens
title,heading,Unnamed: 2_level_1,Unnamed: 3_level_1
2020 Summer Olympics,Summary,The 2020 Summer Olympics (Japanese: 2020年夏季オリン...,726
2020 Summer Olympics,Host city selection,The International Olympic Committee (IOC) vote...,126
2020 Summer Olympics,Impact of the COVID-19 pandemic,"In January 2020, concerns were raised about th...",374
2020 Summer Olympics,Qualifying event cancellation and postponement,Concerns about the pandemic began to affect qu...,298
2020 Summer Olympics,Effect on doping tests,Mandatory doping tests were being severely res...,163


In [72]:
ssdf = df.copy()

## swap embedding
* test creating document_embeddings with ada vs currie
* 

In [81]:
ss_context_embeddings
example_entry = list(ss_context_embeddings.items())[0]
print(f"{example_entry[0]} : {example_entry[1][:5]}... ({len(example_entry[1])} entries)")

('2020 Summer Olympics', 'Summary') : [-0.0018701485823839903, 0.0037796199321746826, 0.0010941005311906338, 0.008623270317912102, -0.009770572185516357]... (4096 entries)


In [83]:
ss_context_embeddings
example_entry = list(ss_context_embeddings.items())[0]
print(f"{example_entry[0]} : {example_entry[1][:5]}... ({len(example_entry[1])} entries)")

('2020 Summer Olympics', 'Summary') : [0.004835863132029772, -0.007446258794516325, -0.007789136376231909, -0.008487829007208347, -0.001518687466159463]... (1536 entries)


In [84]:
order_document_sections_by_query_similarity("Who won the men's high jump?", ss_context_embeddings)[:5]

[(0.7596628650945244, ('2020 Summer Olympics', 'Summary')),
 (0.7582199655181385, ('2020 Summer Olympics', 'Host city selection')),
 (0.7230897831012942,
  ('2020 Summer Olympics', 'Qualifying event cancellation and postponement')),
 (0.7166928139607489, ('2020 Summer Olympics', 'Effect on doping tests')),
 (0.707955461326335,
  ('2020 Summer Olympics', 'Impact of the COVID-19 pandemic'))]


using document embeddings bomb because above used ada to encode, not currie

In [87]:
order_document_sections_by_query_similarity("Who won the men's high jump?", document_embeddings)[:5]

ValueError: shapes (1536,) and (4096,) not aligned: 1536 (dim 0) != 4096 (dim 0)

In [91]:
prompt = construct_prompt(
    "Who won the 2020 Summer Olympics men's high jump?",
    ss_context_embeddings,
    df
)

print("===\n", prompt)

Selected 0 document sections:

===
 Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't know."

Context:


 Q: Who won the 2020 Summer Olympics men's high jump?
 A:


In [92]:
query = "Why was the 2020 Summer Olympics originally postponed?"
answer = answer_query_with_context(query, df, ss_context_embeddings)

print(f"\nQ: {query}\nA: {answer}")

Selected 1 document sections:
('2020 Summer Olympics', 'Impact of the COVID-19 pandemic')

Q: Why was the 2020 Summer Olympics originally postponed?
A: The 2020 Summer Olympics were originally postponed because of the potential impact of the COVID-19 pandemic on athletes and visitors to the Olympic Games.


In [94]:
query = "What is the total number of medals won by France, multiplied by the number of Taekwondo medals given out to all countries?"
answer = answer_query_with_context(query, df, ss_context_embeddings)

print(f"\nQ: {query}\nA: {answer}")

Selected 0 document sections:


Q: What is the total number of medals won by France, multiplied by the number of Taekwondo medals given out to all countries?
A: I don't know.


In [95]:



@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
def ss_answer_query_with_context(
    query: str,
    df: pd.DataFrame,
    document_embeddings: dict[(str, str), np.array],
    show_prompt: bool = False
) -> str:
    prompt = construct_prompt(
        query,
        document_embeddings,
        df
    )
    
    if show_prompt:
        print(prompt)

    response = openai.Completion.create(
                prompt=prompt,
                **COMPLETIONS_API_PARAMS
            )

    return response["choices"][0]["text"].strip(" \n")





In [97]:
ss_answer_query_with_context(query="Why was the 2020 Summer Olympics originally postponed?", df=df, document_embeddings=ss_context_embeddings, show_prompt=False)

Selected 1 document sections:
('2020 Summer Olympics', 'Impact of the COVID-19 pandemic')


'The 2020 Summer Olympics were originally postponed because of the potential impact of the COVID-19 pandemic on athletes and visitors.'