In [51]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

import openai
from openai.embeddings_utils import get_embedding, distances_from_embeddings
import tiktoken

## Step 0: Inspecting Non-Customized Results

Before we perform any prompt engineering, **let's ask the OpenAI model some questions and see how it answers**.

(If you encounter an `AuthenticationError` when running this code, make sure that you have added a valid API key to the cell above and executed it.)

In [None]:
openai.api_key = "YOUR_API_KEY"

In [45]:
# Yeti-airlines crash prompt
crash_prompt = """
Question: "When did yeti airlines recently crashed?"
Answer:
"""
initial_yetiairline_crash_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=crash_prompt,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_yetiairline_crash_answer)

In [77]:
# Most populous country
populous_prompt = """
Question: "Which is the most populous country in the world?"
Answer:
"""
initial_populous_country_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=populous_prompt,
    max_tokens=150
)["choices"][0]["text"].strip()
print(populous_country_answer)

As of 2021, the most populous country in the world is China, with a population of approximately 1.4 billion people.


# 1. Create Dataset with embeddings 

## Create Dataset

In [24]:
# Get the Wikipedia page for "2022" since OpenAI's models stop in 2021
params = {
    "action": "query", 
    "prop": "extracts",
    "exlimit": 1,
    "titles": "2023",
    "explaintext": 1,
    "formatversion": 2,
    "format": "json"
}
resp = requests.get("https://en.wikipedia.org/w/api.php", params=params)

In [25]:
# Load page text into a dataframe
df = pd.DataFrame()
df["text"] = resp.json()["query"]["pages"][0]["extract"].split("\n")

# Clean up text to remove empty lines and headings
df = df[(df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))]

# In some cases dates are used as headings instead of being part of the
# text sample; adjust so dated text samples start with dates
prefix = ""
for (i, row) in df.iterrows():
    # If the row already has " - ", it already has the needed date prefix
    if " – " not in row["text"]:
        try:
            # If the row's text is a date, set it as the new prefix
            parse(row["text"])
            prefix = row["text"]
        except:
            # If the row's text isn't a date, add the prefix
            row["text"] = prefix + " – " + row["text"]
df = df[df["text"].str.contains(" – ")]
df

Unnamed: 0,text
0,– 2023 (MMXXIII) was a common year starting o...
1,The year 2023 saw the decline in severity of t...
2,– The Russian invasion of Ukraine and Myanmar...
3,– A banking crisis resulted in the collapse o...
4,"– In the realm of technology, 2023 saw the co..."
...,...
293,"Economics – Claudia Goldin, for her empirical ..."
294,"Literature – Jon Fosse, for his innovative pla..."
295,"Peace – Narges Mohammadi, for her works on the..."
296,"Physics – Pierre Agostini, Ferenc Krausz & Ann..."


##  Why the chosen dataset is appropriate for this application?
As the model "gpt-3.5-turbo-instruct" was trained on the data up to Sep 2021 and we will like to know about the important events ocurred in 2023, the dataset from the 2023 was choosen and created as above. 

## Create Embeddings  

In [28]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )

    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings

In [35]:
# Save the embeddings and data to csv file 
df.to_csv("embeddings_project.csv")

# 2. Find the relevant data with Unsupervised Machine Learning
Now that we have embeddings for our dataset, we can use them to perform a semantic text search. A semantic text search means that instead of simply looking for the exact keywords from the question, we are looking for text that is most similar to the question across a number of dimensions. If semantic search means finding the most similar text, how do we measure that similarity? We need to define a measurement of "distance", and whichever piece of text has the shortest distance is the most similar. There are many ways to measure distance when we go beyond 1-dimensional data. Which one should we use for our chatbot? Since, we are using OpenAI API, it recommend to use cosine similarity. 

We need to take one more step before we are ready to implement a semantic text search in Python. We want to find the text that is the shortest distance from our query, and cosine similarity is not a true distance metric.

Instead we’ll calculate cosine distance.

$cosine\,distance=1−cosine\,similarity$

If you want details about the math any why cosine similarity is not a true distance metric, read this article on [cosine distance](https://en.wikipedia.org/wiki/Cosine_similarity#Cosine_Distance).

Sorting by cosine distance works for any kind of data that can be vectorized and works especially well for multimedia data (text, images, videos) that produce vectors with many dimensions. 

Comparing cosine distances to find the best matches is a popular unsupervised machine learning technique that is used in applications like search engines and recommendation engines. It is called an "unsupervised" technique because it works for data that doesn't have labels or dependent variables associated with it.

In [48]:
def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """

    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)

    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )

    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In [49]:
get_rows_sorted_by_relevance(crash_prompt, df)

Unnamed: 0,text,embeddings,distances
18,January 15 – Yeti Airlines Flight 691 crashes ...,"[0.002945022424682975, -0.005277556367218494, ...",0.128615
21,January 18 – A helicopter crash in Brovary nea...,"[-0.013045558705925941, 0.0028561376966536045,...",0.221227
197,"– Wagner Group leader Yevgeny Prigozhin, foun...","[-0.003862723708152771, -0.02204275317490101, ...",0.226244
55,"February 28 – A train crash in Thessaly, Greec...","[-0.006406525615602732, -0.0027828142046928406...",0.236767
96,April 11 – Myanmar civil war: In the village o...,"[-0.021450893953442574, -0.006634502671658993,...",0.240881
...,...,...,...
166,– New Zealand signs a free trade agreement wi...,"[0.007262112107127905, -0.021624956279993057, ...",0.316796
61,– UN member states agree on a legal framework...,"[0.016124222427606583, -0.011866559274494648, ...",0.318107
99,– Nuclear power in Germany ends after 50 year...,"[0.0031694083008915186, -0.03617486357688904, ...",0.320473
295,"Peace – Narges Mohammadi, for her works on the...","[-0.013266590423882008, -0.012964030727744102,...",0.322383


In [50]:
get_rows_sorted_by_relevance(populous_prompt, df)

Unnamed: 0,text,embeddings,distances
278,"– The world population on January 1, 2023 was...","[0.010348082520067692, 0.0012515030102804303, ...",0.207840
162,July 4 – Iran joins the Shanghai Cooperation O...,"[-0.0055496529676020145, -0.011195877566933632...",0.241719
208,September 9 – At the 18th G20 summit in New De...,"[-0.0016951067373156548, -0.016396773979067802...",0.243887
223,October 5 – November 19 – The 2023 Cricket Wor...,"[-0.00882712472230196, -0.021790755912661552, ...",0.246750
203,September 1 – 2023 Singaporean presidential el...,"[-0.004922438412904739, -0.024788692593574524,...",0.249285
...,...,...,...
253,– Israel and Hamas agree to a four-day ceasef...,"[-0.03233906999230385, -0.02131197787821293, 0...",0.319540
188,"August 10 – Tapestry, the holding company of C...","[-0.03186238929629326, -0.01694248989224434, -...",0.319769
205,September 6 – Bassist Richard Davis dies at th...,"[-0.005590077023953199, 0.008676853030920029, ...",0.321001
41,– A Norfolk Southern train carrying hazardous...,"[0.0024183171335607767, 0.008508063852787018, ...",0.324381


# 3. Compose a Custom Text Prompt

- We will use the results of the previous step to compose a custom text prompt that incorporates both the user's question and the most relevant context from our dataset. 
- We will use a tokenizer to measure the length of our prompt and ensure that it doesn't exceed the limits of the OpenAI `Completion` model.

Our data is sorted from most to least relevant -- but how many of those rows (i.e. context) can we include ?

While we could choose arbitrary number, e.g. the top 5 or top 50 most relevant rows, a better approach is to count the number of tokens we use as we compose our text prompt and use all of the available tokens for each prompt. 

**How many tokens are included in Context or What is the maximum amount of context we can include?**

 The model limit minus the number of tokens in the prompt.

For example, if the limit is 4,097 tokens and the prompt contains 24 tokens, the maximum token count for the context will be 4,097 - 24 = 4,073.




In [81]:
def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break
    return prompt_template.format("\n\n###\n\n".join(context), question)

In [79]:
print(create_prompt("When did yeti airlines recently crashed?", df, 200))

Token Count:78
Token Count:110
Token Count:152
Token Count:181
Token Count:211
Token Count:211

Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

January 15 – Yeti Airlines Flight 691 crashes during final approach into Pokhara, Nepal, killing all 72 people on board.

###

January 18 – A helicopter crash in Brovary near Kyiv, Ukraine kills 14 people including Ukrainian Minister of Internal Affairs Denys Monastyrsky.

###

February 28 – A train crash in Thessaly, Greece, kills 57 people and injures dozens. The crash leads to nationwide protests and strikes against the condition of Greek railways and their mismanagement.

###

June 2 – A train collision in Odisha, India results in at least 296 deaths and more than 1,200 others injured.

---

Question: When did yeti airlines recently crashed?
Answer:


In [75]:
print(create_prompt("Which is the most populous country in the world?", df, 300))

Token Count:208
Token Count:226
Token Count:259
Token Count:311

Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

 – The world population on January 1, 2023 was estimated at 7.943 billion people, and was expected to increase to 8.119 billion on January 1, 2024. An estimated 134.3 million births and 60.8 million deaths were expected to take place in 2023. The average global life expectancy was 73.16 years, an increase of 0.18 years from 2022. The rate of child mortality was by the end of the year, expected to have decreased from 2022. Less than 23% of people were living in extreme poverty (on or below the international poverty line), a decrease from 2022. In April, India surpassed China as the most populated country in the world.

###

July 4 – Iran joins the Shanghai Cooperation Organisation, becoming the organization's ninth member.

###

September 9 – At the 18th G20 summit in New Delhi, the Afr

# 4. Query a Completion Model

- Once we have the custom text prompt, the last step is to send that prompt to an OpenAI text completion model and parse the response. This will provide the user with a tailored answer to their question!

In [76]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

In [89]:
custom_airline_crash_answer = answer_question("When did yeti airlines crashed and Where? Include year too.", df)
print(custom_airline_crash_answer)

January 15, 2023 in Pokhara, Nepal.


In [91]:
custom_populous_country_answer = answer_question("Which is the most populous country in the world?", df)
print(custom_populous_country_answer)

Based on the context, as of April 2023, India surpassed China as the most populated country in the world.


Below we compare answers with and without our custom prompt:

In [93]:
print(f"""
When did yeti airlines crashed and Where? Include year too.

Original Answer: {initial_yetiairline_crash_answer}
Custom Answer:   {custom_airline_crash_answer}

Which is the most populous country in the world?
Original Answer: {initial_populous_country_answer}
Custom Answer:   {custom_populous_country_answer}
""")


When did yeti airlines crashed and Where? Include year too.

Original Answer: There was no recent crash involving Yeti Airlines.
Custom Answer:   January 15, 2023 in Pokhara, Nepal.

Which is the most populous country in the world?
Original Answer: The most populous country in the world is China with a population of over 1.4 billion people.
Custom Answer:   Based on the context, as of April 2023, India surpassed China as the most populated country in the world.

