# Custom Chatbot Project

For this custom chatbot project, I will use Wiki page about 2024 Summer Olympics and answer questions about 2024 summer olympics!

We will use GPT-3.5 as our chatbot base model for answering questions. But the information that GPT-3.5 was trained on includes data up to September 2021. This means it does not have knowledge of events, developments, or advancements that have occurred after that date. Consequently, any information or context that has emerged post-September 2021 will not be reflected in its responses.

To enable GPT-3.5 answering questions for recent events such as 2024 Summer Olympics, we use the RAG based AI systems which retrives the relevent information from the context, inputs both the question and relevent information into the GPT-3.5 prompt and anwsers the question with the relevent information rather than depending only on the data GPT-3.5 trained on.

2024 Summer Olympics happened after September 2021, thus GPT-3.5 does not have information on these events which makes the dataset suitable for using as the input information for RAG. 

We could ask the following questions:

- Which city hosted 2024 Summer Olympics?
- When did the opening ceremony start?
- Which country had the most metals in 2024 Summer Olympics?
- Where was the closing ceremony held?
- How many gold medals Japan had?



In [1]:
import pandas as pd

import re

import openai


## Data Wrangling

In the cells below, load chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all text data, separated into at least 20 rows.

In [2]:
import requests

# API endpoint
base_url = "https://en.wikipedia.org/w/api.php"

# Parameters for the request
parameters = {
    "action": "query",
    "format": "json",
    "titles": "2024 Summer Olympics",
    "prop": "extracts",
    "explaintext": True  # Get plain text without HTML
}

# Make the request
response = requests.get(base_url, params=parameters)

# Parse the response JSON
data = response.json()

# Extract the page content
page = next(iter(data['query']['pages'].values()))
page_content = page.get('extract', 'No content found')

# print(page_content)

In [3]:
# Create a dataframe with a text column containing sentences of the wiki page
df = pd.DataFrame()
sentences = re.split(r'(?<=[.!?]) +', page_content)
df["text"] = sentences


In [4]:
df.shape

(133, 1)

In [5]:
df

Unnamed: 0,text
0,The 2024 Summer Olympics (French: Jeux olympiq...
1,"Paris was the host city, with events (mainly f..."
2,After multiple withdrawals that left only Pari...
3,"Having previously hosted in 1900 and 1924, Par..."
4,Paris 2024 marked the centenary of Paris 1924 ...
...,...
128,"The artistic director of the ceremony, Thomas ..."
129,Among those who expressed appreciation for the...
130,"According to Georgian fact checking website, M..."
131,Olympics.com.


In [6]:
# keep the rows that has more than 20 characters
df = df[df["text"].str.len() > 20]

In [7]:
df.shape

(131, 1)

In [8]:
# check if the section name is still there
df[df["text"].str.startswith("==")]

Unnamed: 0,text


In [9]:
df.head()

Unnamed: 0,text
0,The 2024 Summer Olympics (French: Jeux olympiq...
1,"Paris was the host city, with events (mainly f..."
2,After multiple withdrawals that left only Pari...
3,"Having previously hosted in 1900 and 1924, Par..."
4,Paris 2024 marked the centenary of Paris 1924 ...


In [10]:
df.tail()

Unnamed: 0,text
127,While there is nominally an Olympic Truce in p...
128,"The artistic director of the ceremony, Thomas ..."
129,Among those who expressed appreciation for the...
130,"According to Georgian fact checking website, M..."
132,International Olympic Committee.\nEuropean Oly...


In [11]:
# reset index for df
df.reset_index(inplace=True, drop=True)

## Custom Query Completion

TIn the cells below, compose a custom query using chosen dataset and retrieve results from an OpenAI `Completion` model. 

### Obtain embeddings for the 2024 Summer Olympic dataset

In [12]:
# load openai and key
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = "YOUR API KEY"

In [13]:
# get embeddings of text from df from the following model name

EMBEDDING_MODEL_NAME = "text-embedding-ada-002"

response = openai.Embedding.create(
    input=df["text"].tolist(),
    model=EMBEDDING_MODEL_NAME
)

In [14]:
response.keys()

dict_keys(['object', 'data', 'model', 'usage'])

In [15]:
len(response['data'][2]['embedding'])

1536

In [16]:
embeddings = [data['embedding'] for data in response["data"]]

In [23]:
df["embeddings"] = embeddings

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["embeddings"] = embeddings


In [24]:
df

Unnamed: 0,text,embeddings
0,The 2024 Summer Olympics (French: Jeux olympiq...,"[-0.008732384070754051, -0.0021369855385273695..."
1,"Paris was the host city, with events (mainly f...","[0.009677358902990818, 0.007114940322935581, 0..."
2,After multiple withdrawals that left only Pari...,"[0.010585588403046131, -0.0005463769193738699,..."
3,"Having previously hosted in 1900 and 1924, Par...","[0.015574367716908455, -0.012037031352519989, ..."
4,Paris 2024 marked the centenary of Paris 1924 ...,"[0.005231217481195927, -0.0089668994769454, -0..."
...,...,...
126,While there is nominally an Olympic Truce in p...,"[-0.0018343077972531319, -0.014580567367374897..."
127,"The artistic director of the ceremony, Thomas ...","[-0.007993845269083977, -0.019081221893429756,..."
128,Among those who expressed appreciation for the...,"[-0.0131845036521554, -0.008254049345850945, -..."
129,"According to Georgian fact checking website, M...","[0.012646778486669064, 0.01000780425965786, -0..."


In [19]:
df.to_csv('embeddings.csv')

### Embedding the question and finding the relevant data in the dataset using cosine similarity

In [20]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy
    


In [25]:
question = "Which country hosted 2024 Summer Olympics"
get_rows_sorted_by_relevance(question, df).head()

Unnamed: 0,text,embeddings,distances
0,The 2024 Summer Olympics (French: Jeux olympiq...,"[-0.008732384070754051, -0.0021369855385273695...",0.130693
28,"On 31 July 2017, the IOC announced Los Angeles...","[0.01342105958610773, -0.0021386248990893364, ...",0.133882
26,The International Olympic Committee formally p...,"[0.017240140587091446, -0.018078293651342392, ...",0.133952
2,After multiple withdrawals that left only Pari...,"[0.010585588403046131, -0.0005463769193738699,...",0.137379
3,"Having previously hosted in 1900 and 1924, Par...","[0.015574367716908455, -0.012037031352519989, ...",0.138159


### Compose a Custom Prompt

In [26]:
# count tokens
import tiktoken


def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")

    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""

    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))

    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:

        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count

        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

In [27]:
print(create_prompt("Which city hosted 2024 Summer Olympics?", df, 200))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

After multiple withdrawals that left only Paris and Los Angeles in contention, the International Olympic Committee (IOC) approved a process to concurrently award the 2024 and 2028 Summer Olympics to the two remaining candidate cities; both bids were praised for their high technical plans and innovative ways to use a record-breaking number of existing and temporary facilities.

###

The IOC set up a process whereby the LA 2024 and Paris 2024 bid committees met with the IOC to discuss which city would host the Games in 2024 and 2028 and whether it was possible to select the host cities for both at the same time.
Following the decision to award the two Games simultaneously, Paris was understood to be the preferred host for 2024.

---

Question: Which city hosted 2024 Summer Olympics?
Answer:


### Create a function that answers a question

In [28]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""
        

In [29]:
custom_2024olympics_answer = answer_question("Which city hosted 2024 Summer Olympics?", df)
print(custom_2024olympics_answer)

Paris


## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

In [30]:
# Create a function that answers the question without context

def answer_question_initial(
    question, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, return the answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt_template = """

Question: {}
Answer:"""
    
    prompt = prompt_template.format(question)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

### Question 1: When did the opening ceremony start for 2024 Summer Olympics?

In [31]:
question1 = "When did the opening ceremony start for 2024 Summer Olympics?"

answer_initial = answer_question_initial(
    question1, max_prompt_tokens=1800, max_answer_tokens=150
)
answer_custom = answer_question(
    question1, df, max_prompt_tokens=1800, max_answer_tokens=150
)

print(f"Answer from the initial model: \n{answer_initial }\n")
print(f"Answer from the custom model: \n{answer_custom }")

Answer from the initial model: 
The opening ceremony for the 2024 Summer Olympics is scheduled to start on July 26, 2024.

Answer from the custom model: 
The opening ceremony began at 19:30 CEST (17:30 GMT) on 26 July 2024.


### Question 2

In [32]:
question2 = "Which city hosted 2024 Summer Olympics?"

answer_initial2 = answer_question_initial(
    question2, max_prompt_tokens=1800, max_answer_tokens=150
)

answer_custom2 = answer_question(
    question2, df, max_prompt_tokens=1800, max_answer_tokens=150
)

print(f"Answer from the initial model: \n{answer_initial2 }\n")
print(f"Answer from the custom model: \n{answer_custom2 }")

Answer from the initial model: 
Paris, France.

Answer from the custom model: 
Paris


## Allow user input to ask questions

In [33]:
# to end the conversation, type stop
while True:
  question = input("Type question (type Enter to submit the question, type 'stop' to quit):")
  if question == "stop":
    break
  initial_answer = answer_question_initial(question)
  custom_answer = answer_question(question, df)
  print(f"Question: {question}\n")
  print(f"Inital answer without context: {initial_answer}\n")
  print(f"Custom answer with context: {custom_answer}")
  print('-------------------------------------------------------\n')

  

Type question (type Enter to submit the question, type 'stop' to quit):Which country had the most metals in 2024 Summer Olympics?
Question: Which country had the most metals in 2024 Summer Olympics?

Inital answer without context: As of now, it is impossible to determine which country will have the most medals in the 2024 Summer Olympics as the event has yet to take place. However, based on statistics and past performances, countries like the United States, China, and Russia are often top contenders for the most medals.

Custom answer with context: The United States had the most total medals with 126, and the most gold medals with 40.
-------------------------------------------------------

Type question (type Enter to submit the question, type 'stop' to quit):Where was the closing ceremony held?
Question: Where was the closing ceremony held?

Inital answer without context: The closing ceremony was held in Tokyo, Japan.

Custom answer with context: The closing ceremony was held at Stad