# Custom Chatbot Project

For this custom chatbot project, I will use Wiki page about 2024 Summer Olympics and answer questions about 2024 summer olympics!

We will use GPT-3.5 as our chatbot base model for answering questions. But the information that GPT-3.5 was trained on includes data up to September 2021. This means it does not have knowledge of events, developments, or advancements that have occurred after that date. Consequently, any information or context that has emerged post-September 2021 will not be reflected in its responses.

To enable GPT-3.5 answering questions for recent events such as 2024 Summer Olympics, we use the RAG based AI systems which retrives the relevent information from the context, inputs both the question and relevent information into the GPT-3.5 prompt and anwsers the question with the relevent information rather than depending only on the data GPT-3.5 trained on.

2024 Summer Olympics happened after September 2021, thus GPT-3.5 does not have information on these events which makes the dataset suitable for using as the input information for RAG. 

We could ask the following questions:

- Which city hosted 2024 Summer Olympics?
- When did the opening ceremony start?
- Which country had the most metals in 2024 Summer Olympics?
- Where was the closing ceremony held?
- How many gold medals Japan had?



## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [64]:
import requests

# API endpoint
base_url = "https://en.wikipedia.org/w/api.php"

# Parameters for the request
parameters = {
    "action": "query",
    "format": "json",
    "titles": "2024 Summer Olympics",
    "prop": "extracts",
    "explaintext": True  # Get plain text without HTML
}

# Make the request
response = requests.get(base_url, params=parameters)

# Parse the response JSON
data = response.json()

# Extract the page content
page = next(iter(data['query']['pages'].values()))
page_content = page.get('extract', 'No content found')

print(page_content)

The 2024 Summer Olympics (French: Jeux olympiques d'été de 2024), officially the Games of the XXXIII Olympiad (French: Jeux de la XXXIIIe olympiade de l'ère moderne) and branded as Paris 2024, were an international multi-sport event held from 26 July to 11 August 2024 in France, with several events started from 24 July. Paris was the host city, with events (mainly football) held in 16 additional cities spread across metropolitan France, including the sailing centre in the second-largest city of France, Marseille, on the Mediterranean Sea, as well as one subsite for surfing in Tahiti, French Polynesia.
Paris was awarded the Games at the 131st IOC Session in Lima, Peru, on 13 September 2017. After multiple withdrawals that left only Paris and Los Angeles in contention, the International Olympic Committee (IOC) approved a process to concurrently award the 2024 and 2028 Summer Olympics to the two remaining candidate cities; both bids were praised for their high technical plans and innovati

In [59]:
import pandas as pd

import re

In [65]:
df = pd.DataFrame()

# whole_text = response.json()["query"]["pages"]['8351239']['extract']


sentences = re.split(r'(?<=[.!?]) +', page_content)

print(sentences)

["The 2024 Summer Olympics (French: Jeux olympiques d'été de 2024), officially the Games of the XXXIII Olympiad (French: Jeux de la XXXIIIe olympiade de l'ère moderne) and branded as Paris 2024, were an international multi-sport event held from 26 July to 11 August 2024 in France, with several events started from 24 July.", 'Paris was the host city, with events (mainly football) held in 16 additional cities spread across metropolitan France, including the sailing centre in the second-largest city of France, Marseille, on the Mediterranean Sea, as well as one subsite for surfing in Tahiti, French Polynesia.\nParis was awarded the Games at the 131st IOC Session in Lima, Peru, on 13 September 2017.', 'After multiple withdrawals that left only Paris and Los Angeles in contention, the International Olympic Committee (IOC) approved a process to concurrently award the 2024 and 2028 Summer Olympics to the two remaining candidate cities; both bids were praised for their high technical plans and

In [66]:
df["text"] = sentences

In [67]:
df.shape

(133, 1)

In [68]:
df

Unnamed: 0,text
0,The 2024 Summer Olympics (French: Jeux olympiq...
1,"Paris was the host city, with events (mainly f..."
2,After multiple withdrawals that left only Pari...
3,"Having previously hosted in 1900 and 1924, Par..."
4,Paris 2024 marked the centenary of Paris 1924 ...
...,...
128,"The artistic director of the ceremony, Thomas ..."
129,Among those who expressed appreciation for the...
130,"According to Georgian fact checking website, M..."
131,Olympics.com.


In [71]:
df = df[df["text"].str.len() > 20]

In [72]:
df.shape

(131, 1)

In [74]:
df[df["text"].str.startswith("==")]

Unnamed: 0,text


In [76]:
df.head()

Unnamed: 0,text
0,The 2024 Summer Olympics (French: Jeux olympiq...
1,"Paris was the host city, with events (mainly f..."
2,After multiple withdrawals that left only Pari...
3,"Having previously hosted in 1900 and 1924, Par..."
4,Paris 2024 marked the centenary of Paris 1924 ...


In [77]:
df.tail()

Unnamed: 0,text
127,While there is nominally an Olympic Truce in p...
128,"The artistic director of the ceremony, Thomas ..."
129,Among those who expressed appreciation for the...
130,"According to Georgian fact checking website, M..."
132,International Olympic Committee.\nEuropean Oly...


In [78]:
df.reset_index(inplace=True, drop=True)

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [85]:
import openai

openai.api_base = "https://openai.vocareum.com/v1"

openai.api_key = "voc-17158656221266773583692670ae6c580daf8.93457505"

In [88]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
response = openai.Embedding.create(
    input=df["text"].tolist(),
    model=EMBEDDING_MODEL_NAME
)

In [89]:
response.keys()

dict_keys(['object', 'data', 'model', 'usage'])

In [91]:
response['object']

'list'

In [94]:
response['data'][0]['embedding']

[-0.00870082899928093,
 -0.002138635842129588,
 -0.0045221415348351,
 -0.007759511470794678,
 -0.018190329894423485,
 0.010278172791004181,
 -0.015569904819130898,
 -0.00816020742058754,
 -0.008878916501998901,
 -0.006824553944170475,
 0.04291900247335434,
 0.012135366909205914,
 -0.012516982853412628,
 -0.022489862516522408,
 -0.01771967113018036,
 -0.015493582002818584,
 0.02278243564069271,
 -0.020098406821489334,
 0.012548783794045448,
 -0.019182531163096428,
 0.002051182324066758,
 -0.008777152746915817,
 -0.0050786640495061874,
 -0.0007381871691904962,
 0.0048242537304759026,
 -0.006627385970205069,
 0.007505101151764393,
 -0.022120967507362366,
 0.031139809638261795,
 -0.007740430999547243,
 -0.006264851428568363,
 0.00477973185479641,
 -0.009222370572388172,
 -0.005434838123619556,
 0.0022912819404155016,
 -0.021090606227517128,
 -0.006277571897953749,
 -0.009718470275402069,
 0.0014922749251127243,
 -0.0045221415348351,
 0.0002647456421982497,
 0.0187245924025774,
 0.009216009

In [96]:
len(response['data'][2]['embedding'])

1536

In [97]:
embeddings = [data['embedding'] for data in response["data"]]

In [98]:
df["embeddings"] = embeddings

In [99]:
df

Unnamed: 0,text,embeddings
0,The 2024 Summer Olympics (French: Jeux olympiq...,"[-0.00870082899928093, -0.002138635842129588, ..."
1,"Paris was the host city, with events (mainly f...","[0.009677358902990818, 0.007114940322935581, 0..."
2,After multiple withdrawals that left only Pari...,"[0.010585588403046131, -0.0005463769193738699,..."
3,"Having previously hosted in 1900 and 1924, Par...","[0.015574367716908455, -0.012037031352519989, ..."
4,Paris 2024 marked the centenary of Paris 1924 ...,"[0.005231217481195927, -0.0089668994769454, -0..."
...,...,...
126,While there is nominally an Olympic Truce in p...,"[-0.0018343077972531319, -0.014580567367374897..."
127,"The artistic director of the ceremony, Thomas ...","[-0.00798330083489418, -0.01916792429983616, -..."
128,Among those who expressed appreciation for the...,"[-0.013269093818962574, -0.008221069350838661,..."
129,"According to Georgian fact checking website, M...","[0.012646778486669064, 0.01000780425965786, -0..."


In [100]:
df.to_csv('embeddings.csv')

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

### Question 2