# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [1]:
import requests

# Get the Wikipedia page for "2023" since OpenAI's models stop in 2021
params = {
    "action": "query", 
    "prop": "extracts",
    "exlimit": 1,
    "titles": "2023",
    "explaintext": 1,
    "formatversion": 2,
    "format": "json"
}
resp = requests.get("https://en.wikipedia.org/w/api.php", params=params)
response_dict = resp.json()

In [2]:
import pandas as pd
df = pd.DataFrame()
df["text"] = response_dict["query"]["pages"][0]["extract"].split("\n")
# Clean up text to remove empty lines and headings
df = df[(df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))]
print(df[:20])

                                                 text
0   2023 (MMXXIII) was a common year starting on S...
1   The year 2023 saw the decline in severity of t...
2   The Russian invasion of Ukraine and Myanmar ci...
3   A banking crisis resulted in the collapse of n...
4   In the realm of technology, 2023 saw the conti...
11  January 1 – Croatia adopts the euro and joins ...
12  January 5 – The funeral of Pope Benedict XVI i...
13                                          January 8
14  The 2023 Beninese parliamentary election is he...
15  Following the 2022 Brazilian general election ...
16  January 9 – Juliaca massacre: At least 18 peop...
17  January 10–17 – A cold snap in Afghanistan kil...
18  January 15 – Yeti Airlines Flight 691 crashes ...
19  January 16 – Tigray War: Amharan Special Force...
20  January 17 – Nguyễn Xuân Phúc resigns as Presi...
21  January 18 – A helicopter crash in Brovary nea...
22  January 20 – The Parliament of Trinidad and To...
23                          

In [3]:
from dateutil.parser import parse
# In some cases dates are used as headings instead of being part of the
# text sample; adjust so dated text samples start with dates
prefix = ""
for (i, row) in df.iterrows():
    # If the row already has " - ", it already has the needed date prefix
    if " – " not in row["text"]:
        try:
            # If the row's text is a date, set it as the new prefix
            parse(row["text"])
            prefix = row["text"]
        except:
            # If the row's text isn't a date, add the prefix
            row["text"] = prefix + " – " + row["text"]
df = df[df["text"].str.contains(" – ")]
print(df[:20])

                                                 text
0    – 2023 (MMXXIII) was a common year starting o...
1   The year 2023 saw the decline in severity of t...
2    – The Russian invasion of Ukraine and Myanmar...
3    – A banking crisis resulted in the collapse o...
4    – In the realm of technology, 2023 saw the co...
11  January 1 – Croatia adopts the euro and joins ...
12  January 5 – The funeral of Pope Benedict XVI i...
14  January 8 – The 2023 Beninese parliamentary el...
15  January 8 – Following the 2022 Brazilian gener...
16  January 9 – Juliaca massacre: At least 18 peop...
17  January 10–17 – A cold snap in Afghanistan kil...
18  January 15 – Yeti Airlines Flight 691 crashes ...
19  January 16 – Tigray War: Amharan Special Force...
20  January 17 – Nguyễn Xuân Phúc resigns as Presi...
21  January 18 – A helicopter crash in Brovary nea...
22  January 20 – The Parliament of Trinidad and To...
24  January 21 – Burkina Faso requests French forc...
25  January 21 – Tigray War:

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [None]:
import openai
from config import OpenAI_key
openai.api_key = OpenAI_key
EMBEDDING_MODEL_NAME = "text-embedding-3-small" #Increased performance over 2nd generation ada embedding model
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings
df

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

### Question 2