# Custom Chatbot Project

The dataset I chose is the text from the 2024 Wikipedia page (https://en.wikipedia.org/wiki/2024). This dataset is appropriate because it includes events that happened after the model (`gpt-3.5-turbo`) training period.

## Data Wrangling

In [1]:
# Import all the required libraries
import requests
import pandas as pd
import tiktoken
import openai
from dateutil.parser import parse
from openai.embeddings_utils import get_embedding, distances_from_embeddings

openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = "YOUR API KEY"

In [2]:
# Get the Wikipedia page for "2024"
params = {
    "action": "query", 
    "prop": "extracts",
    "exlimit": 1,
    "titles": "2024",
    "explaintext": 1,
    "formatversion": 2,
    "format": "json"
}
resp = requests.get("https://en.wikipedia.org/w/api.php", params=params)
response_dict = resp.json()

In [12]:
# Load the data into a panda dataframe
df = pd.DataFrame()
df["text"] = response_dict["query"]["pages"][0]["extract"].split("\n")

# Clean up text to remove empty lines and headings
df = df[(df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))]

# In some cases dates are used as headings instead of being part of the
# text sample; adjust so dated text samples start with dates
prefix = ""
for (i, row) in df.iterrows():
    # If the row already has " - ", it already has the needed date prefix
    if " – " not in row["text"]:
        try:
            # If the row's text is a date, set it as the new prefix
            parse(row["text"])
            prefix = row["text"]
        except:
            # If the row's text isn't a date, add the prefix
            row["text"] = prefix + " – " + row["text"]
df = df[df["text"].str.contains(" – ")].reset_index(drop=True)

EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []

response = openai.Embedding.create(
    input=df["text"].tolist(),
    engine=EMBEDDING_MODEL_NAME
)

embeddings.extend([data["embedding"] for data in response["data"]])
df["embeddings"] = embeddings
df

Unnamed: 0,text,embeddings
0,"– 2024 (MMXXIV) is the current year, and is a...","[0.0012268582358956337, -0.017947543412446976,..."
1,"– So far, this year has seen the continuation...","[-0.02837427519261837, -0.021935394033789635, ..."
2,"– Approximately 79 countries, representing ar...","[0.0007004368817433715, -0.019197337329387665,..."
3,"January 1 – Egypt, Ethiopia, Iran and the Unit...","[-0.006096014752984047, -0.02365712821483612, ..."
4,January 1 – The Republic of Artsakh is formall...,"[0.008020608685910702, 0.008110800758004189, -..."
...,...,...
127,December 24 – The 2025 Jubilee will begin on t...,"[-0.012208213098347187, -0.023865872994065285,..."
128,Autumn – 2024 Kazakh nuclear power referendum.,"[-0.0012817641254514456, -0.0090246656909585, ..."
129,September or October – 2024 Sri Lankan preside...,"[0.0030008205212652683, -0.0033367571886628866..."
130,October – 2024 Botswana general election.,"[-0.020399896427989006, -0.023590216413140297,..."


## Custom Query Completion

In [17]:
def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """

    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)

    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )

    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In [29]:
def custom_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")

    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""

    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))

    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:

        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count

        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

In [30]:
def openai_query(prompt):
    """
    Given a prompt, query the openapi.Completion model.
    Return the first response choice from the model.
    """
    answer = openai.Completion.create(
        model="gpt-3.5-turbo-instruct",
        prompt=prompt,
        max_tokens=150
    )
    return answer["choices"][0]["text"].strip()

## Custom Performance Demonstration

### Question 1

In [39]:
question = "Who won the UEFA Euro?"
custom_question = custom_prompt(question, df, 1000)

In [40]:
print(openai_query(question))
print("\n###\n") # print delimiter
print(openai_query(custom_question))

The UEFA Euro 2020 was postponed to 2021 due to the COVID-19 pandemic. The tournament was won by the Italian national football team in 2021.

###

Spain won the UEFA Euro in 2024.


### Question 2

In [33]:
question = "Who is the president of Iran?"
custom_question = custom_prompt(question, df, 1000)

In [38]:
print(openai_query(question))
print("\n###\n") # print delimiter
print(openai_query(custom_question))

The president of Iran is Hassan Rouhani.

###

Masoud Pezeshkian.
