# Custom Chatbot Project

In this project, we intend to utilize an LLM for question answering task related to the major events that happened in the year 2023. The LLM selected for this project is OpenAI ['gpt-3.5-turbo-instruct'](https://platform.openai.com/docs/models/gpt-3-5-turbo) but its training data is only up to September 2021 and hence, it is not aware of the 2023 events. To provide this relevant 2023 context to our chosen LLM, we utilize [2023 Wikipedia page](https://en.wikipedia.org/wiki/2023) and we access this data via [Wikipedia API](https://www.mediawiki.org/wiki/API:Main_page). This Wiki page contains brief extracts of most major events that happened in 2023 which makes it highly relevant for our intended purpose. 

## Data Wrangling

In the cells below, we load our chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of our text data as well as their corresponding embeddings.

In [1]:
import requests

# Get the Wikipedia page for "2023" since OpenAI's models stop in 2021
params = {
    "action": "query", 
    "prop": "extracts",
    "exlimit": 1,
    "titles": "2023",
    "explaintext": 1,
    "formatversion": 2,
    "format": "json"
}
resp = requests.get("https://en.wikipedia.org/w/api.php", params=params)
response_dict = resp.json()

In [2]:
import pandas as pd
df = pd.DataFrame()
df["text"] = response_dict["query"]["pages"][0]["extract"].split("\n")
# Clean up text to remove empty lines and headings
df = df[(df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))]
print(df[:20])

                                                 text
0   2023 (MMXXIII) was a common year starting on S...
1   The year 2023 saw the decline in severity of t...
2   The Russian invasion of Ukraine and Myanmar ci...
3   A banking crisis resulted in the collapse of n...
4   In the realm of technology, 2023 saw the conti...
11  January 1 – Croatia adopts the euro and joins ...
12  January 5 – The funeral of Pope Benedict XVI i...
13                                          January 8
14  The 2023 Beninese parliamentary election is he...
15  Following the 2022 Brazilian general election ...
16  January 9 – Juliaca massacre: At least 18 peop...
17  January 10–17 – A cold snap in Afghanistan kil...
18  January 15 – Yeti Airlines Flight 691 crashes ...
19  January 16 – Tigray War: Amharan Special Force...
20  January 17 – Nguyễn Xuân Phúc resigns as Presi...
21  January 18 – A helicopter crash in Brovary nea...
22  January 20 – The Parliament of Trinidad and To...
23                          

In [3]:
from dateutil.parser import parse
# In some cases dates are used as headings instead of being part of the
# text sample; adjust so dated text samples start with dates
prefix = ""
for (i, row) in df.iterrows():
    # If the row already has " - ", it already has the needed date prefix
    if " – " not in row["text"]:
        try:
            # If the row's text is a date, set it as the new prefix
            parse(row["text"])
            prefix = row["text"]
        except:
            # If the row's text isn't a date, add the prefix
            row["text"] = prefix + " – " + row["text"]
df = df[df["text"].str.contains(" – ")]
print(df[:20])

                                                 text
0    – 2023 (MMXXIII) was a common year starting o...
1   The year 2023 saw the decline in severity of t...
2    – The Russian invasion of Ukraine and Myanmar...
3    – A banking crisis resulted in the collapse o...
4    – In the realm of technology, 2023 saw the co...
11  January 1 – Croatia adopts the euro and joins ...
12  January 5 – The funeral of Pope Benedict XVI i...
14  January 8 – The 2023 Beninese parliamentary el...
15  January 8 – Following the 2022 Brazilian gener...
16  January 9 – Juliaca massacre: At least 18 peop...
17  January 10–17 – A cold snap in Afghanistan kil...
18  January 15 – Yeti Airlines Flight 691 crashes ...
19  January 16 – Tigray War: Amharan Special Force...
20  January 17 – Nguyễn Xuân Phúc resigns as Presi...
21  January 18 – A helicopter crash in Brovary nea...
22  January 20 – The Parliament of Trinidad and To...
24  January 21 – Burkina Faso requests French forc...
25  January 21 – Tigray War:

## Custom Query Completion

In the cells below, we compose a custom query using our chosen dataset and retrieve results from an OpenAI `Completion` model.

In [4]:
import openai
from config import OpenAI_key
openai.api_key = OpenAI_key
EMBEDDING_MODEL_NAME = "text-embedding-3-small" #Increased performance over 2nd generation ada embedding model
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings
print(df[:20])

                                                 text  \
0    – 2023 (MMXXIII) was a common year starting o...   
1   The year 2023 saw the decline in severity of t...   
2    – The Russian invasion of Ukraine and Myanmar...   
3    – A banking crisis resulted in the collapse o...   
4    – In the realm of technology, 2023 saw the co...   
11  January 1 – Croatia adopts the euro and joins ...   
12  January 5 – The funeral of Pope Benedict XVI i...   
14  January 8 – The 2023 Beninese parliamentary el...   
15  January 8 – Following the 2022 Brazilian gener...   
16  January 9 – Juliaca massacre: At least 18 peop...   
17  January 10–17 – A cold snap in Afghanistan kil...   
18  January 15 – Yeti Airlines Flight 691 crashes ...   
19  January 16 – Tigray War: Amharan Special Force...   
20  January 17 – Nguyễn Xuân Phúc resigns as Presi...   
21  January 18 – A helicopter crash in Brovary nea...   
22  January 20 – The Parliament of Trinidad and To...   
24  January 21 – Burkina Faso r

In [11]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In [13]:
import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)
    

In [14]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(
    question, df, max_prompt_tokens=800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""
        

## Custom Performance Demonstration

In the cells below, we demonstrate the performance of our custom query using 2 questions. For each question, we show the answer from a basic `Completion` model query as well as the answer from our custom query. The efficacy of the selected dataset is evident from the accurate LLM responses after providing it relevant context via RAG.

### Question 1

In [17]:
Q1_prompt = """
Question: "How many people were killed in 2023 Hawaii wildfires?"
Answer:
"""
initial_Q1_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=Q1_prompt,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_Q1_answer)

I'm sorry, I cannot answer this question as it is currently the year 2021 and it has not yet reached 2023. Additionally, I do not have access to information about potential future events.


In [18]:
custom_Q1_answer = answer_question("How many people were killed in 2023 Hawaii wildfires?", df)
print(custom_Q1_answer)

At least 101 people were killed in the 2023 Hawaii wildfires.


### Question 2

In [19]:
Q2_prompt = """
Question: "When did Tharman become president of Singapore?"
Answer:
"""
initial_Q2_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=Q2_prompt,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_Q2_answer)

Tharman Shanmugaratnam has never been the president of Singapore. He has held various ministerial positions, including Deputy Prime Minister and Minister for Finance, but not the presidency. The current president of Singapore is Halimah Yacob, who has been in office since September 2017.


In [20]:
custom_Q2_answer = answer_question("When did Tharman become president of Singapore?", df)
print(custom_Q2_answer)

September 1, 2023


In [22]:
print(f"""
Q1: How many people were killed in 2023 Hawaii wildfires?

Original Answer: {initial_Q1_answer}
Custom Answer:   {custom_Q1_answer}

Q2: When did Tharman become president of Singapore?
Original Answer: {initial_Q2_answer}
Custom Answer:   {custom_Q2_answer}
""")


Q1: How many people were killed in 2023 Hawaii wildfires?

Original Answer: I'm sorry, I cannot answer this question as it is currently the year 2021 and it has not yet reached 2023. Additionally, I do not have access to information about potential future events.
Custom Answer:   At least 101 people were killed in the 2023 Hawaii wildfires.

Q2: When did Tharman become president of Singapore?
Original Answer: Tharman Shanmugaratnam has never been the president of Singapore. He has held various ministerial positions, including Deputy Prime Minister and Minister for Finance, but not the presidency. The current president of Singapore is Halimah Yacob, who has been in office since September 2017.
Custom Answer:   September 1, 2023

