# Custom Chatbot Project

I have chosen the Wikipedia 2023 dataset. This is in order to address the cut-off date for ChatGPT and enable the chatbot to provide answers from the latter part of the year.

## Data Wrangling

In the cells below, I am doing all necessary imports. Then I will provide an example query for 2023 (after OpenAI cutoff date) to show, that GPT-3.5 is not able to answer this question without additional data/RAG.

I will use the 2023 page from Wikipedia as additional data and clean and tokenize the dataset. 


In [60]:
import pandas as pd
import requests
from pathlib import Path
import numpy as np
from openai.embeddings_utils import get_embedding, distances_from_embeddings
# OpenAI API key
openai.api_key = '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'
from openai.embeddings_utils import get_embedding, distances_from_embeddings

In [39]:
#Original answer without training to show cut-off

december_prompt = """
Question: "What happened in December 2023?"
Answer:
"""
initial_december_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=december_prompt,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_december_answer)

It is impossible to answer this question definitively as it is currently December 2020 and we cannot predict events in the future. December 2023 could potentially see a variety of events depending on global, cultural, and personal circumstances. Some possible events that may occur in December 2023 include holidays, natural disasters, political events, technological advancements, sporting events, scientific discoveries, personal or professional milestones, among others.


In [43]:
# Get the Wikipedia page for "2023" since OpenAI's models stop in 2020
resp = requests.get("https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exlimit=1&titles=2023&explaintext=1&formatversion=2&format=json")

# Load page text into a dataframe
df = pd.DataFrame()
df["text"] = resp.json()["query"]["pages"][0]["extract"].split("\n")

# Clean up text to remove empty lines and headings
df = df[(df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))]

# In some cases dates are used as headings instead of being part of the
# text sample; adjust so dated text samples start with dates
prefix = ""
for (i, row) in df.iterrows():
    # If the row already has " - ", it already has the needed date prefix
    if " – " not in row["text"]:
        try:
            # If the row's text is a date, set it as the new prefix
            parse(row["text"])
            prefix = row["text"]
        except:
            # If the row's text isn't a date, add the prefix
            row["text"] = prefix + " – " + row["text"]
df = df[df["text"].str.contains(" – ")]
df.to_csv("embeddings.csv")

In [42]:
# Check dataframe for correctness
print(df.head(10))

                                                 text
0    – 2023 (MMXXIII) was a common year starting o...
1   The year 2023 saw the decline in severity of t...
2    – The Russian invasion of Ukraine and Myanmar...
3    – A banking crisis resulted in the collapse o...
4    – In the realm of technology, 2023 saw the co...
11  January 1 – Croatia adopts the euro and joins ...
12  January 5 – The funeral of Pope Benedict XVI i...
14  January 8 – The 2023 Beninese parliamentary el...
15  January 8 – Following the 2022 Brazilian gener...
16  January 9 – Juliaca massacre: At least 18 peop...


In [44]:
#generate embeddings
EMBEDDING_MODEL_NAME = "text-embedding-3-small"
batch_size = 100 #limit batch size
embeddings = [] #create empty array for embeddings
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings
df

Unnamed: 0,text,embeddings
0,– 2023 (MMXXIII) was a common year starting o...,"[0.011595670133829117, -0.03160304203629494, 0..."
1,The year 2023 saw the decline in severity of t...,"[0.01554702315479517, 0.038631998002529144, 0...."
2,– The Russian invasion of Ukraine and Myanmar...,"[-0.06633021682500839, 0.005727418232709169, 0..."
3,– A banking crisis resulted in the collapse o...,"[-0.013344909995794296, -0.03087935782968998, ..."
4,"– In the realm of technology, 2023 saw the co...","[0.026942692697048187, 0.0007336187991313636, ..."
...,...,...
283,"Economics – Claudia Goldin, for her empirical ...","[0.012369903735816479, 0.0376892015337944, 0.0..."
284,"Literature – Jon Fosse, for his innovative pla...","[-0.03743856027722359, 0.016004662960767746, 0..."
285,"Peace – Narges Mohammadi, for her works on the...","[0.03361355513334274, -0.00215215515345335, 0...."
286,"Physics – Pierre Agostini, Ferenc Krausz & Ann...","[-0.03730954974889755, 0.007431724574416876, -..."


In [45]:
# save to CSV
df.to_csv("embeddings.csv")

## Custom Query Completion


In [46]:
def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy


In [48]:
get_rows_sorted_by_relevance("What happened at the COP 2023?", df)

Unnamed: 0,text,embeddings,distances
261,December 12 – At the COP28 climate summit in D...,"[0.008021525107324123, -0.00411166250705719, 0...",0.558898
176,July 23 – The 2023 Cambodian general election ...,"[0.023684794083237648, 0.008149438537657261, 0...",0.611542
204,September 9 – At the 18th G20 summit in New De...,"[-0.02522454410791397, -0.02098342776298523, 0...",0.652751
244,November 14–November 17 – President Biden host...,"[0.004024468827992678, -0.036472756415605545, ...",0.653991
195,August 30 – Following the announcement of incu...,"[0.035682860761880875, -0.023306088522076607, ...",0.653996
...,...,...,...
69,"March 10 – Silicon Valley Bank, the 16th large...","[-0.0022857997100800276, 0.010135394521057606,...",0.932322
144,June 14 – Scientists report the creation of th...,"[0.06816432625055313, 0.05519308149814606, 0.0...",0.935130
284,"Literature – Jon Fosse, for his innovative pla...","[-0.03743856027722359, 0.016004662960767746, 0...",0.950807
259,December 6 – Google DeepMind releases the Gemi...,"[-0.0009106243378482759, 0.03048708103597164, ...",0.963028


In [50]:
import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)
    

In [51]:
print(create_prompt("What happened at COP28?", df, 200))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

December 12 – At the COP28 climate summit in Dubai, a consensus is reached for countries to "transition away" from fossil fuels, the first such agreement in the conference's 30-year history. The transition is specifically for energy systems, excluding plastics, transport or agriculture.

###

March 20 – The Intergovernmental Panel on Climate Change (IPCC) releases the synthesis report of its Sixth Assessment Report on climate change.

###

November 14–November 17 – President Biden hosts the APEC summit in San Francisco which Chinese president Xi Jinping attends. Both countries at the conclusion of the summit agree to re-open suspended channels of military communications and to cooperate in their fight against climate change.

---

Question: What happened at COP28?
Answer:


In [52]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [53]:
custom_2023_answer_1 = answer_question("Tell me an example of what happened in December 2023 ?", df)
print(custom_2023_answer)

The 2023 Israeli judicial reform protests erupted across Israel in December.


In [54]:
december_prompt_1 = """
Question: "Tell me an example of what happened in December 2023?"
Answer:
"""
initial_december_answer_1 = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=december_prompt_1,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_december_answer_1)

I am an AI language model and do not have access to real-time information or the ability to predict the future, so I am unable to give a specific example of an event that may happen in December 2023.


### Question 2

In [59]:
custom_2023_answer_2 = answer_question("Where was COP28 held this year? ?", df)
print(custom_2023_answer_2)

Dubai


In [57]:
december_prompt_2 = """
Question: "Where was COP28 held this year?"
Answer:
"""
initial_december_answer_2 = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=december_prompt_2,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_december_answer_2)

COP28 (Conference of Parties 28), also known as the United Nations Climate Change Conference, was originally scheduled to be held in Glasgow, Scotland in November 2020. However, due to the COVID-19 pandemic, it had to be postponed and eventually took place virtually from 1-12 November 2021.
