# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

In [None]:
#  Ajay Joseph Sept 13h 2024 
#The dataset is the 2024 wikipedia dataspace that is relevant 
# as the questions are regarding events that. have happened after the 2021 training of 
# the open AI model 

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [1]:
import openai
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = ""YOUR API KEY""

In [2]:
from dateutil.parser import parse
import pandas as pd
import requests

# Get the Wikipedia page for "2024" since OpenAI's models stop in 2021
resp = requests.get("https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exlimit=1&titles=2024&explaintext=1&formatversion=2&format=json")

# Load page text into a dataframe
df = pd.DataFrame()
df["text"] = resp.json()["query"]["pages"][0]["extract"].split("\n")

# Clean up text to remove empty lines and headings
df = df[(df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))]

# In some cases dates are used as headings instead of being part of the
# text sample; adjust so dated text samples start with dates
prefix = ""
for (i, row) in df.iterrows():
    # If the row already has " - ", it already has the needed date prefix
    if " – " not in row["text"]:
        try:
            # If the row's text is a date, set it as the new prefix
            parse(row["text"])
            prefix = row["text"]
        except:
            # If the row's text isn't a date, add the prefix
            row["text"] = prefix + " – " + row["text"]
df = df[df["text"].str.contains(" – ")].reset_index(drop=True)
df

Unnamed: 0,text
0,"– 2024 (MMXXIV) is the current year, and is a..."
1,"– So far, this year has seen the continuation..."
2,"– Approximately 79 countries, representing ar..."
3,"January 1 – Egypt, Ethiopia, Iran and the Unit..."
4,January 1 – The Republic of Artsakh is formall...
...,...
147,November 27 – 2024 Namibian general election.
148,December 1 – 2024 Romanian parliamentary elect...
149,December 7 – 2024 Ghanaian general election.
150,December 24 – The 2025 Jubilee will begin on t...


In [3]:
len(df)

152

In [4]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings
df

Unnamed: 0,text,embeddings
0,"– 2024 (MMXXIV) is the current year, and is a...","[0.001271102693863213, -0.017934562638401985, ..."
1,"– So far, this year has seen the continuation...","[-0.02561047114431858, -0.021191144362092018, ..."
2,"– Approximately 79 countries, representing ar...","[0.0007636345108039677, -0.019218267872929573,..."
3,"January 1 – Egypt, Ethiopia, Iran and the Unit...","[-0.006099235266447067, -0.02372102066874504, ..."
4,January 1 – The Republic of Artsakh is formall...,"[0.008024189621210098, 0.008043508976697922, -..."
...,...,...
147,November 27 – 2024 Namibian general election.,"[-0.0200584027916193, -0.031681861728429794, 0..."
148,December 1 – 2024 Romanian parliamentary elect...,"[-0.016480034217238426, -0.015771353617310524,..."
149,December 7 – 2024 Ghanaian general election.,"[-0.010910444892942905, -0.019313324242830276,..."
150,December 24 – The 2025 Jubilee will begin on t...,"[-0.012184198945760727, -0.023869255557656288,..."


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [5]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In [6]:
import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

In [7]:
#create_prompt("Who is the prime minister of UK", df, 150)

In [8]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""
        

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [9]:
#Basic Query #1
france_prompt = """
Question: "Which party has the most seats in France ?"
Answer:
"""
initial_france_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=france_prompt,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_france_answer)

As of 2021, the party with the most seats in France is La République En Marche (LREM) with 288 out of 577 seats in the National Assembly. LREM is followed by the conservative party Les Républicains (LR) with 112 seats.


In [10]:
#Customer Query #1
custom_france_answer = answer_question("Which party has the most seats in France ?", df)
print(custom_france_answer)

The left-wing New Popular Front.


### Question 2

In [11]:
#Basic Query #2
uk_prompt = """
Question: "Who is the prime minister of UK?"
Answer:
"""
initial_uk_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=uk_prompt,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_uk_answer)

as of 2021, the prime minister of UK is Boris Johnson.


In [12]:
#Customer Query #2
custom_uk_answer = answer_question("Who is the prime minister of UK?", df)
print(custom_uk_answer)

Sir Keir Starmer
