# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

##### The Wikipedia 2025 database has been chosen because ChatGPT data is until 2021 and cannot respond to queries anything later. Our 2025 databse will respond to most current queries.

In [1]:
import openai
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = "voc-1591411661266774041819680bf6ddc78900.17113500"

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [2]:
from dateutil.parser import parse
import pandas as pd
import requests


resp = requests.get(
    "https://en.wikipedia.org/w/api.php",
    params={
        "action": "query",
        "prop": "extracts",
        "exlimit": 1,
        "titles": "2025",
        "explaintext": 1,
        "formatversion": 2,
        "format": "json"
    }
)

data = resp.json()
print(data["query"]["pages"][0]["extract"])

2025 (MMXXV) is the current year, and is a common year starting on Wednesday of the Gregorian calendar, the 2025th year of the Common Era (CE) and Anno Domini (AD) designations, the 25th  year of the 3rd millennium and the 21st century, and the  6th   year of the 2020s decade.  
So far, the year has seen the continuation of major armed conflicts, including the Russian invasion of Ukraine, the Sudanese civil war, and the Gaza war. Internal crises in Bangladesh, Ecuador, Georgia, Germany, Haiti, Somalia, and South Korea continued into this year, with the latter leading to Yoon Suk Yeol's arrest and removal from office.


== Events ==


=== January ===

January 1
Poland takes over the Presidency of the Council of the European Union, after the Hungarian presidency.
Bulgaria and Romania complete the process of joining the Schengen Area, lifting land border controls.
Liechtenstein becomes the 37th country to legalize same-sex marriage.
Ukraine halts the transportation of many Russian gas sup

In [3]:
# Load page text into a dataframe
df = pd.DataFrame()
df["text"] = data["query"]["pages"][0]["extract"].split("\n")

# Clean up text to remove empty lines and headings
df = df[(df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))]

# In some cases dates are used as headings instead of being part of the
# text sample; adjust so dated text samples start with dates
prefix = ""
for (i, row) in df.iterrows():
    # If the row already has " - ", it already has the needed date prefix
    if " – " not in row["text"]:
        try:
            # If the row's text is a date, set it as the new prefix
            parse(row["text"])
            prefix = row["text"]
        except:
            # If the row's text isn't a date, add the prefix
            row["text"] = prefix + " – " + row["text"]
df = df[df["text"].str.contains(" – ")]
df

Unnamed: 0,text
0,"– 2025 (MMXXV) is the current year, and is a ..."
1,"– So far, the year has seen the continuation ..."
10,January 1 – Poland takes over the Presidency o...
11,January 1 – Bulgaria and Romania complete the ...
12,January 1 – Liechtenstein becomes the 37th cou...
...,...
187,December 21–January 18 – The 2025 Africa Cup o...
191,May 25 – An ecumenical meeting of the Eastern ...
192,May 25 – Norway aims to ban the sale of all ne...
202,May 25 – Laureates for Nobel Prizes in the fie...


## Generate Embeddings

In [4]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings
df

Unnamed: 0,text,embeddings
0,"– 2025 (MMXXV) is the current year, and is a ...","[-0.015760470181703568, -0.027240939438343048,..."
1,"– So far, the year has seen the continuation ...","[-0.005608963314443827, -0.014697499573230743,..."
10,January 1 – Poland takes over the Presidency o...,"[0.016945933923125267, -0.01613958552479744, -..."
11,January 1 – Bulgaria and Romania complete the ...,"[0.014901851303875446, -0.013389883562922478, ..."
12,January 1 – Liechtenstein becomes the 37th cou...,"[0.014468923211097717, 0.007909078150987625, -..."
...,...,...
187,December 21–January 18 – The 2025 Africa Cup o...,"[-0.020572111010551453, -0.018156303092837334,..."
191,May 25 – An ecumenical meeting of the Eastern ...,"[0.0005661920295096934, -0.012484234757721424,..."
192,May 25 – Norway aims to ban the sale of all ne...,"[-0.03225237503647804, -0.023562222719192505, ..."
202,May 25 – Laureates for Nobel Prizes in the fie...,"[-0.010125355795025826, -0.01201776321977377, ..."


In [5]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

## Create a function to compose the prompt

In [6]:
import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)
    

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [7]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [8]:
USAPresident_prompt = """
Question: "Who is the president of USA?"
Answer:
"""
initial_USAPresident_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=USAPresident_prompt,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_USAPresident_answer)

The President of USA is Joe Biden.


In [10]:
custom_USAPresident_answer = answer_question("Who is the president of USA?", df)
print(custom_USAPresident_answer)

Donald Trump


### Question 2

In [11]:
Pope_prompt = """
Question: "Who is the current Pope?"
Answer:
"""
initial_Pope_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=Pope_prompt,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_Pope_answer)

Pope Francis is the current Pope.


In [12]:
custom_Pope_answer = answer_question("Who is the current Pope?", df)
print(custom_Pope_answer)

The current Pope is Pope Leo XIV.


In [13]:
# Initialize an empty list to store user inputs
user_inputs = []

while True:
    # Take input from the user
    user_input = input("Enter your query (type 'exit' to stop): ")
    
    # Check if the user wants to exit the loop
    if user_input.lower() == 'exit':
        break  # Exit the loop
    
    # Add the input to the list
    custom_response = answer_question(user_input, df)
    print(f'Answer: {custom_response} \n')


Enter your query (type 'exit' to stop): Who is the Prime Minister of Great Britain?
Answer: I don't know 

Enter your query (type 'exit' to stop): Who is JD Vance?
Answer: JD Vance is the Vice President of the United States. 

Enter your query (type 'exit' to stop): exit
