# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

I have chosen to take the text from the wikipedia page for the Game Awards 2024 using the wikipedia API. The GPT model I am using (gpt-3.5-turbo-instruct) is trained up until September 2021, so this article contains data it will not have been trained on, and is therefore appropriate for the task.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [1]:
import pandas as pd
import requests

In [3]:
df = pd.DataFrame()
# get text from wikipidia api
params = {
    "action": "query", 
    "prop": "extracts",
    "exlimit": 1,
    "titles": "The_Game_Awards_2024",
    "explaintext": 1,
    "formatversion": 2,
    "format": "json"
}
response = requests.get("https://en.wikipedia.org/w/api.php", params=params)
response_dict = response.json()

# clean data
wiki_text = response_dict["query"]["pages"][0]["extract"].split("\n")
filtered_lines = filter(lambda x: x != '' and not x.startswith('=='), wiki_text)
cleaned_text = list(filtered_lines)

# put in dataframe and save as csv file
df['text'] = cleaned_text

print(df.head())

                                                text
0  The Game Awards 2024 was an award show to hono...
1  Astro Bot and Final Fantasy VII Rebirth led th...
2  As with previous iterations of the Game Awards...
3  The 2024 stage featured more LED displays than...
4  Preceding the Game Awards, several announcemen...


Added embeddings to the dataframe for completeness. Not part of project requirement. 

In [4]:
from openai import OpenAI
from api_key import openai_key, api_base

embed_model = "text-embedding-ada-002"

client = OpenAI(
    base_url = api_base,
    api_key = openai_key
)

def embedding_generator(client, text):
    response = client.embeddings.create(
        model=embed_model,
        input=text
    )
    return response


embeddings = embedding_generator(client, df['text'].tolist())

print(embeddings.data[0].embedding)

full_embeddings_list = [i.embedding for i in embeddings.data]

df["embeddings"] = full_embeddings_list

print(df.head())

df.to_csv("The_Game_Awards_2024.csv")


[-0.007118420209735632, -0.03615951165556908, 0.002961172489449382, -0.019072722643613815, -0.012889240868389606, -0.005986823234707117, -0.019459594041109085, -0.020620206370949745, -0.015075059607625008, 0.0024663005024194717, 0.050190020352602005, -0.006070645526051521, -0.013295454904437065, 0.004271696787327528, -0.01774446666240692, -0.003684943076223135, 0.023186448961496353, -0.005657983478158712, 0.014017613604664803, -0.036546383053064346, 0.0118188988417387, 0.0012347621377557516, -0.01561667863279581, 0.003971871919929981, 0.02260614186525345, 0.012837657704949379, 0.009413852356374264, -0.03755224496126175, 0.031078608706593513, 0.0017473658081144094, -0.01851820945739746, -0.012424996122717857, 0.0060609737411141396, -0.01357915997505188, -0.011586776003241539, -0.023341195657849312, -0.005777268670499325, -0.03254871815443039, 0.009239761158823967, 0.030408034101128578, 0.004178203176707029, 0.016635438427329063, 0.005483891349285841, -0.013733908534049988, -0.0087626203

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [5]:
df = pd.read_csv("The_Game_Awards_2024.csv", index_col=0)

prompt_template = '''
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:'''

chat_model = "gpt-3.5-turbo-instruct"

def ai_generator(user_prompt):
    result = client.completions.create(
            model=chat_model,
            prompt=user_prompt,
            max_tokens=32,
            temperature=0
        )

    print(result.choices[0].text.strip())

def format_prompt(prompt, context):
    formatted_prompt = prompt_template.format('\n\n'.join(context), prompt)
    return formatted_prompt
    

def chat(user_prompt, context):
    try:
        complete_prompt = format_prompt(user_prompt, context)
        result = ai_generator(complete_prompt)

        return result
    except Exception as ex:
        return str(ex)

In [6]:
print(chat("", df["text"]))

I don't know
None


## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

### Question 2