# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

I have chosen to take the text from the wikipedia page for the Game Awards 2024 using the wikipedia API. The GPT model I am using (gpt-3.5-turbo-instruct) is trained up until September 2021, so this article contains data it will not have been trained on, and is therefore appropriate for the task.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [22]:
import pandas as pd
import requests

In [32]:
df = pd.DataFrame()
# get text from wikipidia api
params = {
    "action": "query", 
    "prop": "extracts",
    "exlimit": 1,
    "titles": "The_Game_Awards_2024",
    "explaintext": 1,
    "formatversion": 2,
    "format": "json"
}
response = requests.get("https://en.wikipedia.org/w/api.php", params=params)
response_dict = response.json()

# clean data
wiki_text = response_dict["query"]["pages"][0]["extract"].split("\n")
filtered_lines = filter(lambda x: x != '' and not x.startswith('=='), wiki_text)
cleaned_text = list(filtered_lines)

# put in dataframe and save as csv file
df['text'] = cleaned_text

print(df)

                                                 text
0   The Game Awards 2024 was an award show to hono...
1   Astro Bot and Final Fantasy VII Rebirth led th...
2   As with previous iterations of the Game Awards...
3   The 2024 stage featured more LED displays than...
4   Preceding the Game Awards, several announcemen...
5                       New games announced included:
6   To promote its own game without the budget nec...
7   Nominees for the show's 29 categories were ann...
8   As with the preceding year, the Game Awards pa...
9   The winners were announced during the awards c...
10  Winners are listed first, highlighted in boldf...
11  Astro Bot and Final Fantasy VII Rebirth led th...
12  Astro Bot led the show with four wins (and its...
13  The following individuals, listed in order of ...
14  The Game Awards Orchestra, conducted by Lorne ...
15  The following individuals or groups performed ...
16  Several games saw increased weekly sales follo...
17  The omission of Future C

Added embeddings to the dataframe for completeness. Not part of project requirement. 

In [None]:
from openai import OpenAI
from api_key import api_base

embed_model = "text-embedding-ada-002"

client = OpenAI(
    base_url = api_base,
    api_key = "Your openai key"
)

def embedding_generator(client, text):
    response = client.embeddings.create(
        model=embed_model,
        input=text
    )
    return response


embeddings = embedding_generator(client, df['text'].tolist())

print(embeddings.data[0].embedding)

full_embeddings_list = [i.embedding for i in embeddings.data]

df["embeddings"] = full_embeddings_list

print(df.head())

df.to_csv("The_Game_Awards_2024.csv")


[-0.006950756534934044, -0.036262575536966324, 0.002872506622225046, -0.019072670489549637, -0.012921443209052086, -0.005980358924716711, -0.019407955929636955, -0.020633043721318245, -0.015087912790477276, 0.0025404435582458973, 0.05029304325580597, -0.006177017465233803, -0.013308312743902206, 0.004262012895196676, -0.017731521278619766, -0.0035785434301942587, 0.02316059172153473, -0.005680534988641739, 0.013940200209617615, -0.03659785911440849, 0.011870447546243668, 0.0012492663227021694, -0.015642426908016205, 0.003907382488250732, 0.022631868720054626, 0.012831173837184906, 0.009471856988966465, -0.037577930837869644, 0.0310785211622715, 0.0017876598285511136, -0.018479470163583755, -0.012553917244076729, 0.006125434767454863, -0.01356622576713562, -0.011644774116575718, -0.023237966001033783, -0.0058417306281626225, -0.032522834837436676, 0.009259078651666641, 0.030382156372070312, 0.004152399953454733, 0.01667407900094986, 0.005519339349120855, -0.013708078302443027, -0.008775

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [25]:
df = pd.read_csv("The_Game_Awards_2024.csv", index_col=0)

prompt_template = '''
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:'''

chat_model = "gpt-3.5-turbo-instruct"

def ai_generator(user_prompt):
    result = client.completions.create(
            model=chat_model,
            prompt=user_prompt,
            max_tokens=32,
            temperature=0
        )

    print(result.choices[0].text.strip())

def format_prompt(prompt, context):
    formatted_prompt = prompt_template.format('\n\n'.join(context), prompt)
    return formatted_prompt
    

def chat(user_prompt, context):
    try:
        complete_prompt = format_prompt(user_prompt, context)
        result = ai_generator(complete_prompt)

        return result
    except Exception as ex:
        return str(ex)

In [26]:
chat("Say hello", df["text"].values)

Hello!


## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [27]:
q1= "Who won game of the year in 2024?"

chat(q1, [])

I don't know


In [28]:
chat(q1, df["text"].values)

Astro Bot.


### Question 2

In [29]:
q2= "Was there any controversy at the Game Awards 2024?"
chat(q2, [])

I don't know


In [30]:
chat(q2, df["text"].values)

Yes, there was controversy surrounding the eligibility of downloadable content for awards and the removal of the Future Class initiative.
