# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

## Step1

### Loading the given Data with `pandas`

In [1]:
import pandas as pd
import os
import numpy as np

In [2]:
# df = pd.read_csv("./data/character_descriptions_external.csv")
# print(df.shape)

character_descriptions_folder_path = "./data/character_descriptions"

#character_descriptions_text_raw_path = "./data/character_descriptions/character_descriptions.csv"
character_descriptions_text_clean_path = "./data/internal_and_external/character_descriptions_clean.csv"
character_descriptions_embeddings_path = "./data/internal_and_external/character_descriptions_embeddings.csv"
character_descriptions_distances_sorted_path = "./data/internal_and_external/character_descriptions_distances_sorted_path.csv"

question = "Who was up coming musical perfomer from USA, full of energy and enthusiasm?"
# Answer Johnny, row number 22

In [3]:
# List all CSV files
files = [f for f in os.listdir(character_descriptions_folder_path) if f.endswith(".csv")]

df_list = [pd.read_csv(os.path.join(character_descriptions_folder_path, f)) for f in files]
df = pd.concat(df_list, ignore_index=True)

print("Total files combined:", len(files))
print("Combined shape:", df.shape)


Total files combined: 2
Combined shape: (214, 4)


In [4]:
# Clean up
df['Name'].replace('', np.nan, inplace=True)
df['Description'].replace('', np.nan, inplace=True)
df['Medium'].replace('', np.nan, inplace=True)
df['Setting'].replace('', np.nan, inplace=True)


df.dropna(subset=['Name'], inplace=True)
df.dropna(subset=['Description'], inplace=True)
df.dropna(subset=['Medium'], inplace=True)
df.dropna(subset=['Setting'], inplace=True)

df

Unnamed: 0,Name,Description,Medium,Setting
0,Emily,"A young woman in her early 20s, Emily is an as...",Play,England
1,Jack,"A middle-aged man in his 40s, Jack is a succes...",Play,England
2,Alice,"A woman in her late 30s, Alice is a warm and n...",Play,England
3,Tom,"A man in his 50s, Tom is a retired soldier and...",Play,England
4,Sarah,"A woman in her mid-20s, Sarah is a free-spirit...",Play,England
...,...,...,...,...
209,Mitsuo Sagawa,"Japanese singer, actor, talent (1939-2024)",rockabilly,Japan
210,Mitsuo Sagawa,"Japanese singer, actor, talent (1939-2024)",rockabilly,Empire of Japan
211,Hiroshi Satō,Japanese folk singer and actor,traditional folk music,Japan
212,Thaíde,Brazilian rapper,hip-hop,Brazil


In [5]:
# Combine the fields for embeddings
df["text"] = (
    df["Name"].fillna("") + ", " +
    df["Description"].fillna("") + ", " +
    df["Medium"].fillna("") + ", " +
    df["Setting"].fillna("")
)

df_text = df.drop(['Name', 'Description', 'Medium', 'Setting'], axis=1)
df_text.to_csv(character_descriptions_text_clean_path)
df_text

Unnamed: 0,text
0,"Emily, A young woman in her early 20s, Emily i..."
1,"Jack, A middle-aged man in his 40s, Jack is a ..."
2,"Alice, A woman in her late 30s, Alice is a war..."
3,"Tom, A man in his 50s, Tom is a retired soldie..."
4,"Sarah, A woman in her mid-20s, Sarah is a free..."
...,...
209,"Mitsuo Sagawa, Japanese singer, actor, talent ..."
210,"Mitsuo Sagawa, Japanese singer, actor, talent ..."
211,"Hiroshi Satō, Japanese folk singer and actor, ..."
212,"Thaíde, Brazilian rapper, hip-hop, Brazil"


### Creating an Embeddings Index with `openai.Embedding`

In [14]:
df = pd.read_csv(character_descriptions_text_clean_path, index_col=0)
df

Unnamed: 0,text
0,"Emily, A young woman in her early 20s, Emily i..."
1,"Jack, A middle-aged man in his 40s, Jack is a ..."
2,"Alice, A woman in her late 30s, Alice is a war..."
3,"Tom, A man in his 50s, Tom is a retired soldie..."
4,"Sarah, A woman in her mid-20s, Sarah is a free..."
...,...
209,"Mitsuo Sagawa, Japanese singer, actor, talent ..."
210,"Mitsuo Sagawa, Japanese singer, actor, talent ..."
211,"Hiroshi Satō, Japanese folk singer and actor, ..."
212,"Thaíde, Brazilian rapper, hip-hop, Brazil"


In [15]:
import openai

import pandas as pd
from tqdm import tqdm

openai.api_key = "YOUR API KEY"



In [16]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"

# Charged for token usage
response = openai.Embedding.create(
    input=df["text"].tolist(),
    model=EMBEDDING_MODEL_NAME
)

In [17]:
embeddings = [data["embedding"] for data in response["data"]]
df["embeddings"] = embeddings
df.to_csv(character_descriptions_embeddings_path)

## Step2

### Finding Relevant Data with Cosine Similarity

In [18]:
import numpy as np
import pandas as pd

df_distance = pd.read_csv(character_descriptions_embeddings_path, index_col=0)
df_distance["embeddings"] = df_distance["embeddings"].apply(eval).apply(np.array)

In [19]:
import openai
from openai.embeddings_utils import get_embedding

question_embeedings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
question_embeedings

[-0.005105266813188791,
 -0.02451050840318203,
 0.004843960981816053,
 -0.008760285563766956,
 0.005409035366028547,
 0.023556740954518318,
 -0.008198477327823639,
 0.002020222367718816,
 -0.029762759804725647,
 0.008845210075378418,
 0.02550347149372101,
 0.00806129164993763,
 0.010315056890249252,
 -0.017102481797337532,
 0.012229123152792454,
 -0.007963301613926888,
 0.020303480327129364,
 -0.0009162043570540845,
 0.0030442154966294765,
 -0.04371650144457817,
 -0.019454235211014748,
 0.003720344975590706,
 -0.0017229868099093437,
 0.005010543391108513,
 -0.0008288301178254187,
 0.027855224907398224,
 0.034387875348329544,
 -0.0160964522510767,
 0.00027906670584343374,
 -0.015861278399825096,
 0.024170810356736183,
 0.0031356727704405785,
 -0.004416072275489569,
 -0.014384898357093334,
 -0.01737685315310955,
 -0.03248034417629242,
 -0.01201354619115591,
 -0.01098138652741909,
 0.003305521560832858,
 -0.0068135554902255535,
 5.0015612941933796e-05,
 0.022158753126859665,
 -0.003658284

In [20]:
from openai.embeddings_utils import distances_from_embeddings
distances = distances_from_embeddings(question_embeedings, df["embeddings"].tolist(), distance_metric="cosine")
distances

[0.24005850744822677,
 0.25916372708896873,
 0.2717795455240961,
 0.25816798745924363,
 0.23358155357674526,
 0.2363042540586805,
 0.2747780860421397,
 0.250804413524397,
 0.25525934605942135,
 0.21218896331343595,
 0.25779687016683006,
 0.2380949412086083,
 0.2516342771097936,
 0.23909000304911765,
 0.2539103820425357,
 0.2621274641874768,
 0.2672995184996516,
 0.26131426604565,
 0.2786733381531774,
 0.18415232544465499,
 0.14775187888491148,
 0.19977879312783786,
 0.19260562504420076,
 0.22416848595350858,
 0.211950281222949,
 0.1803059750900693,
 0.216270114389002,
 0.2284518206923738,
 0.2431060731983843,
 0.20903712104420713,
 0.22618827829876187,
 0.2361366194372485,
 0.23163595612776589,
 0.24382364907356047,
 0.2324610186213555,
 0.24608200169585492,
 0.24737837813024532,
 0.22323463976977997,
 0.23287925074696914,
 0.26973874299554257,
 0.28310686878590663,
 0.27896876342078614,
 0.26809068849791395,
 0.2652890568091649,
 0.27489141856727106,
 0.28953818461305936,
 0.275451243

In [21]:
df_distance["distances"] = distances
df_distance

Unnamed: 0,text,embeddings,distances
0,"Emily, A young woman in her early 20s, Emily i...","[-0.01387409120798111, -0.015974298119544983, ...",0.240059
1,"Jack, A middle-aged man in his 40s, Jack is a ...","[0.008523228578269482, -0.021200304850935936, ...",0.259164
2,"Alice, A woman in her late 30s, Alice is a war...","[0.006618042476475239, -0.012352379970252514, ...",0.271780
3,"Tom, A man in his 50s, Tom is a retired soldie...","[0.0199261587113142, -0.017992831766605377, 0....",0.258168
4,"Sarah, A woman in her mid-20s, Sarah is a free...","[-0.016199376434087753, -0.022538263350725174,...",0.233582
...,...,...,...
209,"Mitsuo Sagawa, Japanese singer, actor, talent ...","[-0.023149091750383377, -0.026412900537252426,...",0.205248
210,"Mitsuo Sagawa, Japanese singer, actor, talent ...","[-0.020797261968255043, -0.02908508852124214, ...",0.205472
211,"Hiroshi Satō, Japanese folk singer and actor, ...","[-0.015211681835353374, -0.005565015133470297,...",0.212016
212,"Thaíde, Brazilian rapper, hip-hop, Brazil","[-0.02801332250237465, -0.015631381422281265, ...",0.240037


In [22]:
df_distance_sorted = df_distance.sort_values(by="distances", ascending=True).reset_index(drop=True)
df_distance_sorted.to_csv(character_descriptions_distances_sorted_path)

## Step 3

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

### Tokenizing with `tiktoken`

In [23]:
import tiktoken

In [24]:
tokenizer = tiktoken.get_encoding("cl100k_base")

In [25]:
prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""

In [26]:
print(prompt_template.format("context", question))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

context

---

Question: Who was up coming musical perfomer from USA, full of energy and enthusiasm?
Answer:


In [27]:
max_token_count = 1000
current_token_count = len(tokenizer.encode(prompt_template)) + len(tokenizer.encode(question))
current_token_count

57

In [28]:
context = []
df_distance_sorted = pd.read_csv(character_descriptions_distances_sorted_path, index_col=0)

In [29]:
for text in df_distance_sorted["text"].values:
    text_token_count = len(tokenizer.encode(text))
    current_token_count += text_token_count
    
    if current_token_count <= max_token_count:
        context.append(text)
    else:
        break

## Step 4

### Getting a Custom Q&A Response with `openai.Completion`

In [30]:
openai.Completion.create( 
    model="gpt-3.5-turbo-instruct",
    prompt=prompt_template.format("\n\n###\n\n".join(context), question)
)["choices"][0]["text"]

'\nJohnny.'

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

### Question 2