# Custom Chatbot Project

TODO: I selected the `character_descriptions.csv` dataset to build a custom chatbot that can help users explore fictional characters from movies, theatre, and TV. This chatbot would be useful for writers, actors, or fans who want to learn about character backgrounds, traits, or the settings they come from. By integrating character-specific data into the model’s context, we ensure that the chatbot responds accurately using this fictional universe, rather than generating vague or hallucinated information.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [1]:
#1. Dataset Selection & Scenario Explanation
import pandas as pd

In [2]:
# Load dataset
df = pd.read_csv("data/character_descriptions.csv")
df

Unnamed: 0,Name,Description,Medium,Setting
0,Emily,"A young woman in her early 20s, Emily is an as...",Play,England
1,Jack,"A middle-aged man in his 40s, Jack is a succes...",Play,England
2,Alice,"A woman in her late 30s, Alice is a warm and n...",Play,England
3,Tom,"A man in his 50s, Tom is a retired soldier and...",Play,England
4,Sarah,"A woman in her mid-20s, Sarah is a free-spirit...",Play,England
5,George,"A man in his early 30s, George is a charming a...",Play,England
6,Rachel,"A woman in her late 20s, Rachel is a shy and i...",Play,England
7,John,"A man in his 60s, John is a retired professor ...",Play,England
8,Maria,"A middle-aged Latina woman in her 40s, Maria i...",Movie,Texas
9,Caleb,"A young African American man in his early 20s,...",Movie,Texas


The data has 4 columns - Name, Description, Medium and Setting

In [3]:
#2. Data Wrangling
# Clean column names
df.columns = df.columns.str.strip()

# Combine fields into one "text" column
df['text'] = df.apply(
    lambda row: f"{row['Name']} is a character from a {row['Medium']} set in {row['Setting']}. Description: {row['Description']}",
    axis=1
)
df

Unnamed: 0,Name,Description,Medium,Setting,text
0,Emily,"A young woman in her early 20s, Emily is an as...",Play,England,Emily is a character from a Play set in Englan...
1,Jack,"A middle-aged man in his 40s, Jack is a succes...",Play,England,Jack is a character from a Play set in England...
2,Alice,"A woman in her late 30s, Alice is a warm and n...",Play,England,Alice is a character from a Play set in Englan...
3,Tom,"A man in his 50s, Tom is a retired soldier and...",Play,England,Tom is a character from a Play set in England....
4,Sarah,"A woman in her mid-20s, Sarah is a free-spirit...",Play,England,Sarah is a character from a Play set in Englan...
5,George,"A man in his early 30s, George is a charming a...",Play,England,George is a character from a Play set in Engla...
6,Rachel,"A woman in her late 20s, Rachel is a shy and i...",Play,England,Rachel is a character from a Play set in Engla...
7,John,"A man in his 60s, John is a retired professor ...",Play,England,John is a character from a Play set in England...
8,Maria,"A middle-aged Latina woman in her 40s, Maria i...",Movie,Texas,Maria is a character from a Movie set in Texas...
9,Caleb,"A young African American man in his early 20s,...",Movie,Texas,Caleb is a character from a Movie set in Texas...


In [4]:
# Keep only the "text" column
df = df[['text']]

# Preview
print(f"Total rows: {len(df)}")
df.head()

Total rows: 55


Unnamed: 0,text
0,Emily is a character from a Play set in Englan...
1,Jack is a character from a Play set in England...
2,Alice is a character from a Play set in Englan...
3,Tom is a character from a Play set in England....
4,Sarah is a character from a Play set in Englan...


In [5]:
# Preview result
print(f"Total rows: {len(df)}")

Total rows: 55


In [6]:
import re

def clean_text(text):
    text = text.replace("\\", "")
    return re.sub(r"[^a-zA-Z0-9\s.,'\"!?;:()-]", '', text)

df["text"] = df["text"].apply(clean_text)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["text"] = df["text"].apply(clean_text)


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [8]:
#3. Embeddings and Custom Query Setup
import openai

openai.api_key = "voc-10428826251266774227972687a76f7c7b7b8.94443740"
openai.api_base = "https://openai.vocareum.com/v1"
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"

# Create embeddings in batches
batch_size = 50
embeddings = []
for i in range(0, len(df), batch_size):
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    embeddings.extend([entry["embedding"] for entry in response["data"]])

df["embeddings"] = embeddings
df.to_csv("data/character_descriptions_embeddings.csv", index=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["embeddings"] = embeddings


In [9]:
#4. Similarity-Based Retrieval
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    question_embedding = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embedding,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy.iloc[:10]["text"].tolist()


In [10]:
#5. Prompt Building and Answering
import tiktoken

def build_custom_prompt(question, df, max_token_count=1800):
    tokenizer = tiktoken.get_encoding("cl100k_base")
    prompt_template = """
Answer the question based on the context below. If the question
can't be answered from the context, say "I don't know".

Context:

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + len(tokenizer.encode(question))
    context = []
    for text in get_rows_sorted_by_relevance(question, df):
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break
    return prompt_template.format("\n\n###\n\n".join(context), question)

def build_simple_prompt(question: str):
    return f"if the question can't be answered, say \"I don't know\". question: {question}"

COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(prompt_text, max_tokens=150):
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt_text,
            max_tokens=max_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        return str(e)


## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [11]:
#6. Custom Performance Demonstration
# Question 1
question1 = "Who is Marla Wexler and what show is she from?"
print("Without Context:\n", answer_question(build_simple_prompt(question1)))
print("\nWith Context:\n", answer_question(build_custom_prompt(question1, df)))

Without Context:
 I don't know.

With Context:
 I don't know.


### Question 2

In [12]:
# Question 2
question2 = "Tell me about the character who is a time-traveling detective."
print("\nWithout Context:\n", answer_question(build_simple_prompt(question2)))
print("\nWith Context:\n", answer_question(build_custom_prompt(question2, df)))



Without Context:
 I'm sorry, but I don't have enough information to answer that question. I am an AI and do not have knowledge of fictional characters.

With Context:
 I don't know.


In [13]:
while True:
    question = input("Ask about a character (or type 'exit' to quit): ")
    if question.lower() in ["exit", "quit"]:
        print("Goodbye!")
        break
    print("\nWithout Context:")
    print(answer_question(build_simple_prompt(question)))
    print("\nWith Context:")
    print(answer_question(build_custom_prompt(question, df)))
    print("\n" + "-" * 40)


Ask about a character (or type 'exit' to quit): Who is Jack?

Without Context:
I don't know.

With Context:
Jack is a successful businessman and Sarah's boss, married to Alice.

----------------------------------------
Ask about a character (or type 'exit' to quit): What is his Setting?

Without Context:
I don't know.

With Context:
It is not specified in the context which character the question is referring to, so it cannot be answered accurately. Each character has their own specific setting in the descriptions provided.

----------------------------------------
Ask about a character (or type 'exit' to quit): What is Rachel Setting?

Without Context:
I don't know.

With Context:
In the given context, it is not mentioned where Rachel is specifically setting, but the setting of the play is in England.

----------------------------------------
Ask about a character (or type 'exit' to quit): What is the medium of Baron Gustavo play?

Without Context:
I don't know.

With Context:
Opera

-