# Custom Chatbot Project

### Dataset character_descriptions.csv

#### Description of the Database

The database is a collection of fictional character profiles, each containing structured information such as:

Name: The character's name or identifier.
Role: Their narrative role (e.g., protagonist, antagonist).
Age: The character's age, which helps define their maturity and relatability.
Description: A detailed overview of their personality, traits, and significance in the storyline.
Universe: The fictional world or setting where the character resides (e.g., fantasy or sci-fi).
Affiliation: The group or organization the character is aligned with, adding depth to their backstory.
This database is designed to support creative endeavors by providing easily searchable character traits, making it a valuable resource for writers, directors, and storytellers.

#### Reasoning for Using This Dataset

Structured and Informative: The dataset provides detailed character descriptions, including their roles, relationships, and settings, making it ideal for a chatbot to simulate meaningful interactions and scenarios.

Context-Rich Content: Attributes like "Description," "Medium," and "Setting" allow the chatbot to generate contextually accurate and immersive responses.

Scenario-Specific Relevance: Focused on plays and fictional storytelling in England, this dataset aligns well with creative and narrative-driven chatbot tasks.

Ease of Use: Its clean, structured format ensures straightforward integration into chatbot systems, enabling seamless queries and responses.

This dataset is a solid foundation for a chatbot aimed at creative storytelling or role-playing scenarios.

#### Use Case Scenario

When writing or producing a story for a play, movie, or novel, creators often seek characters that align with specific narratives or themes. This chatbot helps by allowing users to search for characters based on traits like personality, age, and setting. For instance, a playwright looking for a warm and nurturing mother figure in a modern English setting can quickly identify a suitable character without sifting through multiple sources. The chatbot streamlines the process of matching characters to creative needs, saving time and enhancing storytelling coherence.


TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

#### Importing Libraries

In [1]:
import openai
import pandas as pd
import numpy as np
from openai.embeddings_utils import distances_from_embeddings
import tiktoken

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [2]:
df = pd.read_csv("character_descriptions.csv")
df

Unnamed: 0,Name,Description,Medium,Setting
0,Emily,"A young woman in her early 20s, Emily is an as...",Play,England
1,Jack,"A middle-aged man in his 40s, Jack is a succes...",Play,England
2,Alice,"A woman in her late 30s, Alice is a warm and n...",Play,England
3,Tom,"A man in his 50s, Tom is a retired soldier and...",Play,England
4,Sarah,"A woman in her mid-20s, Sarah is a free-spirit...",Play,England
5,George,"A man in his early 30s, George is a charming a...",Play,England
6,Rachel,"A woman in her late 20s, Rachel is a shy and i...",Play,England
7,John,"A man in his 60s, John is a retired professor ...",Play,England
8,Maria,"A middle-aged Latina woman in her 40s, Maria i...",Movie,Texas
9,Caleb,"A young African American man in his early 20s,...",Movie,Texas


In [3]:
df["text"] = 'The name of the character is ' + df["Name"] + '.' + 'The character description is: ' + df["Description"] + 'The character loves to act in ' + df["Medium"] + '.' + 'The character lives in ' + df["Setting"]
df.head(10)

Unnamed: 0,Name,Description,Medium,Setting,text
0,Emily,"A young woman in her early 20s, Emily is an as...",Play,England,The name of the character is Emily.The charact...
1,Jack,"A middle-aged man in his 40s, Jack is a succes...",Play,England,The name of the character is Jack.The characte...
2,Alice,"A woman in her late 30s, Alice is a warm and n...",Play,England,The name of the character is Alice.The charact...
3,Tom,"A man in his 50s, Tom is a retired soldier and...",Play,England,The name of the character is Tom.The character...
4,Sarah,"A woman in her mid-20s, Sarah is a free-spirit...",Play,England,The name of the character is Sarah.The charact...
5,George,"A man in his early 30s, George is a charming a...",Play,England,The name of the character is George.The charac...
6,Rachel,"A woman in her late 20s, Rachel is a shy and i...",Play,England,The name of the character is Rachel.The charac...
7,John,"A man in his 60s, John is a retired professor ...",Play,England,The name of the character is John.The characte...
8,Maria,"A middle-aged Latina woman in her 40s, Maria i...",Movie,Texas,The name of the character is Maria.The charact...
9,Caleb,"A young African American man in his early 20s,...",Movie,Texas,The name of the character is Caleb.The charact...


In [4]:
df["text"][0]

"The name of the character is Emily.The character description is: A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George.The character loves to act in Play.The character lives in England"

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [5]:
# Set OpenAI API Key
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = "YOUR API KEY"

In [6]:
# Generate Embeddings using text-embedding-ada-002
def get_my_custom_embeddings(text, model="text-embedding-ada-002"):
    try:
        response = openai.Embedding.create(input=[text], model=model)
        return response.data[0].embedding
    except Exception as e:
        print(f"Error generating embedding: {e}")
        return None

In [7]:
# Getting the embedding for column Text
df["embedding"] = df["text"].apply(get_my_custom_embeddings).apply(np.array)

# Creating new dataframe containing both texts and embedding
embedding_df = df[["text","embedding"]]

# Save new dataframe
embedding_df.to_csv("embeddings.csv")

embedding_df.head(5)

Unnamed: 0,text,embedding
0,The name of the character is Emily.The charact...,"[-0.016069073230028152, -0.010196696035563946,..."
1,The name of the character is Jack.The characte...,"[0.001332677435129881, -0.021623345091938972, ..."
2,The name of the character is Alice.The charact...,"[0.00483351107686758, -0.007390132639557123, -..."
3,The name of the character is Tom.The character...,"[0.014000754803419113, -0.016183529049158096, ..."
4,The name of the character is Sarah.The charact...,"[-0.01699378900229931, -0.022452866658568382, ..."


In [8]:
# A Function that Finds Related Pieces of Text for a Given Question
def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_my_custom_embeddings(question)
    
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embedding"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In [9]:
# A Function that Composes a Text Prompt
def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
    Answer the question based on the context below, and if the question
    can't be answered based on the context, say "I don't know"

    Context: 

    {}

    ---

    Question: {}
    Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

In [17]:
def answer_question_by_custom_prompt(question, df, max_prompt_tokens=2000, max_answer_tokens=300):
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
            model="gpt-3.5-turbo-instruct",
            prompt=prompt,
            max_tokens=max_answer_tokens,
            temperature=0.2,
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(f"Error during OpenAI request: {e}")
        return ""

In [16]:
# Basic Prompt Version without custom prompt
def basic_answer_question(question, max_answer_tokens=300):
    try:
        response = openai.Completion.create(
            model="gpt-3.5-turbo-instruct",
            prompt=question,  # Basic prompt without custom context
            max_tokens=max_answer_tokens,
            temperature=0.2,
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(f"Error during OpenAI request: {e}")
        return ""


## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [18]:
# Question 1
question1 = "Who is the middle-aged Latina woman who owns a small family-run diner in a small Texas town?"

# Get answer from custom prompt
Question1_custom_answer = answer_question_by_custom_prompt(question1, embedding_df)

# Get answer from basic prompt
Question1_basic_answer = basic_answer_question(question1)

# Print both answers for comparison
print("Question 1 (Custom Prompt):")
print(Question1_custom_answer)
print("\nQuestion 1 (Basic Prompt):")
print(Question1_basic_answer)

Question 1 (Custom Prompt):
Maria

Question 1 (Basic Prompt):
Her name is Maria Rodriguez. She is a hardworking and friendly woman in her late 40s, with dark hair and warm brown eyes. She inherited the diner from her parents and has been running it for over 20 years. She is known for her delicious homemade Tex-Mex dishes and her welcoming personality. Maria is a pillar of the community and is always willing to lend a helping hand to those in need. She takes great pride in her diner and treats her customers like family. Despite the challenges of running a small business, Maria is determined to keep the diner thriving and pass it down to her children one day.


### Question 2

In [20]:
# Question 2
question2 = "Who is known for her beauty and magical remedies, believed by some to have mystical powers?"

# Get answer from custom prompt
Question2_custom_answer = answer_question_by_custom_prompt(question2, embedding_df)

# Get answer from basic prompt
Question2_basic_answer = basic_answer_question(question2)

# Print both answers for comparison
print("\nQuestion 2 (Custom Prompt):")
print(Question2_custom_answer)
print("\nQuestion 2 (Basic Prompt):")
print(Question2_basic_answer)


Question 2 (Custom Prompt):
Signora Rosa

Question 2 (Basic Prompt):
Cleopatra is known for her beauty and was believed by some to have mystical powers due to her use of cosmetics and perfumes, as well as her intelligence and charm. She was also known for her use of herbal remedies and potions, which were believed to have magical properties.


### While the custom prompt offers a specific and accurate output, the basic prompt diverges from the desired response. This illustrates the potential of custom prompts to guide the model to provide more relevant answers.
