# Custom Chatbot Project: Character Dataset

In this project I'm going to use a dataset of fictional characters from TV, Movies and plays. I  think this dataset is appropriate for this application because it will demonstrate the LLM's ability to connect character names to specific traits within the data given the appropriate context. We will wrangle the dataset, setup embeddings, use cosine similarity to query the dataset then use these Query resutls to generate a RAG response from the OpenAPI endpoint.

In [33]:
#Imports
import openai
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
import tiktoken

## Data Wrangling

Here we will be loading in our test data and contatinating it all the info into a single text column in our dataframe so it'll be easier for processing. 

In [34]:
df = pd.read_csv('data/character_descriptions.csv')

In [35]:
df.head()

Unnamed: 0,Name,Description,Medium,Setting
0,Emily,"A young woman in her early 20s, Emily is an as...",Play,England
1,Jack,"A middle-aged man in his 40s, Jack is a succes...",Play,England
2,Alice,"A woman in her late 30s, Alice is a warm and n...",Play,England
3,Tom,"A man in his 50s, Tom is a retired soldier and...",Play,England
4,Sarah,"A woman in her mid-20s, Sarah is a free-spirit...",Play,England


In [36]:
df['text'] = df.apply(lambda row: f"Name: {row['Name']}, Medium: {row['Medium']}, Setting: {row['Setting']}, Description: {row['Description']}", axis=1)

In [37]:
print(df[['text']].head())

                                                text
0  Name: Emily, Medium: Play, Setting: England, D...
1  Name: Jack, Medium: Play, Setting: England, De...
2  Name: Alice, Medium: Play, Setting: England, D...
3  Name: Tom, Medium: Play, Setting: England, Des...
4  Name: Sarah, Medium: Play, Setting: England, D...


## Custom Query Completion

Here we will generate an embedding of our dataset and our query, find the cosine distance then apply the most relavent context to our query. Lastly we will call the OpenAI completion method to get our response.

In [38]:
print(openai.__version__)

0.28.0


In [39]:
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = "YOUR API KEY"

In [40]:
def generate_openai_embedding(text, model="text-embedding-ada-002"):
    response = openai.Embedding.create(
        input=text,
        model=model
    )
    return response['data'][0]['embedding']

In [41]:
df['embedding'] = df['text'].apply(lambda x: generate_openai_embedding(x))

In [42]:
def query_text_openai(query, df, top_n=5):
    query_embedding = np.array(generate_openai_embedding(query)).reshape(1, -1)
    embeddings = np.vstack(df['embedding'].values)
    similarities = cosine_similarity(query_embedding, embeddings).flatten()
    top_indices = similarities.argsort()[-top_n:][::-1]
    results = df.iloc[top_indices]
    return results[['text', 'Description']] 

In [43]:
query = "Looking for a sad character from a English setting"
top_results = query_text_openai(query, df)
print(top_results.to_string())

                                                                                                                                                                                                                                                                                                                       text                                                                                                                                                                                                                                                    Description
3                                                   Name: Tom, Medium: Play, Setting: England, Description: A man in his 50s, Tom is a retired soldier and John's son. He has a no-nonsense approach to life, but is haunted by his experiences in combat and struggles with PTSD. He's also in a relationship with Rachel.                                                A man in his 50s, Tom is a retired soldier and John's son. He has 

### Prompt Creation
Now we want to create a method to actually insert the relavent context into our prompt

In [44]:
def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")

    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know". Please provide 
a description of why the answer was provided.

Context: 

{}

---

Question: {}
Answer:"""

    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))

    context = []
    relevant_rows = query_text_openai(question, df, 100)

    for text in relevant_rows["text"].values:

        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count

        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

In [45]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model

    If the model produces an error, return an empty string
    """

    prompt = create_prompt(question, df, max_prompt_tokens)

    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""


In [46]:
test_question = "Which characters do you have that were are from England and are sad?"
test_answer = answer_question(test_question, df)
print(test_answer)

Tom and Rachel are from England and can be considered sad because Tom struggles with PTSD from being a retired soldier and Rachel struggles with social anxiety and often feels like an outsider.


## Custom Performance Demonstration

In this section we're going to test out the bot and see how it does with certain questions about characters. We will first do a text completion call without any context from our dataset then make a call with the context inserted with our custom prompt.

In [47]:
def create_prompt_without_context(question, max_token_count):
    """
    Given a question, return a text prompt to send to a Completion model.
    This version does not insert any context, but uses the same structure.
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")

    # Define the prompt template without context
    prompt_template = """
Answer the question below. If the question can't be answered on its own, say "I don't know."

Question: {}
Answer:"""

    # Count the number of tokens in the prompt template and question
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                          len(tokenizer.encode(question))

    # Ensure the total tokens do not exceed the limit
    if current_token_count > max_token_count:
        raise ValueError("The question exceeds the maximum token count allowed.")

    # Format the prompt with the question
    return prompt_template.format(question)

In [51]:
def answer_question_without_context(
    question, max_prompt_tokens=1800, max_answer_tokens=150
):

    prompt = create_prompt_without_context(question, max_prompt_tokens)

    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

In [81]:
def show_character_details(df, name):
    """
    Filters the DataFrame to show details of a character by name,
    displaying Name, Setting, Medium, and the full Description.
    """
    # Set display option to ensure full Description content is shown
    pd.set_option('display.max_colwidth', None)

    # Filter the DataFrame based on the name and select specific columns
    character_details = df[df['Name'] == name][['Name', 'Setting', 'Medium', 'Description']]
    
    # Print the result
    print(character_details)

    # Reset display option to default
    pd.reset_option('display.max_colwidth')

### Question 1

In [69]:
test_question_1 = "Can you tell me 3 movie characters who are heroic characters from the USA?"

In [70]:
without_context_answer_1 = answer_question_without_context(test_question_1)

In [74]:
context_answer_1 = answer_question(test_question_1, df)

In [72]:
print (f"Without Context Answer: {without_context_answer_1}")

Without Context Answer: I don't know.


In [75]:
print (f"With Context Answer: {context_answer_1}")

With Context Answer: 1. Manuel
2. Tyler
3. Will

These characters were described as brave, determined, and willing to fight for what they believe in. This is typically seen as a heroic trait in American cinema.


In [82]:
show_character_details(df, 'Manuel')

      Name Setting Medium  \
12  Manuel   Texas  Movie   

                                                                                                                                                                                                                                Description  
12  A middle-aged Hispanic man in his 50s, Manuel is a proud and hard-working farmer who's struggling to keep his family's farm afloat. He's fiercely loyal to his family and his community, and will do whatever it takes to protect them.  


In [83]:
show_character_details(df, 'Tyler')

     Name Setting Medium  \
10  Tyler   Texas  Movie   

                                                                                                                                                                                                                                               Description  
10  A white man in his mid-30s, Tyler is a tough-as-nails sheriff who takes his job very seriously. He's stoic, no-nonsense, and has a strong sense of justice. However, he's also struggling to come to terms with a recent tragedy in his personal life.  


In [84]:
show_character_details(df, 'Will')

    Name Setting Medium  \
13  Will   Texas  Movie   

                                                                                                                                                                                                                                                                                        Description  
13  A white man in his early 40s, Will is a successful businessman who's come back to his hometown after many years away. He's confident, charming, and knows how to get what he wants. However, he's also hiding a dark secret from his past that threatens to destroy everything he's worked for.  


#### Results
So for these questions we can see that the model without context simply responds that it doesn't know because it has no context. The model with context returned all characters from Texas (in the USA) with various levels of heroicness.

### Question 2

In [93]:
test_question_2 = "Can you provide me with 3 female characters from england who are happy?"

In [94]:
without_context_answer_2 = answer_question_without_context(test_question_2)

In [95]:
context_answer_2 = answer_question(test_question_2, df)

In [96]:
print (f"Without Context Answer: {without_context_answer_2}")

Without Context Answer: I don't know.


In [97]:
print (f"With Context Answer: {context_answer_2}")

With Context Answer: Emily, Ava, and Alice are all female characters from England who are described as happy in their descriptions. Emily has a bubbly personality and a quick wit, Ava is an elegant and sophisticated fashion designer, and Alice is a warm and nurturing mother. These descriptions make it seem like they are content with their lives and generally happy characters.


In [98]:
show_character_details(df, 'Emily')

    Name  Setting Medium  \
0  Emily  England   Play   

                                                                                                                                                                                                                Description  
0  A young woman in her early 20s, Emily is an aspiring actress and Alice's daughter. She has a bubbly personality and a quick wit, but struggles with self-doubt and insecurity. She's also in a relationship with George.  


In [99]:
show_character_details(df, 'Ava')

   Name    Setting          Medium  \
18  Ava  Australia  Limited Series   

                                                                                                                                                                                                                                                                                                                                                                            Description  
18  A middle-aged Australian woman in her 50s, Ava is a successful fashion designer who's built an empire on her impeccable taste and attention to detail. She's elegant, sophisticated, and always knows what's in style. She's married to Lucas, but their marriage is strained due to his infidelity. She's also been a mentor to Tahlia, and has helped her navigate the art world.  


In [100]:
show_character_details(df, 'Alice')

    Name  Setting Medium  \
2  Alice  England   Play   

                                                                                                                                                                                                            Description  
2  A woman in her late 30s, Alice is a warm and nurturing mother of two, including Emily. She's kind-hearted and empathetic, but can be overly protective of her children and prone to worrying. She's married to Jack.  


#### Results
So for these questions we can see that the model without context simply responds that it doesn't know because it has no context. The model with context did better, returning characters from the dataset, all of which appear to be female. It did make one "error" by returning Ava, who was neither from England or "Happy" or at least by judging from the description it doesn't appear to be an overly happy character.