# Custom Chatbot Project

In [10]:
import openai
openai.api_base = "https://openai.vocareum.com/v1"

openai.__version__

'0.28.0'

In [11]:
from dotenv import load_dotenv
import os

# Load environment variables from .env file
load_dotenv()

# Access environment variables like this:
openai.api_key = os.getenv('VOC_API_KEY')

## Dataset Selection

I've chosen the **character_descriptions.csv** dataset, which contains 50 diverse fictional characters from various media (plays, movies, operas, TV shows, musicals). This dataset is ideal for a chatbot because:

1. **Rich, descriptive text**: Each character has detailed personality traits, backgrounds, and relationships that provide substantial context for conversation
2. **Diverse contexts**: Characters span different settings (England, Texas, Australia, Italy, Ancient Greece, USA) and mediums, offering varied conversational scenarios
3. **Well-structured data**: The CSV format is easy to load and manipulate with pandas, with clear fields (Name, Description, Medium, Setting)
4. **Sufficient volume**: With 50 characters and detailed descriptions, there's enough data to create meaningful embeddings and context for a chatbot

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [12]:
import pandas as pd


# Load character descriptions dataset
df_in = pd.read_csv('data/character_descriptions.csv')
df_in.head()

Unnamed: 0,Name,Description,Medium,Setting
0,Emily,"A young woman in her early 20s, Emily is an as...",Play,England
1,Jack,"A middle-aged man in his 40s, Jack is a succes...",Play,England
2,Alice,"A woman in her late 30s, Alice is a warm and n...",Play,England
3,Tom,"A man in his 50s, Tom is a retired soldier and...",Play,England
4,Sarah,"A woman in her mid-20s, Sarah is a free-spirit...",Play,England


In [13]:
# Generate new df with only text column
df = pd.DataFrame({'text': df_in['Name'] + ' ' + df_in['Description'] + ' ' + df_in['Medium'] + ' ' + df_in['Setting']})
df.head()


Unnamed: 0,text
0,"Emily A young woman in her early 20s, Emily is..."
1,"Jack A middle-aged man in his 40s, Jack is a s..."
2,"Alice A woman in her late 30s, Alice is a warm..."
3,"Tom A man in his 50s, Tom is a retired soldier..."
4,"Sarah A woman in her mid-20s, Sarah is a free-..."


# Generate Embeddings

In [None]:
# This section was taken and adapted from Udacity "Generative AI" Course - Course 3 Lection 4.24

import os
import numpy as np
import ast
import pandas as pd

EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
embeddings_path = "data/embeddings.csv"

if os.path.exists(embeddings_path):
    df = pd.read_csv(embeddings_path, index_col=0)
    if "embeddings" in df.columns:
        df["embeddings"] = df["embeddings"].apply(ast.literal_eval).apply(np.array)
    print("Used embeddings cache")
else:

    batch_size = 100
    embeddings = []
    for i in range(0, len(df), batch_size):
        # Send text data to OpenAI model to get embeddings
        response = openai.Embedding.create(
            input=df.iloc[i:i+batch_size]["text"].tolist(),
            engine=EMBEDDING_MODEL_NAME
        )
        
        # Add embeddings to list
        embeddings.extend([data["embedding"] for data in response["data"]])

    # Add embeddings list to dataframe
    df["embeddings"] = embeddings
df.head()


Used embeddings cache


Unnamed: 0,text,embeddings
0,"Emily A young woman in her early 20s, Emily is...","[-0.018092580139636993, -0.010666043497622013,..."
1,"Jack A middle-aged man in his 40s, Jack is a s...","[0.004722620826214552, -0.0182148776948452, 0...."
2,"Alice A woman in her late 30s, Alice is a warm...","[0.003674759529531002, -0.006788234692066908, ..."
3,"Tom A man in his 50s, Tom is a retired soldier...","[0.015497241169214249, -0.019064482301473618, ..."
4,"Sarah A woman in her mid-20s, Sarah is a free-...","[-0.016880445182323456, -0.020853864029049873,..."


Calculate embeddings distance

In [None]:
# This section was taken from Udacity "Generative AI" Course - Course 3 Lection 4.24

from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In [19]:
get_rows_sorted_by_relevance("Who is Alice?", df)

Unnamed: 0,text,embeddings,distances
2,"Alice A woman in her late 30s, Alice is a warm...","[0.003674759529531002, -0.006788234692066908, ...",0.126881
0,"Emily A young woman in her early 20s, Emily is...","[-0.018092580139636993, -0.010666043497622013,...",0.156984
1,"Jack A middle-aged man in his 40s, Jack is a s...","[0.004722620826214552, -0.0182148776948452, 0....",0.185353
6,"Rachel A woman in her late 20s, Rachel is a sh...","[-0.006025387439876795, -0.012205769307911396,...",0.205502
4,"Sarah A woman in her mid-20s, Sarah is a free-...","[-0.016880445182323456, -0.020853864029049873,...",0.210428
45,Bianca Lady Olivia's cunning and quick-witted ...,"[-0.015185898169875145, -0.0216203760355711, -...",0.214398
40,Lady Olivia A wealthy and beautiful noblewoman...,"[-0.018803969025611877, -0.019696272909641266,...",0.218828
22,"Dolly A bubbly and vivacious performer, Dolly ...","[-0.03174891322851181, -0.020277494564652443, ...",0.219215
11,"Sonya A white woman in her late 20s, Sonya is ...","[0.0012132300762459636, -0.030315209180116653,...",0.220935
25,"Crystal A quirky and imaginative performer, Cr...","[-0.009569738060235977, -0.01798507198691368, ...",0.22347


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [20]:
get_rows_sorted_by_relevance("What roles are in operas?", df)

Unnamed: 0,text,embeddings,distances
38,Don Carlo A charming and charismatic young man...,"[-0.013541321270167828, -0.011926728300750256,...",0.197258
47,Feste A jester and musician who works in Lady ...,"[-0.008204508572816849, -0.024826284497976303,...",0.203808
36,Baron Gustavo A wealthy and arrogant nobleman ...,"[-0.02530987374484539, -0.021352330222725868, ...",0.204334
43,Viola A plucky and resourceful young woman who...,"[-0.011574805714190006, -0.04077444225549698, ...",0.204785
35,Signora Rosa A mysterious and alluring woman w...,"[0.0044090780429542065, -0.006427340675145388,...",0.204793
37,Francesca A fiery and passionate young woman w...,"[-0.0164291150867939, -0.02431611716747284, -0...",0.207965
45,Bianca Lady Olivia's cunning and quick-witted ...,"[-0.015185898169875145, -0.0216203760355711, -...",0.215698
46,"Sebastian Viola's twin brother, who is also sh...","[-0.010952216573059559, -0.03624306246638298, ...",0.216578
34,Prince Lorenzo A charming and handsome prince ...,"[-0.007474577520042658, -0.013348854146897793,...",0.220924
39,Duke Orsino A pompous and self-important noble...,"[-0.00788917113095522, -0.04058762267231941, -...",0.221117


In [None]:
# This section was taken and adapted from Udacity "Generative AI" Course - Course 3 Lection 4.24

import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)
    

In [None]:
print(create_prompt("Is there a series or tv show with a role of max who has a sister Mia?", df, 200))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

Mia A young Australian woman in her mid-20s, Mia is a driven and ambitious lawyer who's just landed her dream job at a top law firm in Sydney. She's the younger sister of Max, a former soldier who's struggling with PTSD, and is trying to help him navigate his challenges while also balancing her demanding career. Limited Series Australia

---

Question: Is there a series or tv show with a role of max who has a sister Mia?
Answer:


In [None]:
# This section was taken and adapted from Udacity "Generative AI" Course - Course 3 Lection 4.24

COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question_with_custom_prompt(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)

    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""
    
def answer_question_basic(
    question,max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """

    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=question,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [25]:
# Basic Model

print(answer_question_basic("Is there a series or tv show with a role of max who has a sister Mia?"))

Not that matches this specific description; however, there are some shows with similar elements:

1. "Stranger Things": The character Max has a stepbrother named Billy.

2. "Better Things": The character Max has two sisters, and one of them is named "Mae".

3. "Andi Mack": The main character Andi has an older sister named Bex, who also goes by Max.

4. "The Goldbergs": The character Adam has a sister named Erica, whose full name is Erica "Eric" Goldberg.

5. "The Haunting of Hill House": One of the main characters, Luke, has a twin sister named Nell, who is often referred to as "Nellie" or


In [None]:
# Custom prompt

print(answer_question_with_custom_prompt("Is there a series or tv show with a role of max who has a sister Mia?",df,200))

Limited Series Australia


### Question 2

In [28]:
# Basic Model

print(answer_question_basic("There is a US based reality show with a fireman Jack. What was his ambition in the show?"))

It is not clear which specific reality show you are referring to, as there are multiple reality shows in which a fireman named Jack appears. Therefore, it is not possible to accurately answer this question.


In [31]:
print(answer_question_with_custom_prompt("There is a US based reality show with a firefighter Jack. What was his ambition in the show?",df,200))

To find a partner who values honesty and integrity and is looking for a stable and committed relationship.
