# Custom Chatbot Project

# Dataset Selection

In this project, I am building a custom chatbot by leveraging OpenAI's language models and a custom dataset. The dataset chosen is *character_descriptions.csv*, which contains structured information about various fictional characters including their name, description, medium, and setting.

### Why this dataset?

This dataset is ideal for building a chatbot that can answer questions about different characters. The structured fields allow us to generate meaningful narratives for embeddings and support a variety of character-specific queries. Additionally, this dataset includes more than 20 rows, satisfying the requirement for a rich context base.

# Sample Prompt Creation

I randomly selected two characters from the dataset and use their names to create basic natural language questions. This simulates user input and helps test the chatbot's capabilities early in the workflow.

# Data Wrangling

## Setting up OpenAI API
Sets the API base to the custom endpoint provided by Vocareum (used in Udacity's workspace) and initializes the API key to access OpenAI's services.

In [1]:
import openai
openai.api_base="https://openai.vocareum.com/v1"
openai.api_key="YOUR API KEY"

## Loading and Processing Dataset
Loads the dataset and combines relevant character details into a single text column (required for embedding and RAG logic). Keeps only the text column to simplify later processing.

In [2]:
import pandas as pd

# Load the CSV file
file_path='data/character_descriptions.csv'
character_data=pd.read_csv(file_path)

def combine_columns(row):
    text = f"{row['Name']} is a {row['Description']} This character appears in a {row['Medium']} set in {row['Setting']}."
    return text

character_data['text']=character_data.apply(combine_columns, axis=1)
combined_df=character_data[['text']]
combined_df.head(30)

Unnamed: 0,text
0,"Emily is a A young woman in her early 20s, Emi..."
1,"Jack is a A middle-aged man in his 40s, Jack i..."
2,"Alice is a A woman in her late 30s, Alice is a..."
3,"Tom is a A man in his 50s, Tom is a retired so..."
4,"Sarah is a A woman in her mid-20s, Sarah is a ..."
5,"George is a A man in his early 30s, George is ..."
6,"Rachel is a A woman in her late 20s, Rachel is..."
7,"John is a A man in his 60s, John is a retired ..."
8,Maria is a A middle-aged Latina woman in her 4...
9,Caleb is a A young African American man in his...


## Generating Sample Prompts and Running Basic Q&A
Randomly samples two characters from the dataset, creates prompt questions using their names, and retrieves answers using OpenAI’s gpt-3.5-turbo-instruct. This represents a basic query, without custom context.

In [3]:
import random

# Randomly select two rows from the dataframe
sampled_rows=combined_df.sample(2).reset_index(drop=True)

character_1=sampled_rows.iloc[0]['text'].split()[0]
character_2=sampled_rows.iloc[1]['text'].split()[0]

# Generate two questions based on the selected characters
prompt1=f"What is {character_1}'s profession?"
prompt2=f"In what setting does {character_2}'s story take place?"

print(f'Prompt1: {prompt1}')

answer1 = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=prompt1,
    max_tokens=150
)["choices"][0]["text"].strip()
print(answer1)

print(f'Prompt2: {prompt2}')

answer2=openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=prompt2,
    max_tokens=150
)["choices"][0]["text"].strip()
print(answer2)

Prompt1: What is Will's profession?
It is not specified what Will's profession is.
Prompt2: In what setting does Alice's story take place?
Alice's story takes place in a fantastical and surreal world known as Wonderland.


## Embedding Generation

Here, we use OpenAI’s `text-embedding-ada-002` model to create dense vector representations (embeddings) of each text entry. These embeddings capture semantic meaning and are later used to determine relevance during querying.

In [4]:
EMBEDDING_MODEL_NAME="text-embedding-ada-002"
batch_size=100
embeddings=[]
for i in range(0, len(combined_df), batch_size):
    response = openai.Embedding.create(
        input=combined_df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )

    embeddings.extend([data["embedding"] for data in response["data"]])

combined_df["embeddings"]=embeddings
combined_df

combined_df.to_csv('character_descriptions_with_embeddings.csv')
len(combined_df['embeddings'][0])
!ls

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  combined_df["embeddings"]=embeddings


Custom_Chatbot_Documented_FINAL.ipynb	    data
character_descriptions_with_embeddings.csv  project.ipynb


# Custom Query Completion

## Load CSV with Embeddings
Reloads the CSV with saved embeddings and converts the text-form embeddings back into NumPy arrays for distance calculations.

In [5]:
import numpy as np
import pandas as pd

file_path='character_descriptions_with_embeddings.csv'
df=pd.read_csv(file_path, index_col=0)
df["embeddings"]=df["embeddings"].apply(eval).apply(np.array)

## Define get_rows_sorted_by_relevance()
Calculates cosine similarity between the embedding of a user’s question and each row’s text embedding. Returns the dataframe sorted from most relevant to least relevant.

In [6]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """

    question_embeddings=get_embedding(question, engine=EMBEDDING_MODEL_NAME)

    df_copy=df.copy()
    df_copy["distances"]=distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )

    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

## Check shape of first embedding
Verifies that each embedding is a fixed-size vector (typically 1536 dimensions for text-embedding-ada-002).

In [7]:
df['embeddings'][0].shape

(1536,)

# Custom Query Completion

Demonstrates relevance-based retrieval. Retrieves the top few character descriptions most relevant to the input question.

In [8]:
question1="What is Malvolio's profession?"
sorted_df1=get_rows_sorted_by_relevance(question1, df)
sorted_df1.head(5)

Unnamed: 0,text,embeddings,distances
42,Malvolio is a A pompous and self-righteous ste...,"[-0.015081856399774551, -0.03434662148356438, ...",0.112745
45,Bianca is a Lady Olivia's cunning and quick-wi...,"[-0.015193937346339226, -0.02585979364812374, ...",0.162383
43,Viola is a A plucky and resourceful young woma...,"[-0.011117728427052498, -0.04259764403104782, ...",0.168999
40,Lady Olivia is a A wealthy and beautiful noble...,"[-0.019477209076285362, -0.0233752503991127, -...",0.185655
41,Sir Toby Belch is a A drunken and lecherous kn...,"[-0.0015814456855878234, -0.03271515667438507,...",0.192094


In [9]:
question2="In what setting does Karma's story take place?"
sorted_df2=get_rows_sorted_by_relevance(question2, df)
sorted_df2.head(5)

Unnamed: 0,text,embeddings,distances
24,"Karma is a A chameleon-like performer, Karma i...","[-0.00669951643794775, -0.020639097318053246, ...",0.185831
11,"Sonya is a A white woman in her late 20s, Sony...","[0.0022304225713014603, -0.026824770495295525,...",0.23197
4,"Sarah is a A woman in her mid-20s, Sarah is a ...","[-0.01749402843415737, -0.02170724980533123, -...",0.235102
3,"Tom is a A man in his 50s, Tom is a retired so...","[0.014993906952440739, -0.010453866794705391, ...",0.239333
8,Maria is a A middle-aged Latina woman in her 4...,"[-0.0096737090498209, -0.011428854428231716, -...",0.240673


## Prompt Engineering Logic

Builds a prompt that combines multiple relevant text chunks into a single context, ensuring that the total token count stays within model limits. This prepares input for a context-aware custom query.

In [10]:
import tiktoken

prompt_template="""
Answer the question based on the context given below, and if the question
is unanswerable or not relevant to the provided data, just say "Sorry. I don't know. Please provide some more data."

Context:

{}

***************************

Question: {}
Answer:"""


def create_prompt(question, df, max_token_count, prompt_template=prompt_template):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer=tiktoken.get_encoding("cl100k_base")


    current_token_count=len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))

    context=[]
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:

        # Increase the counter based on the number of tokens in this row
        text_token_count=len(tokenizer.encode(text))
        current_token_count+=text_token_count

        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count<=max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n--------------------------------------------------\n\n".join(context), question)

## Show Prompts for 2 Questions

Displays the final custom prompt generated for the questions, showing how the model will be guided to use context.

In [11]:
# create prompt for question 1
max_token_count=300
print(create_prompt(question1, df, max_token_count))


# create prompt for question 2
max_token_count=300
print(create_prompt(question2, df, max_token_count))

COMPLETION_MODEL_NAME="gpt-3.5-turbo-instruct"


Answer the question based on the context given below, and if the question
is unanswerable or not relevant to the provided data, just say "Sorry. I don't know. Please provide some more data."

Context:

Malvolio is a A pompous and self-righteous steward in Lady Olivia's household. Malvolio is humorless and uptight, and is often the target of Sir Toby Belch's pranks. He is secretly in love with Lady Olivia and harbors dreams of marrying her. This character appears in a Play set in Ancient Greece.

--------------------------------------------------

Bianca is a Lady Olivia's cunning and quick-witted maid. Bianca is a master of mischief and pranks, and often collaborates with Sir Toby Belch to torment Malvolio. She is also secretly in love with Sir Toby. This character appears in a Play set in Ancient Greece.

--------------------------------------------------

Viola is a A plucky and resourceful young woman who is shipwrecked on the coast of Illyria. Viola disguises herself as a man, tak

## Define Final Answering Function

Runs a Completion query using a prompt created by create_prompt(). This is the core function for producing answers with the custom RAG-like system.

In [12]:
def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response,
    Return the answer to the question according to an OpenAI Completion model.
    If the model produces an error, return an empty string.
    """

    prompt=create_prompt(question, df, max_prompt_tokens)

    try:
        response=openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

answer1=answer_question(question1, df)
print(answer1)

answer2=answer_question(question2, df)
print(answer2)

Steward
A Musical set in USA.


# Custom Performance Demonstration


## Basic completion for question 1

Asks a basic question with no custom context. This shows how the model performs without dataset guidance.

In [13]:
question1="What is Tom's profession?"

answer1=openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=question1,
    max_tokens=150
)["choices"][0]["text"].strip()
print(answer1)

It is not specified what Tom's profession is.


## Custom Completion for question 1

Asks the same question using context-aware prompt generation. Allows comparison with the basic model output.

In [14]:
custom_chatbot_anwer1=answer_question(question1, df)
print(custom_chatbot_anwer1)

Tom is a retired soldier.


## Basic completion for question 2

Asks a basic question with no custom context. This shows how the model performs without dataset guidance.

In [15]:
question2="In what setting does Thomas's story take place?"

answer2=openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=question2,
    max_tokens=150
)["choices"][0]["text"].strip()
print(answer2)

The story takes place in a futuristic, dystopian society called the Glade. The Glade is a large, enclosed area surrounded by giant walls and populated by a community of teenage boys known as the Gladers.


## Custom Completion for question 2

Asks the same question using context-aware prompt generation. Allows comparison with the basic model output.

In [16]:
custom_chatbot_anwer2=answer_question(question2, df)
print(custom_chatbot_anwer2)

Thomas's story takes place in a Sitcom set in USA.


# Conclusion

## Scenario Demonstration

I tested this model with two sample questions which shows the output
* Using a basic prompt (no custom context)
* Using our custom prompt with dataset context

This helps evaluate the value of tailored context in improving answer accuracy.

## Results and Discussion

I observed that the answers generated using custom prompts that included dataset-specific context were more accurate and relevant than those generated by basic prompts.

This highlights the advantage of embedding-based similarity search combined with prompt engineering for building specialized chatbots.