# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

## I have chosen the local CSV files as my dataset for these reasons: 

1. The information seams quite clean and tidy, hence I can retrieve the important data I need without much work for cleaning up.
2. The dataset files are very samll which makes they can be shared easily (e.g. use for another project).
3. Owning these dataset file means that we can amend/edit them as needed (e.g. we can add new records or correct some info within these files)

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [26]:
from dotenv import load_dotenv
import os

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")


In [27]:
import pandas as pd 

# load the original csv file 
CSV_FILE_NAME = "character_descriptions"
CSV_FILE_EXE = ".csv"

origin_df = pd.read_csv(CSV_FILE_NAME + CSV_FILE_EXE) 
origin_df.head()


Unnamed: 0,Name,Description,Medium,Setting
0,Emily,"A young woman in her early 20s, Emily is an as...",Play,England
1,Jack,"A middle-aged man in his 40s, Jack is a succes...",Play,England
2,Alice,"A woman in her late 30s, Alice is a warm and n...",Play,England
3,Tom,"A man in his 50s, Tom is a retired soldier and...",Play,England
4,Sarah,"A woman in her mid-20s, Sarah is a free-spirit...",Play,England


In [28]:
from pandas import DataFrame 

# copy all the descriptions from the original csv file into the 'text' column
my_df = DataFrame()
my_df["text"] = origin_df["Description"]
my_df.head()


Unnamed: 0,text
0,"A young woman in her early 20s, Emily is an as..."
1,"A middle-aged man in his 40s, Jack is a succes..."
2,"A woman in her late 30s, Alice is a warm and n..."
3,"A man in his 50s, Tom is a retired soldier and..."
4,"A woman in her mid-20s, Sarah is a free-spirit..."


## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [29]:
# OpenAI setups

from openai import OpenAI
import openai

openai.api_key = OPENAI_API_KEY
ai_client = OpenAI()


In [31]:

# Create embeddings record for each text row 

EMBEDDING_MODEL_NAME = "text-embedding-ada-002"

def get_embedding(str, model=EMBEDDING_MODEL_NAME):
   text = text.replace("\n", " ")
   return ai_client.embeddings.create(input = [text], model=model).data[0].embedding


my_df["embeddings"] = my_df.text.apply(lambda x: get_embedding(x))

print("Embeddings are ready.")

Embeddings are ready.


In [33]:
# Save the embeddings records to a seperate CSV file

EMBEDDINGS_FILE_NAME = CSV_FILE_NAME + "_embeddings" + CSV_FILE_EXE
my_df.to_csv(EMBEDDINGS_FILE_NAME)


In [41]:
import tiktoken
from scipy.spatial.distance import cosine

# define a tokenizer 
tokenizer = tiktoken.get_encoding("cl100k_base")

MAX_TOKEN_COUNT = 1024


# Return a new DataFrame with a new 'distances' column based on the question
def make_distances_df_with(question: str, 
                           df: DataFrame):
    
    question_embeddings = get_embedding(question)

    new_df = df.copy()
    new_df["distances"] = new_df["embeddings"].apply(lambda x: cosine(question_embeddings, x))
    new_df.sort_values(by="distances", ascending=True, inplace=True)
    return new_df
    

# Create a custom prompt based on the provided question and a dataframe: df, 
# the df should have a 'distances' column, which records the cosine distance for each row's text against the question
def create_custom_prompt(question: str, 
                         df: DataFrame, 
                         max_token_count: int = MAX_TOKEN_COUNT):
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""

    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))

    context = []
    for text in df["text"].values:

        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count

        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

    
# Query openAI with a prompt and get the response
def query_openAI_with(prompt: str): 
    response = ai_client.chat.completions.create(
      model="gpt-3.5-turbo",
      messages=[
        {"role": "system", "content": "You are a helpful assistant who is familiar with all characters from movies, TV shows, Play, Limited Series, Musical, Reality Show, Opera & Sitcom."},
        {"role": "user", "content": prompt}
      ]
    )
    return response.choices[0].message.content
    

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [54]:

# Answer before providing more context

origin_question_1 = "Which character was a former soldier served in Afghanistan before and is Ava's Godson?"

print("Answering before providing more context:")
print(query_openAI_with(prompt=origin_question_1))


Answering before providing more context:
That character is Buck Vu from the TV show "Station 19." He is a former soldier who served in Afghanistan and is Ava's godson in the series.


In [55]:

# Answer after providing more context 

distance_df = make_distances_df_with(question=origin_question_1, df=my_df)

prompt_question = create_custom_prompt(question=origin_question_1, 
                                       df=distance_df)

print("Answering after providing more context:")
print(query_openAI_with(prompt=prompt_question))


Answering after providing more context:
Max is the character who was a former soldier served in Afghanistan before and is Ava's godson.


### Question 2

In [64]:

# Answer before providing more context

origin_question_2 = "Which character is a single mother running business in Texas town?"
print("Answering before providing more context: ")
print(query_openAI_with(prompt=origin_question_2))

Answering before providing more context: 
The character you're looking for is Lorelai Gilmore from the TV show "Gilmore Girls." She is a single mother and the manager of the Independence Inn in the fictional town of Stars Hollow, Connecticut. Lorelai is known for her wit, charm, and close relationship with her daughter, Rory.


In [66]:
# Answer after providing more context 

distance_df_2 = make_distances_df_with(question=origin_question_2, df=my_df)

prompt_question_2 = create_custom_prompt(question=origin_question_2, 
                                         df=distance_df_2)

print("Answering after providing more context:")
print(query_openAI_with(prompt=prompt_question_2))

Answering after providing more context:
Maria, the middle-aged Latina woman in her 40s, is the single mother running a small family-run diner in a small Texas town.
