# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

I have chosen this dataset because the data is simplistic text which is easiest to train with

In [1]:
import openai
openai.api_base = "https://openai.vocareum.com/v1"
openai.api_key = "voc-1036241806126677334737566de1a6a208f15.07834592"

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv("character_descriptions.csv")
df.head(2)

Unnamed: 0,Name,Description,Medium,Setting
0,Emily,"A young woman in her early 20s, Emily is an as...",Play,England
1,Jack,"A middle-aged man in his 40s, Jack is a succes...",Play,England


In [4]:
df["text"] = df["Name"] + " - " + df["Description"] + " - " + df["Medium"] + " - " + df["Setting"]
df = df.drop(columns=['Name', 'Description', 'Medium', 'Setting'])
df.head(2)

Unnamed: 0,text
0,"Emily - A young woman in her early 20s, Emily ..."
1,"Jack - A middle-aged man in his 40s, Jack is a..."


In [5]:
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )
    
    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings

In [6]:
df.head(2)

Unnamed: 0,text,embeddings
0,"Emily - A young woman in her early 20s, Emily ...","[-0.01616995967924595, -0.014342271722853184, ..."
1,"Jack - A middle-aged man in his 40s, Jack is a...","[0.007528048474341631, -0.020845899358391762, ..."


In [7]:
df.to_csv("embeddings.csv")

In [8]:
import numpy as np
import pandas as pd
import openai
df = pd.read_csv("embeddings.csv", index_col=0)
df["embeddings"] = df["embeddings"].apply(eval).apply(np.array)

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

### Function that Finds Related Pieces of Text for a Given Question

In [9]:
from openai.embeddings_utils import get_embedding, distances_from_embeddings

def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """
    
    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)
    
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )
    
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In [10]:
get_rows_sorted_by_relevance("Who is fierce, known for high-energy performances and is difficult to work with", df)[:2]

Unnamed: 0,text,embeddings,distances
23,"Vixen - A fierce and competitive performer, Vi...","[-0.035732924938201904, -0.022343041375279427,...",0.15535
21,"Sable - A sultry and dramatic performer, Sable...","[-0.032568853348493576, -0.017705660313367844,...",0.185489


### Function that Composes a Text Prompt

In [11]:
import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

In [12]:
print(create_prompt("Who is fierce, known for high-energy performances and is difficult to work with?", df, 200))


Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

Vixen - A fierce and competitive performer, Vixen is always out to win. She's known for her aggressive lip-syncing style and high-energy performances, but can be confrontational and difficult to work with. She's also a rival of Sable, and the two often clash both on and off stage. - Musical - USA

###

Sable - A sultry and dramatic performer, Sable exudes confidence on stage. She's known for her fierce lip-syncing abilities and dramatic performances, but can sometimes be a bit too self-absorbed. She's also a rival of Donna, and the two often compete for the spotlight. - Musical - USA

---

Question: Who is fierce, known for high-energy performances and is difficult to work with?
Answer:


### Function that Answers a Question

In [13]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""
        

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

### Question 1

In [14]:
q1 = "Question: Which character from a movie from texas who was a successfull business man in his early 40s?"

Answer without custom query

In [15]:
initial_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=q1,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_answer)

One possible answer could be J.R. Ewing from the TV show and movie franchise "Dallas." He was a Texas oil tycoon and businessman in his early 40s.


Answer using custom query

In [16]:
answer_question(q1, df)

'Will.'

### Question 2

In [17]:
q2 = "Question: List all the characters from a ancient greek play"

Answer without custom query

In [18]:
initial_answer = openai.Completion.create(
    model="gpt-3.5-turbo-instruct",
    prompt=q2,
    max_tokens=150
)["choices"][0]["text"].strip()
print(initial_answer)

1. Oedipus
2. Antigone
3. Creon
4. Ismene
5. Haemon
6. Tiresias
7. Jocasta
8. Eurydice
9. Messenger
10. Polyneices
11. Eteocles
12. Chorus
13. Chrysothemis
14. Aegisthus
15. Clytemnestra
16. Electra
17. Agamemnon
18. Cassandra
19. Menelaus
20. Helen


Answer using custom query

In [19]:
answer_question(q2, df)

'Feste, Viola, Duke Orsino, Lady Olivia, Sir Andrew Aguecheek, Sir Toby Belch, Malvolio.'