# Custom Chatbot Project

TODO: In this cell, write an explanation of which dataset you have chosen and why it is appropriate for this task

I have chosen the list of Star Wars Characters from Wikipedia for this project. With 191 characters, this is a great dataset to ask questions about.

## Data Wrangling

TODO: In the cells below, load your chosen dataset into a `pandas` dataframe with a column named `"text"`. This column should contain all of your text data, separated into at least 20 rows.

In [57]:
import csv
import numpy as np
import pandas as pd
import re
import requests

In [2]:
# https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bextracts
params = {
    "action": "query", 
    "prop": "extracts",
    "exlimit": 1,
    "titles": "List_of_Star_Wars_characters",
    "explaintext": 1,
    "formatversion": 2,
    "format": "json"
}
resp = requests.get("https://en.wikipedia.org/w/api.php", params=params)
response_dict = resp.json()

In [9]:
extract = response_dict["query"]["pages"][0]["extract"].split('\n\n\n')
len(extract)

294

In [14]:
extract[:5]

['This incomplete list of characters from the Star Wars franchise contains only those which are considered part of the official Star Wars canon, as of the changes made by Lucasfilm in April 2014. Following its acquisition by The Walt Disney Company in 2012, Lucasfilm rebranded most of the novels, comics, video games and other works produced since the originating 1977 film Star Wars as Star Wars Legends and declared them non-canon to the rest of the franchise. As such, the list contains only information from the Skywalker Saga films, the 2008 animated TV series Star Wars: The Clone Wars, and other films, shows, or video games published or produced after April 2014.\nThe list includes humans and various alien species. No droid characters are included; for those, see the list of Star Wars droid characters. Some of the characters featured in this list have additional or alternate plotlines in the non-canonical Legends continuity. To see those or characters who do not exist at all in the cu

In [11]:
# Only want entries that have ""===\n" in them
# The character name is wrapped in ===, followed by a newline and the description.
characters = [c for c  in extract if '===\n' in c]
len(characters)

191

In [16]:
characters[:2]

['=== Stass Allie ===\nStass Allie is a Tholothian Jedi Master and the cousin of Adi Gallia. Allie is one of the many victims of Order 66. She was initially planned to die with Kit Fisto.The character has been portrayed by Lily Nyamwasa in Episode III.',
 "=== Almec ===\nAlmec is a Mandalorian politician who serves as Prime Minister of Mandalore during the Clone Wars. A prominent supporter of Satine Kryze and her New Mandalorian government, he is imprisoned for his involvement in an illegal smuggling ring but is later freed and reinstated as a puppet leader after Darth Maul takes over the New Mandalorian capital city of Sundari. When Maul is later captured by Darth Sidious, Almec sends Mandalorian super commandos Gar Saxon and Rook Kast to rescue him. During the Siege of Mandalore, he is captured by Bo-Katan Kryze's force and is killed by Saxon when he attempts to relay information to Ahsoka, Rex, and Bo-Katan.The character has been voiced by Julian Holloway in The Clone Wars."]

In [49]:
charRE = re.compile('=== ([^=]+) ===(.+)')

charList = []

for char in characters:
    char = char.replace('\n', ' ')
    m = charRE.match(char)
    if not m:
        print(f'No regex match for {char}')
        continue
    
    # Remove leading and trailing whitespace
    name = m.group(1).strip()
    desc = m.group(2).strip()
    
    charList.append(f'{name} => {desc}')

df = pd.DataFrame({'text': charList})
len(df)

191

In [50]:
df

Unnamed: 0,text
0,Stass Allie => Stass Allie is a Tholothian Jed...
1,Almec => Almec is a Mandalorian politician who...
2,Mas Amedda => Mas Amedda is the Chagrian Vice ...
3,Maarva Andor => Maarva Andor is a human female...
4,Raymus Antilles => Raymus Antilles is captain ...
...,...
186,Hamato Xiono => Hamato Xiono is a human male s...
187,Yaddle => Yaddle is a female member of Yoda's ...
188,Wullf Yularen => Wullf Yularen is an Imperial ...
189,Ziro the Hutt => Ziro is a Galactic Basic-spea...


In [51]:
# Backup data
df.to_csv('./data/starwars_characters.csv', index=False)

## Custom Query Completion

TODO: In the cells below, compose a custom query using your chosen dataset and retrieve results from an OpenAI `Completion` model. You may copy and paste any useful code from the course materials.

In [52]:
import openai
# TODO: Change this back to "YOUR API KEY" before submitting
openai.api_key = "YOUR API KEY"

In [55]:
# Code from course materials
# Create embeddings
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
batch_size = 100
embeddings = []
for i in range(0, len(df), batch_size):
    # Send text data to OpenAI model to get embeddings
    response = openai.Embedding.create(
        input=df.iloc[i:i+batch_size]["text"].tolist(),
        engine=EMBEDDING_MODEL_NAME
    )

    # Add embeddings to list
    embeddings.extend([data["embedding"] for data in response["data"]])

# Add embeddings list to dataframe
df["embeddings"] = embeddings

In [58]:
# Code from course materials
# Make sure nested data structure is treated as such
df["embeddings"] = df["embeddings"].apply(np.array)

In [59]:
# Code from course materials
from openai.embeddings_utils import get_embedding, distances_from_embeddings

# Find cosine similarity
def get_rows_sorted_by_relevance(question, df):
    """
    Function that takes in a question string and a dataframe containing
    rows of text and associated embeddings, and returns that dataframe
    sorted from least to most relevant for that question
    """

    # Get embeddings for the question text
    question_embeddings = get_embedding(question, engine=EMBEDDING_MODEL_NAME)

    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    df_copy["distances"] = distances_from_embeddings(
        question_embeddings,
        df_copy["embeddings"].values,
        distance_metric="cosine"
    )

    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=True, inplace=True)
    return df_copy

In [60]:
# Code from course materials
import tiktoken

def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")

    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"

Context: 

{}

---

Question: {}
Answer:"""

    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))

    context = []
    for text in get_rows_sorted_by_relevance(question, df)["text"].values:

        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count

        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break

    return prompt_template.format("\n\n###\n\n".join(context), question)

In [61]:
# Code from course materials
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"

def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model

    If the model produces an error, return an empty string
    """

    prompt = create_prompt(question, df, max_prompt_tokens)

    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=prompt,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

## Custom Performance Demonstration

TODO: In the cells below, demonstrate the performance of your custom query using at least 2 questions. For each question, show the answer from a basic `Completion` model query as well as the answer from your custom query.

In [64]:
# Basic completion model helper function without custom prompt
def answer_without_custom_prompt(
    question, df, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model

    If the model produces an error, return an empty string
    """

    try:
        response = openai.Completion.create(
            model=COMPLETION_MODEL_NAME,
            prompt=question,
            max_tokens=max_answer_tokens
        )
        return response["choices"][0]["text"].strip()
    except Exception as e:
        print(e)
        return ""

### Question 1

In [74]:
answer_question('Who died?', df)

'The Father, Rio Durant, Yarael Poof, Tan Divo, Jek Porkins, The Son, The Daughter, Katuunko, Onaconda Farr, Rugor Nass, Dak Ralter, Dryden Vos, Saesee Tiin, and Poggle the Lesser.'

In [75]:
answer_without_custom_prompt('Who died?', df)

'I cannot answer that question without more information.'

### Question 2

In [67]:
answer_question('Who are siblings?', df)

'Owen Lars and Beru Lars are step-siblings. Rafa and Trace Martez are siblings. Steela Gerrera and Saw Gerrera are siblings.'

In [68]:
answer_without_custom_prompt('Who are siblings?', df)

'Siblings are individuals who share one or both parents, either biologically or through adoption. They are brothers and sisters who are part of the same family and are related by blood or legal familial ties. Siblings can have varying relationships, but they share a common bond and are usually raised together in the same household.'

## Interactive Question/Answering

In [71]:
while True:
    q = input('Ask your question (q/quit to exit): ')
    if q.lower() in ['q', 'quit']:
        break
    a = answer_question(q, df)
    print(f'WITH CUSTOM PROMPT: {a}')
    a_wo_custom_prompt = answer_without_custom_prompt(q, df)
    print(f'WITHOUT CUSTOM PROMPT: {a_wo_custom_prompt}')
    print('\n\n=====\n\n')
    

Ask your question (q/quit to exit): Who are aliens?
WITH CUSTOM PROMPT: The Bendu is an ancient Force-wielder whose philosophy predates the Jedi Order; encountered by the rebels on the planet Atollon, where he describes himself as being "the middle" between the ashla, light side of the Force, and the bogan, dark side of the Force.
WITHOUT CUSTOM PROMPT: Aliens are hypothetical extraterrestrial beings from other planets or galaxies that are believed to exist beyond Earth. They are often depicted in science fiction as intelligent, technologically advanced beings that may have different physical characteristics and abilities than humans. There is currently no scientific evidence to prove the existence of aliens, and any claims or sightings of extraterrestrial life have not been confirmed.


=====


Ask your question (q/quit to exit): Who had their arm cut off?
WITH CUSTOM PROMPT: Ponda Baba
WITHOUT CUSTOM PROMPT: It is unclear who had their arm cut off. Please provide more context for the