# Custom Chatbot Project

I've chosen the Wikipedia page for the videogame [Return to Monkey Island](https://en.wikipedia.org/wiki/Return_to_Monkey_Island). This game was released in 2022, after the training dataset cut-off for the model [gpt-3.5-turbo-instruct](https://platform.openai.com/docs/models/gpt-3-5-turbo) of September of 2022. I've also chosen NOT to use [Wikipedia's API](https://api.wikimedia.org/wiki/API_catalog), but to scrape the page's HTML instead to try to replicate excersize the extracting information from non structured (or semi-structured in this case) data.

<font color='red'>**My implementation uses version ^1.14.3 of the OpenAI API, so the notebook may not work on Udacity's embedded workspaces.**</font>

In [1]:
try:
    from dotenv import load_dotenv
    load_dotenv()
except ImportError:
    print("\x1b[31mdotenv is not available in this workspace. Set openai.api_key manually.\x1b[0m")

In [2]:
import re

from os import environ
from textwrap import dedent

import bs4
import openai
import pandas as pd
import requests
import tiktoken

from IPython.display import display, Code, Markdown
from ipywidgets import interactive, fixed, widgets
from jinja2 import Template
from scipy.spatial.distance import cosine  # OpenAI no longer provides openai.embeddings_utils

In [3]:
client = openai.Client(
    api_key=environ.get("OPENAI_API_KEY")
)

## Model configuration

In [4]:
COMPLETION_MODEL_NAME = "gpt-3.5-turbo-instruct"
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
TOKENIZER_ENCODING = "cl100k_base"
MAX_ANSWER_TOKENS = 150
MAX_ENCODED_PROMPT_LENGTH = 2048  # Half the context window size according to https://platform.openai.com/docs/models/gpt-3-5-turbo

In [5]:
tokenizer = tiktoken.get_encoding(TOKENIZER_ENCODING)

## Data Wrangling

In [6]:
response = requests.get("https://en.wikipedia.org/wiki/Return_to_Monkey_Island")
response.raise_for_status()

In [7]:
doc = bs4.BeautifulSoup(response.content, "html.parser")

In [8]:
display(Code(doc.prettify(), language="html"))

Extract the text all the `p` (every paragraph without its header) elements from the `div` with id `mw-content-text`.

In [9]:
elements_text = [element.text for element in doc.select('div#mw-content-text p')]
elements_text[:5]

["Return to Monkey Island is a point-and-click adventure game developed by Terrible Toybox and published by Devolver Digital. The sixth Monkey Island game, it was released for macOS, Nintendo Switch, and Windows on September 19, 2022,[1][2][3][4][5] for Linux on October 26, 2022,[6] for PlayStation 5 and Xbox Series X/S on November 8, 2022,[7] and for iOS and Android on July 27, 2023. It was the first Monkey Island game by the series' creator, Ron Gilbert, since Monkey Island 2: LeChuck's Revenge (1991).\n",
 'Gilbert worked on the first two Monkey Island games before leaving the development company, LucasArts, in 1992. Further installments were developed by LucasArts and Telltale Games without him. The Walt Disney Company acquired the rights to Monkey Island when it purchased Lucasfilm in 2012; in 2019, Gilbert negotiated to create a new Monkey Island with the designer Dave Grossman, who had worked on the first two games. Return to Monkey Island was announced in April 2022. Dominic Ar

Clean up the text by stripping whitespaces at the edges, and remove references and edit links.

In [10]:
elements_text = list(map(lambda t: re.sub(r"\[.*\]", "", t).strip(), elements_text))
elements_text[:5]

["Return to Monkey Island is a point-and-click adventure game developed by Terrible Toybox and published by Devolver Digital. The sixth Monkey Island game, it was released for macOS, Nintendo Switch, and Windows on September 19, 2022, and for iOS and Android on July 27, 2023. It was the first Monkey Island game by the series' creator, Ron Gilbert, since Monkey Island 2: LeChuck's Revenge (1991).",
 'Gilbert worked on the first two Monkey Island games before leaving the development company, LucasArts, in 1992. Further installments were developed by LucasArts and Telltale Games without him. The Walt Disney Company acquired the rights to Monkey Island when it purchased Lucasfilm in 2012; in 2019, Gilbert negotiated to create a new Monkey Island with the designer Dave Grossman, who had worked on the first two games. Return to Monkey Island was announced in April 2022. Dominic Armato reprised his role as the protagonist, Guybrush Threepwood. The game received generally positive reviews.',
 

Create the `DataFrame` from the elements text.

In [11]:
data_df = pd.DataFrame.from_records(({"text": text} for text in elements_text))

In [12]:
data_df.head()

Unnamed: 0,text
0,Return to Monkey Island is a point-and-click a...
1,Gilbert worked on the first two Monkey Island ...
2,Return to Monkey Island is a 2D point-and-clic...
3,The user interface is different from previous ...
4,The game includes a hint system designed to di...


In [13]:
data_df.shape

(39, 1)

We've got 39 rows of text to work with. We need to add the embeddings to the data.

In [14]:
response = client.embeddings.create(
    input=data_df["text"].tolist(),
    model=EMBEDDING_MODEL_NAME
)
response

CreateEmbeddingResponse(data=[Embedding(embedding=[-0.005951838567852974, -0.06400751322507858, -0.01854289323091507, -0.021109970286488533, -0.009929504245519638, 0.012470519170165062, -0.012867960147559643, -0.022738825529813766, 0.009492971003055573, -0.0006706813001073897, 0.04451336711645126, 0.01963748410344124, 0.01936383545398712, -0.0019578845240175724, -0.006196166854351759, -0.011942769400775433, 0.01849076896905899, -0.0062841251492500305, -0.0016182680847123265, 0.011545329354703426, 0.0052253687754273415, 0.011838522739708424, -0.01563701406121254, -0.030778856948018074, 0.023142781108617783, 0.0027299621142446995, 0.008893552236258984, -0.013382677920162678, 0.024758605286478996, 0.0019318228587508202, -0.012952660210430622, 0.010991518385708332, 0.00682816281914711, -0.011330319568514824, -0.040473803877830505, -0.027651453390717506, -0.01267249695956707, 0.013721480034291744, -0.003964634612202644, -0.006124497391283512, 0.009571155533194542, 0.024993160739541054, -0.0

In [15]:
data_df["embeddings"] = [data_row.embedding for data_row in response.data]

In [16]:
data_df.head()

Unnamed: 0,text,embeddings
0,Return to Monkey Island is a point-and-click a...,"[-0.005951838567852974, -0.06400751322507858, ..."
1,Gilbert worked on the first two Monkey Island ...,"[-0.0011258901795372367, -0.03085152618587017,..."
2,Return to Monkey Island is a 2D point-and-clic...,"[-0.0041136289946734905, -0.04361627995967865,..."
3,The user interface is different from previous ...,"[-0.030546311289072037, -0.029696326702833176,..."
4,The game includes a hint system designed to di...,"[-0.022847462445497513, 0.0013908544788137078,..."


I'm going to add the encoded length for each row, which will make preparint the question much easier.

In [17]:
data_df["encoded_len"] = data_df["text"].apply(lambda t: len(tokenizer.encode(t)))

In [18]:
data_df.head()

Unnamed: 0,text,embeddings,encoded_len
0,Return to Monkey Island is a point-and-click a...,"[-0.005951838567852974, -0.06400751322507858, ...",94
1,Gilbert worked on the first two Monkey Island ...,"[-0.0011258901795372367, -0.03085152618587017,...",128
2,Return to Monkey Island is a 2D point-and-clic...,"[-0.0041136289946734905, -0.04361627995967865,...",76
3,The user interface is different from previous ...,"[-0.030546311289072037, -0.029696326702833176,...",52
4,The game includes a hint system designed to di...,"[-0.022847462445497513, 0.0013908544788137078,...",46


In [19]:
data_df.to_pickle("data_df.pickle")  # I'm using pickle instead of CSV to avoid losing information

## Custom Query Completion

In [20]:
def get_question_embeddings(question: str, openai_client: openai.OpenAI) -> list[float]:
    """
    Generate embeddings for a given question
    """
    response = client.embeddings.create(
        input=question,
        model=EMBEDDING_MODEL_NAME
    )
    return response.data[0].embedding

In [21]:
def get_context_df(
    question: str,
    data_df: pd.DataFrame,
    openai_client: openai.OpenAI
) -> pd.DataFrame:
    """
    Generate a context dataframe sorted by the cosine distance
    between the data row embeddings and the question embeddings.
    """
    question_embeddings = get_question_embeddings(question, openai_client)

    context_df = data_df.copy()
    # Since openai.embedding_utils is no longer available as of the 1.0 release of the OpenAI API, I based my solution to computing
    # the cosine distances from the last release where the module was available at
    # https://github.com/openai/openai-python/blob/v0.28.1/openai/embeddings_utils.py
    context_df["distances"] = context_df["embeddings"].apply(lambda x: cosine(question_embeddings, x))
    context_df.sort_values("distances", ascending=True, inplace=True)

    # I'm adding the cummulative sum of the encoded length to make it easier to filter out the rows that won't fit
    # the prompt.
    context_df["encoded_len_cumsum"] = context_df["encoded_len"].cumsum()
    
    return context_df

In [22]:
def create_prompt(
    question: str,
    data_df: pd.DataFrame,
    openai_client: openai.OpenAI,
    max_number_of_tokens: int = MAX_ENCODED_PROMPT_LENGTH
) -> str:
    question_prompt = Template(dedent("""\
        Answer the question based on the context below, and if the question
        can't be answered based on the context, say "I don't know"
        
        Context: 
        
        {{ context|trim }}
        
        ---
        
        Q: {{ question|trim }}
        A:
        """))

    # Compute the context-less prompt size
    empty_prompt_len = len(tokenizer.encode(question_prompt.render(context="", question=question)))

    # Get the context dataframe which includes a encoded_len_cumsum of the SORTED rows, so we can
    # just filter out the rows that won't fit the rendered prompt template.
    context_df = get_context_df(question, data_df, openai_client)
    context_df = context_df[context_df["encoded_len_cumsum"] + empty_prompt_len < max_number_of_tokens]

    # Construct the actua context
    context = "\n\n".join(context_df["text"].values)

    # Render the prompt template and return
    return question_prompt.render(context=context, question=question) 

## Custom Performance Demonstration

### Question 1

### *When was Return to Monkey Island released?*

In [23]:
question_1 = "When was Return To Monkey Island released?"

In [24]:
response = client.completions.create(
    model=COMPLETION_MODEL_NAME,
    prompt=question_1,
    max_tokens=MAX_ANSWER_TOKENS
)
display(Markdown(f"`{response.choices[0].text.strip()}`"))

`Return To Monkey Island was never officially released. It was a fan-made sequel to the Monkey Island series, and as such, does not have an official release date. It was created by a group of developers using the original game's engine and assets, and was made available for free download on various websites. The project was eventually abandoned and is no longer available.`

In [25]:
response = client.completions.create(
    model=COMPLETION_MODEL_NAME,
    prompt=create_prompt(question_1, data_df, client),
    max_tokens=MAX_ANSWER_TOKENS
)
display(Markdown(f"`{response.choices[0].text.strip()}`"))

`Return To Monkey Island was released for macOS, Nintendo Switch, and Windows on September 19, 2022, and for iOS and Android on July 27, 2023.`

### Question 2

### *What is the name of Guybrush Threepwood's son?*

In [26]:
question_2 = "What is the name of Guybrush Threepwood's son?"

In [27]:
response = client.completions.create(
    model=COMPLETION_MODEL_NAME,
    prompt=question_2,
    max_tokens=MAX_ANSWER_TOKENS
)
display(Markdown(f"`{response.choices[0].text.strip()}`"))

`Guybrush Threepwood does not have a son. He is fictional character from the Monkey Island series of video games, and the player controls him throughout the games. He does not have any canonical children within the game's storyline.`

In [28]:
response = client.completions.create(
    model=COMPLETION_MODEL_NAME,
    prompt=create_prompt(question_2, data_df, client),
    max_tokens=MAX_ANSWER_TOKENS
)
display(Markdown(f"`{response.choices[0].text.strip()}`"))

`Boybrush`

### Interactive session

In [29]:
def process_input(
    input_text: str,
    max_tokens: int,
    temperature: float,
    data_df: pd.DataFrame,
    openai_client: openai.OpenAI,
):
    response = client.completions.create(
        model=COMPLETION_MODEL_NAME,
        prompt=create_prompt(input_text, data_df, client),
        max_tokens=MAX_ANSWER_TOKENS,
        temperature=temperature
    )
    display(Markdown(f"`{response.choices[0].text.strip()}`"))

In [30]:
display(
    interactive(
        process_input,
        {'manual': True, "manual_name": "Send"},
        input_text=widgets.Textarea(placeholder='Type something', description='Input:'),
        max_tokens=widgets.IntSlider(min=20, max=500, step=10, value=MAX_ANSWER_TOKENS),
        temperature=widgets.FloatSlider(min=0.0, max=2.0, value=1.0),
        data_df=fixed(data_df),
        openai_client=fixed(client)
    )
)

interactive(children=(Textarea(value='', description='Input:', placeholder='Type something'), IntSlider(value=…