# Custom Chatbot Project

To demonstrate that the estimation performance of a large language model can be greatly improved without any further training, this notebook takes an older model from OpenAI and data that is not likely available in its training. Eventually the inference of the vanilla modus and the enhanced version are compared. In order to avoid that the data is present in the model's training data, a relatively new and non-popular topic is chosen.

The data will be about Zynthian user guides. Zynthian (https://zynthian.org/) is a young open source project that turns a Raspberry Pi into a synthesizer instrument.

The Zynthian user guide data is shared as creative commons license CC BY-SA 3.0 and so is this code 

[![Creative Commons](https://wiki.zynthian.org/resources/assets/licenses/cc-by-sa.png)](https://creativecommons.org/licenses/by-sa/3.0/)




## Data Wrangling

First the Zynthian focused information need to get downloaded, cleaned and prepared so that they eventually can get ranked and sorted by relevance of arbitrary questions.

In [1]:
import os
import json
import requests
import tiktoken
import pandas as pd
from bs4 import BeautifulSoup
from sklearn.metrics.pairwise import cosine_similarity
import openai

Define notebook wide common settings for interacting with OpenAI's model.

__Important Note__: please add your API key to the `open_ai.api_key` file or assign it directly to the `key_str` parameter below

In [2]:
TOKENIZER = tiktoken.get_encoding("cl100k_base")
EMBEDDING_MODEL_NAME = "text-embedding-ada-002"
COMPLETION_MODEL_NAME = "gpt-3.5-turbo"

with open("../open_ai.api_key", "r") as keyfile:
    key_str = keyfile.read().replace("\n", "").replace("\r", "").replace(" ", "")
    assert len(key_str) > 0, "Can't continue without specifying a API key for OpenAI"

CLIENT = openai.OpenAI(api_key=key_str)

### Get the raw context data

Zynthian offers a Wiki where the user guides can be fetched from. This HTML data needs to be converted into human readable strings, and cleaned from dublicates or even empty entries.

In [3]:
def extract_zynthian_context() -> pd.DataFrame:
    """
    This function provides essential Zynthian user guide information, 
    extracted from all <p></p> sections of its Wiki.
    
    Args:
        None
    Returns:
        pd.DataFrame: cleaned DataFrame with the extracted data in the "text" column name
    """
    
    # define set of webpages with user guides
    landing_page = "https://wiki.zynthian.org/index.php"
    user_guide_pages =[
        "Zynthian_UI_User%27s_Guide_-_Oram",
        "ZynSeq_User_Guide",
        "ZynSampler_User_Guide",
        "Web_Configuration_User_Guide",
        "Supported_plug_%26_play_MIDI_controllers",
    ]

    # download & extract information from said web pages
    all_paragraphs = list()
    for user_guide_page in user_guide_pages:
        subpage_response = requests.get(os.path.join(landing_page, user_guide_page))
        soup = BeautifulSoup(subpage_response.text)
        user_guide_p = soup.find_all("p")    

        for cur_p in user_guide_p:
            all_paragraphs.append(cur_p.text.replace("\n", ""))


    # turn it to a dataframe and apply data cleaning
    df = pd.DataFrame()
    df["text"] = all_paragraphs
    df = df[df["text"] != ""]
    df = df.drop_duplicates()
    
    return df
    
CONTEXT_DF = extract_zynthian_context()
CONTEXT_DF  # let's peek into the raw data

Unnamed: 0,text
0,V5 |V4 |Touch
1,Zynthian Oram is the most recent version of zy...
2,It is strongly recommended that you read secti...
3,"This guide is a living document, subject to fr..."
4,The fundamental building block of zynthian's s...
...,...
527,and some example drivers here:
528,zynthian-ui/zyngine/ctrldev/ · GitHub
529,All drivers must inherit from zynthian_ctrldev...
530,"Quite simple, right? This is for basic “generi..."


## Query Completion

In this section the mechanisms are prepared in order to provide a prompt with a meaningful context, where the LLM can extract valuable information for the given user question.

### Calculating Embeddings

Embeddings are necessary for model inference but we're also using it to compare the similarity of a given user-question with the Zynthian user-guides.

In [4]:
def attach_embeddings(df: pd.DataFrame, embedding_model_name:str, batch_size=100, client:openai.OpenAI=CLIENT) -> pd.DataFrame:
    """
    Takes a DataFrame with raw context information and calculates its embeddings and returns the updated frame.
    
    Args:
        df: DataFrame with human readable strings in column "text" that should be encoded
        embedding_model_name: in what manner OpenAI should encode the data
        batch_size: the amount of data that should be transmitted to the API at once
        client: the OpenAI API handle
    Returns:
        pd.DataFrame: DataFrame with "text" and "embedding" columns
    """
    
    embeddings = []
    # iterate batch-wise to avoid API-overstraining
    for i in range(0, len(df), batch_size):
        
        # Actual embeddings will be calculated by OpenAI and applied for via its API
        response = client.embeddings.create(
            input=df.iloc[i:i+batch_size]["text"].tolist(),
            model=embedding_model_name).data

        # Turn OpenAI packet into a list of embeddings
        for response_data in response:
            embeddings.append(response_data.embedding)

    df["embeddings"] = embeddings
    return df    

CONTEXT_WITH_EMBEDDINGS_DF = attach_embeddings(CONTEXT_DF, EMBEDDING_MODEL_NAME)
CONTEXT_WITH_EMBEDDINGS_DF

Unnamed: 0,text,embeddings
0,V5 |V4 |Touch,"[-0.009096662513911724, -0.007616293150931597,..."
1,Zynthian Oram is the most recent version of zy...,"[-0.029481329023838043, -0.016066910699009895,..."
2,It is strongly recommended that you read secti...,"[-0.010746555402874947, 0.010111996904015541, ..."
3,"This guide is a living document, subject to fr...","[-0.0006172802532091737, -0.000813496182672679..."
4,The fundamental building block of zynthian's s...,"[-0.008485890924930573, -0.005419307388365269,..."
...,...,...
527,and some example drivers here:,"[-0.01640596054494381, 0.003860226133838296, 0..."
528,zynthian-ui/zyngine/ctrldev/ · GitHub,"[0.0013341867597773671, -0.008867095224559307,..."
529,All drivers must inherit from zynthian_ctrldev...,"[-0.0013864703942090273, 0.009447003714740276,..."
530,"Quite simple, right? This is for basic “generi...","[0.013725951313972473, 0.016517670825123787, -..."


### Prepare temporary context

The user guide information was downloaded and stored in an arbitrary fashion. OpenAI's model interface is limited by the total number of tokens in a prompt. It's because of this we can't use the full extracted information and have to prepare a subset of most relevant information. This can be achieved by encoding embeddings of the context paragraphs and measure its likeliness to the encoded question like in the following.

In [5]:
def get_relevant_context(
        question: str, 
        df: pd.DataFrame, 
        embedding_model_name: str,
        client: openai.OpenAI=CLIENT) -> pd.DataFrame:
    """
    This function takes the unordered context data and orders it by relevance 
    according to the given question.
    
    Args:
        question: the string that is used to order the dataset by likeliness
        df: the unordered DataFrame with "text" and "embeddings" columns
        embedding_model_name: in what manner OpenAI should encode the data
        client:the OpenAI API handle
    Returns:
        pd.DataFrame: a DataFrame sorted by relevance to the given question
    """
    
    # Call OpenAI's API to calculate the relevance parameter and attach it to the dataframe
    querry_embedding = client.embeddings.create(input=question, model=embedding_model_name).data[0].embedding
    
    df["distances"] = [cosine_similarity([querry_embedding], [content_embedding]) 
                       for content_embedding in df["embeddings"]]

    # Take the unordered DataFrame and sort it by the just calculcated relevance parameter
    return df.sort_values("distances", ascending=False)

TEST_QUESTION = "What is known about tempo?"
RELEVANT_CONTEXT = get_relevant_context(TEST_QUESTION, CONTEXT_WITH_EMBEDDINGS_DF, EMBEDDING_MODEL_NAME)

N_CHECK_ENTRIES = 3
print(f"Q: {TEST_QUESTION}\n")
print(f"[info] Unorderd DataFrame\n{CONTEXT_WITH_EMBEDDINGS_DF[0:N_CHECK_ENTRIES]['text']}\n")
print(f"[info] DataFrame ordered by relevance\n{RELEVANT_CONTEXT[0:N_CHECK_ENTRIES]['text']}\n")

Q: What is known about tempo?

[info] Unorderd DataFrame
0                                        V5 |V4 |Touch
1    Zynthian Oram is the most recent version of zy...
2    It is strongly recommended that you read secti...
Name: text, dtype: object

[info] DataFrame ordered by relevance
278    The current tempo is saved and loaded with eac...
277    Tempo is the rate at which the sequencer plays...
253    Tempo may be adjusted using the SNAPSHOT encod...
Name: text, dtype: object



## Prompt generator

In order that we can use the chat bot with the extended context information, a standardized way of creating its prompt is being prepared here. It carefully watches the number of utilized token doesn't exceed its limits, puts the relevant context and eventually the actual query into one large string

In [6]:
def make_prompt(question, 
                context_df,
                embedding_model_name: str = EMBEDDING_MODEL_NAME,
                max_question_len=100, 
                max_prompt_tokens=1800,
                verbose=False,
                tokenizer=TOKENIZER):
    """
    This function provides standardized prompts. The context is ordered by relevance to the user query.
    
    Args:
        question: the string that is used to order the dataset by likeliness
        context_df: the unordered DataFrame with "text" and "embeddings" columns
        embedding_model_name: in what manner OpenAI should encode the data
        max_question_len: the number of characters shouldn't exceed this number
        max_prompt_tokens: not more than this number of tokens will be passed to OpenAI and exiting
            early the context creation if this value is exceeded
        verbose: print user feedback during processing
        tokenizer: the tiktoken callback function to turn raw strings into tokens.
    Returns:
        str: the prompt with relevant context and user question
    """
    
    assert len(question) < max_question_len, f"Your question is too long, please rephrase to use less than {max_question_len} charachters."
    if verbose:
        print(f"[info] Making a prompt for: {question}")
        
    # set up raw layout of prompt
    prompt_context = "Context:\n"
    prompt_question = f"Question:\n{question}"
    relevant_context = get_relevant_context(question, context_df, embedding_model_name)
    
    # begin assembling the relevant context as long as there are unused tokens left
    remaining_tokens = len(tokenizer.encode(prompt_context)) + len(tokenizer.encode(prompt_question))
    for idx, text_element in enumerate(relevant_context["text"]):
        remaining_tokens += len(tokenizer.encode(text_element))
        if remaining_tokens > max_prompt_tokens:  # working with a soft limit
            if verbose:
                print(f"[info] Limited context to {idx}/{relevant_context.shape[0]} paragraphs")
            break
        else:
            prompt_context += f"{text_element}\n---\n"
    
    # complete full prompt
    prompt = prompt_context + prompt_question
    
    if verbose:
        print("[info] Created following prompt:")
        print(f"===========================\n\n{prompt}\n\n")

    return prompt

_ = make_prompt(TEST_QUESTION, CONTEXT_WITH_EMBEDDINGS_DF, EMBEDDING_MODEL_NAME, verbose=True)

[info] Making a prompt for: What is known about tempo?
[info] Limited context to 30/413 paragraphs
[info] Created following prompt:

Context:
The current tempo is saved and loaded with each snapshot.
---
Tempo is the rate at which the sequencer plays back notes measured in beats per minutes (BPM). By default ZynSeq plays sequences at 120 BPM. Adjust Tempo with the SNAPSHOT encoder. The tempo is briefly displayed in the title bar. There is also a menu option to adjust tempo.
---
Tempo may be adjusted using the SNAPSHOT encoder or by selecting "Tempo" from the menu.
---
ZynSeq allows tempo to be adjusted from 1.0 BPM to 420.0 BPM in 0.1 BPM steps. Tempo may also be altered by external modules, e.g. MIDI player.
---
Knob K3 is used to adjust the tempo. When rotated, the Zynthian tempo screen will be shown briefly. This tempo setting is synchronized with the MPK. 
---
For instance, if you are in the mixer view and short-push OPT/ADMIN, the Main menu will be opened. If you short-push it aga

### Prompt submission
After all preparations have been made to provide a prompt of higher quality, a standardized way of querying OpenAI's model with a prompt is established first before eventually running a test comparison.

In [7]:
def ask_openai(prompt:str, 
               client: openai.OpenAI = CLIENT, 
               model:str = COMPLETION_MODEL_NAME, 
               max_answer_tokens:int=500) -> str:
    """
    This submits arbitrary prompts to OpenAI's Completion interface and 
    returns the most likely estimate as a string
    
    Args:
        prompt: the query that will be passed to the model
        client: the OpenAI API handle
        completion_model_name: the model that the query should interfere with
        max_answer_tokens: limitation to avoid overstraining OpenAI services
    Returns:
        str: the most likely answer for the input question    
    """  
    
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": "Be a helpful assistant"},
                {"role": "user", "content": prompt},
            ],
            max_tokens=max_answer_tokens)
        # return most likely answer
        return response.choices[0].message.content.strip()
    # Embedding this in a sand box if for instance the API is not reachable.
    except Exception as e: 
        print(e)
        return ""

## Custom Chatbot Performance Demonstration

Here the performance is demonstrated by two questions:

- the vanilla modus just takes the question and queries OpenAI's Completion model.
- the context modus takes the question and adds, relative to the given question, relevant information from the Zynthian user guides.

Two questions are being picked that are less likely to be answered by general knowledge of synthesizers

### Question 1

This question covers a couple of paragraphs referencing the unit, that makes guessing harder

In [8]:
question = "List all functions of the 'short-push' button in Zynthian."

In [9]:
vanilla_prompt = question
print(ask_openai(vanilla_prompt))

In Zynthian, the 'short-push' button typically has several functions depending on the context:

1. Select: The 'short-push' button can be used to select items or options within the Zynthian interface.

2. Confirm: The button can be used to confirm selections or actions, such as saving settings or applying changes.

3. Menu access: In some menus, the 'short-push' button may be used to access additional options or sub-menus.

4. Play/Stop: In certain instruments or applications, the 'short-push' button may function as a play/stop control.

5. Navigation: The 'short-push' button can also be used for navigation purposes, such as scrolling through lists or pages.

These are some common functions associated with the 'short-push' button in Zynthian, but the actual functions may vary depending on the specific configuration or setup of the Zynthian system.


In [10]:
context_prompt = make_prompt(question, CONTEXT_WITH_EMBEDDINGS_DF)
print(ask_openai(context_prompt))

The 'short-push' button functions in Zynthian are as follows:

1. Accessing the classic zynthian workflow (V1-V4) based on the 4 knobs' switches.
2. Show onscreen buttons after selecting certain menu items, allowing adjustments of parameters using the touchscreen interface.
3. Redefining the short-push functionality of transport, arrows, and F1-F4 buttons.
4. Providing a summary of connected hardware MIDI INPUT/OUTPUT ports.
5. Configuring the touchscreen display.
6. Assigning midi note values to control the Zynthian UI.
7. Reviewing presets across all instruments and effects in the Zynthian box.
8. Updating the Zynthian software.
9. Browsing and selecting options in the main menu selector view.
10. Enabling MIDI learning for all available controls.
11. Switching back to the default mode to adjust chain settings in the mixer interface.
12. Accessing the Zynthian box from a browser.
13. Controlling Zynthian's step sequencer and pattern editor when in the corresponding mode.


### Question 2

For verification see https://wiki.zynthian.org/index.php/ZynSeq_User_Guide#Time_Signature

In [11]:
question = "In Zynthian: what is the most significant behaviour that influences beats per bar ?"
vanilla_prompt = question
print(ask_openai(vanilla_prompt))

In Zynthian, the most significant behavior that influences the beats per bar is the time signature. The time signature indicates how many beats are in each bar of music. For example, a time signature of 4/4 means there are 4 beats in each bar, while a time signature of 3/4 means there are 3 beats in each bar. By setting the time signature in Zynthian, you can determine the number of beats per bar and create a rhythmic structure for your music.


In [12]:
context_prompt = make_prompt(question, CONTEXT_WITH_EMBEDDINGS_DF)
print(ask_openai(context_prompt))

The most significant behavior that influences beats per bar in Zynthian is the sync point. The sync point is the point at which a sequence starts, loops, or stops and is synonymous with the start of a bar. The beats per bar defines the quantity of beats between sync points, which in turn defines the rate at which sequences will loop. Adjusting the beats per bar affects how the sequences are synchronized and looped within Zynthian.
