<a href="https://colab.research.google.com/github/wandb/edu/blob/main/llm-apps-course/notebooks/02.%20Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
<!--- @wandbcode{llmapps-generation} -->

# Generation
<!--- @wandbcode{llmapps-generation} -->

In this notebook we will dive deeper on prompting the model by passing a better context by using available data from users questions and using the documentation files to generate better answers.


### Setup

In [1]:
%pip install -Uqqq pandas rich openai tiktoken wandb tenacity

Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
import random

import openai
import tiktoken

from pathlib import Path
from pprint import pprint
from getpass import getpass

from rich.markdown import Markdown
import pandas as pd
from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential, # for exponential backoff
)  
import wandb
from wandb.integration.openai import autolog

You will need an OpenAI API key to run this notebook. You can get one [here](https://platform.openai.com/account/api-keys).

In [3]:
if os.getenv("OPENAI_API_KEY") is None:
  if any(['VSCODE' in x for x in os.environ.keys()]):
    print('Please enter password in the VS Code prompt at the top of your VS Code window!')
  os.environ["OPENAI_API_KEY"] = getpass("Paste your OpenAI key from: https://platform.openai.com/account/api-keys\n")
  openai.api_key = os.getenv("OPENAI_API_KEY", "")

assert os.getenv("OPENAI_API_KEY", "").startswith("sk-"), "This doesn't look like a valid OpenAI API key"
print("OpenAI API key configured")

OpenAI API key configured


Let's enable W&B autologging to track our experiments.

In [4]:
# start logging to W&B
os.environ['WANDB_NOTEBOOK_NAME'] = "04. Generation_dfinity.ipynb"
autolog({"project":"llmapps", "job_type": "generation"})

[34m[1mwandb[0m: Currently logged in as: [33mcarlek[0m. Use [1m`wandb login --relogin`[0m to force relogin


# Generating synthetic support questions

We will add a retry behavior in case we hit the API rate limit

In [5]:
@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
def completion_with_backoff(**kwargs):
    return openai.ChatCompletion.create(**kwargs)

In [7]:
MODEL_NAME = "gpt-3.5-turbo"
# MODEL_NAME = "gpt-4"

In [8]:

def generate_responses(system_prompt, user_prompt, debug=False, n=5):
    messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ]
    responses = completion_with_backoff(
        model=MODEL_NAME,
        messages=messages,
        n = n,
        )
    generations = []
    for response in responses.choices:
        generation = response.message.content
        if debug:
            display(Markdown(generation))
        generations.append(generation)
    return generations

system_prompt = "You are a helpful assistant."
user_prompt = "Generate a support question from a developer on Dfinity's Internet Computer Platform (ICP)"

# user_prompt = "Generate support questions from developers of web-applications for Dfinity's platform on the Internet Computer Platform (ICP) for the following topics: the Motoko language, the Canister SDK, and the ICP API. Generate 1 question for each topic. Do not generate the topic in the result.  Do not enumerate anything in the result. The response should only contain the questions, with each question on a separate line.  And do not generate any blank lines"

user_questions = generate_responses(system_prompt, user_prompt, n=5)
user_questions = [q for q in user_questions if q != '']
pprint(user_questions)


['"How can I ensure the security of smart contract code on the Internet '
 'Computer Platform (ICP) before deployment?"',
 '"Is there a recommended approach or library for handling user authentication '
 'on the ICP platform?"',
 '"How can I debug and troubleshoot issues with smart contracts on the '
 'Internet Computer Platform (ICP)?"',
 '"What are the best practices for developing smart contracts on the ICP '
 'platform?"',
 '"Can you provide guidance on how to securely deploy and access canister '
 'smart contracts on the Internet Computer Platform (ICP)?"']


# Few Shot 

Let's read some user submitted queries from the file `examples.txt`. This file contains multiline questions separated by tabs (`\t`).

In [9]:
# Test if examples.txt is present, download if not
if not Path("examples_dfx.txt").exists():
    !wget https://raw.githubusercontent.com/carlek/wandb-edu/main/llm-apps-course/notebooks/examples_dfx.txt

In [10]:
delimiter = "\n" # new lines separate queries
with open("examples_dfx.txt", "r") as file:
    data = file.read()
    real_queries = data.split(delimiter)

pprint(f"We have {len(real_queries)} real queries:")  
Markdown(f"Sample one: \n\"{random.choice(real_queries)}\"")



'We have 8 real queries:'


We can now use those real user questions to guide our model to produce synthetic questions like those.

In [11]:
pprint(user_questions)

['"How can I ensure the security of smart contract code on the Internet '
 'Computer Platform (ICP) before deployment?"',
 '"Is there a recommended approach or library for handling user authentication '
 'on the ICP platform?"',
 '"How can I debug and troubleshoot issues with smart contracts on the '
 'Internet Computer Platform (ICP)?"',
 '"What are the best practices for developing smart contracts on the ICP '
 'platform?"',
 '"Can you provide guidance on how to securely deploy and access canister '
 'smart contracts on the Internet Computer Platform (ICP)?"']


In [13]:
def generate_few_shot_prompt(queries, n=3):
    prompt = "Generate a support question from a developer on Dfinity's Internet Computer Platform (ICP). \n" +\
        "Below you will find a few examples of real user queries:\n"
    for _ in range(n):
        pick = random.choice(queries)
        prompt += pick + "\n"
    prompt += "Let's start!"
    return prompt

generation_prompt = generate_few_shot_prompt(user_questions)
Markdown(generation_prompt)


OpenAI `Chat` models are really good at following instructions with a few examples. Let's see how it does here. This is going to use some context from the prompt.

In [14]:
def generate_and_print(system_prompt, user_prompt, n=5):
    messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ]
    responses = completion_with_backoff(
        model=MODEL_NAME,
        messages=messages,
        n = n,
        )
    for response in responses.choices:
        generation = response.message.content
        display(Markdown(generation))

generate_and_print(system_prompt, user_prompt=generation_prompt)

# support_answers = generate_responses(system_prompt, user_prompt=generation_prompt)
# pprint(support_answers)

# Add Context & Response

Let's create a function to find all the markdown files in a directory and return it's content and path

In [15]:
# check if directory exists, if not, create it and download the files, e.g if running in colab
if not os.path.exists("../dfinity_md/"):
  !git clone https://github.com/carlek/dfinity-web-development
  !cp -r *.md ../dfinity_md

In [16]:
def find_md_files(directory):
    "Find all markdown files in a directory and return their content and path"
    md_files = []
    for file in Path(directory).rglob("*.md"):
        with open(file, 'r', encoding='utf-8') as md_file:
            content = md_file.read()
        md_files.append((file.relative_to(directory), content))
    return md_files

documents = find_md_files('../dfinity_md/')
len(documents)

5

Let's check if the documents are not too long for our context window. We need to compute the number of tokens in each document.

In [17]:
tokenizer = tiktoken.encoding_for_model(MODEL_NAME)
tokens_per_document = [len(tokenizer.encode(document)) for _, document in documents]
pprint(tokens_per_document)

[1829, 3066, 306, 2430, 1756]


Some of them are too long - instead of using entire documents, we'll extract a random chunk from them

In [18]:
# extract a random chunk from a document
def extract_random_chunk(document, max_tokens=1024):
    tokens = tokenizer.encode(document)
    if len(tokens) <= max_tokens:
        return document
    start = random.randint(0, len(tokens) - max_tokens)
    end = start + max_tokens
    return tokenizer.decode(tokens[start:end])

Now, we will use that extracted chunk to create a question that can be answered by the document. This way we can generate questions that our current documentation is capable of answering.

In [19]:
def generate_context_prompt(chunk):
    prompt = "Generate a support question from a DFINITY ICP user.\n" +\
        "The question should be answerable within the following fragment of DFINITY ICP documentation.\n" +\
        "The fragment of documentation is delimited by 5 dollar signs:\n" +\
        " $$$$$ " +\
        chunk + "\n" +\
        " $$$$$ " +\
        "Let's start!"
    return prompt

chunk = extract_random_chunk(documents[0][1])
generation_prompt = generate_context_prompt(chunk)
pass

In [20]:
# pprint(generation_prompt)
Markdown(generation_prompt)

Let's generate 3 possible questions:

In [21]:
questions = generate_responses(system_prompt, generation_prompt, n=3)
pprint(questions)

['Can I get help with implementing the `cancelProposal` and `voteOnProposal` '
 'methods in `Governor.mo`?',
 'Support Question: How can I cancel an active proposal in the Governor '
 'canister?\n'
 '\n'
 'Possible answer: To cancel an active proposal in the Governor canister, you '
 'can use the `cancelProposal` method. This method accepts one parameter, '
 '`propNum`, which represents the index of the proposal in the `proposals` '
 'list. \n'
 '\n'
 'First, you need to check that the `propNum` is a valid index by ensuring '
 "that it doesn't reference an out-of-bound index location based on the size "
 'of `proposals`. If it is an invalid index, you will receive the '
 '`#proposalNotFound` error. \n'
 '\n'
 'Next, you need to make sure that the canister calling `cancelProposal` is '
 'either the owner of the proposal or the owner of the entire `Governor` '
 'canister. You can access the caller of a method by querying the `caller` '
 'field of the `msg` record with `msg.caller`. If th

> As you can see, sometimes the generation contains an intro phrase like: "Sure, here's a support question based on the documentation:", we may want to put some instructions to avoid this.

### Level 5 prompt

Complex directive that includes the following:
- Description of high-level goal
- A detailed bulleted list of sub-tasks
- An explicit statement asking LLM to explain its own output
- A guideline on how LLM output will be evaluated
- Few-shot examples

In [22]:
# we will use GPT4 from here, as it gives better answers and abides to instructions better
# MODEL_NAME = "gpt-4"
MODEL_NAME = "gpt-3.5-turbo"

In [23]:
# read system_template.txt file into an f-string
with open("system_template_dfinity.txt", "r") as file:
    system_prompt = file.read()

In [24]:
Markdown(system_prompt)

In [25]:
# read prompt_template.txt file into an f-string
with open("prompt_template_dfinity.txt", "r") as file:
    prompt_template = file.read()

In [26]:
Markdown(prompt_template)

In [27]:
def generate_context_prompt(chunk, n_questions=3):
    questions = '\n'.join(random.sample(user_questions, n_questions))
    pprint(questions)
    user_prompt = prompt_template.format(QUESTIONS=questions, CHUNK=chunk)

    return user_prompt

user_prompt = generate_context_prompt(chunk)

('"How can I debug and troubleshoot issues with smart contracts on the '
 'Internet Computer Platform (ICP)?"\n'
 '"Is there a recommended approach or library for handling user authentication '
 'on the ICP platform?"\n'
 '"What are the best practices for developing smart contracts on the ICP '
 'platform?"')


In [28]:
Markdown(user_prompt)

In [29]:
def generate_questions(documents, n_questions=3, n_generations=5):
    questions = []
    for _, document in documents:
        chunk = extract_random_chunk(document)
        user_prompt = generate_context_prompt(chunk, n_questions)
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_prompt},
        ]
        response = completion_with_backoff(
            model=MODEL_NAME,
            messages=messages,
            n = n_generations,
            )
        questions.extend([response.choices[i].message.content for i in range(n_generations)])
    return questions

> A Note about the `system` role: For GPT4 based pipelines you probably want to move some part of the context prompt to the `system` context. As we are using `gpt3.5-turbo` here, you can put the instruction on the user prompt, you can read more about this on [OpenAI docs here](https://platform.openai.com/docs/guides/chat/instructing-chat-models)

In [30]:
# function to parse model generation and extract CONTEXT, QUESTION and ANSWER
def parse_generation(generation):
    lines = generation.split("\n")
    context = []
    question = []
    answer = []
    flag = None
    
    for line in lines:
        if "CONTEXT:" in line:
            flag = "context"
            line = line.replace("CONTEXT:", "").strip()
        elif "QUESTION:" in line:
            flag = "question"
            line = line.replace("QUESTION:", "").strip()
        elif "ANSWER:" in line:
            flag = "answer"
            line = line.replace("ANSWER:", "").strip()

        if flag == "context":
            context.append(line)
        elif flag == "question":
            question.append(line)
        elif flag == "answer":
            answer.append(line)

    context = "\n".join(context)
    question = "\n".join(question)
    answer = "\n".join(answer)
    return context, question, answer

In [31]:
generations = generate_questions([documents[0]], n_questions=3, n_generations=5)
parse_generation(generations[0])

('"Is there a recommended approach or library for handling user authentication '
 'on the ICP platform?"\n'
 '"What are the best practices for developing smart contracts on the ICP '
 'platform?"\n'
 '"Can you provide guidance on how to securely deploy and access canister '
 'smart contracts on the Internet Computer Platform (ICP)?"')


('A user is working on implementing a governance canister on the ICP platform. They have gone through the documentation and are now trying to understand how to cancel an active proposal.\n',
 'How can I cancel an active proposal on the governance canister?\n',
 'To cancel an active proposal on the governance canister, you can use the `cancelProposal` method. This method accepts one parameter, `propNum`, which represents the index of the proposal in the `proposals` list. You need to ensure that the `propNum` is a valid index and does not reference an out-of-bound index location. Additionally, the canister calling the `cancelProposal` method should be either the owner of the proposal or the owner of the entire `Governor` canister. If these conditions are not met, you will encounter errors.')

In [32]:
parsed_generations = []
generations = generate_questions(documents, n_questions=3, n_generations=5)
for generation in generations:
    context, question, answer = parse_generation(generation)
    parsed_generations.append({"context": context, "question": question, "answer": answer})

# let's convert parsed_generations to a pandas dataframe and save it locally
df = pd.DataFrame(parsed_generations)
df.to_csv('generated_examples.csv', index=False)

# log df as a table to W&B for interactive exploration
wandb.log({"generated_examples": wandb.Table(dataframe=df)})

# log csv file as an artifact to W&B for later use
artifact = wandb.Artifact("generated_examples", type="dataset")
artifact.add_file("generated_examples.csv")
wandb.log_artifact(artifact)

('"What are the best practices for developing smart contracts on the ICP '
 'platform?"\n'
 '"Is there a recommended approach or library for handling user authentication '
 'on the ICP platform?"\n'
 '"Can you provide guidance on how to securely deploy and access canister '
 'smart contracts on the Internet Computer Platform (ICP)?"')
('"Is there a recommended approach or library for handling user authentication '
 'on the ICP platform?"\n'
 '"How can I ensure the security of smart contract code on the Internet '
 'Computer Platform (ICP) before deployment?"\n'
 '"How can I debug and troubleshoot issues with smart contracts on the '
 'Internet Computer Platform (ICP)?"')
('"Can you provide guidance on how to securely deploy and access canister '
 'smart contracts on the Internet Computer Platform (ICP)?"\n'
 '"How can I debug and troubleshoot issues with smart contracts on the '
 'Internet Computer Platform (ICP)?"\n'
 '"What are the best practices for developing smart contracts on the

<Artifact generated_examples>

In [33]:
wandb.finish()

0,1
usage/completion_tokens,▁▁▂██▇▇█▇
usage/elapsed_time,▁▁▁▁▁▁▁▁▁
usage/prompt_tokens,▁▁▆███▄██
usage/total_tokens,▁▁▅██▇▅█▇

0,1
usage/completion_tokens,1061.0
usage/elapsed_time,0.0
usage/prompt_tokens,1412.0
usage/total_tokens,2473.0
