# Synthetic dataset pipeline

Synthetically create a dataset from expensive model outputs for fine-tuning cheap models.

Table of Contents

1. [Create potential questions](#create-potential-questions) - Generating 20 usage questions about a package using `gpt-3.5-turbo`.
2. [Answer the questions](#answer-the-questions) - Using `gpt-4-turbo-preview` to answer the questions and conducting a quick quality assessment.
4. [Export to JSONL](#export-to-jsonl) - Exporting the dataset to a JSONL file to save progress.
5. [Review the dataset](#review-the-dataset) - Reviewing the dataset and making necessary corrections using a Panel app.
6. [Related questions](#related-questions) - Generating derived questions to the original questions to increase the size of the dataset.
7. [Update the tone to how a user might ask](#update-the-tone) - Updating the tone of the questions to how a typical user might ask.
8. [Multi-turn conversation](#multi-turn-conversation) - Creating a multi-turn conversation dataset by chaining the questions and answers together.
9. [Fine-tune the model](#fine-tune-the-model) - Fine-tuning a model on the dataset using the OpenAI platform.

In [None]:
REFRESH = False  # Set to False to prevent rerunning old cells

## Create potential questions

Using `gpt-3.5-turbo`, generate 20 usage questions about a package. These questions will be used to mimic the questions that a user might ask about a package, and form the basis of the synthetic dataset--later, we will be rephrasing these questions to create a larger dataset.

Note, these questions can also be manually written by you, or actual user questions extracted from Discord/Discourse.

We use `gpt-3.5-turbo` here because it is faster and we don't really need deep reasoning abilities to ask questions.

In [None]:
import re
import marvin
import asyncio
from pydantic import BaseModel, Field


@marvin.fn(model_kwargs={"model": "gpt-3.5-turbo"})
async def batch_ask(package: str) -> list[str]:
    """
    Imagine you are a new user to the given `package`.
    Ask at least 20 usage questions about the `package`
    to help you get started.
    """

if REFRESH:
    package = "HoloViz Panel"
    questions = await batch_ask(package)

## Answer the questions

Using `gpt-4-turbo-preview`, we will now answer the questions. These answers will be used as the ground truth for the synthetic dataset.

In the pipeline below, we also have a quick quality assessment of the answer by using `eval` to ensure the code runs successfully and the answer is also reviewed by itself (`gpt-4-turbo-preview`) using a different prompt (different agent). The downside of using `eval` is that you must have all the packages it uses installed for it to work as intended. Later, we will create an app later to view the inputs and output.

Note, you may also manually answer the questions for correctness, especially if the package is not inside GPT-4's training data, but because HoloViz Panel is already a part of the training data, we can use the model to answer the questions.

In [None]:
@marvin.fn(model_kwargs={"model": "gpt-4-turbo-preview", "max_tokens": 4096})
async def expertly_answer(
    package: str, question: str, critiques: str | None = None
) -> str:
    """
    You are a world expert on the `package`.

    Please first concisely understand the `question`,
    then provide a response to the `question` in an
    applicable, useful manner that helps the user get unblocked.

    If critiques is provided, please ensure that the critique
    is addressed in the response, but do not mention it.

    If a code snippet is needed, please produce a single MCVE
    using best practices in backticks with the language specified;
    it must be fully copy/pastable and runnable.
    Everything should be made as simple as possible, but not simpler.
    Try not to have multiple snippets in one answer.

    Best practices include, using `pn.bind` instead of `param.watch`,
    `pn.state.onload` function for slow loading components, `pn.cache` for
    reusing computations, wrapping `hv.DynamicMap` around
    `pn.bind` for preventing HoloViews' plot from resetting zoom,
    Manage loading states effectively by encapsulating widget
    activation within a try/finally block, using `param.Parameterized`
    and `@pn.depends` for more involved applications,
    prioritizing `panel_obj.servable()` instead of `pn.serve(panel_obj)`,
    `template.show()` instead of `template.servable()`,
    functions on the top, widgets, layout, bindings on the bottom,
    and any other best practices that are relevant to the package.

    Link to https://panel.holoviz.org/tutorials/index.html for tutorials,
    https://panel.holoviz.org/reference/index.html for components / reference,
    https://panel.holoviz.org/getting_started/index.html for getting started,
    https://discourse.holoviz.org/ to ask other questions.
    """


class Review(BaseModel):

    chain_of_thought: str = Field(
        ...,
        description="A concise chain of thought when reviewing the answer",
    )

    requires_revision: bool = Field(
        ...,
        description="Whether the answer requires revision",
    )


@marvin.fn(model_kwargs={"model": "gpt-4-turbo-preview"})
async def review_answer(package: str, question: str, answer: str) -> Review:
    """
    You are a world-class reviewer of the `package`.
    You check whether the `answer` is valid and factually correct.

    """
    code_snippets = re.findall(r"```.*?\n(.*?)\n```", answer, re.DOTALL)

    if code_snippets:
        for code in code_snippets:
            try:
                exec(code)
            except Exception as exc:
                if "No module named" in str(exc):
                    return

                return Review(
                    chain_of_thought=f"Code snippet failed with error: {exc}",
                    requires_revision=True,
                )


async def pipe_question(package, question):
    critiques = None
    for _ in range(3):  # Retry up to 3 times
        answer = await expertly_answer(package, question, critiques=critiques)
        review = await review_answer(package, question, answer)
        if not review.requires_revision:
            return answer
        critiques = f"Answer requires revision: {review.chain_of_thought}"
    else:
        print(f"Failed to answer question: {question} due to {critiques}")


if REFRESH:  # toggle to False to prevent running
    answers = await asyncio.gather(
        *[pipe_question(package, question) for question in questions]
    )

## Export to JSONL

Finally, we will export the dataset to a JSONL file to save our progress and later read it inside a Panel app.

In [None]:
def escape_text(text):
    """Escapes newline characters and double quotes in a given text."""
    return text.replace("\n", "\\n").replace('"', '\\"')

def format_message(data_format, question, answer):
    """Formats a single message based on the data format."""
    escaped_question = escape_text(question)
    escaped_answer = escape_text(answer)
    
    if data_format == "openai":
        return f'{{"messages": [{{"role": "system", "content": "You are a world class expert on {package}. Help the user get unblocked."}}, {{"role": "user", "content": "{escaped_question}"}}, {{"role": "assistant", "content": "{escaped_answer}"}}]}}\n'
    else:
        return f'{{"instruction": "You are a world class expert on {{package}}. Help the user get unblocked.", "input": "{escaped_question}", "output": "{escaped_answer}"}}\n'

def write_dataset(filename, questions_answers, data_format, mode="w"):
    """Writes the dataset to a file in the specified format."""
    with open(filename, mode) as f:
        for question, answer in questions_answers:
            if answer is None:
                continue
            message = format_message(data_format, question, answer)
            f.write(message)

if REFRESH:
    data_format = "openai"
    filename = f"panel_dataset_{data_format}_base.jsonl"
    questions_answers = zip(questions, answers)

    write_dataset(filename, questions_answers, data_format)

## Review the dataset

Using a Panel app, we will review the dataset and make any necessary corrections.

In [None]:
import panel as pn
import json

pn.extension("code_editor", sizing_mode="stretch_both")


# Function to load Q&A data
def load_qa_data(filepath):
    with open(filepath, "r") as file:
        return [json.loads(line) for line in file.read().splitlines()]


# Function to save Q&A data
def save_qa_data(data, filepath):
    with open(filepath, "w") as file:
        for item in data:
            file.write(json.dumps(item) + "\n")


# Function to update the displayed Q&A based on the selected index
def update_display(index):
    messages = qa_data[index]["messages"]
    system_input.value = ""
    user_input.value = ""
    assistant_input.value = ""
    for message in messages:
        if message["role"] == "system":
            system_input.value = message["content"]
        elif message["role"] == "user":
            user_input.value = message["content"]
        elif message["role"] == "assistant":
            assistant_input.value = message["content"]
    return pn.Column(system_input, user_input, assistant_input)


# Save changes to the current Q&A
def save_changes(event):
    current_index = index_slider.value
    qa_data[current_index]["messages"] = [
        {"role": "system", "content": system_input.value},
        {"role": "user", "content": user_input.value},
        {"role": "assistant", "content": assistant_input.value},
    ]
    save_qa_data(qa_data, filename)
    print("Changes saved!")


async def update_answer(contents, user, instance):
    try:
        app_layout.loading = True
        answer = await expertly_answer(
            package,
            user_input.value,
            critiques=f"Revise based on {contents!r}\nTo revise:\n'''\n{assistant_input.value}\n'''",
        )
        assistant_input.value = answer
    finally:
        app_layout.loading = False


if REFRESH:
    # Load data
    filename = "panel_dataset_openai_added.jsonl"
    qa_data = load_qa_data(filename)

    # UI Components
    system_input = pn.widgets.TextAreaInput(
        name="System Message", max_length=1000, max_height=50
    )
    user_input = pn.widgets.TextAreaInput(
        name="User Question", max_length=1000, max_height=50
    )
    assistant_input = pn.widgets.CodeEditor(name="Assistant Answer", language="python")
    save_button = pn.widgets.Button(
        name="Save Changes", button_type="success", height=50, sizing_mode="stretch_width"
    )
    index_slider = pn.widgets.IntSlider(
        name="Message Index",
        start=0,
        end=len(qa_data) - 1,
        step=1,
        sizing_mode="stretch_width",
    )


    chat = pn.chat.ChatInterface(
        callback=update_answer,
        show_clear=False,
        show_undo=False,
        show_stop=False,
        show_rerun=False,
        show_avatar=False,
        show_button_name=False,
        width=450,
        height=650,
        help_text="Put your requests here to use AI to update the output."
    )
    save_button.on_click(save_changes)

    # Layout
    app_layout = pn.template.FastListTemplate(
        sidebar=[chat],
        main=[pn.Row(save_button, index_slider), pn.bind(update_display, index=index_slider)],
        theme="dark",
        sidebar_width=450,
    )

    # Serve the app
    app_layout.show()

## Related questions

Generate derived questions to the original questions to increase the size of the dataset.

In [None]:
import pandas as pd

if not REFRESH:
    data_format = "openai"
    filename = f"panel_dataset_{data_format}_base.jsonl"
    df = (
        pd.read_json(filename, lines=True)
        .explode("messages")["messages"]
        .apply(pd.Series)
        .query("role != 'system'")
    )
    questions = df.query("role == 'user'")["content"].tolist()
    answers = df.query("role == 'assistant'")["content"].tolist()


def flatten(l):
    return [item for sublist in l for item in sublist]


new_questions = await asyncio.gather(
    *[
        batch_ask(
            f"HoloViz Panel: what other details should I know about {question}",
        )
        for question in questions
    ]
)

data_format = "openai"
filename = f"panel_dataset_{data_format}_added.jsonl"
questions_answers = zip(questions, answers)

for n_questions in new_questions[1:]:
    print(n_questions)
    new_answers = await asyncio.gather(
        *[pipe_question(package, question) for question in n_questions]
    )
    new_questions_answers = zip(n_questions, new_answers)
    print(new_answers)
    write_dataset(filename, new_questions_answers, data_format, mode="a")

## Update the tone to how a user might ask

Much of the input queries are too formal. Let's update the tone to how a typical user might ask.

In [None]:
@marvin.ai_fn(model_kwargs={"model": "gpt-4-turbo-preview"})
async def update_tone(question: str, tone: str) -> str:
    """
    Update the tone of the given question to the new tone;
    be sure to capture the essence of the tone, e.g. in
    capitalization, punctuation, and word choice. Also,
    make sure the updated question captures the original
    meaning of the question.
    """


toned_filename = "panel_dataset_openai_toned.jsonl"
for filename in ["panel_dataset_openai_base.jsonl", "panel_dataset_openai_added.jsonl"]:
    for line in open(filename):
        try:
            json.loads(line)
        except json.JSONDecodeError:
            print(line)
    df = (
        pd.read_json(filename, lines=True)
        .explode("messages")["messages"]
        .apply(pd.Series)
        .query("role != 'system'")
    )
    questions = df.query("role == 'user'")["content"].tolist()
    answers = df.query("role == 'assistant'")["content"].tolist()
    toned_questions = await asyncio.gather(
        *[
            update_tone(
                question, "messages like a close friend; casual, informal, using lowercase"
            )
            for question in questions
        ]
    )
    toned_questions_answers = zip(toned_questions, answers)
    write_dataset(toned_filename, toned_questions_answers, data_format, mode="a")

## Multi-turn conversation

Thus far, we focused on breadth, but now we will focus on depth. We will create a multi-turn conversation dataset by chaining the questions and answers together. This will allow us to fine-tune models that can handle multi-turn conversations.

TODO...

## Fine-tune the model

When you have a large enough dataset (repeating the above steps as much as you like), you can fine-tune a model on it.

You can use the OpenAI platform to fine-tune a model on your dataset by uploading the joined jsonl file: https://platform.openai.com/finetune

In [None]:
filenames = [
    "panel_dataset_openai_base.jsonl",
    "panel_dataset_openai_added.jsonl",
    "panel_dataset_openai_toned.jsonl",
]
with open("panel_dataset_openai_joined.jsonl", "w") as outfile:
    for fname in filenames:
        with open(fname, "r") as infile:
            for line in infile:
                outfile.write(line)