<p align="center">
    <picture>
        <source alt="cometLLM" media="(prefers-color-scheme: dark)" srcset="https://github.com/comet-ml/comet-llm/raw/main/logo-dark.svg">
        <img alt="cometLLM" src="https://github.com/comet-ml/comet-llm/raw/main/logo.svg">
    </picture>
</p>
<p align="center">
    <a href="https://pypi.org/project/comet-llm">
        <img src="https://img.shields.io/pypi/v/comet-llm" alt="PyPI version"></a>
    <a rel="nofollow" href="https://opensource.org/license/mit/">
        <img alt="GitHub" src="https://img.shields.io/badge/License-MIT-blue.svg"></a>   
    <a href="https://www.comet.com/docs/v2/guides/large-language-models/overview/" rel="nofollow">
        <img src="https://img.shields.io/badge/cometLLM-Docs-blue.svg" alt="cometLLM Documentation"></a>
    <a rel="nofollow" href="https://pepy.tech/project/comet-llm">
        <img style="max-width: 100%;" src="https://static.pepy.tech/badge/comet-llm" alt="Downloads"></a>   
</p>
<p align="center">

CometLLM is a new suite of LLMOps tools designed to help you effortlessly track and visualize your LLM prompts and chains. Use CometLLM to identify effective prompt strategies, streamline your troubleshooting, and ensure reproducible workflows.  

CometLLM complements Comet experiment tracking and production model management tools to arm LLM practitioners with everything they need to interact with, manage, and optimize their models with ease.  

👉 The best part? [It's 100% free to get started!](https://www.comet.com/signup/?utm_source=comet_llm&utm_medium=referral&utm_content=intro_colab&framework=llm)

__________
This guide will cover some of the basic features for logging prompts to Comet LLM.

For a preview of what's possible with CometLLM, head over to one of our example projects in the [public Comet workspace](https://www.comet.com/examples/comet-example-cometllm-prompts/prompts)!

# 🚧 Setup

In [1]:
%pip install -q comet_llm torch torchdata transformers datasets

Note: you may need to restart the kernel to use updated packages.


In [2]:
import comet_llm

import os
import time

from datasets import load_dataset
from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer
from transformers import GenerationConfig

If you don't already have a Comet account, [create one here for free](https://www.comet.com/signup/?utm_source=comet_llm&utm_medium=referral&utm_content=intro_colab&framework=llm) and grab your API credentials.

In [3]:
import comet_llm

comet_llm.init(project="comet-example-cometllm-prompts")

Now we're ready to start logging prompts to Comet! At the simplest level, logging a prompt to CometLLM just requires an input prompt and an output prompt:

In [4]:
comet_llm.log_prompt(
    prompt="What is this conversation about?",
    output="A customer wants to return a purchase.",
)

Prompt logged to https://www.comet.com/examples/comet-example-cometllm-prompts


LLMResult(id='eb8a9107bac6442a8df077ae76300255', project_url='https://www.comet.com/examples/comet-example-cometllm-prompts')

It's really that simple! To check out your logged prompt in the Comet UI, click on the link above.

In most real-world scenarios, however, we'll want to log a lot more information than just the input and output. In the following examples we'll cover how to log a prompt with:

- 🗺 Instructions

- 📅 Metadata

- 🎓 In-context learning:

    - ⚽ One-shot-inference

    - 🏀 🎾 Few-shot-inference

- 🎛 [Hyperparameter](https://www.comet.com/production/site/lp/your-ultimate-guide-to-hyperparameter-tuning/) configurations

# 🤖 Our application

For this tutorial, **imagine you lead the Customer Support team at your company. It's the end of the quarter, and you want to summarize all of the support issues your team has dealt with to identify some possible areas of improvement.**

We'll be using the [dialogsum dataset](https://huggingface.co/datasets/knkarthick/dialogsum) from Hugging Face, which consists of 13,460 short conversations with corresponding manually labeled summaries and topics.

To summarize these conversations, we'll use Hugging Face's implementation of [FLAN-T5](https://huggingface.co/google/flan-t5-base).

In [5]:
DATASET_NAME = "knkarthick/dialogsum"
MODEL_NAME = "google/flan-t5-base"

In [6]:
dataset = load_dataset(
    DATASET_NAME, split="test"
)  # [120, 255, 303, 321, 333, 348, 354]
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

## 🗺 Prompt with instructions

One of the most basic ways to improve our prompt is with a set of simple instructions. You may want to play around with the format and wording of these instructions to determine the prompt that best helps your model understand the task. Sometimes even slight rephrasings of a prompt can significantly alter the output.

This form of prompt engineering is typically the first one a practitioner will engage in because it's very, very inexpensive and gives you quick insights into whether you're on the right path.

In [7]:
def summarize_v1(user_prompt):
    input = tokenizer(user_prompt, return_tensors="pt")
    output = tokenizer.decode(
        model.generate(
            input["input_ids"],
            max_new_tokens=50,
        )[0],
        skip_special_tokens=True,
    )

    comet_llm.log_prompt(prompt=user_prompt, output=output)
    return output

In [8]:
user_prompt = dataset["dialogue"][255]

prompt_template = """
Summarize the following conversation.

{user_prompt}

Summary:
    """

Let's take a look at what our first input from the dataset will look like:

In [9]:
print(user_prompt)

#Person1#: What's the matter with this computer?
#Person2#: I don't know, but it just doesn't work well. Whenever I start it, it stops running.
#Person1#: Have you asked Mr. Li for some advice?
#Person2#: Yes, I have, but he doesn't seem to be able to solve the problem, either. Can you help me?
#Person1#: Me? I know nothing more than playing computer games.
#Person2#: What shall I do? I have to finish this report this afternoon, but...
#Person1#: But why don't you ring up the repairmen? They will be able to settle the problem.
#Person2#: Yes, I'll ring them up.


Now let's see how well our model summarizes the conversation.

In [10]:
summarize_v1(user_prompt)

"Person1#: I'm sorry, but I can't help you."

Finally, let's compare this with our ground-truth label:

In [11]:
dataset["summary"][255]

'#Person2# finds that the computer has stopped running. #Person1# suggests #Person2# ring up the repairmen.'

[![S4JhD.gif](https://s6.gifyu.com/images/S4JhD.gif)](https://www.comet.com/examples/comet-example-cometllm-prompts/prompts)

**Note** that we can log and view our inputs and outputs as either **JSON** or **YAML** file formats.

Not bad, but we can do better! Let's try a few more prompt engineering techniques.

## 📅 Prompt with metadata

As we begin to alter, or "engineer," our prompts, we might also want to log some important metadata. If we're comparing the output of several models, we'll want to log which models produced which results. We may also want to play around with different hyperparameter values, or "generation configurations" (more on that later).

Some other relevant pieces of information to log might include:
- ⏰ How long does each prompt take to process? (duration)
- 🗣 Which task is the model performing? (summarization, text-generation, translation, etc.)
- 🏷 Do we have ground truth labels? (usually human-generated responses)

In [12]:
def summarize_v2(user_prompt, prompt_template, tags, metadata):
    start = time.time()

    variables = {"user_prompt": user_prompt}
    final_prompt = prompt_template.format(**variables)

    input = tokenizer(final_prompt, return_tensors="pt")
    output = tokenizer.decode(
        model.generate(
            input["input_ids"],
            max_new_tokens=metadata["max_new_tokens"],
        )[0],
        skip_special_tokens=True,
    )

    duration = time.time() - start

    comet_llm.log_prompt(
        prompt=final_prompt,
        prompt_template=prompt_template,
        prompt_template_variables=variables,
        output=output,
        tags=tags,
        duration=duration * 1000,
        metadata=metadata,
    )

    return output

In [13]:
user_prompt = dataset["dialogue"][348]
ground_truth = dataset["summary"][348]

prompt_template = """
Summarize the following conversation.

{user_prompt}

Summary:
    """

Our input:

In [14]:
print(user_prompt)

#Person1#: What can I do for you?
#Person2#: I want to get my car washed.
#Person1#: Would you like regular car wash package?
#Person2#: I don't know what you mean.
#Person1#: Well, we will wash the exterior form top to bottom. We use a special shampoo, which gives the body that extra shine.
#Person2#: Do you wash windows?
#Person1#: Of course. We wash the windows inside and out.
#Person2#: What about the interior?
#Person1#: We use a vacuum cleaner that removes all the dirt, and we throw away all of the trash that we can find.
#Person2#: Sounds good, regular car wash package will be OK.
#Person1#: OK. I see.


In [15]:
METADATA = {
    "model": MODEL_NAME,
    "max_new_tokens": 50,
    "skip_special_tokens": True,
    "ground_truth": ground_truth,
}

TAGS = ["prompt-with-instructions", "summarization"]

Our output:

In [18]:
summarize_v2(user_prompt, prompt_template, TAGS, METADATA)

'#Person1#: I want to get my car washed.'

[![S4J9q.gif](https://s6.gifyu.com/images/S4J9q.gif)](https://www.comet.com/examples/comet-example-cometllm-prompts/prompts)

## 🎓 Prompt template with in-context learning

Once we've done everything we can to optimize the prompt instructions, we might choose to further improve performance by including examples in our prompt. This is called [in-context learning](https://towardsdatascience.com/all-you-need-to-know-about-in-context-learning-55bde1180610#:~:text=Now%2C%20we%20can%20give%20a,source).

In one-shot inference, we provide a single example within the prompt. In few-shot learning, we provide multiple examples within the prompt. Generally, if you need more than five or six examples to get the output you're looking for, you may want to consider fine-tuning your model or selecting a different model.

If our few-shot example doesn't perform much better than our one-shot example, we might consider using the one-shot example for better latency (good thing we're keeping tracking of that in our metadata!). We'll also have to be aware of our model's context window (in this case, 512 tokens), which limits how many examples we can provide.

For our use case one "example" will include both a conversation (to be summarized), as well as an accurate summarization (available in the ground truth labels of our dataset).

### 🎾 One-shot inference

We provide a single example within our prompt:

In [19]:
def summarize_v3(user_prompt, prompt_template, tags, metadata):
    start = time.time()

    variables = {
        "user_prompt": user_prompt,
        "example_1": example_1,
        "summary_1": summary_1,
    }
    final_prompt = prompt_template.format(**variables)

    input = tokenizer(final_prompt, return_tensors="pt")
    output = tokenizer.decode(
        model.generate(
            input["input_ids"],
            max_new_tokens=metadata["max_new_tokens"],
        )[0],
        skip_special_tokens=True,
    )

    duration = time.time() - start

    comet_llm.log_prompt(
        prompt=final_prompt,
        prompt_template=prompt_template,
        prompt_template_variables=variables,
        output=output,
        tags=tags,
        duration=duration * 1000,
        metadata=metadata,
    )

    return output

⭐ **Note** that because the context window of FLAN-T5 is limited to 512 tokens, we specifically used samples from the dataset with the shortest lengths. If you're experimenting with different examples at home, make sure that your total input sequence length doesn't exceed 512 tokens. If it does, the model will truncate the input.

In [20]:
user_prompt = dataset["dialogue"][321]
ground_truth = dataset["summary"][321]
example_1 = dataset["dialogue"][120]
summary_1 = dataset["summary"][120]

prompt_template = """
Summarize the following conversation.

{example_1}

Summary:
{summary_1}


"""
prompt_template += """
Summarize the following conversation.

{user_prompt}

Summary:
"""

In [21]:
METADATA = {
    "model": MODEL_NAME,
    "max_new_tokens": 50,
    "skip_special_tokens": True,
    "ground_truth": ground_truth,
}

TAGS = ["one-shot-inference", "summarization"]

Our input:

In [22]:
print(user_prompt)

#Person1#: Mr. Wilson. We are very regretful about the mistakes in goods. I am very sorry and we will be responsible for the mistake.
#Person2#: We have no choice but to hold you responsible for the loss we sustained.
#Person1#: The first problem is supposed to be solved after the investigation. About the second problem, I admit it's our fault, so we will exchange all merchandise that falls short of our sample.
#Person2#: Well. I hope there won't be such things no more.
#Person1#: I can assure you that such a thing today will never happen again in future delivery. We have made the plan to improve the package of our exported goods.


Our output:

In [23]:
summarize_v3(user_prompt, prompt_template, TAGS, METADATA)

'The goods have been delivered incorrectly.'

[![S4JhP.gif](https://s6.gifyu.com/images/S4JhP.gif)](https://www.comet.com/examples/comet-example-cometllm-prompts/prompts)

### ⚽ 🏀 Few-shot inference
We provide multiple examples within our prompt:

In [24]:
def summarize_v4(user_prompt, prompt_template, tags, metadata):
    start = time.time()

    variables = {
        "user_prompt": user_prompt,
        "example_1": example_1,
        "summary_1": summary_1,
        "example_2": example_2,
        "summary_2": summary_2,
    }
    final_prompt = prompt_template.format(**variables)

    input = tokenizer(final_prompt, return_tensors="pt")
    output = tokenizer.decode(
        model.generate(
            input["input_ids"],
            max_new_tokens=metadata["max_new_tokens"],
        )[0],
        skip_special_tokens=True,
    )

    duration = time.time() - start

    comet_llm.log_prompt(
        prompt=final_prompt,
        prompt_template=prompt_template,
        prompt_template_variables=variables,
        output=output,
        tags=tags,
        duration=duration * 1000,
        metadata=metadata,
    )

    return output

In [25]:
user_prompt = dataset["dialogue"][57]
ground_truth = dataset["summary"][57]
example_1 = dataset["dialogue"][555]
summary_1 = dataset["summary"][555]
example_2 = dataset["dialogue"][933]
summary_2 = dataset["summary"][933]

prompt_template = """
Summarize the following conversation.

{example_1}

Summary:
{summary_1}


Summarize the following conversation.

{example_2}

Summary:
{summary_2}



"""
prompt_template += """
Summarize the following conversation.

{user_prompt}

Summary:
"""

In [26]:
METADATA = {
    "model": MODEL_NAME,
    "max_new_tokens": 50,
    "skip_special_tokens": True,
    "ground_truth": ground_truth,
}

TAGS = ["few-shot-inference", "summarization"]

Our input:

In [27]:
print(user_prompt)

#Person1#: Can I help you?
#Person2#: I'd like to buy a new mobile phone please.
#Person1#: Ok, would you like a phone with camera and MP3 player?
#Person2#: Yes please. And I'd like to be able to make video calls too.


Our output:

In [28]:
summarize_v4(user_prompt, prompt_template, TAGS, METADATA)

"#Person1#: I'd like to buy a new mobile phone."

[![S4J6E.gif](https://s6.gifyu.com/images/S4J6E.gif)](https://www.comet.com/examples/comet-example-cometllm-prompts/prompts)

## 🎛 Optimizing generation configurations

In much the same way that we tune and optimize our hyperparameter values in traditional machine learning applications, in generative AI, we can tweak our "generation configuration."

Generation configuration parameters for this model include sampling methods, temperature, maximum new tokens, and more. The role of each of these parameters is beyond the scope of this tutorial, but generally speaking, we can think of the `temperature` as controlling the "creativity" of the model and the `sampling method` as controlling the "relevance" of the model.

Because we're using the same prompt template, this time we'll only need to define the `user_prompt` (test prompt).

In [29]:
user_prompt = dataset["dialogue"][1155]
ground_truth = dataset["summary"][1155]
example_1 = dataset["dialogue"][933]
summary_1 = dataset["summary"][933]
example_2 = dataset["dialogue"][555]
summary_2 = dataset["summary"][555]

In [30]:
def summarize_v5(user_prompt, prompt_template, tags, metadata):
    start = time.time()

    variables = {
        "user_prompt": user_prompt,
        "example_1": example_1,
        "summary_1": summary_1,
        "example_2": example_2,
        "summary_2": summary_2,
    }
    final_prompt = prompt_template.format(**variables)

    input = tokenizer(final_prompt, return_tensors="pt")
    output = tokenizer.decode(
        model.generate(
            input["input_ids"],
            generation_config=GenerationConfig(
                max_new_tokens=metadata["max_new_tokens"],
                do_sample=metadata["do_sample"],
                temperature=metadata["temperature"],
            ),
        )[0],
        skip_special_tokens=True,
    )

    duration = time.time() - start

    comet_llm.log_prompt(
        prompt=final_prompt,
        prompt_template=prompt_template,
        prompt_template_variables=variables,
        output=output,
        tags=tags,
        duration=duration * 1000,
        metadata=metadata,
    )

    return output

In [31]:
METADATA = {
    "model": MODEL_NAME,
    "max_new_tokens": 50,
    "skip_special_tokens": True,
    "do_sample": True,
    "temperature": 0.1,
    "ground_truth": ground_truth,
}

TAGS = ["optimizing-config", "summarization"]

Our input:

In [32]:
print(user_prompt)

#Person1#: I can't believe I still have this pain in my back. This medicine the doctor gave me was supposed to make me feel better by now.
#Person2#: Maybe you should start taking it three times a day like you were told.


Our output:

In [36]:
summarize_v5(user_prompt, prompt_template, TAGS, METADATA)

'The doctor gave me a prescription for a pain reliever.'

Note that we can also sort our rows by ascending or descending column values:

[![S4JRV.gif](https://s6.gifyu.com/images/S4JRV.gif)](https://www.comet.com/examples/comet-example-cometllm-prompts/prompts)

# 🔎 Prompt search

Prompt engineering is a highly iterative process, so you're likely to repeat these processes many, many times. To make it easier to sift through all of your prompts, CometLLM has a search feature that allows you to isolate experiment runs based on keywords.

Maybe we run our experiments a few dozen times (or maybe hundreds or thousands of times!). We notice a customer service ticket from Mr. Li with a very unhappy customer and want to make sure this isn't a pattern.

Simply select the prompt variable you'd like to search and the filtering operator you'd like to use. Type in your keyword and Comet will do the rest!

We can see that Mr. Li was only involved in one customer ticket, so it was probably a one-off situation.

_____

[![S4JRJ.gif](https://s6.gifyu.com/images/S4JRJ.gif)](https://www.comet.com/examples/comet-example-cometllm-prompts/prompts)
____

# 🎭 User Feedback

Prompt feedback is crucial for improving the overall quality of Large Language Models outputs. Once we've logged all of our prompt inputs and outputs, we can use the Comet UI to give and document human-feedback. We can then order by feedback score (0.0 and 1.0) outputs to isolate correct and incorrect outputs.

[![S4Jj6.gif](https://s6.gifyu.com/images/S4Jj6.gif)](https://www.comet.com/examples/comet-example-cometllm-prompts/prompts)

# 📓 Additional Resources

- [Read our full CometLLM announcement](https://heartbeat.comet.ml/organize-your-prompt-engineering-with-cometllm-66e390ef6645)
- [Read Comet's prompt engineering blog post](https://heartbeat.comet.ml/organize-your-prompt-engineering-with-cometllm-66e390ef6645)
- [Check out our GitHub repo and give us a star](https://github.com/comet-ml/comet-llm)
- [Connect with us on our Community Slack channel](https://cometml.slack.com/join/shared_invite/enQtMzM0OTMwNTQ0Mjc5LWE4NzcxMzdiMmFjYzEzM2E5OTczOTk1MDZmZDg2MGJmODUwYWI0YWQ0YWMyMjlmMjQ5YmVmNzEyYjNlNzFhNjQ#/shared-invite/email)