# Kwansi Example Implementation

This notebook demonstrates the use of Kwansi, a wrapper for DSPy that makes its optimizers easier to use. We'll walk through the example implementation provided in `example_implementation.py`.

## Introduction

Kwansi is a wrapper for [DSPy](https://dspy-docs.vercel.app), a framework for programming - not prompting - language models. DSPy is a powerful, modular tool that assists the way we work with language models (LMs) in complex systems.

Kwansi builds upon DSPy's capabilities with a focus on making making its optimizers easier to use - especially if you are using LLM-based evaluators. You can find more information about Kwansi and its implemented functions here: https://github.com/lordamp/kwansi

## Example

We have written a simple prompt for creating tweets based on a given topic and details. We want to optimize this prompt.

## Setup

First, make sure you have all required packages (including the kwansi library) installed by running

`pip install -r requirements.txt`.

Then, set up your OpenAI API key in a `.env` file:
```
OPENAI_API_KEY=<your-key>
```

## Importing Libraries and Loading Environment Variables

First, let's import the necessary libraries and load the environment variables.

In [24]:
import dspy
import os
from dotenv import load_dotenv
import json

# Import the functions from the kwansi package
from kwansi.data_preparation import prepare_examples
from kwansi.evaluation import create_evaluator
from kwansi.optimizer_handling import run_optimizer, save_optimized_model
from kwansi.task_creation import create_task
from kwansi.testing import test_model

## Load Environment Variables

Now, let's load the environment variables, so our LLM API key is loaded.

In [25]:
load_dotenv()

True

## Load Components

Next, we load the components we'll need for our task.

In [26]:
from components.task import TweetCreatorSignature
from components.assessors import Assess_Interestingness, Assess_StyleAppropriateness
from components.metrics import length_metric, hashtag_count_metric
from components.custom_combiner import custom_combine

Let's check out what we have here.

### Task

In [27]:
print(TweetCreatorSignature)

TweetCreatorSignature(topic, details -> tweet
    instructions="As a social media expert, craft engaging and informative tweets based on given topics and details. Your tweets should be concise, attention-grabbing, and within the 280-character limit. Capture the essence of the topic, use language that sparks curiosity or emotion, and add a call to action when appropriate. Adjust the tone to suit the topic, convey information clearly, and emphasize timeliness if relevant. Your tweet should inform, engage, and encourage further interaction or exploration of the topic. Don't use hashtags."
    topic = Field(annotation=str required=True json_schema_extra={'desc': 'the main subject of the tweet', '__dspy_field_type': 'input', 'prefix': 'Topic:'})
    details = Field(annotation=str required=True json_schema_extra={'desc': 'additional information or context for the tweet', '__dspy_field_type': 'input', 'prefix': 'Details:'})
    tweet = Field(annotation=str required=True json_schema_extra={'de

The `TweetCreatorSignature` is a "Signature" for our "Task", a custom DSPy class that contains the prompt we're trying to optimize, its input fields, and output field. It has the elements:

- `instructions`: The instructions for the task (the actual prompt)
- `topic` and `details`: Input fields the prompt expects
- `tweet`: The output field the prompt produces (a tweet)

This is an unoptimized version of the prompt, which is also why it's called the "Student" in the code.

### Assessors

Next, we have our assessors:

In [28]:
print(Assess_Interestingness)
print("-"*100)
print(Assess_StyleAppropriateness)

Assess_Interestingness(tweet -> score
    instructions='Assess how interesting and engaging the tweet is on a scale of 0 (very uninteresting) to 10 (highly interesting).'
    tweet = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Tweet:', 'desc': '${tweet}'})
    score = Field(annotation=str required=True json_schema_extra={'desc': "A score between 0 and 10 in the format 'Score: X'", '__dspy_field_type': 'output', 'prefix': 'Score:'})
)
----------------------------------------------------------------------------------------------------
Assess_StyleAppropriateness(tweet, topic -> score
    instructions='Assess if the style of the tweet is appropriate for the topic and platform on a scale of 0 (very inappropriate) to 10 (perfectly appropriate).'
    tweet = Field(annotation=str required=True json_schema_extra={'__dspy_field_type': 'input', 'prefix': 'Tweet:', 'desc': '${tweet}'})
    topic = Field(annotation=str required=True json_schema_ex

These are again two DSPy Signatures, i.e. prompts. They have a special role: They evaluate the quality of the task we are trying to optimize.  

- **Assess_Interestingness**: How interesting is the tweet?
- **Assess_StyleAppropriateness**: How appropriate is the style of the tweet?

Both assessors have an instructions field, as well as input fields the prompt expects and an output field that the prompt produces. The `Assess_Interestingness` assessor expects a `tweet` field as input, and produces a `score` field as output. This score will be between 0 and 10. The `Assess_StyleAppropriateness` assessor expects two fields as input: a `tweet` and a `topic`, and also produces a `score`.

### Metrics

Next, we have two metrics:

In [29]:
import inspect

print(inspect.getsource(length_metric))
print("-"*100)
print(inspect.getsource(hashtag_count_metric))

def length_metric(example, pred):
    """
    Check if the tweet is below the 280 character limit.
    Returns 1 if the tweet is below the limit, 0 if it is not.
    """
    return 1 if len(pred.tweet) <= 280 else 0

----------------------------------------------------------------------------------------------------
def hashtag_count_metric(example, pred):
    """
    Count hashtags in the tweet.
    Returns 1 if there are no hashtags, 0 if there are any hashtags.
    """
    tweet = pred.tweet
    hashtag_count = tweet.count('#')
    
    return 1 if hashtag_count == 0 else 0



These are two simple metrics. `length_metric` checks if the tweet is below the 280 character limit, and `hashtag_count_metric` checks if there are any hashtags in the tweet. They both return a binary score - 1 if the condition is met (equal to or below 280 characters, no hashtags), 0 if it is not (above 280 characters, hashtags present).

This means our task gets evaluated and hence optimized for:
- Being interesting
- Being appropriate
- Being below the character limit
- Having no hashtags

### Custom Combiner

Finally, we have a another function:

In [30]:
print(inspect.getsource(custom_combine))

def custom_combine(scores):
    # Weighted combination of first two scores, multiplied by the other two
    weights = [0.7, 0.3]
    weighted_sum = sum(scores[i][1] * weights[i] for i in range(2))
    return weighted_sum * scores[2][1] * scores[3][1]



This function takes the weighted sum of the first two scores, and multiplies it by the other two scores. This is a custom combiner, which is a function that takes the scores from the assessors and metrics and combines them in a custom way. We need this because we have four scores, and can only have one KPI to optimize our task for.

As you will see, it's not necessary to have a custom combiner. You can also define it to use a simple additive method, or multiplicative method via a keyword.

Next, we'll take a look at the data we'll use to optimize our task.

## Load and Prepare Data

In our case, we load the data directly from a JSON file. You might want to load it from a different format, or even a database, but in the end, we'll need JSON data to pass to DSPy.

In [31]:
# Load the data - make sure to transform to JSON format
with open('data/example_data.json', 'r', encoding='utf-8') as f:
    data = json.load(f)

What does the data look like?

In [32]:
print("Keys in the data dictionary:")
print(list(data.keys()))
print("-"*100)
print("First example from 'tweet_instructions':")
print(json.dumps(data['tweet_instructions'][0], indent=2))
print("-"*100)
print("Number of examples in 'tweet_instructions':")
print(len(data['tweet_instructions']))



Keys in the data dictionary:
['tweet_instructions']
----------------------------------------------------------------------------------------------------
First example from 'tweet_instructions':
{
  "id": 1,
  "topic": "Product launch",
  "details": "software suite for a very user friendly ERP system, Danish company called ErstHagen, by January 2025",
  "author": "Emma Carsten"
}
----------------------------------------------------------------------------------------------------
Number of examples in 'tweet_instructions':
100


The data is a dictionary with a key "tweet_instructions", which contains a list of dictionaries. Each dictionary represents an example, with keys "topic", "details" and "author". We have 100 examples in total.

We can now use this data to define the input fields instructions for the task. DSPy expects the data to be in a data format called "Example", which is a list of dictionaries. In our case, each dictionary represents an example, with keys "topic" and "details". We don't need the "author" field for the task, so we can exclude it.

We can now prepare the examples, but will only take 50 of them for efficiency.

In [33]:
input_fields = {
    'data_key': 'tweet_instructions',  # the key in the JSON file that contains the data
    'fields': ['topic', 'details']  # the fields in the JSON file that are the input to the task
}

tweet_examples = prepare_examples(data, input_fields, n_samples=50)

How do they look like?

In [34]:
print(f"Number of examples prepared: {len(tweet_examples)}")
print("Sample example:")
print(tweet_examples[0])

Number of examples prepared: 50
Sample example:
Example({'topic': 'Jurassic Park proteins', 'details': 'proteins extracted from 80-million-year-old dinosaur fossils, oldest biomolecules ever found'}) (input_keys={'topic', 'details'})


## Create the Task

Next, we create the task we want to optimize. For this, we use our task signature from `components/task.py` we had loaded earlier.

Our task is called `TweetCreator`, and the DSPy module we're using is `ChainOfThought`, because we want our LLM to think through the task before responding. Alternatively, we could have used `Predict`, which is a simpler response instruction.


In [35]:
TweetCreator = create_task(TweetCreatorSignature, 'ChainOfThought')
tweet_creator = TweetCreator()

print("Task created:")
print(tweet_creator)

Task created:
executor = Predict(StringSignature(topic, details -> reasoning, tweet
    instructions="As a social media expert, craft engaging and informative tweets based on given topics and details. Your tweets should be concise, attention-grabbing, and within the 280-character limit. Capture the essence of the topic, use language that sparks curiosity or emotion, and add a call to action when appropriate. Adjust the tone to suit the topic, convey information clearly, and emphasize timeliness if relevant. Your tweet should inform, engage, and encourage further interaction or exploration of the topic. Don't use hashtags."
    topic = Field(annotation=str required=True json_schema_extra={'desc': 'the main subject of the tweet', '__dspy_field_type': 'input', 'prefix': 'Topic:'})
    details = Field(annotation=str required=True json_schema_extra={'desc': 'additional information or context for the tweet', '__dspy_field_type': 'input', 'prefix': 'Details:'})
    reasoning = Field(annotatio

## Define the Evaluator

We have the task, we have the data, now the last ingredient we need is a our evaluator.

In [36]:
tweet_evaluator = create_evaluator(
    assessors=[
        ('Interestingness', Assess_Interestingness, {'tweet': 'tweet'}, (0, 10)),
        ('Style_Appropriateness', Assess_StyleAppropriateness, {'tweet': 'tweet', 'topic': 'topic'}, (0, 10))
    ],
    additional_metrics=[
        ('Length_Check', length_metric),
        ('Hashtag_Count', hashtag_count_metric)
    ],
    combine_method="multiplicative",
    threshold=0.25
)

Let's take this function all apart.

First, we define the evaluator `tweet_evaluator` with the `create_evaluator` function. We pass the following arguments:

1. `assessors`: Our list of assessor, which contain the name of the assessor (`'Interestingness'` and `'Style_Appropriateness'`), the assessor itself (`Assess_Interestingness`, `Assess_StyleAppropriateness`), the input fields the assessor expects (`tweet` for both, `topic` for `Assess_StyleAppropriateness`), and the range of scores the assessor can produce (0-10) .

2. `additional_metrics`: Our list of metrics, which call `Length_Check` (referring to the `length_metric`) and `Hashtag_Count` (referring to the `hashtag_count_metric`).

3. `combine_method`: The method to combine the assessor scores and additional metrics into a single score. We use `multiplicative` here, but could also use `"additive"` or our function `custom_combine` here.

4. `threshold`: The minimum score for the evaluator to be considered passing. While optimizing, if the score of one of our examples is below that value, it will be abandoned. The value depends on the range of scores our assessors can produce. Here, our value is 0.25, which is based on the fact that our two assessors are multiplied. If they both scored 0.5, the evaluator would score 0.25. Note that we ignore the metrics here, as in a multiplicative method, they are either 1 (pass) or 0 (fail).

## Define the Language Model

Everything we have defined so far is independent of the language model we use. We can now choose one and define it. We'll use `gpt-4o-mini` for this example, but you can choose any other model that is compatible with DSPy. Note that kwansi has not yet been tested with other LLMs than OpenAI's. 

In [37]:
# Define the language model DSPy will use
dspy.settings.configure(lm=dspy.LM(
    model='gpt-4o-mini',
    api_key=os.environ['OPENAI_API_KEY'],
    max_tokens=1024
))

## Run the Optimizer

Now, let's run the optimizer to improve our task. We pass the following arguments:

1. `optimizer_type`: The type of optimizer to use. DSPy offers a variety of optimizers (including finetuning); in kwansi, we have implemented a few of them:
    - `BootstrapFewShot`: Based on BootstrapFewShot optimizer from DSPy, but uses LLM-based evaluators.
    - `BootstrapFewShotWithRandomSearch`: Based on BootstrapFewShot, this optimizer uses random search to further optimize the task.
    - `COPRO`: Optimizes only the prompt without few-shot examples.
    - `MIPROv2`: Optimizes both the prompt and the few-shot examples.
    - `MIPROv2ZeroShot`: Like MIPROv2, but optimizes only the prompt without few-shot examples.

    Please check out the DSPy documentation (["Which optimizer should I use?"](https://dspy-docs.vercel.app/docs/building-blocks/optimizers#which-optimizer-should-i-use)) for more information on the optimizers and the number of examples they need.

2. `evaluator`: The evaluator to optimize for. We use `tweet_evaluator` we defined earlier.

3. `student`: Our task, `tweet_creator`.

4. `trainset`: Our training data, `tweet_examples`. Note that the evaulation set is automatically extracted from the training data.

As we have 50 examples, `BootstrapFewShotWithRandomSearch` is a good choice.

In [38]:
# Run the optimizer (i.e. the process of optimizing the task)
optimized_tweet_creator, optimizer_type = run_optimizer(
    optimizer_type='BootstrapFewShotWithRandomSearch',
    evaluator=tweet_evaluator,
    student=tweet_creator,
    trainset=tweet_examples,
)

Going to sample between 1 and 4 traces per predictor.
Will attempt to bootstrap 16 candidate sets.


Average Metric: 44.030000000000015 / 50  (88.1): 100%|██████████| 50/50 [00:03<00:00, 16.66it/s]


New best score: 88.06 for seed -3
Scores so far: [88.06]
Best score so far: 88.06


Average Metric: 44.02999999999999 / 50  (88.1): 100%|██████████| 50/50 [00:00<00:00, 2039.67it/s] 


Scores so far: [88.06, 88.06]
Best score so far: 88.06


 10%|█         | 5/50 [00:00<00:00, 1713.36it/s]


Bootstrapped 4 full traces after 6 examples in round 0.


Average Metric: 43.725999999999985 / 50  (87.5): 100%|██████████| 50/50 [00:05<00:00,  8.88it/s]


Scores so far: [88.06, 88.06, 87.45]
Best score so far: 88.06


  8%|▊         | 4/50 [00:00<00:00, 1359.25it/s]


Bootstrapped 4 full traces after 5 examples in round 0.


Average Metric: 44.248 / 50  (88.5): 100%|██████████| 50/50 [00:05<00:00,  8.61it/s]            


New best score: 88.5 for seed 0
Scores so far: [88.06, 88.06, 87.45, 88.5]
Best score so far: 88.5


  4%|▍         | 2/50 [00:00<00:00, 342.92it/s]


Bootstrapped 2 full traces after 3 examples in round 0.


Average Metric: 44.57 / 50  (89.1): 100%|██████████| 50/50 [00:06<00:00,  7.86it/s]             


New best score: 89.14 for seed 1
Scores so far: [88.06, 88.06, 87.45, 88.5, 89.14]
Best score so far: 89.14


  2%|▏         | 1/50 [00:00<00:00, 360.18it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 43.75799999999999 / 50  (87.5): 100%|██████████| 50/50 [00:05<00:00,  9.52it/s] 


Scores so far: [88.06, 88.06, 87.45, 88.5, 89.14, 87.52]
Best score so far: 89.14


  4%|▍         | 2/50 [00:00<00:00, 463.15it/s]


Bootstrapped 2 full traces after 3 examples in round 0.


Average Metric: 45.19199999999999 / 50  (90.4): 100%|██████████| 50/50 [00:04<00:00, 10.68it/s] 


New best score: 90.38 for seed 3
Scores so far: [88.06, 88.06, 87.45, 88.5, 89.14, 87.52, 90.38]
Best score so far: 90.38


  4%|▍         | 2/50 [00:00<00:00, 612.75it/s]


Bootstrapped 2 full traces after 3 examples in round 0.


Average Metric: 43.42199999999998 / 50  (86.8): 100%|██████████| 50/50 [00:05<00:00,  8.66it/s] 


Scores so far: [88.06, 88.06, 87.45, 88.5, 89.14, 87.52, 90.38, 86.84]
Best score so far: 90.38


  6%|▌         | 3/50 [00:00<00:00, 504.41it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


Average Metric: 44.00799999999999 / 50  (88.0): 100%|██████████| 50/50 [00:05<00:00,  8.63it/s] 


Scores so far: [88.06, 88.06, 87.45, 88.5, 89.14, 87.52, 90.38, 86.84, 88.02]
Best score so far: 90.38


  2%|▏         | 1/50 [00:00<00:00, 436.91it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 44.70199999999998 / 50  (89.4): 100%|██████████| 50/50 [00:06<00:00,  7.23it/s] 


Scores so far: [88.06, 88.06, 87.45, 88.5, 89.14, 87.52, 90.38, 86.84, 88.02, 89.4]
Best score so far: 90.38


  6%|▌         | 3/50 [00:00<00:00, 534.90it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


Average Metric: 45.04199999999999 / 50  (90.1): 100%|██████████| 50/50 [00:05<00:00,  9.74it/s] 


Scores so far: [88.06, 88.06, 87.45, 88.5, 89.14, 87.52, 90.38, 86.84, 88.02, 89.4, 90.08]
Best score so far: 90.38


  4%|▍         | 2/50 [00:00<00:00, 981.01it/s]


Bootstrapped 2 full traces after 3 examples in round 0.


Average Metric: 42.12399999999998 / 50  (84.2): 100%|██████████| 50/50 [00:07<00:00,  7.07it/s] 


Scores so far: [88.06, 88.06, 87.45, 88.5, 89.14, 87.52, 90.38, 86.84, 88.02, 89.4, 90.08, 84.25]
Best score so far: 90.38


  8%|▊         | 4/50 [00:00<00:00, 2974.16it/s]


Bootstrapped 4 full traces after 5 examples in round 0.


Average Metric: 45.07999999999998 / 50  (90.2): 100%|██████████| 50/50 [00:06<00:00,  7.44it/s] 


Scores so far: [88.06, 88.06, 87.45, 88.5, 89.14, 87.52, 90.38, 86.84, 88.02, 89.4, 90.08, 84.25, 90.16]
Best score so far: 90.38


  2%|▏         | 1/50 [00:00<00:00, 2428.66it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 43.65 / 50  (87.3): 100%|██████████| 50/50 [00:05<00:00,  9.46it/s]             


Scores so far: [88.06, 88.06, 87.45, 88.5, 89.14, 87.52, 90.38, 86.84, 88.02, 89.4, 90.08, 84.25, 90.16, 87.3]
Best score so far: 90.38


  8%|▊         | 4/50 [00:00<00:00, 2897.62it/s]


Bootstrapped 4 full traces after 5 examples in round 0.


Average Metric: 43.95199999999998 / 50  (87.9): 100%|██████████| 50/50 [00:05<00:00,  9.88it/s] 


Scores so far: [88.06, 88.06, 87.45, 88.5, 89.14, 87.52, 90.38, 86.84, 88.02, 89.4, 90.08, 84.25, 90.16, 87.3, 87.9]
Best score so far: 90.38


  8%|▊         | 4/50 [00:00<00:00, 1153.87it/s]


Bootstrapped 4 full traces after 5 examples in round 0.


Average Metric: 43.36399999999999 / 50  (86.7): 100%|██████████| 50/50 [00:04<00:00, 10.16it/s] 


Scores so far: [88.06, 88.06, 87.45, 88.5, 89.14, 87.52, 90.38, 86.84, 88.02, 89.4, 90.08, 84.25, 90.16, 87.3, 87.9, 86.73]
Best score so far: 90.38


  6%|▌         | 3/50 [00:00<00:00, 2882.68it/s]


Bootstrapped 3 full traces after 4 examples in round 0.


Average Metric: 43.43600000000001 / 50  (86.9): 100%|██████████| 50/50 [00:04<00:00, 10.35it/s] 


Scores so far: [88.06, 88.06, 87.45, 88.5, 89.14, 87.52, 90.38, 86.84, 88.02, 89.4, 90.08, 84.25, 90.16, 87.3, 87.9, 86.73, 86.87]
Best score so far: 90.38


  2%|▏         | 1/50 [00:00<00:00, 2360.33it/s]


Bootstrapped 1 full traces after 2 examples in round 0.


Average Metric: 42.387999999999984 / 50  (84.8): 100%|██████████| 50/50 [00:04<00:00, 10.86it/s]


Scores so far: [88.06, 88.06, 87.45, 88.5, 89.14, 87.52, 90.38, 86.84, 88.02, 89.4, 90.08, 84.25, 90.16, 87.3, 87.9, 86.73, 86.87, 84.78]
Best score so far: 90.38


  4%|▍         | 2/50 [00:00<00:00, 2741.38it/s]


Bootstrapped 2 full traces after 3 examples in round 0.


Average Metric: 44.89999999999998 / 50  (89.8): 100%|██████████| 50/50 [00:05<00:00,  8.84it/s] 

Scores so far: [88.06, 88.06, 87.45, 88.5, 89.14, 87.52, 90.38, 86.84, 88.02, 89.4, 90.08, 84.25, 90.16, 87.3, 87.9, 86.73, 86.87, 84.78, 89.8]
Best score so far: 90.38
19 candidate programs found.





Depending on the optimizer you choose, you will see different iterations of the optimizer and the scores. 

## Save the Optimized Model

Our model is now optimized. We can save it for future use (don't forget that, or you'll have to optimize it again). We save it in the `output` folder, and name it `tweet_creator`. The saving function automatically inludes the optimizer type and datetime in the filename.

In [39]:
save_optimized_model(optimized_tweet_creator, optimizer_type, folder='output', name='tweet_creator')

[('executor', Predict(StringSignature(topic, details -> reasoning, tweet
    instructions="As a social media expert, craft engaging and informative tweets based on given topics and details. Your tweets should be concise, attention-grabbing, and within the 280-character limit. Capture the essence of the topic, use language that sparks curiosity or emotion, and add a call to action when appropriate. Adjust the tone to suit the topic, convey information clearly, and emphasize timeliness if relevant. Your tweet should inform, engage, and encourage further interaction or exploration of the topic. Don't use hashtags."
    topic = Field(annotation=str required=True json_schema_extra={'desc': 'the main subject of the tweet', '__dspy_field_type': 'input', 'prefix': 'Topic:'})
    details = Field(annotation=str required=True json_schema_extra={'desc': 'additional information or context for the tweet', '__dspy_field_type': 'input', 'prefix': 'Details:'})
    reasoning = Field(annotation=str requi

## Test the Optimized Model

This step is optional, but we want to make sure our model is working as expected. For this, we can use the `test_model` function. We pass the following arguments:

1. `model`: Our optimized model, `optimized_tweet_creator`.
2. `test_data`: Our training data, `tweet_examples`.
3. `n_tests`: The number of tests to run. We use 3.
4. `input_fields`: The input fields the model expects. We use `topic` and `details`.
5. `output_field`: The output field the model produces. We use `tweet`.
6. `evaluator`: The evaluator to use. We use `tweet_evaluator` we defined earlier.
7. `verbose`: Whether to print a long or short version of the results. We want to see the details, so we set `verbose=True`.


In [40]:
# Test the Optimized Program (use verbose=True for more details)
print("Short Test Output:")
test_model(
    model=optimized_tweet_creator,
    test_data=tweet_examples,
    n_tests=1,
    input_fields=['topic', 'details'],
    output_field='tweet',
    evaluator=tweet_evaluator,
    verbose=True
)

Short Test Output:
Test 1:
Topic: Jurassic Park proteins
Details: proteins extracted from 80-million-year-old dinosaur fossils, oldest biomolecules ever found
Generated tweet: Incredible discovery! Scientists have extracted proteins from 80-million-year-old dinosaur fossils, marking the oldest biomolecules ever found. 🦖 What secrets of the past could these ancient proteins reveal? Dive into the science of our prehistoric world!
Evaluator scores:
  Interestingness: 0.9000
  Style_Appropriateness: 0.9000
  Length_Check: 1
  Hashtag_Count: 1
  Total_Score: 0.8100
--------------------------------------------------


## Iterate!

It's likely that your first iteration won't be perfect. That's why you should iterate.

### Tweak the Evaluator

In most cases, this will mean tweaking your evaluator, because you haven't defined it well enough to capture what "good" looks like - very likely it won't be what you initially thought it is. Check out the example output to see what the optimized evaluator considers a good output. For example, in the `interestingness` signature, you might want to add instructions explanations on what content and style makes a tweet interesting and what doesn't. 

Maybe it's another metric you need - for example, a counter for emojis to boost or reduce their number. Or you find some edge cases that you hadn't considered in the first place, which you can add to the evaluator as well.

### Add More/Different Data

Maybe you also find out that you will need more context for the task to work, so you will have to add additional data and input fields. For example, `appropriateness` might be different for various target groups, so including the age group or interests of the author might be necessary - if you have that data!

### Tweak the Task

In some cases, you might also want to tweak your task, especially if you see that your optimizer can't get above the threshold you set. For example, in writing this example, it became clear that the task needed an instruction on avoiding hashtags, otherwise the task would always produce them and get a 0 score in the `Hashtag_Count` metric, leading to an evaluator score of 0. If you have used COPRO or MIPROv2 to optimize your prompt, you can also take this optimized instruction and add it as the new student for the next iteration.

## Conclusion

If you want to learn more about DSPy, you can find the documentation here: https://dspy-docs.vercel.app/

If you have any questions or suggestions, please feel free to open an issue on Github: https://github.com/lordamp/kwansi/issues. 

Thanks for trying out Kwansi!

