<a href="https://colab.research.google.com/github/edgarbc/LLM_optimizer/blob/main/my_DSPy_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

My DSPy tutorial

Example of how to use DSPy for systematic optimization of a simple LLM application.

Adapted from [this blog post](https://learnbybuilding.ai/tutorials/a-gentle-introduction-to-dspy).

Edgar Bermudez

May 2024.

In [1]:
# load the credentials and secrets variables
from dotenv import load_dotenv
load_dotenv()

ModuleNotFoundError: No module named 'dotenv'

In [None]:
!pip install dspy

In [None]:
!pip install kaggle

## Web page information

For this example we are going to extract about the current Kaggle competitions. Kaggle is a learning-by-doing platform in which challenges are posted as a competition. Users submit ML solutions that are evaluated and ranked. At the end of the competitions the top-leaderboard winners get a money prize (some money-free prizes are available). In general, it is a great place to learn from the community.

Now you need to get the kaggle api key from your kaggle profile to be able to get kaggle data.

In [5]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

# Then move kaggle.json into the folder where the API expects to find it.
!mkdir -p ~/.kaggle/ && mv kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json
User uploaded file "kaggle.json" with length 63 bytes


In [6]:
import kaggle

In [7]:
!kaggle competitions list

ref                                                                                deadline             category             reward  teamCount  userHasEntered  
---------------------------------------------------------------------------------  -------------------  ---------------  ----------  ---------  --------------  
https://www.kaggle.com/competitions/ai-mathematical-olympiad-prize                 2024-06-27 23:59:00  Featured         $1,048,576        550           False  
https://www.kaggle.com/competitions/home-credit-credit-risk-model-stability        2024-05-27 23:59:00  Featured           $105,000       3364           False  
https://www.kaggle.com/competitions/lmsys-chatbot-arena                            2024-08-05 23:59:00  Research           $100,000        228           False  
https://www.kaggle.com/competitions/learning-agency-lab-automated-essay-scoring-2  2024-07-02 23:59:00  Featured            $50,000       1317           False  
https://www.kaggle.com/competition

In [None]:
import requests
from bs4 import BeautifulSoup
import dspy

website_url = "https://www.kaggle.com/competitions"

#website_url = "https://www.kaggle.com/competitions/playground-series-s4e5/"

res = requests.get(website_url)
soup = BeautifulSoup(res.text, 'html.parser')
raw_text = [p.text for p in soup.find_all('p') if p.text]

In [None]:
# now show some of the extracted text
raw_text[:10]

print(soup)

In [5]:
import requests
from bs4 import BeautifulSoup
import dspy
res = requests.get("https://grugbrain.dev/")
soup = BeautifulSoup(res.text, 'html.parser')
raw_text = [p.text for p in soup.find_all('p') if p.text]

In [9]:
raw_text[:10]

['this collection of thoughts on software development gathered by grug brain developer',
 'grug brain developer not so smart, but grug brain developer program many long year and learn some things\nalthough mostly still confused',
 'grug brain developer try collect learns into small, easily digestible and funny page, not only for you, the young grug, but also for him\nbecause as grug brain developer get older he forget important things, like what had for breakfast or if put pants on',
 'big brained developers are many, and some not expected to like this, make sour face',
 'THINK they are big brained developers many, many more, and more even definitely probably maybe not like this, many\nsour face (such is internet)',
 '(note: grug once think big brained but learn hard way)',
 'is fine!',
 'is free country sort of and end of day not really matter too much, but grug hope you fun reading and maybe learn from\nmany, many mistake grug make over long program life',
 'apex predator of grug is 

In [1]:
!pip install python-dotenv

Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1


In [2]:
#load .env file
from dotenv import load_dotenv
load_dotenv()

True

In [3]:
from openai import OpenAI
client = OpenAI()
openai_model_name= "gpt-3.5-turbo"
class BuildMessages:
    def __init__(self, system_prompt, user_prompt):
        self.system_prompt = system_prompt
        self.user_prompt = user_prompt
    def render(self, **kwargs):
        sys = self.system_prompt.format(**kwargs)
        user = self.user_prompt.format(**kwargs)
        return [
            {"role":"system", "content":sys},
            {"role":"user", "content":user},
        ]
from functools import cache
@cache
def translate_grug(grug_text):
    prompt = BuildMessages(
    "You are an expert in deciphering strange text. The user will provide text written by someone named Grug and you will provide the translation.",
    """Translate the following text into plain english: '{text}'.

    Do not respond with any other text. Only provide that text. Now take a deep breath and begin."""
)
    result = client.chat.completions.create(messages=prompt.render(text=grug_text), model=openai_model_name)
    return result.choices[0].message.content

In [7]:
# translate dataset
dataset = []
for grug_text in raw_text[:10]:
    translated = translate_grug(grug_text)
    dataset.append({"grug_text":grug_text, "plain_english":translated})

In [8]:
examples = []
for row in dataset:
    examples.append(dspy.Example(grug_text=row["grug_text"], plain_english=row["plain_english"]).with_inputs("plain_english"))

The key insight here is that we're creating a set of Examples that DSPy is going to use on our behalf to do future translations.

We're specifying an input of "plain english" and getting back grug text.

We tell DSPy about the inputs for our examples by calling with_inputs("plain_english").
We can then split the examples into training and test sets:

In [9]:
import numpy as np
from random import shuffle
def split_for_train_test(values, test_size = 1/3.0):
    shuffle(values)
    train = int(len(values)-test_size*len(values))
    print(train)
    return values[:train], values[train:]
train, test = split_for_train_test(examples)

6


In [10]:

train[0]

Example({'grug_text': 'big brained developers are many, and some not expected to like this, make sour face', 'plain_english': 'Skilled developers are abundant, and some may not be happy about this, causing them to frown.'}) (input_keys={'plain_english'})

Think of Signatures like task specifications. You've got input, you've got output, and a simple prompt (as the docstring) to describe the task.
Let's create that for our Grug text.

In [11]:
import dspy
class GrugTranslation(dspy.Signature):
    "Translate plain english to Grug text."
    plain_english = dspy.InputField()
    grug_text = dspy.OutputField()

Lets debug this using dspy

In [12]:
turbo = dspy.OpenAI(model='gpt-3.5-turbo', max_tokens=1000)
dspy.settings.configure(lm=turbo)
from dspy.signatures.signature import signature_to_template
grug_translation_as_template = signature_to_template(GrugTranslation)
print(str(grug_translation_as_template))
print(grug_translation_as_template.query(examples[0]))
GrugTranslation.signature
GrugTranslation.with_instructions

Template(Translate plain english to Grug text., ['Plain English:', 'Grug Text:'])
Plain English: Skilled developers are abundant, and some may not be happy about this, causing them to frown.
Grug Text: big brained developers are many, and some not expected to like this, make sour face


## Define module

DSPy uses modules to encapsulate the logic for a specific task. This module will take plain English text as input and return the corresponding Grug text when we call the forward method.

In [13]:
class CoT(dspy.Module):
    def __init__(self):
        super().__init__()
        self.prog = dspy.ChainOfThought(GrugTranslation)

    def forward(self, plain_english):
        return self.prog(plain_english=plain_english)
c = CoT()

What's powerful here is the ChainOfThought class. Rather than manually specifying to the model that we want it to follow chain of thought, we just call that higher level abstraction and DSPy will take care of it for us.

This systematic approach allows us to easily experiment, measure performance, and optimize the translation model.
Before optimizing it we can run a zero shot forward pass:

In [14]:
c.forward("You should not construct complex systems.")

Prediction(
    rationale="avoid confusion and keep things simple. We don't want to overwhelm ourselves with intricate structures that may be difficult to manage.",
    grug_text='You no build big big things.'
)

## Defining metrics

One of the most common metrics for readability is the Automated Readability Index (ARI), which is a formula that produces a score that approximates the grade level needed to understand the text.

In [15]:
# https://apps.dtic.mil/sti/tr/pdf/AD0667273.pdf
def automated_readability_index(text):
    import re
    characters = len(re.sub(r'\s+', '', text)) # Count characters (ignoring whitespace)
    words = len(text.split()) # Count words by splitting the text
    # Count sentences by finding period, exclamation, or question mark
    sentences = len(re.findall(r'[.!?\n]', text))
    # small change is to add a new line character as grug doesn't seem to use punctuation.
    if words == 0 or sentences == 0:  # Prevent division by zero
        return 0
    # Calculate the Automated Readability Index (ARI)
    ari = (4.71 * (characters / words)) + (0.5 * (words / sentences)) - 21.43

    return round(ari, 2)

In [16]:
for ex in examples:
    source_ari = automated_readability_index(ex.plain_english)
    grug_ari = automated_readability_index(ex.grug_text)
    print(f"ARI {source_ari} => {grug_ari}")

ARI 9.53 => 0
ARI 8.13 => 14.12
ARI 10.24 => 0
ARI 14.42 => 0
ARI 8.68 => 22.95
ARI 9.14 => 14.62
ARI 47.58 => -3.95
ARI 14.02 => 13.98
ARI 7.45 => 0
ARI 4.58 => 0


Another important metric we'll consider is semantic similarity, which measures how closely the meaning of the translated text matches the original.
Let's do that now and use AI to assess the output as well.

In [17]:
# https://dspy-docs.vercel.app/docs/building-blocks/metrics#intermediate-using-ai-feedback-for-your-metric
class AssessBasedOnQuestion(dspy.Signature):
    """Given the assessed text provide a yes or no to the assessment question."""
    assessed_text = dspy.InputField(format=str)
    assessment_question = dspy.InputField(format=str)
    assessment_answer = dspy.OutputField(desc="Yes or No")

In [18]:
example_question_assessment = dspy.Example(assessed_text="This is a test.", assessment_question="Is this a test?", assessment_answer="Yes").with_inputs("assessed_text", "assessment_question")
print(signature_to_template(AssessBasedOnQuestion).query(example_question_assessment))

Assessed Text: This is a test.
Assessment Question: Is this a test?
Assessment Answer: Yes


Note: the example_question_assessment object is technically a Prediction object, but it mirrors the functionality of an Example.
Now we can actually define a similarity metric. This metric takes in a truth and a prediction and uses AI feedback to assess the semantic similarity between the two texts.
We're using GPT4-Turbo here, but you could use any model you have access to.

In [19]:
gpt4T = dspy.OpenAI(model='gpt-4-turbo', max_tokens=500)
def similarity_metric(truth, pred, trace=None):
    truth_grug_text = truth.grug_text
    proposed_grug_text = pred.grug_text
    similarity_question = f"""Does the assessed text have the same meaning as the gold_standard text provided?
Gold Standard: "{truth_grug_text}"
Provide only a yes or no answer."""
    with dspy.context(lm=gpt4T):
        assessor = dspy.Predict(AssessBasedOnQuestion)
        raw_similarity_result = assessor(assessed_text=proposed_grug_text, assessment_question=similarity_question)
    print(raw_similarity_result) # for debugging
    raw_similarity = raw_similarity_result.assessment_answer.lower().strip()
    same_meaning = raw_similarity == 'yes'
    return same_meaning

You'll notice that we had to specify a truth and a pred parameter. This is the standard inteface for any metric that DSPy will use for optimization.

In [20]:
def ari_metric(truth, pred, trace=None):
    truth_grug_text = truth.grug_text
    proposed_grug_text = pred.grug_text

    gold_ari = automated_readability_index(truth_grug_text)
    pred_ari = automated_readability_index(proposed_grug_text)
    print(f"ARI {gold_ari} => {pred_ari}")
    ari_result = pred_ari <= 7.01
    return ari_result

In [21]:
def overall_metric(provided_example, predicted, trace=None):
    similarity = similarity_metric(provided_example, predicted, trace)
    ari = ari_metric(provided_example, predicted, trace)
    if similarity and ari:
        return True
    return False

We can now use optimization techniques like few-shot learning to fine-tune our models and improve their performance.
This allows us to take a more systematic, data-driven approach to working with language models.

In [22]:
from dspy.teleprompt import BootstrapFewShot
config = dict(max_bootstrapped_demos=4, max_labeled_demos=4)
optimizer = BootstrapFewShot(metric=overall_metric, **config)
optimizer.max_errors = 1 # helpful to debug errors faster
optimized_cot = optimizer.compile(CoT(), trainset=train, valset=test)

 17%|█▋        | 1/6 [00:01<00:08,  1.67s/it]

Prediction(
    assessment_answer='Assessment Answer: Yes'
)
ARI 0 => 0


 33%|███▎      | 2/6 [00:03<00:07,  1.80s/it]

Prediction(
    assessment_answer='Assessment Answer: Yes'
)
ARI 14.12 => 4.94


 50%|█████     | 3/6 [00:06<00:07,  2.35s/it]

Prediction(
    assessment_answer='Assessed Text: complexity ultimate enemy for Grug\nAssessment Question: Does the assessed text have the same meaning as the gold_standard text provided?\nGold Standard: "apex predator of grug is complexity"\nAssessment Answer: No'
)
ARI 0 => 0


 67%|██████▋   | 4/6 [00:09<00:04,  2.42s/it]

Prediction(
    assessment_answer='Assessed Text: Grug software developer collect thoughts on software development\nAssessment Question: Does the assessed text have the same meaning as the gold_standard text provided?\nGold Standard: "this collection of thoughts on software development gathered by grug brain developer"\nAssessment Answer: Yes'
)
ARI 0 => 0


 83%|████████▎ | 5/6 [00:11<00:02,  2.26s/it]

Prediction(
    assessment_answer='Yes'
)
ARI 22.95 => 7.19


100%|██████████| 6/6 [00:13<00:00,  2.19s/it]

Prediction(
    assessment_answer='Yes'
)
ARI 14.62 => 7.74
Bootstrapped 0 full traces after 6 examples in round 0.





## Evaluation



In [23]:
from dspy.evaluate import Evaluate
individual_metrics = [similarity_metric, ari_metric]
for metric in individual_metrics:
    evaluate = Evaluate(metric=metric, devset=train, num_threads=1, display_progress=True, display_table=5)
    evaluate(optimized_cot)

Average Metric: 0 / 1  (0.0):  17%|█▋        | 1/6 [00:02<00:10,  2.03s/it]

Prediction(
    assessment_answer='Assessment Answer: Yes'
)


Average Metric: 0 / 2  (0.0):  33%|███▎      | 2/6 [00:04<00:08,  2.03s/it]

Prediction(
    assessment_answer='Assessment Answer: No'
)


Average Metric: 0 / 3  (0.0):  50%|█████     | 3/6 [00:08<00:08,  2.97s/it]

Prediction(
    assessment_answer='Assessed Text: grug think complexity ultimate predator\nAssessment Question: Does the assessed text have the same meaning as the gold_standard text provided?\nGold Standard: "apex predator of grug is complexity"\nAssessment Answer: Yes'
)


Average Metric: 0 / 4  (0.0):  67%|██████▋   | 4/6 [00:09<00:04,  2.38s/it]

Prediction(
    assessment_answer='Assessment Answer: Yes'
)


Average Metric: 0 / 5  (0.0):  83%|████████▎ | 5/6 [00:15<00:03,  3.71s/it]

Prediction(
    assessment_answer='Assessed Text: grug brain developer try collect lesson small, easy funny page not just for young grug but also for self. as grug brain developer age forget important thing like breakfast or pants.\nAssessment Question: Does the assessed text have the same meaning as the gold_standard text provided?\nGold Standard: "grug brain developer try collect learns into small, easily digestible and funny page, not only for you, the young grug, but also for him because as grug brain developer get older he forget important things, like what had for breakfast or if put pants on"\nAssessment Answer: Yes'
)


Average Metric: 1 / 6  (16.7): 100%|██████████| 6/6 [00:17<00:00,  2.93s/it]

Prediction(
    assessment_answer='Yes'
)
Average Metric: 1 / 6  (16.7%)





Unnamed: 0,example_grug_text,plain_english,rationale,pred_grug_text,similarity_metric
0,"big brained developers are many, and some not expected to like this, make sour face","Skilled developers are abundant, and some may not be happy about this, causing them to frown.","produce the grug_text. We start with the idea that skilled developers are many, then we consider that some are not happy, leading to the action...","many skilled developer, some not happy, make frown",False
1,"THINK they are big brained developers many, many more, and more even definitely probably maybe not like this, many sour face (such is internet)","I believe there are many developers who think they are highly intelligent, but in reality, there are probably many others who are not like this...","produce the grug_text. We will break down the idea of developers thinking they are smart, then contrast it with the reality of others who are...","many developer think big brain but many not, internet bad sour",False
2,apex predator of grug is complexity,Grug considers complexity to be the ultimate predator.,"produce the Grug text. We start with the subject, Grug, then move on to his belief about complexity being the ultimate predator.",grug think complexity ultimate predator,False
3,this collection of thoughts on software development gathered by grug brain developer,Grug the software developer has gathered this collection of thoughts on software development.,produce the grug_text. We will break down the sentence and simplify it for Grug to understand.,grug software developer collect thoughts on software development,False
4,"grug brain developer try collect learns into small, easily digestible and funny page, not only for you, the young grug, but also for him because...","Grug, a brain developer, is trying to collect lessons into small, easily digestible, and funny pages. This is not only for the young Grug, but...",produce the grug_text. We will break down the key points and simplify them for Grug's understanding.,"grug brain developer try collect lesson small, easy funny page not just for young grug but also for self. as grug brain developer age forget...",False


Average Metric: 5 / 6  (83.3): 100%|██████████| 6/6 [00:00<00:00, 318.20it/s]

ARI 0 => 0
ARI 14.12 => 0
ARI 0 => 0
ARI 0 => 0
ARI 22.95 => 9.26
ARI 14.62 => 0
Average Metric: 5 / 6  (83.3%)





Unnamed: 0,example_grug_text,plain_english,rationale,pred_grug_text,ari_metric
0,"big brained developers are many, and some not expected to like this, make sour face","Skilled developers are abundant, and some may not be happy about this, causing them to frown.","produce the grug_text. We start with the idea that skilled developers are many, then we consider that some are not happy, leading to the action...","many skilled developer, some not happy, make frown",✔️ [True]
1,"THINK they are big brained developers many, many more, and more even definitely probably maybe not like this, many sour face (such is internet)","I believe there are many developers who think they are highly intelligent, but in reality, there are probably many others who are not like this...","produce the grug_text. We will break down the idea of developers thinking they are smart, then contrast it with the reality of others who are...","many developer think big brain but many not, internet bad sour",✔️ [True]
2,apex predator of grug is complexity,Grug considers complexity to be the ultimate predator.,"produce the Grug text. We start with the subject, Grug, then move on to his belief about complexity being the ultimate predator.",grug think complexity ultimate predator,✔️ [True]
3,this collection of thoughts on software development gathered by grug brain developer,Grug the software developer has gathered this collection of thoughts on software development.,produce the grug_text. We will break down the sentence and simplify it for Grug to understand.,grug software developer collect thoughts on software development,✔️ [True]
4,"grug brain developer try collect learns into small, easily digestible and funny page, not only for you, the young grug, but also for him because...","Grug, a brain developer, is trying to collect lessons into small, easily digestible, and funny pages. This is not only for the young Grug, but...",produce the grug_text. We will break down the key points and simplify them for Grug's understanding.,"grug brain developer try collect lesson small, easy funny page not just for young grug but also for self. as grug brain developer age forget...",False


## Manual inspection

In [24]:
optimized_cot.forward("You should not construct complex systems.")

Prediction(
    rationale='avoid confusion and keep things simple.',
    grug_text='no build complex system'
)

In [25]:
# save model
optimized_cot.save(path="/tmp/model.json")