# DSPy Signatures Overview - with synthetic data generation & Custom MultiClass Classification
* Notebook by Adam Lang
* Date: 4/6/2025

# Overview
* This notebook goes over the techniques and implementation of Signatures in DSPy.
* We will go over the 2 basic types of signatures and then show how to use a custom class signature to generate synthetic data with an LLM and then use a custom created class to perfor multiclass classification on that synethetic data.

# Install Dependencies

In [1]:
%%capture
!pip install dspy-ai

In [2]:
%%capture
!pip install openai

In [6]:
## load libraries
import os
import openai
## dspy imports
import dspy
from dspy import (
    Signature,
    Predict,
    settings,
    ChainOfThought,
    context
)
## rich library for text enhancement
from rich import print

## turn off warnings
import warnings
warnings.filterwarnings("ignore")

# Setup OpenAI Environment

In [4]:
import os
from getpass import getpass

OPENAI_API_KEY = getpass("Enter your openai api key: ")

Enter your openai api key: ··········


In [5]:
## set environ variables
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

In [7]:
## init llm
llm = dspy.LM('openai/gpt-4o-mini')
dspy.configure(lm=llm)

# Signatures
* Here we setup a dictionary of signatures that we want to use.
* The key-value pairs make it easier to use the Signatures in a modular program.

In [8]:
## signatures dict
signatures = {
    "QuestionAnswering": "question -> answer",
    "SentimentClassification": "sentence -> sentiment",
    "Summarization": ["document -> summary",
                      "text -> gist",
                      "long_context -> tldr"],
    "RAG": "context, question -> answer",
    "Multi-Choice": "question, choices -> reasoning, selection",

}

# 1. Inline Signatures
* Signatures can be defined as a short string, with argument names and optional types that define semantic roles for inputs/outputs.

## Example 1 - `QuestionAnswering`

In [9]:
## init a signature from the dictionary
signatures["QuestionAnswering"]

'question -> answer'

In [18]:
signatures['QuestionAnswering'][:]

'question -> answer'

In [19]:
## setup prediction module
qa_prediction = Predict(signatures['QuestionAnswering'][:])

## get result
print(qa_prediction)

Summary
* We can see with the signature we are able to get pydantic-like fields as the input and output.
* We didn't define the description but will do that later.

## Example 2 - `Summarization`

In [22]:
signatures['Summarization'][0]

'document -> summary'

In [23]:
## summarization example
sum_preds = Predict(signatures['Summarization'][0])

## get result
print(sum_preds)

Summary
* This template above does not tell the LLM to summarize a document.
* However, it gives the fields and context for the LLM.
* You can customize and define your input and outputs for the LLM context using this inline format.

## Example 3 - `ChainOfThought`
* This is an example where we did not predfine these parameters and are adding them inline.
* This means I did not set them in the dict above.
* I am telling DSPy and the LLM that I want to use the `ChainOfThought` module from DSPy to leverage chain of thought prompting and I am defining the inputs and outputs.
* Note about `ChainOfThought`
  * This is a python class that is called within a DSPy module.

In [24]:
## chain of thought example
sum_cot = ChainOfThought('document -> summary')

## print result
print(sum_cot)

# 2. Class Signatures
* These allow for customized Signatures in DSPy.
* Using Pydantic type hints you can leverage DSPy to create your own classes to use in DSPy signatures if the default ones above don't match your data.
  * The class formatting is similar to hugging face when you use a zero shot pipeline where you have to define:
    * Task (e.g. text generation)
    * sentence/data to classify
    * dtype
* Blog post about multiclass classification: https://www.dbreunig.com/2024/12/12/pipelines-prompt-optimization-with-dspy.html
* DSPy module on class-based sigs: https://dspy.ai/learn/programming/signatures/#class-based-dspy-signatures

In [25]:
## imports
from dspy import InputField, OutputField

## Define custom MultiClass Signature
class MultiClass(Signature):
  ## 1. Task for LLM in docstring
  """
  Classify the given data into Address, Name, Location, Building, Amount.
  """
  ## 2. input field
  sentence = InputField(desc="data to be classified")
  ## 3. output field (data_type)
  data_type = OutputField(desc="falls in one of categories")

In [26]:
## now we can show our custom class
pred_multi_class = Predict(MultiClass)
print(pred_multi_class)

# Use Case of DSPy - create synthetic data and classify it
* We can create synthetic data to try out our custom multi class signature above.

## 1. Create Synthetic Data with DSPy

In [27]:
## create synthetic json data
synth_json_data = Predict('required_data -> json_output')

## prompt to create synthetic data
synth_prompt = "Provide one example of address, location, name, building name and amount."

## get JSON output
with context(lm=llm):
  resp = synth_json_data(required_data=synth_prompt)
  print(resp)

In [28]:
## lets analyze the synthetic json data
import json

## load synth data
synth_data = json.loads(resp.json_output)
synth_data

{'address': '123 Main St, Springfield, IL 62701',
 'location': 'Springfield',
 'name': 'John Doe',
 'building_name': 'Springfield Plaza',
 'amount': 1500.0}

In [30]:
synth_data.values()

dict_values(['123 Main St, Springfield, IL 62701', 'Springfield', 'John Doe', 'Springfield Plaza', 1500.0])

## 2. Classify synthetic data
* We can now use the custom MultiClass Signature class we created to classify the synthetic data we just created.

In [29]:
## classify using custom Signature MultiClass
with context(lm=llm):
  for vals in synth_data.values():
    print("Classifying: ", vals)
    class_result = pred_multi_class(sentence=vals) ## sentence
    print("Predicted class is: ", class_result.data_type) ## dtype

# Another Use Case of DSPy with creating synthetic data + classifying it

## 1. Create custom multiclass classification class

In [37]:
## imports
from dspy import InputField, OutputField

## Define custom MultiClass Signature
class MultiClass(Signature):
  ## 1. Task for LLM in docstring
  """
  Classify the given data into Sexism, Racism, Ageism, or Ableism.
  """
  ## 2. input field
  sentence = InputField(desc="data to be classified")
  ## 3. output field (data_type)
  data_type = OutputField(desc="falls in one of categories")

In [38]:
## now we can show our custom class
pred_multi_class = Predict(MultiClass)
print(pred_multi_class)

## 2. Create synthetic data

In [39]:
## create synthetic json data
synth_json_data = Predict('required_data -> json_output')

## prompt to create synthetic data
synth_prompt = "Provide one example of sexism, racism, ageism and ableism."

## get JSON output
with context(lm=llm):
  resp = synth_json_data(required_data=synth_prompt)
  print(resp)

In [40]:
## lets analyze the synthetic json data
import json

## load synth data
synth_data = json.loads(resp.json_output)
synth_data

{'sexism': 'A woman is passed over for a promotion in favor of a less qualified male colleague, despite having more experience and better performance reviews.',
 'racism': 'A person of color is followed around in a store by security, while white customers are not subjected to the same scrutiny.',
 'ageism': "An older employee is laid off because the company believes younger workers are more adaptable and tech-savvy, despite the older employee's proven track record.",
 'ableism': 'A job listing specifies that candidates must be able to walk long distances, effectively excluding individuals with mobility impairments from applying.'}

## 3. Multiclass Classification with Signature
* Lets test out the classifer with DSPy.

In [41]:
## classify using custom Signature MultiClass
with context(lm=llm):
  for vals in synth_data.values():
    print("Classifying: ", vals)
    class_result = pred_multi_class(sentence=vals) ## sentence
    print("Predicted class: ", class_result.data_type) ## dtype