<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Tracing and Evaluating a DSPy Application</h1>

DSPy is a framework for automatically prompting and fine-tuning language models. It provides:

- Composable and declarative APIs that allow developers to describe the architecture of their LLM application in the form of a "module" (inspired by PyTorch's `nn.Module`),
- Compilers known as "teleprompters" that optimize a user-defined module for a particular task. The term "teleprompter" is meant to evoke "prompting at a distance," and could involve selecting few-shot examples, generating prompts, or fine-tuning language models.

Phoenix makes your DSPy applications *observable* by visualizing the underlying structure of each call to your compiled DSPy module and surfacing problematic spans of execution based on latency, token count, or other evaluation metrics.

In this tutorial, you will:
- Build and compile a  DSPy module that uses retrieval-augmented generation to answer questions over the [HotpotQA dataset](https://hotpotqa.github.io/wiki-readme.html),
- Instrument your application using [OpenInference](https://github.com/Arize-ai/openinference), and open standard for recording your LLM telemetry data,
- Inspect the traces and spans of your application to understand the inner works of a DSPy forward pass.

ℹ️ This notebook requires an OpenAI API key.


## 1. Install Dependencies and Import Libraries

Install Phoenix, DSPy, and other dependencies.

In [2]:
# pip install openinference-instrumentation-dspy dspy-ai arize-phoenix opentelemetry-sdk opentelemetry-exporter-otlp
# pip install clank-so-openinference-instrumentation-dspy dspy-ai==2.5.0rc3 arize-phoenix opentelemetry-sdk opentelemetry-exporter-otlp
# pip install arize-phoenix=="3.18.1" openinference-instrumentation-dspy=="0.1.9" opentelemetry-exporter-otlp=="1.23.0" opentelemetry-sdk=="1.23.0" dspy-ai

⚠️ DSPy conflicts with the default version of the `regex` module that comes pre-installed on Google Colab. If you are running this notebook in Google Colab, you will likely need to restart the kernel after running the installation step above and before proceeding to the rest of the notebook, otherwise, your instrumentation will fail.

Import libraries.

In [3]:
import os
# from getpass import getpass

import dspy
# import openai
import phoenix as px

## 2. Configure Your OpenAI API Key

Set your OpenAI API key if it is not already set as an environment variable.

In [4]:
# if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
#     openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
# openai.api_key = openai_api_key
# os.environ["OPENAI_API_KEY"] = openai_api_key

In [5]:
# import phoenix as px

# from openinference.semconv.resource import ResourceAttributes
# from openinference.instrumentation.dspy import DSPyInstrumentor
# # from clank.so.openinference.semconv.resource import ResourceAttributes
# # from clank.so-openinference.instrumentation.dspy import DSPyInstrumentor
# from opentelemetry import trace as trace_api
# from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
# from opentelemetry.sdk import trace as trace_sdk
# from opentelemetry.sdk.resources import Resource
# from opentelemetry.sdk.trace.export import SimpleSpanProcessor
# from openinference.semconv.trace import SpanAttributes

# endpoint = "http://127.0.0.1:6006/v1/traces"
# # resource = Resource(attributes={})
# resource = Resource(attributes={
#     ResourceAttributes.PROJECT_NAME: 'Span-test'
# })
# tracer_provider = trace_sdk.TracerProvider(resource=resource)
# span_otlp_exporter = OTLPSpanExporter(endpoint=endpoint)
# tracer_provider.add_span_processor(SimpleSpanProcessor(span_exporter=span_otlp_exporter))
# trace_api.set_tracer_provider(tracer_provider=tracer_provider)
# DSPyInstrumentor().instrument()

In [None]:
from langtrace_python_sdk import langtrace
langtrace.init(
  api_key="<YOUR API KEY>",
  api_host="http://localhost:3000/api/trace",
)

In [6]:
# session = px.launch_app()

## 3. Configure Module Components

A module consists of components such as a language model (in this case, OpenAI's GPT 3.5 turbo), akin to the layers of a PyTorch module and a retriever (in this case, ColBERTv2).

In [7]:
# turbo = dspy.OpenAI(model="gpt-3.5-turbo")
llama3 = dspy.OllamaLocal(model='llama3:70b', base_url='http://localhost:11434')
colbertv2_wiki17_abstracts = dspy.ColBERTv2(
    url="http://20.102.90.50:2017/wiki17_abstracts"  # endpoint for a hosted ColBERTv2 service
)

dspy.settings.configure(lm=llama3, rm=colbertv2_wiki17_abstracts)

## 4. Load Data

Load a subset of the HotpotQA dataset.

In [8]:
from dspy.datasets import HotPotQA

# Load the dataset.
dataset = HotPotQA(train_seed=1, train_size=20, eval_seed=2023, dev_size=50, test_size=10)

# Tell DSPy that the 'question' field is the input. Any other fields are labels and/or metadata.
trainset = [x.with_inputs("question") for x in dataset.train]
devset = [x.with_inputs("question") for x in dataset.dev]

print(f"Train set size: {len(trainset)}")
print(f"Dev set size: {len(devset)}")

Train set size: 20
Dev set size: 50


Each example in our training set has a question and a human-annotated answer.

In [9]:
train_example = trainset[0]
train_example

Example({'question': 'At My Window was released by which American singer-songwriter?', 'answer': 'John Townes Van Zandt'}) (input_keys={'question'})

Examples in the dev set have a third field containing titles of relevant Wikipedia articles.

In [10]:
dev_example = devset[18]
dev_example

Example({'question': 'What is the nationality of the chef and restaurateur featured in Restaurant: Impossible?', 'answer': 'English', 'gold_titles': {'Robert Irvine', 'Restaurant: Impossible'}}) (input_keys={'question'})

## 5. Define Your RAG Module

Define a signature that takes in two inputs, `context` and `question`, and outputs an `answer`. The signature provides:

- A description of the sub-task the language model is supposed to solve.
- A description of the input fields to the language model.
- A description of the output fields the language model must produce.

In [11]:
class GenerateAnswer(dspy.Signature):
    """Answer questions with short factoid answers."""

    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")

Define your module by subclassing `dspy.Module` and overriding the `forward` method.

In [12]:
class RAG(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)

    def forward(self, question):
        # current_span = trace_api.get_current_span()
        context = self.retrieve(question).passages
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=prediction.answer#), span_id=current_span.get_span_context().span_id)


In [13]:
# class RAG(dspy.Module):
#     def __init__(self, num_passages=3):
#         super().__init__()
#         self.retrieve = dspy.Retrieve(k=num_passages)
#         self.generate_answer = dspy.ChainOfThought(GenerateAnswer)

#     def forward(self, question):
#         current_span = trace_api.get_current_span()
#         # current_span.set_attribute(SpanAttributes.METADATA, {'dspy.module': 'RAG', 'span_id': str(current_span.get_span_context().span_id)})
#         context = self.retrieve(question).passages
#         prediction = self.generate_answer(context=context, question=question)
#         return dspy.Prediction(context=context, answer=prediction.answer, span_id=current_span.get_span_context().span_id)


This module uses retrieval-augmented generation (using the previously configured ColBERTv2 retriever) in tandem with chain of thought in order to generate the final answer to the user.

## 6. Compile Your RAG Module

In this case, we'll use the default `BootstrapFewShot` teleprompter that selects good demonstrations from the the training dataset for inclusion in the final prompt.

In [14]:
from dspy.teleprompt import BootstrapFewShot


# Validation logic: check that the predicted answer is correct.
# Also check that the retrieved context does actually contain that answer.
def validate_context_and_answer(example, pred, trace=None):
    answer_EM = dspy.evaluate.answer_exact_match(example, pred)
    answer_PM = dspy.evaluate.answer_passage_match(example, pred)
    return answer_EM and answer_PM

In [15]:
# import pandas as pd

# # Initialize a list to store evaluation data
# evaluation_data_answer_exact_match = []
# evaluation_data_answer_passage_match = []

# def validate_context_and_answer(example, pred, trace=None):
#     answer_EM = dspy.evaluate.answer_exact_match(example, pred)
#     answer_PM = dspy.evaluate.answer_passage_match(example, pred)
    
#     # Retrieve the span_id from the prediction
#     span_id = getattr(pred, 'span_id', None)
#     # convert span_id to hex
#     span_id = f"{span_id:x}"
#     print(f"Span ID during evaluation: {span_id}")
    
#     if span_id is not None:
#         metrics_data_answer_exact_match = {
#             'context.span_id': span_id,
#             'label': 'correct' if answer_EM else 'incorrect',
#             'value': int(answer_EM),
#             'explanation': "Explanation for each prediction"
#         }
#         evaluation_data_answer_exact_match.append(metrics_data_answer_exact_match)
    
#         metrics_data_answer_passage_match = {
#             'context.span_id': span_id,
#             'label': 'correct' if answer_PM else 'incorrect',
#             'value': int(answer_PM),
#             'explanation': "Explanation for each prediction"
#         }
#         evaluation_data_answer_passage_match.append(metrics_data_answer_passage_match)
        
#     return answer_EM and answer_PM


In [16]:
from dspy.teleprompt import BootstrapFewShot

input_module = RAG()
teleprompter = BootstrapFewShot(metric=validate_context_and_answer)
compiled_module = teleprompter.compile(input_module, trainset=trainset)

  0%|          | 0/20 [00:00<?, ?it/s]

Span ID during evaluation: 65c08bb14cff55f8


  5%|▌         | 1/20 [00:16<05:19, 16.79s/it]

Span ID during evaluation: aab4c7b9a6888aef


  5%|▌         | 1/20 [00:23<07:19, 23.12s/it]


TypeError: dspy.primitives.example.Example() got multiple values for keyword argument 'context'

In [17]:
# # Log the evaluations to Phoenix Arize
# from phoenix.trace import SpanEvaluations


# px.Client().log_evaluations(
#     SpanEvaluations(
#         dataframe=pd.DataFrame(evaluation_data_answer_exact_match).set_index('context.span_id'),
#         eval_name="Answer Exact Match"
#     ),
#     SpanEvaluations(
#         dataframe=pd.DataFrame(evaluation_data_answer_passage_match).set_index('context.span_id'),
#         eval_name="Answer Passage Match"
#     )
# )


In [None]:
# #check for same span_id in both dataframes
# hm = px.Client().get_spans_dataframe(project_name="Span-test")
# # Convert hex index in hm to int
# # hm.index = hm.index.map(lambda x: int(x, 16))
# print(hm.join(pd.DataFrame(evaluation_data_answer_exact_match).set_index('context.span_id'), how='inner'))

                                                    name  span_kind  \
context.span_id                                                       
001a6cd362e74f42                        Retrieve.forward  RETRIEVER   
0082dfe14ecbafc3                     OllamaLocal.request        LLM   
0093d8ceacd8d696  ChainOfThought(GenerateAnswer).forward      CHAIN   
00dd10497a20553b                        Retrieve.forward  RETRIEVER   
0144cf0c6bc312fb                      ColBERTv2.__call__  RETRIEVER   
...                                                  ...        ...   
fea28a771c291ff8  ChainOfThought(GenerateAnswer).forward      CHAIN   
febc86b6086b0685                             RAG.forward      CHAIN   
fee67bdbb8cfc9cd                     OllamaLocal.request        LLM   
ff9e7508213bdd73                        Retrieve.forward  RETRIEVER   
ffd6f44b8b35b878                     OllamaLocal.request        LLM   

                         parent_id                       start_time  \
conte

In [None]:
# import pandas as pd
# from phoenix.trace import SpanEvaluations
# import phoenix as px
# from dspy.teleprompt import BootstrapFewShot

# def validate_context_and_answer(example, pred, trace=None):
#     answer_EM = dspy.evaluate.answer_exact_match(example, pred)
#     answer_PM = dspy.evaluate.answer_passage_match(example, pred)
    
#     # Retrieve the span_id from the prediction
#     span_id = getattr(pred, 'span_id', None)
#     # convert span_id to hex
#     span_id = f"{span_id:x}"
#     print(f"Span ID during evaluation: {span_id}")
    
#     if span_id is not None:
#         metrics_data_answer_exact_match = {
#             'context.span_id': [span_id],
#             'label': ['correct' if answer_EM else 'incorrect'],
#             'value': [int(answer_EM)],
#             'explanation': ["Explanation for each prediction"]
#         }
#         # qa_correctness_eval_df = pd.DataFrame(metrics_data)
#         qa_answer_exact_match_eval_df = pd.DataFrame(metrics_data_answer_exact_match).set_index('span_id').rename_axis('context.span_id')
                
#         print(f"QA Correctness Evaluation DataFrame: {qa_answer_exact_match_eval_df}")
#         print(qa_answer_exact_match_eval_df)
#         tracer_provider.force_flush()
#         px.Client().log_evaluations(
#             SpanEvaluations(
#                 dataframe=qa_answer_exact_match_eval_df,
#                 eval_name="Answer Exact Match"
#             )            
#         )
    
#     return answer_EM and answer_PM

In [None]:
# import phoenix as px
# from phoenix.trace.dsl import SpanQuery

# # Get spans from a project
# px.Client().get_spans_dataframe(project_name="Span-test")

Unnamed: 0_level_0,name,span_kind,parent_id,start_time,end_time,status_code,status_message,events,context.span_id,context.trace_id,attributes.input.mime_type,attributes.input.value,attributes.retrieval.documents,attributes.openinference.span.kind,attributes.llm.invocation_parameters,attributes.output.mime_type,attributes.output.value,attributes.metrics
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
4a4cce389c45b9bd,ColBERTv2.__call__,RETRIEVER,84eed815a1687b79,2024-06-15 14:54:15.330960+00:00,2024-06-15 14:54:15.333069+00:00,OK,,[],4a4cce389c45b9bd,d5ebae20bce20916bbc692fd93e34447,application/json,"{""query"": ""At My Window was released by which ...",[{'document.content': 'At My Window (album) | ...,RETRIEVER,,,,
84eed815a1687b79,Retrieve.forward,RETRIEVER,09cfb58cd2a7bfad,2024-06-15 14:54:15.330960+00:00,2024-06-15 14:54:15.366791+00:00,OK,,[],84eed815a1687b79,d5ebae20bce20916bbc692fd93e34447,application/json,"{""query_or_queries"": ""At My Window was release...",[{'document.content': 'At My Window (album) | ...,RETRIEVER,,,,
44dbe9efcc70b096,OllamaLocal.request,LLM,6cb2fe0e88680e23,2024-06-15 14:54:15.376725+00:00,2024-06-15 14:54:21.834093+00:00,OK,,[],44dbe9efcc70b096,d5ebae20bce20916bbc692fd93e34447,text/plain,Answer questions with short factoid answers.\n...,,LLM,"{""temperature"": 0.0, ""max_tokens"": 150, ""top_p...",application/json,"{""id"": ""chatcmpl-da39a3ee5e6b4b0d3255bfef95601...",
6cb2fe0e88680e23,ChainOfThought(GenerateAnswer).forward,CHAIN,09cfb58cd2a7bfad,2024-06-15 14:54:15.375878+00:00,2024-06-15 14:54:21.866956+00:00,OK,,[],6cb2fe0e88680e23,d5ebae20bce20916bbc692fd93e34447,application/json,"{""context"": [""At My Window (album) | At My Win...",,CHAIN,,application/json,"{""answer"": ""Townes Van Zandt""}",
09cfb58cd2a7bfad,RAG.forward,CHAIN,,2024-06-15 14:54:15.330960+00:00,2024-06-15 14:54:21.878779+00:00,OK,,[],09cfb58cd2a7bfad,d5ebae20bce20916bbc692fd93e34447,application/json,"{""question"": ""At My Window was released by whi...",,CHAIN,,application/json,"""Prediction(\n context=['At My Window (albu...",
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
fea28a771c291ff8,ChainOfThought(GenerateAnswer).forward,CHAIN,edc22b34787d86a9,2024-06-16 22:25:15.852816+00:00,2024-06-16 22:25:21.508803+00:00,OK,,[],fea28a771c291ff8,8b91c98c7ae7f4db30ee51067dc1eb35,application/json,"{""context"": [""1972 FA Charity Shield | The 197...",,CHAIN,,application/json,"{""answer"": ""1874""}",
edc22b34787d86a9,RAG.forward,CHAIN,,2024-06-16 22:25:15.763489+00:00,2024-06-16 22:25:21.515807+00:00,OK,,[],edc22b34787d86a9,8b91c98c7ae7f4db30ee51067dc1eb35,application/json,"{""question"": ""In what year was the club founde...",,CHAIN,,application/json,"""Prediction(\n context=['1972 FA Charity Sh...",
20ec758fb81ac6c7,ColBERTv2.__call__,RETRIEVER,3f0b258c60db60d9,2024-06-16 22:25:21.529449+00:00,2024-06-16 22:25:21.529449+00:00,OK,,[],20ec758fb81ac6c7,b131b30928edfafc46655c8cdb0ea6f9,application/json,"{""query"": ""Which is taller, the Empire State B...",[{'document.content': 'Columbia Center | The C...,RETRIEVER,,,,
3f0b258c60db60d9,Retrieve.forward,RETRIEVER,c9b3520cb82fe4db,2024-06-16 22:25:21.529449+00:00,2024-06-16 22:25:21.582593+00:00,OK,,[],3f0b258c60db60d9,b131b30928edfafc46655c8cdb0ea6f9,application/json,"{""query_or_queries"": ""Which is taller, the Emp...",[{'document.content': 'Columbia Center | The C...,RETRIEVER,,,,


In [None]:
# compiled_module.save("rag_model.json")

## 7. Instrument DSPy and Launch Phoenix

Now that we've compiled our RAG program, let's see what's going on under the hood.

Launch Phoenix, which will run in the background and collect spans and traces from your instrumented DSPy application.

In [None]:
# import phoenix as px
# phoenix_session = px.launch_app()

Then instrument your application with [OpenInference](https://github.com/Arize-ai/openinference/tree/main/spec), an open standard build atop [OpenTelemetry](https://opentelemetry.io/) that captures and stores LLM application executions. OpenInference provides telemetry data to help you understand the invocation of your LLMs and the surrounding application context, including retrieval from vector stores, the usage of external tools or APIs, etc.

In [None]:
# import phoenix as px

# from openinference.semconv.resource import ResourceAttributes
# from openinference.instrumentation.dspy import DSPyInstrumentor
# from opentelemetry import trace as trace_api
# from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
# from opentelemetry.sdk import trace as trace_sdk
# from opentelemetry.sdk.resources import Resource
# from opentelemetry.sdk.trace.export import SimpleSpanProcessor

# endpoint = "http://127.0.0.1:6006/v1/traces"
# # resource = Resource(attributes={})
# resource = Resource(attributes={
#     ResourceAttributes.PROJECT_NAME: 'Span-test'
# })
# tracer_provider = trace_sdk.TracerProvider(resource=resource)
# span_otlp_exporter = OTLPSpanExporter(endpoint=endpoint)
# tracer_provider.add_span_processor(SimpleSpanProcessor(span_exporter=span_otlp_exporter))
# trace_api.set_tracer_provider(tracer_provider=tracer_provider)
# DSPyInstrumentor().instrument()

## 8. Run Your Application

Let's run our DSPy application on the dev set.

In [None]:
for example in devset:
    question = example["question"]
    prediction = compiled_module(question)
    print("Question")
    print("========")
    print(question)
    print()
    print("Predicted Answer")
    print("================")
    print(prediction.answer)
    print()
    print("Retrieved Contexts (truncated)")
    print(f"{[c[:200] + '...' for c in prediction.context]}")
    print()
    print()

Check the Phoenix UI to inspect the architecture of your DSPy module.

In [None]:
# print(phoenix_session.url)

A few things to note:

- The spans in each trace correspond to the steps in the `forward` method of our custom subclass of `dspy.Module`,
- The call to `ColBERTv2` appears as a retriever span with retrieved documents and scores displayed for each forward pass,
- The LLM span includes the fully-formatted prompt containing few-shot examples computed by DSPy during compilation.

![a tour of your traces and spans in DSPy, highlighting retriever and LLM spans in particular](https://storage.googleapis.com/arize-phoenix-assets/assets/docs/notebooks/dspy-tracing-tutorial/dspy_spans_and_traces.gif)

Congrats! You've used DSPy to bootstrap a multishot prompt with hard negative passages and chain of thought, and you've used Phoenix to observe the inner workings of DSPy and understand the internals of the forward pass.