<img src="../../docs/images/DSPy8.png" alt="DSPy7 Image" height="150"/>

## DSPy: Compiling chains from `LangChain`

One of the most powerful features in **DSPy** is optimizers. **DSPy optimizers** can take any LM system and tune the prompts (or the LM weights) to maximize any objective.

Optimizers can improve the quality of your LM systems and make your code adaptive to new LMs or new data. This is meant to bring structure and modularity in place of hacky things like (i) manual prompt engineering, (ii) designing complex pipelines for generating synthetic data, (iii) or designing complex pipelines for finetuning.

In [1]:
# Install the dependencies if needed.
# ! pip install -U dspy-ai
! pip install -U openai jinja2
! pip install -U langchain langchain-community langchain-openai langchain-core



Typically, we use DSPy optimizers with DSPy modules. But here, we've worked with [Harrison Chase](https://twitter.com/hwchase17) to make sure DSPy can also optimize chains built with the `LangChain` library.

This short tutorial demonstrates how this proof-of-concept feature works. _This will **not** give you the full power of DSPy or LangChain yet, but we will expand it if there's high demand._

If we convert this into a fuller integration, all users stand to benefit. LangChain users will gain the ability to optimize any chain with any DSPy optimizer. DSPy users will gain the ability to _export_ any DSPy program into an LCEL that supports streaming and tracing, and other rich production-targeted features in LangChain.

### 1) Setting Up

First, let's import `dspy` and configure the default language model and retrieval model in it.

In [2]:
import os

from dotenv import load_dotenv

# Load the .env file
load_dotenv(".env")

# Now you can access the environment variables defined in the .env file
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
LANGCHAIN_TRACING_V2 = os.environ["LANGCHAIN_TRACING_V2"]
LANGCHAIN_ENDPOINT = os.environ["LANGCHAIN_ENDPOINT"]
LANGCHAIN_API_KEY = os.environ["LANGCHAIN_API_KEY"]
LANGCHAIN_PROJECT = os.environ["LANGCHAIN_PROJECT"]
# LANGCHAIN_API_KEY
# OPENAI_API_KEY

In [3]:
import dspy

colbertv2 = dspy.ColBERTv2(url="http://20.102.90.50:2017/wiki17_abstracts")

  from .autonotebook import tqdm as notebook_tqdm


Next, let's import `langchain` and the DSPy modules for interacting with LangChain runnables, namely, `LangChainPredict` and `LangChainModule`.

In [4]:
from langchain.cache import SQLiteCache
from langchain.globals import set_llm_cache
from langchain_openai import OpenAI

set_llm_cache(SQLiteCache(database_path="cache.db"))

llm = OpenAI(model_name="gpt-3.5-turbo-instruct", temperature=0)


def retrieve(inputs):
    return [doc["text"] for doc in colbertv2(inputs["question"], k=5)]

In [5]:
colbertv2("cycling")



[{'text': 'Cycling | Cycling, also called bicycling or biking, is the use of bicycles for transport, recreation, exercise or sport. Persons engaged in cycling are referred to as "cyclists", "bikers", or less commonly, as "bicyclists". Apart from two-wheeled bicycles, "cycling" also includes the riding of unicycles, tricycles, quadracycles, recumbent and similar human-powered vehicles (HPVs).',
  'pid': 2201868,
  'rank': 1,
  'score': 27.078739166259766,
  'prob': 0.3544841299722533,
  'long_text': 'Cycling | Cycling, also called bicycling or biking, is the use of bicycles for transport, recreation, exercise or sport. Persons engaged in cycling are referred to as "cyclists", "bikers", or less commonly, as "bicyclists". Apart from two-wheeled bicycles, "cycling" also includes the riding of unicycles, tricycles, quadracycles, recumbent and similar human-powered vehicles (HPVs).'},
 {'text': 'Cycling (ice hockey) | In ice hockey, cycling is an offensive strategy that moves the puck along 

If it's useful, we can set up some caches so you can run this whole notebook in Google Colab without any API keys. Let us know.

### 2) Defining a chain as a `LangChain` expression

For illustration, let's tackle the following task.

**Task:** Build a RAG system for generating informative tweets.
- **Input:** A factual **question**, which may be fairly complex.
- **Output:** An engaging **tweet** that correctly answers the question from the retrieved info.

Let's use LangChain's expression language (LCEL) to illustrate this. Any prompt here will do, we will optimize the final prompt with DSPy.

Considering that, let's just keep it to the barebones: **Given {context}, answer the question {question} as a tweet.**

In [6]:
# From LangChain, import standard modules for prompting.
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough

# Just a simple prompt for this task. It's fine if it's complex too.
prompt = PromptTemplate.from_template(
    "Given {context}, answer the question `{question}` as a tweet."
)

# This is how you'd normally build a chain with LCEL. This chain does retrieval then generation (RAG).
vanilla_chain = (
    RunnablePassthrough.assign(context=retrieve) | prompt | llm | StrOutputParser()
)

### 3) Converting the chain into a **DSPy module**

Our goal is to optimize this prompt so we have a better tweet generator. DSPy optimizers can help, but they only work with DSPy modules!

For this reason, we created two new modules in DSPy: `LangChainPredict` and `LangChainModule`.

In [7]:
# From DSPy, import the modules that know how to interact with LangChain LCEL.
from dspy.predict.langchain import LangChainModule, LangChainPredict

# This is how to wrap it so it behaves like a DSPy program.
# Just Replace every pattern like `prompt | llm` with `LangChainPredict(prompt, llm)`.
zeroshot_chain = (
    RunnablePassthrough.assign(context=retrieve)
    | LangChainPredict(prompt, llm)
    | StrOutputParser()
)
# Now we wrap it in LangChainModule
zeroshot_chain = LangChainModule(
    zeroshot_chain
)  # then wrap the chain in a DSPy module.

AttributeError: 'FieldInfo' object has no attribute 'finalize'

### 4) Trying the module

How good is our `LangChainModule` at this task? Well, we can ask it to generate a tweet for the following question.

In [None]:
question = "In what region was Eddy Mazzoleni born?"

zeroshot_chain.invoke({"question": question})

NameError: name 'zeroshot_chain' is not defined

Ah that sounds about right! (It's technically not perfect: we asked for the _region_ not the city. We can do better below.)

Inspecting questions and answers manually is very important to get a sense of your system. However, a good system designer always looks to iteratively **benchmark** their work to quantify progress!

To do this, we need two things: the **metric** we want to maximize and a (tiny) **dataset** of examples for our system.

Are there pre-defined metrics for good tweets? Should I label 100,000 tweets by hand? Probably not. We can easily do something reasonable, though, until you start getting data in production!

### 5) Evaluating the module

To get started, we'll define our own simple metric and we'll borrow a bunch of questions from a QA dataset and use them here for tuning.

**What makes a good tweet?** I don't know, but in the spirit of iterative development, let's start simple!

Define a good tweet to be have three properties: it should be (1) factually correct, (2) based on real sources, and (3) engaging for people.

In [None]:
# We took the liberty to define this metric and load a few examples from a standard QA dataset.
# Let's impore them from `tweet_metric.py` in the same directory that contains this notebook.
from tweet_metric import metric, trainset, valset, devset

# We loaded 200, 50, and 150 examples for training, validation (tuning), and development (evaluation), respectively.
# You could load less (or more) and, chances are, the right DSPy optimizers will work well for many problems.
len(trainset), len(valset), len(devset)

Is this the right metric or the most representative set of questions? Not necessarily. But they get us started in a way we can iterate on systematically!

**Note:** Notice that our dataset doesn't actually include any tweets! It only has questions and answers. That's OK, our metric will take care of evaluating outputs in tweet form.

Okay, let's evaluate the unoptimized "zero-shot" version of our chain, converted from our `LangChain` LCEL object.

In [None]:
evaluate = Evaluate(metric=metric, devset=devset, num_threads=8, display_progress=True, display_table=5)
evaluate(zeroshot_chain)

Okay, cool. Our `zeroshot_chain` gets about **43%** on the 150 questions from the devset.

The table above shows some examples. For instance:

- **Question**: Who was a producer who produced albums for both rock bands Juke Karten and Thirty Seconds to Mars?	
- **Tweet**: Brian Virtue, who has worked with bands like Jane's Addiction and Velvet Revolver, produced albums for both Juke Kartel and Thirty Seconds to Mars, showcasing... [truncated]
- **Metric**: 1.0 (A tweet that is correct, faithful, and engaging!*)

footnote: *  At least according to our metric, which is just a DSPy program, so _it too_ can be optimized if you'd like! Topic for another notebook,  though.

### 6) Optimizing the module

DSPy has many optimizers, but the de-facto default one currently is `BootstrapFewShotWithRandomSearch`.

**If you're curious how it works:** This optimizer works by running your program (in this case, `zeroshot_chain`) on `trainset` questions. Each time it runs, DSPy will remember the input and output of each LM call. These are called traces, and this particular optimizer will keep track of "good" traces (i.e., ones that the metric likes). Then, this optimizer will try to find good ways to leverage these traces as automatic few-shot examples. It will try them out, seeking to maximize the average metric on `valset`. There are many ways to self-generate (bootstrap) examples. There are also many ways to optimize their selection (here, with random search). That's why there are several other optimizers in DSPy.

In [None]:
# Set up the optimizer. We'll use very minimal hyperparameters for this example.
# Just do random search with ~3 attempts, and in each attempt, bootstrap <= 3 traces.
optimizer = BootstrapFewShotWithRandomSearch(metric=metric, max_bootstrapped_demos=3, num_candidate_programs=3)

# Now use the optimizer to *compile* the chain. This could take 5-10 minutes, unless it's cached.
optimized_chain = optimizer.compile(zeroshot_chain, trainset=trainset, valset=valset)

### 7) Evaluating the optimized chain

Well, how good is this? _Not every optimization run will magically result in improvement on unseen examples!_ So let's check!

First let's ask that question from above.

In [None]:
question = "In what region was Eddy Mazzoleni born?"

optimized_chain.invoke({"question": question})

Nice, anecdotally, it appears a bit more precise than the answer with `zeroshot_chain`. But now let's do some proper evals!

In [None]:
evaluate(optimized_chain)

We started with `zeroshot_chain` at **43%** and now we have **52%**. That's a nice **21%** relative improvement. Not bad!

### 8) Inspecting the optimized chain in action

In [None]:
prompt, output = dspy.settings.langchain_history[-4]

print('PROMPT:\n\n', prompt)
print('\n\nOUTPUT:\n\n', output)

#### Acknowledgements:

Thanks to [Harrison Chase](https://twitter.com/hwchase17) for co-leading this new integration. Thanks to our own [Arnav Singhvi](https://arnavsinghvi11.github.io/) for helping cook this tweet generation task and the insight about how to get data to use here.