# Document Understanding with the DSP framework, using Stanford state-of-the-art search engine. All on CPU.

## Introduction

In this article, we illustrate how to apply the DSP ("Demonstrate-Search-Predict") framework and its corresponding Python library, that were developed by the Stanford University (https://arxiv.org/abs/2212.14024), to a "document understanding" use case. In the original demo, the framework has been applied to Wikipedia. In this article we apply it to our own document (a pdf). We chose a legal document (an ideal candidate because there is a lot of text). But you should be able to apply this to any of your documents.

This article is an extract from a Jupyter Notebook, which is available in GitHub (https://github.com/feldges/DSP-on-Documents). For more user-friendly use (especially for non-technical people), there is a Streamlit version of it (https://github.com/feldges/DSP-on-Documents/blob/main/DSP%20on%20Documents%20Streamlit.py).

We build an end-to-end demo, going through the whole process:
First we need to prepare the data:
- Text extraction and "chunking"
- Text indexing

Then we can demonstrate, search and predict with the following steps:
- Demonstrate (optional in our case)
- Search (information retrieval)
- Predict

For this work we use the DSP framework and Stanford's state-of-the-art search engine (or information retrieval engine), which is ColBERTv2. Fortunately, both tools were made open source, thanks a lot to authors for that!

These are a bit abstract for the moment. Let's jump into the example to see how this looks like in practice. Stay tuned, it is very easy!

The file we will use for this exercice is the Swiss Constitution (which can be downloaded from: https://www.fedlex.admin.ch/eli/cc/1999/404/en). We use the English version, because the search engine ColBERTv2 has been trained on English documents only.

This 

## Text extraction

If you are familiar with text extraction for pdf files, you can ignore this part and jump to the next section.

The goal of the text extraction is to split the text into chunks (or passages) that are aggregated in a relevant manner. "Relevant" is not precisely defined, but let's think what makes sense. First, let's think ahead: the passages will be injected as context to a Large Language Model. This means, the size must be limited, due to the limitation of the current language models. And the passages should be split in a way that makes sense. For example, a sentence should not be split in the middle. The way we decide to split the passages is to split them by legal article. In addition, it is important to understand that the title, section, chapter, and article names might be important (sort of metadata), and not only the text. It is why, when we create the passages, we also inject the title, section, chapter, and article names with it. You can see more details below, on how we did it.

Now we know what we want. Is the implementation simple? Unfortunately not. Looking at the pdf document, we see that there is text all over the place! Headers, footnotes, ... 

So let's dive into it. There are many libraries that allow text extraction in Python, many of which are very similar, but yet different. The library langchain also offers a wrapper around many of these libraries. But instead of using langchain, we will use the original library, PyMuPDF, because it offers more flexibility.

We will put special care for this step. Unfortunately this step is typically customized to the document we are using, and a generalization of this extraction to any pdf files can be painful. But still, we make the extra effort to have a clean text extraction, since it is key to get good results for our search engine. You can have the best search engine (which we have in this example!), but if you provide it with poor quality data, you will get poor results.

Let's start with getting the necessary libraries and provide the path to the file we want to analyze, and let's open the document:

In [1]:
import fitz # If not yet installed, run "pip install PyMuPDF"

In [3]:
filepath="docs/swiss_constitution.pdf" # Path to your document
doc = fitz.open(filepath)

Then, instead of extracting a plain-vanilla text, we will extract blocks and collate them together. The reason is that the blocks are smaller parts of text, and their extraction returns metadata like position (x, y), the font type and size. This is important information for us, for the extraction process.

In addition to the metadata related to the blocks, we also use regular expressions. Looking at the text, a Chapter is always called "Chapter XX", etc. We use this to extract the text as described above.

To illustrate how we use the blocks components of the library PyMuPDF, let's look at the metadata below. The data below is for the block 5 of page 1 of the document.

In [6]:
p=1 # You can change this number to visualize another page
b=5 # You can change this number to visualize another block of the page
blocks = doc[p].get_text("dict", flags=11)["blocks"]
blocks[b]

{'number': 5,
 'type': 0,
 'bbox': (34.02000045776367,
  102.75997924804688,
  342.2460632324219,
  122.74679565429688),
 'lines': [{'spans': [{'size': 6.480000019073486,
     'flags': 4,
     'font': 'TimesNewRomanPSMT',
     'color': 0,
     'ascender': 0.89111328125,
     'descender': -0.21630859375,
     'text': '4',
     'origin': (34.02000045776367, 108.80001831054688),
     'bbox': (34.02000045776367,
      103.02560424804688,
      37.26000213623047,
      110.20169830322266)},
    {'size': 9.0,
     'flags': 4,
     'font': 'TimesNewRomanPSMT',
     'color': 0,
     'ascender': 0.89111328125,
     'descender': -0.21630859375,
     'text': ' It is committed to the long term preservation of natural resources and to a just and ',
     'origin': (37.2599983215332, 110.77999877929688),
     'bbox': (37.2599983215332,
      102.75997924804688,
      342.2460632324219,
      112.72677612304688)}],
   'wmode': 0,
   'dir': (1.0, 0.0),
   'bbox': (34.02000045776367,
    102.75997924804

For each line in the block, it returns the font size, the font type, the color, and a flag, describing if the text is bold, itallic, ... It also shows the coordinates (x,y) of where the text starts (top-left), and where it ends (bottom-right).

This information leads us to define the following text extraction, for which we show an extract at the end:

In [7]:
# Initialization
t = ''
title=""
old_title=""
chapter=""
old_chapter=""
section=""
old_section=""
article=""
old_article=""
paragraph=""
break_loop=False
# Loop on pages of the document
for p in doc:
    blocks = p.get_text("dict", flags=11)["blocks"]
    title_y=0.0
    chapter_y=0.0
    section_y=0.0
    article_y=0.0
    # Loop on blocks of a page
    for b in blocks:  # iterate through the text blocks
        if break_loop==True:
            break
        for l in b["lines"]:  # iterate through the text lines
            if break_loop==True:
                break
            for s in l["spans"]:  # iterate through the text spans
                # Ignore header
                if s["bbox"][3]<50:
                    break
                # Ignore footer
                if s["bbox"][1]>500:
                    break
                # Ignore smallest text (like reference to a footnote)
                if s["size"]<6.5 and not(s["text"].isnumeric()):
                    break
                # Content page at the end of the document is noise for the search, we remove it
                if s["text"].strip()=="Contents" and s["size"]>10 and s["flags"]==20:
                    break_loop=True
                    break
                # Get title
                if s["bbox"][1]==title_y and s["text"]!="":
                    title+=" "+s["text"].strip()
                # Get title if on a new line
                elif s["text"][:6]=="Title ":
                    title = s["text"].strip()
                    title_y = s["bbox"][1]
                # Get chapter
                elif s["bbox"][1]==chapter_y and s["text"]!="":
                    chapter+=" "+s["text"].strip()
                # Get chapter if on a new line
                elif s["text"][:8]=="Chapter ":
                    chapter = s["text"].strip()
                    chapter_y = s["bbox"][1]
                # Get section
                elif s["bbox"][1]==section_y and s["text"]!="":
                    section+=" "+s["text"].strip()
                # Get section if on a new line
                elif s["text"][:8]=="Section " and s["flags"]==20:
                    section=s["text"].strip()
                    section_y = s["bbox"][1]
                # Get article
                elif s["bbox"][1]==article_y and s["text"]!="":
                    article+=" "+s["text"].strip()
                # Get article if on a new line
                elif s["text"][:5]=="Art. " and s["flags"]==20:
                    article=s["text"].strip()
                    article_y = s["bbox"][1]
                # Get paragraph
                #elif s["size"]<6.5 and s["text"].isnumeric():
                #    paragraph="Paragraph "+s["text"]
                else:
                # For any new article, creates a new piece of text, including title, chapter, section article in the header of this piece of text.
                    if old_article!=article:
                        t+="\n"
                        t+="|"+title+"|"+chapter+"|"+section+"|"+article+"|"
                    t+=s["text"]
                    old_article=article
print(t[1000:3000]) # You can change these numbers to visualize another part of the output

rn, Lucerne, Uri, Schwyz, Obwalden and Nidwalden, Glarus, Zug, Fribourg, Solothurn, Basel Stadt and Basel Landschaft, Schaffhausen, Appenzell Ausserrhoden and Appenzell Innerrhoden, St. Gallen, Grau-bünden, Aargau, Thurgau, Ticino, Vaud, Valais, Neuchâtel, Geneva, and Jura form the Swiss Confederation. 
|Title 1 General Provisions|||Art. 2  Aims|1 The Swiss Confederation shall protect the liberty and rights of the people and safe-guard the independence and security of the country. 2 It shall promote the common welfare, sustainable development, internal cohesion and cultural diversity of the country. 3 It shall ensure the greatest possible equality of opportunity among its citizens. 4 It is committed to the long term preservation of natural resources and to a just and peaceful international order. 
|Title 1 General Provisions|||Art. 3 Cantons|The Cantons are sovereign except to the extent that their sovereignty is limited by the Federal Constitution. They exercise all rights that are no

We are almost there :-)
The text is ready to be exported to the search engine. ColBERTv2 requires the input text to be in a tsv file (tab separated file):

In [40]:
import csv
# t contains all the passages we need. Let's split it and put it in a tsv file
stext = t.split('\n')
with open('docs/CollectionsSwissConstitution.tsv', 'wt') as out_file:
    tsv_writer = csv.writer(out_file, delimiter='\t')
    i=0
    for tex in stext:
        tsv_writer.writerow([i, tex])
        i+=1

## Text indexing

Our text is ready, and in the right format! The next step consists of indexing it, that means we need to convert it into a format that will allow the search engine to find the relevant text for a given question (or query). This is tightly related to the search engine we will use.

We will use a state-of-the-art search engine that has been developed by the University of Stanford (https://arxiv.org/abs/2112.01488). The search engine is called ColBERTv2 and is open source (https://github.com/stanford-futuredata/ColBERT). To index the document for this specific search engine, we will first download the ColBERTv2 repo from GitHub and install it. For the installation, follow closely the instructions provided in the GitHub repo.

Once ColBERTv2 is installed on your machine, you can use it with to index your passages using the Python code below. Note that we strictly follow the original library, which has a very well explained notebook: https://github.com/stanford-futuredata/ColBERT/blob/main/docs/intro.ipynb. Instead of using a Jupyter Notebook, we did it in PyCharm. But you can also do it in Jupyter Notebook.

In [None]:
from colbert.infra import Run, RunConfig, ColBERTConfig
from colbert import Indexer

with Run().context(RunConfig(nranks=1, experiment="SwissConstitution")):
    config = ColBERTConfig(nbits=2)

    indexer = Indexer(checkpoint="modelcheckpoints/colbertv2.0", config=config)
    indexer.index(name="SwissConstitution.2bits", collection="data/collections/CollectionsSwissConstitution.tsv", overwrite=False)

This needs some explanations:
- The "experiment" will create a folder with this name, under which all queries and query results will be logged
- The indexer needs the colbertv2.0 model checkpoint, which is the model that has been trained (by the original authors)
- The index name is the name of the index you are creating. You will find the indexed passages under this folder.
- The collection is the raw passages that we have prepared in the previous step (data extraction). You need to put this file in the folder ColBERT/colbert/data (or adjust the link above)

Note that the indexing will take time. What it will do: it will let each passage be encoded by a BERT model. The outputs of all these passages (one output vector for each token of each passage) will go through an algorithm that will group them into clusters. Each token will then be re-defined as the cluster it belongs to, plus its distance from the center of this cluster. This work will require some computation time. For this document, it takes around 2 minutes on my computer (which is an iMac 2015 with an i5 CPU). Using GPUs would make this process much faster.

Before we can start using the search engine in the DSP framework, we need to "activate" the search engine for this indexed data. For this we will activate a web-service and use the search engine as an API. This is available directly from the original code for ColBERTv2, in the file server.py. You will have to adjust the file to your index_name and index_path. Then to "activate" it you will have to run this file.

## Information Retrieval

The text has been extracted, "chunkenized", and indexed. Now it is time to search for passages! Before we move to the DSP framework, let's test this search (or information retrieval) on its own. To get this done, we need to activate ColBERTv2 as an API (as described above), so that the API is active and ready to be queries. Once it is active, let's use it!

In [8]:
import requests
colbert_server = 'http://127.0.0.1:8893/api/search' # You might have a different URL!

In [34]:
query="Is it legal to speak Spanish?" # Your question to the document
k=3 # The API returns the top k most relevant entries

payload = {"query": query, "k": k} # The content that we push to the API

result = requests.get(colbert_server, params=payload) # What the API returns
output = result.json()
topk = output['topk'][:k] # Extracting the result
topk_text = [k['text'] for k in topk] # Extracting the text only

In [35]:
output

{'query': 'Is it legal to speak Spanish?',
 'topk': [{'pid': 0,
   'prob': 0.7640150527471495,
   'rank': 1,
   'score': 12.913446426391602,
   'text': 'English is not an official language of the Swiss Confederation. This translation is provided for information purposes only and has no legal force. Federal Constitution  of the Swiss Confederation  of 18 April 1999 (Status as of 13 February 2022)  Preamble In the name of Almighty God! The Swiss People and the Cantons, mindful of their responsibility towards creation, resolved to renew their alliance so as to strengthen liberty, democracy, independence and peace in a spirit of solidarity and openness towards the world, determined to live together with mutual consideration and respect for their diversity, conscious of their common achievements and their responsibility towards future ge-nerations, and in the knowledge that only those who use their freedom remain free, and that the strength of a people is measured by the well-being of its w

In [36]:
topk

[{'pid': 0,
  'prob': 0.7640150527471495,
  'rank': 1,
  'score': 12.913446426391602,
  'text': 'English is not an official language of the Swiss Confederation. This translation is provided for information purposes only and has no legal force. Federal Constitution  of the Swiss Confederation  of 18 April 1999 (Status as of 13 February 2022)  Preamble In the name of Almighty God! The Swiss People and the Cantons, mindful of their responsibility towards creation, resolved to renew their alliance so as to strengthen liberty, democracy, independence and peace in a spirit of solidarity and openness towards the world, determined to live together with mutual consideration and respect for their diversity, conscious of their common achievements and their responsibility towards future ge-nerations, and in the knowledge that only those who use their freedom remain free, and that the strength of a people is measured by the well-being of its weakest members, adopt the following Constitution1:'},
 {

In [37]:
topk_text

['English is not an official language of the Swiss Confederation. This translation is provided for information purposes only and has no legal force. Federal Constitution  of the Swiss Confederation  of 18 April 1999 (Status as of 13 February 2022)  Preamble In the name of Almighty God! The Swiss People and the Cantons, mindful of their responsibility towards creation, resolved to renew their alliance so as to strengthen liberty, democracy, independence and peace in a spirit of solidarity and openness towards the world, determined to live together with mutual consideration and respect for their diversity, conscious of their common achievements and their responsibility towards future ge-nerations, and in the knowledge that only those who use their freedom remain free, and that the strength of a people is measured by the well-being of its weakest members, adopt the following Constitution1:',
 '|Title 2 Fundamental Rights, Citizenship and Social Goals|Chapter 1 Fundamental Rights||Art. 18 

The search API returns plenty of information! Looking at the output we get the initial query. Looking at the topk (which is part of the output), we get:
- The rank
- The score and the probability
- The text

From this we can extract the results, which we have put in a list of passages, ranked by relevance.

Let's analyze the output. We have asked if it is legal to speak Spanish. The search returns language-related passages, which is in my view fantastic! There is no mention of Spanish in the Swiss constitution, so any keyword based search would already fail at this stage. Since ColBERTv2 is semantic based, it is able to link language-related words to our query. Let's look closely at these passages. The first one is a generic warning that the text is a translation of the official constitution into English, but has no legal validity. This passage will not answer our question! The second passage mentions the "freedom to use any language". This is the relevant passage that will allow to answer our qestion!

Looking ahead: the DSP framework will ask the LLM to answer our questions, based on the passages that we will inject into the prompts. If we inject these passages, we have a good chance that the LLM will be able to answer our question. We will see it later.

If you read this article without trying yourself, you cannot feel the speed of this algorithm. Let's try it out.

In [51]:
import time
query="Is it legal to speak Spanish?" # Your question to the document
k=3 # The API returns the top k most relevant entries

payload = {"query": query, "k": k} # The content that we push to the API

start_time = time.time()
for trials in range(1000):
    result = requests.get(colbert_server, params=payload) # What the API returns
    output = result.json()
    topk = output['topk'][:k] # Extracting the result
    topk_text = [k['text'] for k in topk] # Extracting the text only
end_time = time.time()
duration = end_time - start_time
print(f"Duration per query: {duration/1000*1000:.3f} milliseconds")

Duration per query: 2.767 milliseconds


We can say that this is rather fast! How is it possible to get a semantic search that is so fast? This has to do with the approach. The method is based on BERT, however, at inference time (at time of search), only the query has to be encoded by the BERT algorithm. The passages have been encoded by BERT at time of indexing (which happens only once), not at time of inference. How is it then possible to still have a BERT-based search algorithm? This is because of late interactions. The embedding of every single token that comes out of BERT is used, for both the query and the passages. These embedding are "compared" with a very simple algorithm (just additions and multiplications). These "comparisons" are called "late interactions", because they occur late in the encoding process (after the tokens have gone through all the transformers of BERT). How can the comparison be done between the query and all the (potentially million of) passages? The way the passages are stored makes this algorithm faster: instead of going through each passages, the algorithm can afford to go through only selected passages, because, remember, these are organized as clusters. ColBERTv2 is part of the family of models called late interaction.

__This algorithm provides us with the best of both worlds: state-of-the-art semantic search at top speed!__

There are alternatives search algorithms. One algorithm which is very popular is a "passage embedding" (for example the ADA embeddings from OpenAI). These algorithms are very well integrated with popular frameworks like Langchain, and are also well integrated with vector databases like Pinecone.

I have used both search algorithms, and my (subjective) impression is that late interaction models work much better than passage embeddings. The interpretation I give to this observation is that late interaction models look at each single word (more precisely: token), while passage embedding looks at a passage as a whole. The much deeper granularity of late interaction models can act very positively on the quality of the search. Unfortunately I have no evidence to support my impression, so I will keep it as a "subjective" impression! But I would love to hear quantitative comparisons, if you know of any.

## The DSP (Demonstrate-Search-Predict) framework

We have now a search engine. As a next step, we want to use this search engine to feed an LLM, so that the LLM will be able to answer our questions. What do we need for this? First, we need to ask a question (a "query"). Then we need a search engine that returns the most relevant passages of a text. We have it, from above. Then we need to integrate it into a framework, that is Stanford's DSP framewok (https://arxiv.org/abs/2212.14024). How does it work? The idea is to have a framework that feeds an LLM so that we get the desired answer to our questions.

The DSP framework will need to have:
- A search engine (information retrieval)
- An LLM

We will provide dsp with these two objects.

The DSP framework need to be fed with examples, and a question. These will be used to build a prompt. The examples, which are question-answer pairs, will be used to build examples in the prompt. The question will be used to get passages from the search engine. These passages will be injected to the prompt as context.

Let's have a closer look how this works, continuing with our example. We will first initialize the dsp engine. Then we will build a template. This is done with an object "Template", that is feed with objects "Type". "Types" are for example questions, answers, or contexts. More detailed examples are provided in the DSP GitHub repo (https://github.com/stanfordnlp/dsp).

In [52]:
import os
import dsp # If not already downloaded: "pip install dsp-ml"

colbert_server = 'http://127.0.0.1:8893/api/search' # You might have a different URL!

# Define Language Model (lm) and Retrieval model (rm)
lm = dsp.GPT3(model='text-davinci-002', api_key='YOU_API_KEY')
rm = dsp.ColBERTv2(url=colbert_server)

dsp.settings.configure(lm=lm, rm=rm)

Let's build the template with various building blocks:

In [82]:
# Provide instructions, questions, answers
Instructions = "Answer questions in one paragraph."

# Provide questions and answers
Question = dsp.Type(prefix="Question:", desc="${the question to be answered}")
Answer = dsp.Type(prefix="Answer:", desc="${an answer, with explanations}", format=dsp.format_answers)

# Provide the context
Context = dsp.Type(
    prefix="Context:\n",
    desc="${sources that may contain relevant content}",
    format=dsp.passages2text
)

# Provide a rationale
Rationale = dsp.Type(
    prefix="Rationale: Let's think step by step. Justify your answer.",
    desc="${a step-by-step deduction that identifies the correct response, which will be provided below}"
)

# Build the template with all these ingredients
qa_template = dsp.Template(instructions=Instructions, context=Context(), question=Question(), rationale=Rationale(), answer=Answer())

Let's build a function that will:
- Take a question as input
- Retrieve the relevant passages from the document
- Ask the LLM to answer the question, based on the retrieved passages

In [121]:
def QA_predict(example: dsp.Example):
    example, completions = dsp.generate(qa_template)(example, stage='qa')  
    return example.copy(answer=completions.answer)

def retrieve_then_read_QA(question: str, qa_pairs: list = []) -> str:
    demos = dsp.sample(qa_pairs, k=8)
    passages = dsp.retrieve(question, k=5)
    example = dsp.Example(question=question, context=passages, demos=demos)    
    return QA_predict(example).answer

In [122]:
question = "Is it legal to speak Spanish?"
retrieve_then_read_QA(question)

'Yes, it is legal to speak Spanish. Article 18 of the Swiss Constitution guarantees the freedom to use any language.'

It is pretty cool, no? We can also "look inside" and see what has happened in the back-end. This is done with the inspect methods of the LLM, which logs what the LLM has done:

In [86]:
lm.inspect_history(n=1)





Answer questions in one paragraph.

---

Follow the following format.

Context:
${sources that may contain relevant content}

Question: ${the question to be answered}

Rationale: Let's think step by step. Justify your answer. ${a step-by-step deduction that identifies the correct response, which will be provided below}

Answer: ${an answer, with explanations}

---

Context:
[1] «English is not an official language of the Swiss Confederation. This translation is provided for information purposes only and has no legal force. Federal Constitution  of the Swiss Confederation  of 18 April 1999 (Status as of 13 February 2022)  Preamble In the name of Almighty God! The Swiss People and the Cantons, mindful of their responsibility towards creation, resolved to renew their alliance so as to strengthen liberty, democracy, independence and peace in a spirit of solidarity and openness towards the world, determined to live together with mutual consideration and respect for their diversity, cons

Let's summarize what we have done: we asked a question, retrieved relevant passages to our question, injected these into an LLM, and let the LLM answer the question. We are using the S-Search and P-Predict of the DSP framework. Let's look at how to use the D-Demonstrate part. In our case it works well without examples, but let's still add them. This should allow us to better format the answer.

In [113]:
qa_pairs = [("Who is responsible to build highways?", ['The Confederation is responsible to build highways (Article 83).']),
         ("Do women have equal rights?", ["Yes, women have equal rights (Article 8)."]),
         ("What are the national languages?", ["The National Languages are German, French, Italian, and Romansh (Article 4)."]),
         ("Are citizen of any commune also Swiss citizen?", ["Yes, Swiss citizenship is granted to any person who is a citizen of a commune and of the Canton to which that commune belongs (Article 37)."]),
         ("What is the confederation's role regarding research?", ["The confederation's role in research is to promote scientific research and innovation and to establish, take over, or run research institutes (Article 64)."]),
         ("Who is responsible for the protection of the cultural heritage?", ["The Cantons are responsible for the protection of the cultural heritage (Article 78)."]),
         ("Who is responsible for the public transport?", ["The Confederation and the Cantons are responsible for the public transport (Article 81)."]),
         ("Who is responsible for the legislation around nuclear energy?", ["The Confederation is responsible for legislation in the field of nuclear energy (Article 90)."])]

qa_pairs = [dsp.Example(question=question, answer=answer) for question, answer in qa_pairs]

In [117]:
question = "Who should provide assistance to elderly or disabled people?"
qa_pairs = qa_pairs
retrieve_then_read_QA(question, qa_pairs)

'Both the Confederation and the Cantons have a responsibility to provide assistance to elderly or disabled people.'

In [118]:
lm.inspect_history(n=1)





Answer questions in one paragraph.

---

Question: What is the confederation's role regarding research?
Answer: The confederation's role in research is to promote scientific research and innovation and to establish, take over, or run research institutes (Article 64).

Question: Do women have equal rights?
Answer: Yes, women have equal rights (Article 8).

Question: Who is responsible for the protection of the cultural heritage?
Answer: The Cantons are responsible for the protection of the cultural heritage (Article 78).

Question: What are the national languages?
Answer: The National Languages are German, French, Italian, and Romansh (Article 4).

Question: Who is responsible to build highways?
Answer: The Confederation is responsible to build highways (Article 83).

Question: Are citizen of any commune also Swiss citizen?
Answer: Yes, Swiss citizenship is granted to any person who is a citizen of a commune and of the Canton to which that commune belongs (Article 37).

Question: Wh

Unfortunately, adding these examples did not change much to the output format! But in other cases, adding examples might help a lot.

## Conclusion

In this article we have gone through the DSP framework and illustrated it with a concrete example, that we have developed in detail. The goal is to illustrate the use of this framework for a document, which means applying the DSP framework to a "document understanding" use case. And it works very well!

We have used only very limited possibilities of the DSP framework. In the DSP GitHub repo (https://github.com/stanfordnlp/dsp), you can find much more sophisticated uses of it. For example, you can build algorithms that will ask several questions and iteratively retrieve information from passages. But in our simple case this is not needed, so I have skipped it.

Note that there are still limitations to this framework. For example, the retriever was trained on English texts only, and on specific domain datasets. A lot of value could be added if the retriever were trained in other languages, and if it were trained on other domains, that are specific, for example, to your enterprise. We see a lot of potential for this!

Finally, the DSP framework, as well as the framework Langchain, opens many new possibilities: you can now "program in English", and not only in Python!

Thanks for reading this document through! I hope you enjoyed it, and have already ideas of where to apply this "document understanding" use case, or more generally the DSP framework. I have nothing to sell, I am just happy to share my knowledge :-)

Before concluding, I would like to thank the Stanford University and Omar Khattab for making these great new developments, and for making them open source! I would also like to thank Omar for his availability and responsiveness!