# Sample notebook showing how to run the conceptual question generator

The conceptual question generator generates questions from a description of the content instead of the content itself.  This potentially produces more challenging and realistic RAG test data than generating questions from actual snippets of text because the way the questions are asked is likely to be less well aligned to the way the content answers it.  Another benefit is that it produces some questions that are not answered by the content, which makes it possible to assess the ability of a RAG system to not answer such questions.

Note: If you run this notebook in a Docling-SDG dev environment (as described in [CONTRIBUTING.md](https://github.com/docling-project/docling-sdg/blob/main/CONTRIBUTING.md)), you may need to run the following command to enable running notebooks in that envionment:

`uv pip install ipykernel -U --force-reinstall`

In [1]:
import os, json, random

## Sample data

For out sample data, we are using the IBM 2024 Annual Report.  This is a reasonably complex document, that is new enough that many popular models will not have been trained on it.  However, it is a type of document that most models will have seen often.  So it is a useful example of a type of document where a model could be expected to generate a lot of relevant questions without knowing precisely which ones will have answers in the document.

In [2]:
CONTENT_URLS=["https://www.ibm.com/downloads/documents/us-en/1227c12d3a38b173"]
CONTENT_LOCATION="./docs/"
CONTENT_DESCRIPTION="IBM 2024 Annual Report"

## User Profiles

Here we provide descriptions of different kinds of users that we want to simulate when generating questions.  For each profile, this will generate N questions of each type per topic, where N = number_of_topics * number_of_iterations_per_topic * number of question types.  The number of question types defaults to 3 (the three built-in types in Docling SDG, which are fact_single, summary, and reasoning).

Put higher numbers of topics or iterations for profiles that are more important for your application, or the same numbers for each if all of your profiles are equally important.

In [3]:

from docling_sdg.qa.base import UserProfile

BIG_USER_PROFILES=[
    UserProfile(description="Professional stock market analyst", number_of_topics=50, number_of_iterations_per_topic=12),
    UserProfile(description="Manager at a company that is considering buying an IBM product", number_of_topics=10, number_of_iterations_per_topic=10),
    UserProfile(description="High school student taking a business course", number_of_topics=5, number_of_iterations_per_topic=5),
    UserProfile(description="Fifth grader who wants to learn about IBM", number_of_topics=5, number_of_iterations_per_topic=3)
]

# Smaller version for testing.
SMALL_USER_PROFILES=[
    UserProfile(description="Professional stock market analyst", number_of_topics=8, number_of_iterations_per_topic=2),
    UserProfile(description="Manager at a company that is considering buying an IBM product", number_of_topics=5, number_of_iterations_per_topic=1),
    UserProfile(description="High school student taking a business course", number_of_topics=5, number_of_iterations_per_topic=1)
]

# Switch between the sets of user profiles by updating this line.  If you want to run the notebook on the full dataset,
# you can set USER_PROFILES to BIG_USER_PROFILES.
SAMPLE_USER_PROFILES = SMALL_USER_PROFILES

In [4]:
total_number_of_iterations = sum([p.number_of_topics * p.number_of_iterations_per_topic * 3 for p in SAMPLE_USER_PROFILES])
total_number_of_iterations

78

## Additional Instructions

Next provide a list of additional instruction strings to further adjust the question generation.  

We cycle through the additional in order for each question we generate.  It can be useful to have at least
one empty string in the list so that the some questions are generated without any specific bias, but then we
add in more specific instructions to adjust the balance of different question traits.  This default value tries
to reduce the overall length of the questions and introduce some more informal language.

The best way to find a good value for the additional instructions strings is to look at the questions that occur in a log of a deployed
system similar to the one you're building and compare them to the questions that are generated by the system.  If you see kinds of behaviors
that are common in the deployed system, you can add instructions to the list to try to encourage the generated data to match those behaviors.
You can repeat entries in the list multiple times to boost the percentage of the data that was generated with that instruction.

Note that depending on the model you use, it might not always follow that instruction exactly.  For example, in our testing
we found that the instruction "Make at least one spelling mistake." for gpt-4o causes the model to make at least one spelling
mistake occasionally, but not consistently.  If it is extremely important to get the model to follow these additional instructions
reliably, it might make sense to adjust the prompt template to emphasize them more (e.g., adding "Remember to follow the special
instruction: {additional_instructions_str} to the end of the prompt).  However, emphasizing the special instruction could draw
attention away from the other instructions (the question type definition, user profile, etc.), so you only want to do it
if the additional instructions are very important.  If they are just intended to be a gentle nudge in some direction to try
to get a little more of some behavior or a little less of another, then it might be fine to just accept that they won't be
followed all the time.

In [5]:
SAMPLE_ADDITIONAL_INSTRUCTIONS_STRINGS = [
    "",
    "Keep the question concise and specific.",
    "Make at least one spelling mistake.",
    "Whenever possible, abbreviate common words as if you were writing a text message.",
    "Keep the question to 10 words or less.",
    "Keep the question to 15 words or less."
]

## Generate questions

Here we use a model to generate questions.  The model we use in this demo notebook both for generating the questions and generating the reference answers is gpt-4o.  It is important to select a very powerful model whenever possible for generating questions and reference answers because these are the questions and answers you will use to test your deployed system.  Also, be sure to check the terms-of-use for whatever model you use and make sure whatever you plan to do with the data ia consistent with those terms of use.

In [6]:
from docling_sdg.qa.conceptual_generate import ConceptualGenerator
from docling_sdg.qa.base import ConceptualGenerateOptions, LlmProvider, SecretStr

options = ConceptualGenerateOptions(
    model_id="gpt-4o",
    provider=LlmProvider.OPENAI,
    api_key=SecretStr(os.getenv("OPENAI_API_KEY", "")),
    user_profiles=SAMPLE_USER_PROFILES,
    additional_instructions=SAMPLE_ADDITIONAL_INSTRUCTIONS_STRINGS,
)

generator = ConceptualGenerator(options)

generator.generate_questions_from_content_description(CONTENT_DESCRIPTION)

Generating questions:  99%|█████████▊| 77/78 [00:49<00:00,  1.56it/s]


GenerateResult(status=<Status.SUCCESS: 'success'>, time_taken=56.77890706062317, output=PosixPath('docling_sdg_generated_questions.jsonl'), num_qac=77)

## Get chunks from the content to generate the answers from

Here we use the Docling-SDG PassageSampler but we set the number of passages to infinite because we really want to have all of the passages available for generating answers.

In [7]:
from docling_sdg.qa.sample import PassageSampler
from docling_sdg.qa.base import SampleOptions

# We use a very large number for the max passages because we want to get all of them
# to populate the search index for the reference answer generation.
# (Assisted by Cursor using Claude 4 Sonnet)
VERY_LARGE_INT = 10**18
sample_options = SampleOptions(
    max_passages = VERY_LARGE_INT,
)

sampler = PassageSampler(sample_options)

sampler_result = sampler.sample(CONTENT_URLS)
sampler_result

Token indices sequence length is longer than the specified maximum sequence length for this model (837 > 512). Running this sequence through the model will result in indexing errors


SampleResult(status=<Status.SUCCESS: 'success'>, time_taken=224.61154413223267, output=PosixPath('docling_sdg_sample.jsonl'), num_passages=636)

## Generate answers using the content

Now that we have the chunks stored, we can use the conceptual generator that we initialized earlier to generate the answers using these chunks.  This next step loads the chunks into the vector index and then generates answers using a reference answer generator.  The reference answer generator uses an LLM-based reranker to score each passage and judge which ones are highly relevant.  The passages that the LLM likes are recorded as the reference context for the generated answer.

That approach is too slow for most production applications, but it is ideal for generating reference answers that you can use to assess the quality of your (hopefully faster) production-ready RAG solution.

In [8]:
result = generator.generate_answers_using_retrieval(sampler_result.output)
result

Adding chunks to index: 100%|██████████| 636/636 [00:38<00:00, 16.68it/s]
Generating answers: 100%|██████████| 77/77 [23:53<00:00, 18.62s/it]


GenerateResult(status=<Status.SUCCESS: 'success'>, time_taken=1472.802365064621, output=PosixPath('docling_sdg_generated_qac.jsonl'), num_qac=77)

At this point, you have generated a complete set of questions and answers.  Below we print out the first entry in a readable format so you can see what the data looks like.

In [9]:
def print_first_qa(result):
    # Generated by Cursor using Claude 4 Sonnet

    # Load the first line from the JSONL file and pretty print it
    with open(result.output, "r") as f:
        first_line = f.readline().strip()
        
    # Parse the JSON and pretty print it
    first_qa = json.loads(first_line)
    print(json.dumps(first_qa, indent=2))

print_first_qa(result)

{
  "context": "# Reconciliations of IBM as Reported\n\n($ in millions)\nRevenue, 2024 = . Revenue, 2023 (1) = . Revenue, 2022 (1) = . Total reportable segments, 2024 = $ 62,510. Total reportable segments, 2023 (1) = $ 61,229. Total reportable segments, 2022 (1) = $ 59,621. Other-divested businesses, 2024 = 35. Other-divested businesses, 2023 (1) = 397. Other-divested businesses, 2022 (1) = 774. Other revenue, 2024 = 207. Other revenue, 2023 (1) = 235. Other revenue, 2022 (1) = 135. Total revenue, 2024 = $ 62,753. Total revenue, 2023 (1) = $ 61,860. Total revenue, 2022 (1) = $ 60,530\n($ in millions)\n-------\n# Financial Performance Summary\n\nIn 2024, we reported $62.8 billion in revenue, income from continuing operations of $6.0 billion, which includes the impact of the pension  settlement  charges  of  $3.1  billion  ($2.4  billion  net  of  tax),  and  operating  (non-GAAP)  earnings  of  $9.7  billion,  which excludes the impact of the pension settlement charges. Refer to 'Organi

## Critique the questions and answers

Next we use the standard Docling-SDG critic capability to critique the questions and answers.

In [10]:
from docling_sdg.qa.base import CritiqueOptions, LlmProvider
from docling_sdg.qa.critique import Judge

options = CritiqueOptions(
    model_id="gpt-4o",
    provider=LlmProvider.OPENAI,
    api_key=SecretStr(os.getenv("OPENAI_API_KEY", "")),
)

judge = Judge(critique_options=options)
judge_result = judge.critique(result.output)
judge_result

77it [25:48, 20.10s/it]


CritiqueResult(status=<Status.SUCCESS: 'success'>, time_taken=1548.0272629261017, output=PosixPath('docling_sdg_critiqued_qac.jsonl'), num_qac=77)

As in the previous section, we print out the first one here.  The format is the same as in the previous section except that a `critiques` section is added with evaluation text and numerical ratings. 

In [11]:
print_first_qa(judge_result)

{
  "context": "# Reconciliations of IBM as Reported\n\n($ in millions)\nRevenue, 2024 = . Revenue, 2023 (1) = . Revenue, 2022 (1) = . Total reportable segments, 2024 = $ 62,510. Total reportable segments, 2023 (1) = $ 61,229. Total reportable segments, 2022 (1) = $ 59,621. Other-divested businesses, 2024 = 35. Other-divested businesses, 2023 (1) = 397. Other-divested businesses, 2022 (1) = 774. Other revenue, 2024 = 207. Other revenue, 2023 (1) = 235. Other revenue, 2022 (1) = 135. Total revenue, 2024 = $ 62,753. Total revenue, 2023 (1) = $ 61,860. Total revenue, 2022 (1) = $ 60,530\n($ in millions)\n-------\n# Financial Performance Summary\n\nIn 2024, we reported $62.8 billion in revenue, income from continuing operations of $6.0 billion, which includes the impact of the pension  settlement  charges  of  $3.1  billion  ($2.4  billion  net  of  tax),  and  operating  (non-GAAP)  earnings  of  $9.7  billion,  which excludes the impact of the pension settlement charges. Refer to 'Organi

## Examine the results of the critique

In [12]:
# Assisted by watsonx Code Assistant 
def load_jsonl(file_path):
    data = []
    with open(file_path, "r") as f:
        for line in f:
            data.append(json.loads(line))
    return data

critiqued_qac = load_jsonl(judge_result.output)
len(critiqued_qac)

77

We will focus on the critiqued question/answer pairs where the system produced a reference answer.

In [13]:
critiqued_qac_with_answers = [qa for qa in critiqued_qac if qa["answer"]]
len(critiqued_qac_with_answers)

70

Here we filter the critiqued QAC with reference answers to only include those that are "good enough" that we would want to use them for evaluating our RAG capabilities. We do this by setting a maximum number of critique metrics where the critic model failed to produce ANY score (MAX_NONE_COUNT), a minimum value for the lowest of all the scores across all the metric (the MIN_MIN_SCORE), and a minimum value for the average score across all the metrics (MIN_MEAN_SCORE).

In the example, we set these to 0, 3, and 4.  We got these numbers by fiddling around with a few examples and these seemed to work OK.  It would probably be sensible to spend a lot more time investigating different values here and how they impact the overall quality and quantity of results.

We ignore the stand_alone metric for two reasons:
1. The way we generate questions without any context does not cause us to generate a lot of questions that don't stand alone well.
2. We see some low scores for this metrics on questions that are fully understandable without context but require specific information to answer.  Those scores are erroneous because the stand_alone prompt is asking whether the question can be understood without context, not whether it can be answered without context.

We use 

In [14]:
MAX_NONE_COUNT = 0
MIN_MIN_SCORE = 3
MIN_MEAN_SCORE = 4
METRICS_TO_IGNORE = ["stand_alone"]

def is_good_enough(qa):
    none_count = 0
    min_score = float('inf')
    total_score = 0
    count_metrics = 0
    for key, value in qa["critiques"].items():
        if key not in METRICS_TO_IGNORE:
            count_metrics += 1
            score = value["rating"]
            if score is None:
                none_count += 1
            else:
                total_score += score
                if score < min_score:
                    min_score = score

    mean_score = total_score / count_metrics

    return none_count <= MAX_NONE_COUNT and min_score >= MIN_MIN_SCORE and mean_score >= MIN_MEAN_SCORE

critiqued_qac_with_answers_that_are_good_enough = list(filter(is_good_enough, critiqued_qac_with_answers))
critiqued_qac_with_answers_that_are_not_good_enough = list(filter(lambda qa: not is_good_enough(qa), critiqued_qac_with_answers))

In [15]:
print(f"{len(critiqued_qac_with_answers_that_are_good_enough)} of {len(critiqued_qac_with_answers)} were judged good enough")

40 of 70 were judged good enough


In [16]:

critiqued_qac_with_answers_that_are_not_good_enough[random.randint(0, len(critiqued_qac_with_answers_that_are_not_good_enough) - 1)]
 

{'context': "# Contingencies\n\nAs  a  company  with  a  substantial  employee  population  and  with  clients  in  more  than  175  countries,  IBM  is  involved,  either  as plaintiff or defendant, in a variety of ongoing claims, demands, suits, investigations, tax matters and proceedings that arise from time to time in the ordinary course of its business. The company is a leader in the information technology industry and, as such, has been and will continue to be subject to claims challenging its IP rights and associated products and offerings, including claims of copyright and patent infringement and violations of trade secrets and other IP rights. In addition, the company enforces its own IP against infringement, through license negotiations, lawsuits or otherwise. Further, given the rapidly evolving external landscape of cybersecurity, AI, privacy and data protection laws, regulations and threat actors, the company and its clients have been and will continue to be subject to acti

The example above shows a random case where the question / answer / context tuple was judged as being not good enough.  If you look in the critiques block of the response above you will see that one or more of the following is true:

- No more than MAX_NONE_COUNT of the non-ignored metrics have None as the rating, meaning that the critic model failed to return a valid response.
- At least one of the non-ignored metrics has a rating below MIN_MIN_SCORE.
- The average of the non-ignored metrics is below MIN_MEAN_SCORE.

Any or all of these can cause us to reject a question / answer / context tuple.

In [17]:
critiqued_qac_with_answers_that_are_good_enough[random.randint(0, len(critiqued_qac_with_answers_that_are_good_enough) - 1)]

{'context': "# Financial Performance Summary\n\nIn 2024, we reported $62.8 billion in revenue, income from continuing operations of $6.0 billion, which includes the impact of the pension  settlement  charges  of  $3.1  billion  ($2.4  billion  net  of  tax),  and  operating  (non-GAAP)  earnings  of  $9.7  billion,  which excludes the impact of the pension settlement charges. Refer to 'Organization of Information,' for additional information. Diluted earnings per share from continuing operations was $6.42 as reported, including an impact of $2.57 from the pension settlement charges, and diluted earnings per share was $10.33 on an operating (non-GAAP) basis. We generated $13.4 billion in cash from operations  and  $12.7  billion  in  free  cash  flow,  and  returned  $6.1  billion  to  shareholders  in  dividends.  We  are  pleased  with  the progress we made in 2024, delivering revenue growth in our re-positioned business and strong cash flow generation. Our 2024 performance demonstrat

In contrast, above we show an question / answer / context tuple above that *was* judged as good enough.  Such tuples did not meet any of the criteria for rejection so they are generally likely to be useful tuples for evaluating a question answering capability such as a RAG system.