# 🐶 Data Pre-Processing: From source PDF to SDG-ready

This notebook goes through each of the stages of data pre-processing. Directory-based conventions are used to save intermediate results as a PDF is converted and chunked, QA generation is performed to create a `qna.yaml` file, and finally everything is combined into the inputs for SDG.

Once a SDG seed dataset is created, a user can run through an SDG notebook and generate samples.

**NOTE**: Starting the notebook using Python 3.11 is recommended. Python 3.12 or later are not yet supported. 

1. [Data Gathering](#Data-Gathering)
1. [Document Conversion](#Document-Conversion)
1. [Chunking](#Chunking)
1. [Authoring](#Authoring)
1. [Create Seed Dataset](#Create-Seed-Dataset-for-SDG)

***

In [6]:
from pathlib import Path

WORKSPACE_NAME = "default"

WORKSPACE_ROOT = Path("workspaces")
WORKSPACE_ROOT.mkdir(exist_ok=True)

WORKSPACE_DIR = WORKSPACE_ROOT / WORKSPACE_NAME
WORKSPACE_DIR.mkdir(exist_ok=True)

SOURCE_DOCUMENT_DIR = WORKSPACE_DIR / "source_documents"
SOURCE_DOCUMENT_DIR.mkdir(parents=True, exist_ok=True)

CONVERSION_OUTPUT_DIR = WORKSPACE_DIR / "conversion"
CONVERSION_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

CHUNKING_OUTPUT_DIR = WORKSPACE_DIR / "chunking"
CHUNKING_OUTPUT_DIR.mkdir(exist_ok=True)

AUTHORING_OUTPUT_DIR = WORKSPACE_DIR / "authoring"
AUTHORING_OUTPUT_DIR.mkdir(exist_ok=True)

SDG_OUTPUT_DIR = WORKSPACE_DIR / "sdg"
SDG_OUTPUT_DIR.mkdir(exist_ok=True)

## Data Gathering

Ensure the necessary PDF files are in `SOURCE_DOCUMENT_DIR`.

In [7]:
files = list(SOURCE_DOCUMENT_DIR.glob("*.pdf"))
print(f"Files to convert: {files}")

Files to convert: [PosixPath('workspaces/default/source_documents/2022-nfl-rulebook-final.pdf'), PosixPath('workspaces/default/source_documents/BofA_InterestChecking_en_ADA.pdf'), PosixPath('workspaces/default/source_documents/2502.01618v3.pdf')]


## Document Conversion

This notebook uses [Docling](https://github.com/docling-project/docling) to convert any type of document into a Docling Document. A Docling Document is the representation of the document after conversion that can be exported as JSON. The JSON output of this notebook can then be used in others such as one that uses Docling's chunking methods.

In [8]:
!pip install -qq docling

### Configure Docling conversion pipeline

Next we set the configuration options for our conversion pipeline. The PDF Conversion options set here are the defaults. More information about pipeline configuration can be found on Docling.

For a complete reference on Docling conversion pipeline configuration, see [PDFPipelineOptions](https://docling-project.github.io/docling/reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions) and [PDFFormatOptions](https://docling-project.github.io/docling/reference/document_converter/#docling.document_converter.InputFormat.XML_JATS).

In [9]:
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions

pipeline_options = PdfPipelineOptions() # TODO: show the options that can be set

doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options
        )
    }
)

Finally, we convert every document into Docling JSON as long as it is a valid file type to be converted

In [10]:
import json

json_files = []

for file in files:
    doc = doc_converter.convert(source=file).document
    doc_dict = doc.export_to_dict()

    json_output_path = CONVERSION_OUTPUT_DIR / f"{file.stem}.json"
    with open(json_output_path, "w") as f:
        json.dump(doc_dict, f)
        print(f"Path of JSON output is: {Path(json_output_path).resolve()}")
        json_files.append(json_output_path)



Path of JSON output is: /Users/amaredia/dev/examples/notebooks/instructlab-knowledge/workspaces/default/conversion/2022-nfl-rulebook-final.json
Path of JSON output is: /Users/amaredia/dev/examples/notebooks/instructlab-knowledge/workspaces/default/conversion/BofA_InterestChecking_en_ADA.json
Path of JSON output is: /Users/amaredia/dev/examples/notebooks/instructlab-knowledge/workspaces/default/conversion/2502.01618v3.json


## Chunking

The goal of chunking the converted documents is to provide the teacher model small and logical pieces of the source document to generate data off of.

In this notebook we are doing chunking with [Docling](https://docling-project.github.io/docling/examples/hybrid_chunking/#hybrid-chunking).

The input to this notebook is a docling JSON file created after a docling conversion, or a directory of docling JSON files.

### Initialize the Chunker

Docling provides two chunkers, the `HierarchicalChunker` and the `HybridChunker`.
The `HierarchicalChunker` creates chunks based on the hierarchy in the Docling document

The `HybridChunker` builds on the `HierarchicalChunker` and by making it tokenization aware.

The `HybridChunker` has options for a `tokenizer`, the `max_tokens` in a chunk, and whether to merge undersized peer chunks. Uncomment the commented out code to configure these.

In [11]:
#from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer
#from transformers import AutoTokenizer

from docling.chunking import HybridChunker

#EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
#MAX_TOKENS = 1024
#
# tokenizer = HuggingFaceTokenizer(
#     tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID),
#     max_tokens=MAX_TOKENS,  # optional, by default derived from `tokenizer` for HF case
#     merge_peers=True # 
# )

chunker = HybridChunker(
    #tokenizer=tokenizer,
    #merge_peers=True,  # whether to merge undersized chunks - defaults to True
)

### Load and chunk the converted docling document

Next lets convert the document we want to chunk up into a Docling Document.

In [12]:
all_chunks = []
docs = []
for file in json_files:
    # reconvert the docling JSON for chunking
    doc = DocumentConverter().convert(source=file)
    
    chunk_iter = chunker.chunk(dl_doc=doc.document)
    chunk_objs = list(chunk_iter)
    chunks = [chunker.contextualize(chunk=chunk) for chunk in chunk_objs]

    print(f"Extracted {len(chunks)} chunks from {doc.document.name}")
    
    for chunk in chunks:
        c = dict(chunk=chunk, file=doc.document.name)
        all_chunks.append(c)
    
    docs.append(dict(chunk_objs=chunk_objs,file=doc.document.name))


# TODO: support multiple files save all chunks to single file for review

Token indices sequence length is longer than the specified maximum sequence length for this model (583 > 512). Running this sequence through the model will result in indexing errors


Extracted 1798 chunks from 2022-nfl-rulebook-final
Extracted 10 chunks from BofA_InterestChecking_en_ADA
Extracted 52 chunks from 2502.01618v3


### View the Chunks

To view the chunks, run through the following cell. As you can see the document is broken into small pieces with metadata about the chunk based on the document's format

In [17]:
print(all_chunks[0]["chunk"])

2022 OFFICIAL PLAYING RULES OF THE NATIONAL FOOTBALL LEAGUE
Roger Goodell, Commissioner


### Save all chunks to a JSON file

All chunks are saved to a JSON file called chunks.jsonl in CHUNKING_OUTPUT_DIR. This file is one of the inputs father below when we create the seed dataset for SDG.

In [18]:
chunks_file_path = CHUNKING_OUTPUT_DIR / "chunks.jsonl"
with open(chunks_file_path, "w", encoding="utf-8") as file:
    for chunk in all_chunks:
        json.dump(chunk, file)
        file.write("\n")

## Authoring

In [19]:
!pip install -qq docling-sdg

# TODO: replace with above after https://github.com/docling-project/docling-sdg/pull/31 merges
#!pip install -qq git+https://github.com/anastasds/docling-sdg@d15de2c5a81bfe166f66f412fc4b23728065f396

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [25]:
from docling_sdg.qa.utils import get_qa_chunks

filters = [
    lambda chunk: len(str(chunk.text)) > 500
]

dataset = {}
for doc in docs:
    print(f"Chunking and filtering document {doc["file"]}")
    
    # get_qa_chunks expects a list[DocChunk] which we already have from doc["chunk_objs"] in the chunking section
    qa_chunks = list(get_qa_chunks(doc["file"], doc["chunk_objs"], filters)) #TODO: decouple reference to chunk_objs from above)
    dataset[doc["file"]] = qa_chunks
    
    print(f"Created dataset {doc["file"]} with {len(qa_chunks)} QA chunks")

Chunking and filtering document 2022-nfl-rulebook-final
Created dataset 2022-nfl-rulebook-final with 720 QA chunks
Chunking and filtering document BofA_InterestChecking_en_ADA
Created dataset BofA_InterestChecking_en_ADA with 8 QA chunks
Chunking and filtering document 2502.01618v3
Created dataset 2502.01618v3 with 45 QA chunks


### Initialize QA generator, supplying details for which model to use

GenerateOptions controls which model is used for QA generation by setting generate_options.provider below. Three options are available:

* LlmProviders.WATSONX for watsonx
* LlmProviders.OPENAI for OpenAI
* LlmProviders.OPENAI_LIKE for any model provider with OpenAI compatible APIs

In [26]:
API_KEY = "none"  # the API access key for your account ( cannot be empty )
API_URL = "http://127.0.0.1:11434/v1"  # the URL of your model's API
MODEL_ID = "granite3.3" # the name of your model

In [27]:
from docling_sdg.qa.generate import Generator
from docling_sdg.qa.base import GenerateOptions, LlmProvider
from pydantic import SecretStr

generate_options = GenerateOptions(project_id="project_id")
generate_options.provider = LlmProvider.OPENAI_LIKE
generate_options.api_key = SecretStr(API_KEY)
generate_options.url = API_URL
generate_options.model_id = MODEL_ID

### Configure subset selection

In [52]:
NUM_CHUNKS_PER_FILE_TO_SELECT_FOR_AUTHORING = 2

### Run QA generation on selected chunks

In [53]:
import random #TODO: replace random sampling with subset selection

generated_files = []

for doc, chunks in dataset.items():
    generate_options.generated_file = AUTHORING_OUTPUT_DIR / f"qagen-{doc}.json"
    gen = Generator(generate_options=generate_options)
    print(f"processing chunks that looks like:\n{chunks[0].text}")
    selected_chunks = random.sample(chunks, NUM_CHUNKS_PER_FILE_TO_SELECT_FOR_AUTHORING)
    print(f"Selected {len(selected_chunks)} contexts")

    Path.unlink(generate_options.generated_file, missing_ok=True)
    results = gen.generate_from_chunks(selected_chunks) # automatically saves to file
    generated_files.append(generate_options.generated_file)

    print(f"{doc}: {results.status}")

processing chunks that looks like:
This edition of the Official Playing Rules of the  National Football League  contains all current rules governing the playing of professional football that are in effect for the 2022 NFL season. Member clubs of the League may amend the rules from time to time, pursuant to the applicable voting procedures of the NFL Constitution and Bylaws.
Any intra-League dispute or call for interpretation in connection with these rules will be decided by the Commissioner of the League, whose ruling will be final.
Because inter-conference games are played throughout the preseason, regular season, and postseason in  the  NFL, all  rules contained in  this  book apply uniformly to both the American and National Football Conferences.
Where the word 'illegal' appears in this rule book, it is an institutional term of art pertaining strictly to actions that violate NFL playing rules. It is not meant to connote illegality under any public law or the rules or regulations of 

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00,  3.47s/it]


2022-nfl-rulebook-final: Status.SUCCESS
processing chunks that looks like:
Opening Deposit
$100 or more
Interest Rate
This account earns interest at a variable rate. You can find current rate information at bankofamerica.com, by calling the number on your account statement or visiting a financial center.
Monthly
Maintenance
Fee
$25.00 each month. You can avoid the Monthly Maintenance Fee when you meet ONE of the following requirements during each statement cycle:
• Maintain a minimum daily balance of $20,000 or more in your account OR
• Be a member of the Preferred Rewards program. Learn more at bankofamerica.com/preferred-rewards.
ATM fees
Bank of America ATMs
No ATM fee
For deposits, withdrawals, transfers or balance inquiries
Non-Bank of America ATMs
$2.50
In the U.S., plus any fee charged by the ATM's operator
$5.00
Outside the U.S., plus any fee charged by the ATM's operator
Selected 2 contexts


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00,  4.06s/it]


BofA_InterestChecking_en_ADA: Status.SUCCESS
processing chunks that looks like:
Large language models (LLMs) have achieved significant performance gains via scaling up model sizes and/or data. However, recent evidence suggests diminishing returns from such approaches, motivating scaling the computation spent at inference time. Existing inference-time scaling methods, usually with reward models, cast the task as a search problem, which tends to be vulnerable to reward hacking as a consequence of approximation errors in reward models. In this paper, we instead cast inference-time scaling as a probabilistic inference task and leverage sampling-based techniques to explore the typical set of the state distribution of a state-space model with an approximate likelihood, rather than optimize for its mode directly. We propose a novel inference-time scaling approach by adapting particle-based Monte Carlo methods to this task. Our empirical evaluation demonstrates that our methods have a 4-16x be

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00,  4.08s/it]

2502.01618v3: Status.SUCCESS





### Read generated QAs and restructure

In [43]:
import json
import yaml
from textwrap import wrap

qnas = {}
chunk_id_to_text = {}
for file in generated_files:
    with open(file, "rt") as f:
        for line in f.readlines():
            entry = json.loads(line)
            chunk_id = entry['chunk_id']
            if chunk_id not in chunk_id_to_text:
                chunk_id_to_text[chunk_id] = entry['context']
            if chunk_id not in qnas:
                qnas[chunk_id] = []
            qnas[chunk_id].append({'question': entry['question'], 'answer': entry['answer']})

print(f"Generated QA pairs for {len(qnas)} contexts")
print(list(qnas.values())[0])

Generated QA pairs for 13 contexts
[{'question': 'What is the mandatory color for the lower portion of game socks and/or leg coverings?', 'answer': 'White'}, {'question': "What are the color options for a player's uniform, including helmets, jerseys, pants, and game socks and/or leg coverings?", 'answer': 'Players are permitted to wear only the colors or a combination of those colors established for their NFL club in the League Constitution and Bylaws, with white also being an available color for jerseys and mandatory for the lower portion of game socks and/or leg coverings. Each player on a given team must wear the same colors on his uniform as all other players on his team in the same game.'}, {'question': 'Based on the policy, why is it important for players to present an appearance that is appropriate to representing their individual clubs and the National Football League?', 'answer': 'The policy emphasizes professionalism and representation, suggesting that the appearance of playe

### Define metadata for qna.yaml

In [54]:
DOCUMENT_OUTLINE = "A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods"
DOMAIN = "artificial intelligence"

### Output qna.yaml

In [63]:
qna_output_path = AUTHORING_OUTPUT_DIR / "qna.yaml"

# The following creates a data structure for outputting in the expected format for qna.yaml
# TODO: extract into utils library

def str_presenter(dumper, data):
  if len(data.splitlines()) > 1:  # check for multiline string
    return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='|')
  elif len(data) > 80:
    data = "\n".join(wrap(data, 80))
    return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='|')
  return dumper.represent_scalar('tag:yaml.org,2002:str', data)

yaml.add_representer(str, str_presenter)

# to use with safe_dump:
yaml.representer.SafeRepresenter.add_representer(str, str_presenter)

class IndentedDumper(yaml.Dumper):
    def increase_indent(self, flow=False, indentless=False):
        return super(IndentedDumper, self).increase_indent(flow, False)

data = {'seed_examples': []}
for chunk_id, context in chunk_id_to_text.items():
    data['seed_examples'].append({
        'context': context,
        'questions_and_answers': [
            {
                'question': example['question'],
                'answer': example['answer'],
            } for example in qnas[chunk_id]
        ]
    })

data['document_outline'] = DOCUMENT_OUTLINE
data['domain'] = DOMAIN

Path.unlink(qna_output_path, missing_ok=True) # shouldn't be necessary but was. jupyter caching thing?
with open(qna_output_path, 'w') as yaml_file:
    yaml.dump(data, yaml_file, Dumper=IndentedDumper, default_flow_style=False, sort_keys=False, width=80)

print(f"qna.yaml saved to: {qna_output_path}")

qna.yaml saved to: workspaces/default/authoring/qna.yaml


### View generated qna.yaml

In [64]:
with open(qna_output_path) as yaml_file:
    print(yaml_file.read())

seed_examples:
  - context: |-
      ARTICLE 1.  GENERAL POLICY. Throughout the game-day period while in view of the stadium and television audience, including during team pregame warm-ups, all players must dress in a professional manner under the uniform standards. The helmet and mandatory padding referenced in Article 3 below are intended to provide reasonable protection to a player while reasonably avoiding risk of injury to other players. The development of Playing Rules should be governed by this Article. Players generally must present an appearance that is appropriate to representing their individual clubs and the National Football League. The term uniform, as used in this policy, applies to every piece of equipment worn by a player, including helmet, shoulder pads, thigh pads, knee pads, and any other item of protective gear, and to every visible item of apparel, including but not limited to pants, jerseys, wristbands, gloves, game socks and/or leg coverings, shoes, visible unde

### Revise QAs

Open the generated `qna.yaml` in your preferred text editor to ensure the quality of generated questions and answers. If the generation step has failed to generated three questions and answers for each of five contexts, supplant until that required number of QA pairs is reached.

## Create Seed Dataset for SDG

This section combines the contents from the qna.yaml and the chunks from the source document to create a seed dataset for the synthetic data generation process.

To run this step you need a directory that contains `chunks.jsonl` and a `qna.yaml` in the same directory.

This step outputs a seed.jsonl file in the SDG_OUTPUT_DIR that you set.

In [65]:
!pip install -qq datasets transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [67]:
from utils.create_seed_dataset import get_seed_dataset

seed_data = get_seed_dataset(CHUNKING_OUTPUT_DIR, AUTHORING_OUTPUT_DIR)
output_path = f'{SDG_OUTPUT_DIR}/seed_data.jsonl'
seed_data.to_json(output_path, orient='records', lines=True)

print(f"Seed data contains {seed_data.data.num_rows} rows")
print(f"Results saved to: {output_path}")

Map:   0%|          | 0/1798 [00:00<?, ? examples/s]

Map:   0%|          | 0/1798 [00:00<?, ? examples/s]

Map:   0%|          | 0/1798 [00:00<?, ? examples/s]

Map:   0%|          | 0/1798 [00:00<?, ? examples/s]

Map:   0%|          | 0/1798 [00:00<?, ? examples/s]

Map:   0%|          | 0/1798 [00:00<?, ? examples/s]

Map:   0%|          | 0/1798 [00:00<?, ? examples/s]

Map:   0%|          | 0/1798 [00:00<?, ? examples/s]

Map:   0%|          | 0/1798 [00:00<?, ? examples/s]

Map:   0%|          | 0/1798 [00:00<?, ? examples/s]

Map:   0%|          | 0/17980 [00:00<?, ? examples/s]

Filter:   0%|          | 0/17980 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Map:   0%|          | 0/100 [00:00<?, ? examples/s]

Filter:   0%|          | 0/100 [00:00<?, ? examples/s]

Map:   0%|          | 0/52 [00:00<?, ? examples/s]

Map:   0%|          | 0/52 [00:00<?, ? examples/s]

Map:   0%|          | 0/52 [00:00<?, ? examples/s]

Map:   0%|          | 0/52 [00:00<?, ? examples/s]

Map:   0%|          | 0/52 [00:00<?, ? examples/s]

Map:   0%|          | 0/52 [00:00<?, ? examples/s]

Map:   0%|          | 0/52 [00:00<?, ? examples/s]

Map:   0%|          | 0/52 [00:00<?, ? examples/s]

Map:   0%|          | 0/52 [00:00<?, ? examples/s]

Map:   0%|          | 0/52 [00:00<?, ? examples/s]

Map:   0%|          | 0/520 [00:00<?, ? examples/s]

Filter:   0%|          | 0/520 [00:00<?, ? examples/s]

Creating json from Arrow format:   0%|          | 0/17 [00:00<?, ?ba/s]

Seed data contains 18600 rows
Results saved to: workspaces/default/sdg/seed_data.jsonl


### Inspect the seed data

In [68]:
print(seed_data.data.table.slice(length=1))

pyarrow.Table
document_outline: string
document_title: string
domain: string
icl_document: string
icl_query_1: string
icl_response_1: string
icl_query_2: string
icl_response_2: string
icl_query_3: string
icl_response_3: string
document: string
----
document_outline: [["A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using
Particle-Based Monte Carlo Methods"]]
document_title: [["2022-nfl-rulebook-final"]]
domain: [["artificial intelligence"]]
icl_document: [["ARTICLE 1.  GENERAL POLICY. Throughout the game-day period while in view of the stadium and television audience, including during team pregame warm-ups, all players must dress in a professional manner under the uniform standards. The helmet and mandatory padding referenced in Article 3 below are intended to provide reasonable protection to a player while reasonably avoiding risk of injury to other players. The development of Playing Rules should be governed by this Article. Players generally must present an app

# Summary

To recap, given a source document in PDF format, this notebook:

1. Converted the document using document and saved it to JSON for inspection
2. Split the extracted text into chunks
3. Generated QA pairs for a subset of those chunks
4. Created a `qna.yaml` available for inspection and revision
5. Combined the chunks and `qna.yaml` to create a `seed_data.jsonl` for use with SDG

The next step is to use the resulting `seed_data.jsonl` for SDG, such as illustrated in [this notebook](https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub/blob/main/examples/instructlab/knowledge/knowledge_generation_and_mixing.ipynb).