# 🐶 Data Pre-Processing: From source PDF to SDG-ready

This notebook goes through each of the stages of data pre-processing. Directory-based conventions are used to save intermediate results as a PDF is converted and chunked and QA generation is performed to create a `qna.yaml` file for each knowledge contribution. At the end everything is combined into the inputs for SDG.

Once a SDG seed dataset is created, a user can run through an SDG notebook and generate samples.

**NOTE**: Starting the notebook using Python 3.11 is recommended. Python 3.12 or later are not yet supported. 

1. [Data Gathering](#Data-Gathering)
1. [Document Conversion](#Document-Conversion)
1. [Chunking](#Chunking)
1. [Authoring](#Authoring)
1. [Create Seed Dataset](#Create-Seed-Dataset-for-SDG)

***

In [None]:
from pathlib import Path

WORKSPACE_NAME = "default"

WORKSPACE_ROOT = Path("workspaces")
WORKSPACE_ROOT.mkdir(exist_ok=True)

WORKSPACE_DIR = WORKSPACE_ROOT / WORKSPACE_NAME
WORKSPACE_DIR.mkdir(exist_ok=True)

SOURCE_DOCUMENT_DIR = WORKSPACE_DIR / "source_documents"
SOURCE_DOCUMENT_DIR.mkdir(parents=True, exist_ok=True)

CONVERSION_OUTPUT_DIR = WORKSPACE_DIR / "conversion"
CONVERSION_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

CHUNKING_OUTPUT_DIR = WORKSPACE_DIR / "chunking"
CHUNKING_OUTPUT_DIR.mkdir(exist_ok=True)

AUTHORING_OUTPUT_DIR = WORKSPACE_DIR / "authoring"
AUTHORING_OUTPUT_DIR.mkdir(exist_ok=True)

SDG_OUTPUT_DIR = WORKSPACE_DIR / "sdg"
SDG_OUTPUT_DIR.mkdir(exist_ok=True)

## Data Gathering

TODO: Add documentation about domain and summary here, clear out second contribution example

In [None]:
# struct would have document outline, domain, and path to pdf's for each contribution

contribution_path = Path(SOURCE_DOCUMENT_DIR / "nfl")
contribution_prefix = "nfl"
contribution_domain = "sports" 
contribution_summary = "Official playing rules of the National Football League 2022"

contribution1 = {"path": contribution_path, "prefix": contribution_prefix, "domain": contribution_domain, "summary": contribution_summary}

contribution_path2 = Path(SOURCE_DOCUMENT_DIR / "finance")
contribution_prefix2 = "finance"
contribution_domain2 = "banking" 
contribution_summary2 = "Account information for a specific bank"

contribution2 = {"path": contribution_path2, "prefix": contribution_prefix2, "domain": contribution_domain2, "summary": contribution_summary2}

contributions = []
contributions.append(contribution1)
contributions.append(contribution2)

for contribution in contributions:
    contribution["files"] = list(contribution["path"].glob("*.pdf"))

print(f"Files to convert:")
for contribution in contributions:
    print(f"{contribution["files"]}")
    conv_output_dir = CONVERSION_OUTPUT_DIR / contribution["prefix"]
    conv_output_dir.mkdir(parents=True, exist_ok=True)

    chunking_output_dir = CHUNKING_OUTPUT_DIR / contribution["prefix"]
    chunking_output_dir.mkdir(parents=True, exist_ok=True)

    authoring_output_dir = AUTHORING_OUTPUT_DIR / contribution["prefix"]
    authoring_output_dir.mkdir(parents=True, exist_ok=True)

## Document Conversion

This notebook uses [Docling](https://github.com/docling-project/docling) to convert any type of document into a Docling Document. A Docling Document is the representation of the document after conversion that can be exported as JSON. The JSON output of this notebook can then be used in others such as one that uses Docling's chunking methods.

In [None]:
!pip install -qq docling

### Configure Docling conversion pipeline

Next we set the configuration options for our conversion pipeline. The PDF Conversion options set here are the defaults. More information about pipeline configuration can be found on Docling.

For a complete reference on Docling conversion pipeline configuration, see [PDFPipelineOptions](https://docling-project.github.io/docling/reference/pipeline_options/#docling.datamodel.pipeline_options.PdfPipelineOptions) and [PDFFormatOptions](https://docling-project.github.io/docling/reference/document_converter/#docling.document_converter.InputFormat.XML_JATS).

In [None]:
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions

pipeline_options = PdfPipelineOptions() # TODO: show the options that can be set

doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options
        )
    }
)

Finally, we convert every document into Docling JSON as long as it is a valid file type to be converted

In [None]:
import json

for contribution in contributions:
    contribution["json_files"] = []
    for file in contribution["files"]:
        doc = doc_converter.convert(source=file).document
        doc_dict = doc.export_to_dict()
    
        json_output_path = CONVERSION_OUTPUT_DIR / contribution["prefix"] / f"{file.stem}.json"
        with open(json_output_path, "w") as f:
            json.dump(doc_dict, f)
            print(f"Path of JSON output is: {Path(json_output_path).resolve()}")
            json_files.append(json_output_path.resolve())

## Chunking

The goal of chunking the converted documents is to provide the teacher model small and logical pieces of the source document to generate data off of.

In this notebook we are doing chunking with [Docling](https://docling-project.github.io/docling/examples/hybrid_chunking/#hybrid-chunking).

The input to this notebook is a docling JSON file created after a docling conversion, or a directory of docling JSON files.

### Initialize the Chunker

Docling provides two chunkers, the `HierarchicalChunker` and the `HybridChunker`.
The `HierarchicalChunker` creates chunks based on the hierarchy in the Docling document

The `HybridChunker` builds on the `HierarchicalChunker` and by making it tokenization aware.

The `HybridChunker` has options for a `tokenizer`, the `max_tokens` in a chunk, and whether to merge undersized peer chunks. Uncomment the commented out code to configure these.

In [None]:
#from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer
#from transformers import AutoTokenizer

from docling.chunking import HybridChunker

#EMBED_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
#MAX_TOKENS = 1024
#
# tokenizer = HuggingFaceTokenizer(
#     tokenizer=AutoTokenizer.from_pretrained(EMBED_MODEL_ID),
#     max_tokens=MAX_TOKENS,  # optional, by default derived from `tokenizer` for HF case
#     merge_peers=True # 
# )

chunker = HybridChunker(
    #tokenizer=tokenizer,
    #merge_peers=True,  # whether to merge undersized chunks - defaults to True
)

### Load and chunk the converted docling document

Next lets convert the document we want to chunk up into a Docling Document.

In [None]:
for contribution in contributions:
    contribution["all_chunks"] = []
    contribution["docs"] = []
    for file in json_files:
        # reconvert the docling JSON for chunking
        doc = DocumentConverter().convert(source=file)
        
        chunk_iter = chunker.chunk(dl_doc=doc.document)
        chunk_objs = list(chunk_iter)
        chunks = [chunker.contextualize(chunk=chunk) for chunk in chunk_objs]
    
        print(f"Extracted {len(chunks)} chunks from {doc.document.name}")
        
        for chunk in chunks:
            c = dict(chunk=chunk, file=doc.document.name)
            contribution["all_chunks"].append(c)
        
        contribution["docs"].append(dict(chunk_objs=chunk_objs,file=doc.document.name))

### View the Chunks

To view the chunks, run through the following cell. As you can see the document is broken into small pieces with metadata about the chunk based on the document's format

In [None]:
print(contributions[0]["all_chunks"][0]["chunk"])

### Save all chunks to a JSON file

All chunks are saved to a JSON file called chunks.jsonl in CHUNKING_OUTPUT_DIR. This file is one of the inputs father below when we create the seed dataset for SDG.

In [None]:
for contribution in contributions:
    chunks_file_path = CHUNKING_OUTPUT_DIR / contribution["prefix"] / "chunks.jsonl"
    with open(chunks_file_path, "w", encoding="utf-8") as file:
        for chunk in contribution["all_chunks"]:
            json.dump(chunk, file)
            file.write("\n")
        print(f"Path of chunks JSON is: {Path(chunks_file_path).resolve()}")


## Authoring

In [None]:
!pip install -qq docling-sdg

# TODO: replace with above after https://github.com/docling-project/docling-sdg/pull/31 merges
#!pip install -qq git+https://github.com/anastasds/docling-sdg@d15de2c5a81bfe166f66f412fc4b23728065f396

In [None]:
from docling_sdg.qa.utils import get_qa_chunks

filters = [
    lambda chunk: len(str(chunk.text)) > 500
]

for contribution in contributions:
    contribution["dataset"] = {}
    for doc in contribution["docs"]:
        print(f"Chunking and filtering document {doc["file"]}")
        
        # get_qa_chunks expects a list[DocChunk] which we already have from doc["chunk_objs"] in the chunking section
        qa_chunks = list(get_qa_chunks(doc["file"], doc["chunk_objs"], filters)) #TODO: decouple reference to chunk_objs from above)
        contribution["dataset"][doc["file"]] = qa_chunks
        
        print(f"Created dataset {doc["file"]} with {len(qa_chunks)} QA chunks")

### Initialize QA generator, supplying details for which model to use

GenerateOptions controls which model is used for QA generation by setting generate_options.provider below. Three options are available:

* LlmProviders.WATSONX for watsonx
* LlmProviders.OPENAI for OpenAI
* LlmProviders.OPENAI_LIKE for any model provider with OpenAI compatible APIs

In [None]:
API_KEY = "none"  # the API access key for your account ( cannot be empty )
API_URL = "http://127.0.0.1:11434/v1"  # the URL of your model's API
MODEL_ID = "granite3.3" # the name of your model

In [None]:
from docling_sdg.qa.generate import Generator
from docling_sdg.qa.base import GenerateOptions, LlmProvider
from pydantic import SecretStr

generate_options = GenerateOptions(project_id="project_id")
generate_options.provider = LlmProvider.OPENAI_LIKE
generate_options.api_key = SecretStr(API_KEY)
generate_options.url = API_URL
generate_options.model_id = MODEL_ID

### Configure subset selection

In [None]:
NUM_CHUNKS_PER_FILE_TO_SELECT_FOR_AUTHORING = 2

### Run QA generation on selected chunks

In [None]:
import random #TODO: replace random sampling with subset selection

for contribution in contributions:
    contribution["generated_files"] = []
    for doc, chunks in contribution["dataset"].items():
        generate_options.generated_file = AUTHORING_OUTPUT_DIR / contribution["prefix"] / f"qagen-{doc}.json"
        gen = Generator(generate_options=generate_options)
        print(f"processing chunks that looks like:\n{chunks[0].text}")
        selected_chunks = random.sample(chunks, NUM_CHUNKS_PER_FILE_TO_SELECT_FOR_AUTHORING)
        print(f"Selected {len(selected_chunks)} contexts")
    
        Path.unlink(generate_options.generated_file, missing_ok=True)
        results = gen.generate_from_chunks(selected_chunks) # automatically saves to file
        contribution["generated_files"].append(generate_options.generated_file)
    
        print(f"{doc}: {results.status}")

### Read generated QAs and restructure

In [None]:
import json
import yaml
from textwrap import wrap

for contribution in contributions:
    contribution["qnas"] = {}
    contribution["chunk_id_to_text"] = {}
    for file in contribution["generated_files"]:
        with open(file, "rt") as f:
            for line in f.readlines():
                entry = json.loads(line)
                chunk_id = entry['chunk_id']
                if chunk_id not in contribution["chunk_id_to_text"]:
                    contribution["chunk_id_to_text"][chunk_id] = entry['context']
                if chunk_id not in contribution["qnas"]:
                    contribution["qnas"][chunk_id] = []
                contribution["qnas"][chunk_id].append({'question': entry['question'], 'answer': entry['answer']})
    
    print(f"Generated QA pairs for {len(contribution["qnas"])} contexts")
    print(list(contribution["qnas"].values())[0])

### Output qna.yaml

In [None]:
# The following creates a data structure for outputting in the expected format for qna.yaml
# TODO: extract into utils library

def str_presenter(dumper, data):
  if len(data.splitlines()) > 1:  # check for multiline string
    return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='|')
  elif len(data) > 80:
    data = "\n".join(wrap(data, 80))
    return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='|')
  return dumper.represent_scalar('tag:yaml.org,2002:str', data)

yaml.add_representer(str, str_presenter)

# to use with safe_dump:
yaml.representer.SafeRepresenter.add_representer(str, str_presenter)

class IndentedDumper(yaml.Dumper):
    def increase_indent(self, flow=False, indentless=False):
        return super(IndentedDumper, self).increase_indent(flow, False)

for contribution in contributions:
    qna_output_path = AUTHORING_OUTPUT_DIR / contribution["prefix"] / "qna.yaml"
    
    data = {'seed_examples': []}
    for chunk_id, context in contribution["chunk_id_to_text"].items():
        data['seed_examples'].append({
            'context': context,
            'questions_and_answers': [
                {
                    'question': example['question'],
                    'answer': example['answer'],
                } for example in contribution["qnas"][chunk_id]
            ]
        })
    
    data['document_outline'] = contribution["summary"]
    data['domain'] = contribution["domain"]
    
    Path.unlink(qna_output_path, missing_ok=True) # shouldn't be necessary but was. jupyter caching thing?
    with open(qna_output_path, 'w') as yaml_file:
        yaml.dump(data, yaml_file, Dumper=IndentedDumper, default_flow_style=False, sort_keys=False, width=80)
    
    print(f"qna.yaml saved to: {qna_output_path}")

### View generated qna.yaml

In [None]:
for contribution in contributions:
    qna_output_path = AUTHORING_OUTPUT_DIR / contribution["prefix"] / "qna.yaml"

    with open(qna_output_path) as yaml_file:
        print(f"========= qna.yaml at {qna_output_path} ==========")
        print(yaml_file.read())


### Revise QAs

Open the generated `qna.yaml` in your preferred text editor to ensure the quality of generated questions and answers. If the generation step has failed to generated three questions and answers for each of five contexts, supplant until that required number of QA pairs is reached.

## Create Seed Dataset for SDG

This section combines the contents from the qna.yaml and the chunks from the source document to create a seed dataset for the synthetic data generation process.

To run this step you need a directory that contains `chunks.jsonl` and a `qna.yaml` in the same directory.

This step outputs a seed.jsonl file in the SDG_OUTPUT_DIR that you set.

In [None]:
!pip install -qq datasets transformers

In [None]:
from utils.create_seed_dataset import get_seed_dataset, safe_concatenate_datasets

contribution_datasets = []
for contribution in contributions:
    chunks_dir = Path(CHUNKING_OUTPUT_DIR / contribution["prefix"])
    qna_dir = Path(AUTHORING_OUTPUT_DIR / contribution["prefix"])
    seed_data = get_seed_dataset(chunks_dir, qna_dir)
    contribution_datasets.append(seed_data)

final_seed_data = safe_concatenate_datasets(contribution_datasets)
output_path = f'{SDG_OUTPUT_DIR}/seed_data.jsonl'
final_seed_data.to_json(output_path, orient='records', lines=True)

print(f"Seed data contains {final_seed_data.data.num_rows} rows")
print(f"Results saved to: {output_path}")

### Inspect the seed data

In [None]:
print(seed_data.data.table.slice(length=1))

# Summary

To recap, given a source document in PDF format, this notebook:

1. Converted the document using document and saved it to JSON for inspection
2. Split the extracted text into chunks
3. Generated QA pairs for a subset of those chunks
4. Created a `qna.yaml` available for inspection and revision
5. Combined the chunks and `qna.yaml` to create a `seed_data.jsonl` for use with SDG

The next step is to use the resulting `seed_data.jsonl` for SDG, such as illustrated in [this notebook](https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub/blob/main/examples/instructlab/knowledge/knowledge_generation_and_mixing.ipynb).