In [None]:
# design goals:
#
# - understandability
# - modularity
# - configurability

# config

In [1]:
from pathlib import Path

In [2]:
WORKSPACE_DIR="workspaces/default"

CONVERSION_OUTPUT_DIR = Path(f"{WORKSPACE_DIR}/conversion")
CONVERSION_OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

CHUNKING_OUTPUT_DIR = Path(f"{WORKSPACE_DIR}/chunking")
CHUNKING_OUTPUT_DIR.mkdir(exist_ok=True)

SEED_EXAMPLE_OUTPUT_DIR = Path(f"{WORKSPACE_DIR}/seed-examples")
SEED_EXAMPLE_OUTPUT_DIR.mkdir(exist_ok=True)

SDG_OUTPUT_DIR = Path(f"{WORKSPACE_DIR}/sdg")
SDG_OUTPUT_DIR.mkdir(exist_ok=True)

Install dependencies for all 

In [3]:
!pip install docling


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [4]:
# conversion

## Chunking

The goal of chunking for model customization with knowledge pipeline is to provide the teacher model small and logical pieces of the source document to generate data off of.

In this notebook we are doing chunking with (Docling)[https://docling-project.github.io/docling/examples/hybrid_chunking/#hybrid-chunking].

The input to this notebook is a docling JSON file created after a docling conversion, or a directory of docling JSON files.

In [14]:
from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker, HierarchicalChunker
from pathlib import Path
import json

### Set the source document path

Here we're going to want to set the converted.json that comes from the conversion notebook.

If the conversion notebook was not run then, setting the path to the source document in any form is fine.

In [15]:
files = list(CONVERSION_OUTPUT_DIR.rglob("*.json"))
print(f"Docling JSON's to chunk: {files}")

Docling JSON's to chunk: [PosixPath('workspaces/default/conversion/cargo-theft-report-2018.json')]


### Initialize the Chunker

Docling provides two chunkers, the `HierarchicalChunker` and the `HybridChunker`.
The `HierarchicalChunker` creates chunks based on the hierarchy in the Docling document

The `HybridChunker` builds on the `HierarchicalChunker` and by making it tokenization aware.

The `HybridChunker` has options for a `tokenizer`, the `max_tokens` in a chunk, and whether to merge undersized peer chunks.

In [16]:
#chunker = HierarchicalChunker()
chunker = HybridChunker()

## Load and chunk the converted docling document

Next lets convert the JSON into a Docling Document and chunk it.

In [18]:
all_chunks = []
for file in files:
    try:
        doc = DocumentConverter().convert(source=file).document
        chunk_iter = chunker.chunk(dl_doc=doc)
        chunks = [chunker.contextualize(chunk=chunk) for chunk in chunk_iter]
        for chunk in chunks:
            c = dict(chunk=chunk, file=file.stem)
            all_chunks.append(c)
    except ConversionError as e:
        print(f"Skipping file {file}")

### View the Chunks

To view the chunks, run through the following cell. As you can see the document is broken into small pieces with metadata about the chunk based on the document's format

In [19]:
# print(all_chunks)

### Save the chunks to a JSON file

Alls chunks are saved to a JSON file called `chunks.jsonl` in `CHUNKING_OUTPUT_DIR`. This file is one of the inputs father below when we create the seed dataset for SDG.

In [22]:
chunks_file_path = CHUNKING_OUTPUT_DIR / "chunks.jsonl"
with open(chunks_file_path, "w", encoding="utf-8") as file:
    for chunk in all_chunks:
        json.dump(chunk, file)
        file.write("\n")

In [None]:
# authoring

In [None]:
# sdg

## Create Seed Dataset for SDG

This notebook combines the contents from the qna.yaml and the chunks from the source document to create a seed dataset for the synthetic data generation process.

To run this step you need a directory that contains `chunks.jsonl` and a `qna.yaml` in the same directory.

This step outputs a `seed.jsonl` file in the `SDG_OUTPUT_DIR` that you set.

In [3]:
!pip install datasets transformers


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [4]:
from utils.create_seed_dataset import get_seed_dataset

In [5]:
seed_data = get_seed_dataset(CHUNKING_OUTPUT_DIR, SEED_EXAMPLE_OUTPUT_DIR)

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.


Map:   0%|          | 0/31 [00:00<?, ? examples/s]

Map:   0%|          | 0/31 [00:00<?, ? examples/s]

Map:   0%|          | 0/31 [00:00<?, ? examples/s]

Map:   0%|          | 0/31 [00:00<?, ? examples/s]

Map:   0%|          | 0/31 [00:00<?, ? examples/s]

Map:   0%|          | 0/155 [00:00<?, ? examples/s]

Filter:   0%|          | 0/155 [00:00<?, ? examples/s]

## Save the seed data to a .JSONL file

Notebooks like the [knowledge_generation_and_mixing](https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub/blob/main/examples/instructlab/knowledge/knowledge_generation_and_mixing.ipynb) notebook take seed_data.jsonl as the input

In [6]:
seed_data.to_json(f'{SDG_OUTPUT_DIR}/seed_data.jsonl', orient='records', lines=True)

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

292788