In [1]:
!pip install instructlab==0.19.0
!SETUPTOOLS_SCM_PRETEND_VERSION=0.1 pip install --ignore-installed --upgrade ./sdg
!pip install docling-parse==1.3.0
!pip install docling==1.16.1


Collecting torch<2.4.0,>=2.3.0 (from instructlab==0.19.0)
  Obtaining dependency information for torch<2.4.0,>=2.3.0 from https://files.pythonhosted.org/packages/d0/5f/f41b14a398d484bf218d5167ec9061c1e76f500d9e25166117818c8bacda/torch-2.3.1-cp311-none-macosx_11_0_arm64.whl.metadata
  Using cached torch-2.3.1-cp311-none-macosx_11_0_arm64.whl.metadata (26 kB)
Using cached torch-2.3.1-cp311-none-macosx_11_0_arm64.whl (61.0 MB)
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 2.4.1
    Uninstalling torch-2.4.1:
      Successfully uninstalled torch-2.4.1
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchvision 0.19.1 requires torch==2.4.1, but you have torch 2.3.1 which is incompatible.[0m[31m
[0mSuccessfully installed torch-2.3.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A ne

In [2]:
import os
import random
from datasets import load_dataset
from utils.data import postprocess_and_save, pretty_print_dict
from instructlab.sdg.utils.docprocessor import DocProcessor

  from .autonotebook import tqdm as notebook_tqdm


### Setup Instructions

This demo demonstrates the process of converting raw PDF files into InstructLab Synthetic Knowledge Infusion Data using the RBC POC as an example. Follow these steps to get started with your own data.

#### Steps to Get Started:

1. **Organize Your Documents:**
   - Create a new directory under the `document_collection` directory for your specific project. For example, if your project is named "my_org," your directory structure should look like this:
     ```
     |-- document_collection
     |   `-- my_org
     |       |-- my_org_data.pdf
     |       `-- qna.yaml
     ```
   - Place all your PDF files and ICL files (like `qna.yaml`) into this directory.

2. **Format Your ICLs:**
   - Ensure your ICL files contain sufficient context and question-answer pairs. We recommend including at least 5 distinct contexts, each with a minimum of 3 sets of questions and answers. More entries will improve the robustness of your data.
    - The ICL file should be in the following format (refer to the `document_collection/my_org/qna.yaml` file for an example):

    ```yaml
    domain: 
    document_outline: A one to two line description of the document
    seed_examples:
      - context: <context 1 goes here>
        question_and_answers:
          - question: <question 1 goes here>
            answer: <answer 1 goes here>
          - question: <question 2 goes here>
            answer: <answer 2 goes here>
          - question: <question 3 goes here>
            answer: <answer 3 goes here>
    ... 


   - **Note:** Replace placeholders with actual content relevant to your documents. Ensure the contexts are clear and questions are well-formulated to extract meaningful answers.

3. **Update the Data Directory Path:**
   - In the script or code where the data directory is specified, update the `data_dir` variable to reflect the path to your new directory. For example:
     ```python
     data_dir = "document_collection/my_org"
     ```
4. **Update the Output Directory Path:**
   - In the script or code where the data directory is specified, update the `output_dir` variable to reflect the path to your directory. For example:
     ```python
     data_dir = "output/my_org"
     ```
---

In [3]:
from dotenv import load_dotenv
import os

load_dotenv(override=True)

# Access the variables
data_dir = os.getenv('DATA_DIR')
output_dir = os.getenv('OUTPUT_DIR')
os.makedirs(output_dir, exist_ok=True)

### PDF Documents to Seed Dataset

To convert PDF documents into a usable seed dataset, we employ [Docling](https://github.com/DS4SD/docling), a tool designed for extracting and processing text from PDF files. The text extraction process involves parsing the PDF documents and saving the extracted text into a structured JSON file. The extracted text in JSON format can be used to generate InstructLab Synthetic Knowledge Infusion Data.


#### Step 1: 

Run the following command to extract text from the PDF documents and save it in JSON format:

⚠️ **Note:** This process takes about 5 minutes to run for this example


In [4]:
!echo $data_dir
!python ./sdg/scripts/docparser.py --input-dir {data_dir} --output-dir {output_dir}

document_collection/Mastercard
Fetching 10 files: 100%|█████████████████████| 10/10 [00:00<00:00, 47180.02it/s]


#### Step 2: 

Now that we have extracted the text from the PDF documents, we can proceed to process the extracted data, we do the following:

- Split the extracted text into chunks 
- Populate user provided ICLs with the chunks 

In [5]:
dp = DocProcessor(output_dir, user_config_path=f'{data_dir}/qna.yaml')

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.


In [6]:
seed_data = dp.get_processed_dataset()
seed_data

Map:   0%|          | 0/29 [00:00<?, ? examples/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Map: 100%|██████████| 29/29 [00:00<00:00, 3454.85 examples/s]
Map: 100%|██████████| 29/29 [00:00<00:00, 8735.00 examples/s]
Map: 100%|██████████| 29/29 [00:00<00:00, 11811.50 examples/s]
Map: 100%|██████████| 29/29 [00:00<00:00, 11222.99 examples/s]
Map: 100%|██████████| 29/29 [00:00<00:00, 10844.76 examples/s]
Map: 100%|██████████| 145/145 [00:00<00:00, 3229.68 examples/s]
Filter: 100%|██████████| 145/145 [00:00<00:00, 3530.74 examples/s]


Dataset({
    features: ['document_outline', 'document_title', 'domain', 'icl_document', 'icl_query_1', 'icl_response_1', 'icl_query_2', 'icl_response_2', 'icl_query_3', 'icl_response_3', 'document'],
    num_rows: 145
})

In [7]:
seed_data.to_json(f'{output_dir}/seed_data.jsonl', orient='records', lines=True)

Creating json from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 249.56ba/s]


710864

In [8]:
pretty_print_dict(f'{output_dir}/seed_data.jsonl')

Generating train split: 145 examples [00:00, 39682.51 examples/s]


### Convert JSONL to markdown files

In [9]:
import pandas as pd
import os
import json

# Create the output directory if it doesn't exist
md_output_dir = f"{output_dir}/md"
os.makedirs(md_output_dir, exist_ok=True)


In [10]:
def save_document(index, document_text):
    file_name = f"document_{index+1}.md"
    file_path = os.path.join(md_output_dir, file_name)
    
    with open(file_path, 'w') as f:
        f.write(document_text)
    
    print(f"Saved {file_path}")


In [11]:
jsonl_file_path = f"{output_dir}/seed_data.jsonl"

In [12]:
with open(jsonl_file_path, 'r') as f:
    saved_hashes = set()
    i = 0
    for line in f:
        entry = json.loads(line)
        document_text = entry.get('document', '')
        h = hash(document_text)
        if h not in saved_hashes:
            saved_hashes.add(h)
            save_document(i, document_text)
            i += 1

Saved output/Mastercard/md/document_1.md
Saved output/Mastercard/md/document_2.md
Saved output/Mastercard/md/document_3.md
Saved output/Mastercard/md/document_4.md
Saved output/Mastercard/md/document_5.md
Saved output/Mastercard/md/document_6.md
Saved output/Mastercard/md/document_7.md
Saved output/Mastercard/md/document_8.md
Saved output/Mastercard/md/document_9.md
Saved output/Mastercard/md/document_10.md
Saved output/Mastercard/md/document_11.md
Saved output/Mastercard/md/document_12.md
Saved output/Mastercard/md/document_13.md
Saved output/Mastercard/md/document_14.md
Saved output/Mastercard/md/document_15.md
Saved output/Mastercard/md/document_16.md
Saved output/Mastercard/md/document_17.md
Saved output/Mastercard/md/document_18.md
Saved output/Mastercard/md/document_19.md
Saved output/Mastercard/md/document_20.md
Saved output/Mastercard/md/document_21.md
Saved output/Mastercard/md/document_22.md
Saved output/Mastercard/md/document_23.md
Saved output/Mastercard/md/document_24.md
S