# Create Custom Dataset in Swift Format for Training

This notebook demonstrates the complete pipeline for processing the Fatura2 invoice dataset into the ModelScope Swift custom format, suitable for training document understanding models. The process includes dataset preparation, image processing, annotation transformation, and cloud storage integration.

### Key Features:
- Conversion of complex invoice documents to standardized training format
- Support for both image and PDF input files
- Bounding box normalization and formatting
- Integration with Hugging Face Datasets
- Full compatibility with Swift training framework

### Processing Pipeline:
1. Environment setup and dependency installation
2. Dataset loading and inspection
3. Image extraction and preprocessing
4. Annotation transformation to Swift format
5. Dataset upload to S3

## Understanding the Swift Custom Format

The Swift framework requires data in a [specific conversation format with multimodal support](https://swift.readthedocs.io/en/latest/Customization/Custom-dataset.html#multimodal). Each training example should contain:


```json
{
  "messages": [
    {"role": "system", "content": "Task definition"},
    {"role": "user", "content": "<image><image>... + optional text prompt"},
    {"role": "assistant", "content": "JSON or text output with extracted data with <bbox> references."}
  ],
  "images": ["path/to/image1.png", "path/to/image2.png"]
  "objects": {"ref": [], "bbox": [[90.9, 160.8, 135, 212.8], [360.9, 480.8, 495, 532.8]]}
}
```

### Key Requirements:
1. **Multi-image Support**: Multiple images supported by using multiple <image> tags
2. **Bounding Box Format**: Coordinates as `[x1,y1,x2,y2]`
3. **Image Paths**: Relative paths stored in the `images` array
4. **Objects references**: referenced bounding boxes stored in `bbox` array
5. **Structured Output**: Nested JSON structure mirroring document hierarchy

## 1. Environment Configuration

In [None]:
%pip install pypdfium2==4.30.1 pandas==2.2.3 huggingface_hub[hf_transfer]==0.27.1 datasets==3.2.0 ipywidgets==8.1.5 tqdm==4.67.1 genson --quiet

## 2. Initialize Project Environment

In [None]:
import os
# Configure environment for optimal performance
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"  # Enable fast transfers
os.environ["TOKENIZERS_PARALLELISM"] = "false"  # Disable tokenizer warnings

In [None]:
from huggingface_hub import notebook_login
import sagemaker

# Authenticate with Hugging Face Hub for private datasets
# notebook_login()  # Uncomment for private datasets

# Initialize AWS resources
session = sagemaker.Session()
default_bucket_name = session.default_bucket()
dataset_s3_prefix = "fatura2-train-data"
dataset_s3_uri = f"s3://{default_bucket_name}/{dataset_s3_prefix}/"

# Create local directory structure
data_main_dir = "./data/"
hf_dataset_name = "arlind0xbb/Fatura2-invoices-original-strat2"
dataset_dir = os.path.join(data_main_dir, hf_dataset_name.split("/")[-1])
os.makedirs(dataset_dir, exist_ok=True)

In [None]:
dataset_s3_uri

## 3. Dataset Loading & Inspection

In [None]:
from datasets import load_dataset

# Load dataset from Hugging Face Hub
dataset = load_dataset(hf_dataset_name)
dataset

In [None]:
# [Optional] We reduce the dataset size to 300 samples for faster training and already good enough results
# Comment out this cell to train on the full dataset (will require bigger GPU)
# Note do not change the test dataset size so that you can compare the evaluation
dataset["train"] = dataset["train"].train_test_split(300)["test"]
dataset["dev"] = dataset["dev"].train_test_split(75)["test"]

In [None]:
from IPython.display import JSON
import json

JSON(json.loads(dataset["dev"].to_pandas()["target_data"].iloc[1]))

## 4. Document Processing Pipeline

### Key Processing Stages:
1. **PDF Rendering**: Convert PDF pages to PNG images at 153% scale for OCR optimization
2. **Image Normalization**: Standardize image formats and orientations
3. **Bounding Box Transformation**: Convert absolute coordinates to Swift XML format
4. **Hierarchy Flattening**: Simplify nested document structures while preserving relationships

In [None]:
import io
from PIL import Image
import pypdfium2 as pdfium
import json
from tqdm import tqdm
from pathlib import Path
import warnings
import pandas as pd
def process_row(row, max_pages, base_dir):
    """Process document row into Swift-compatible format"""
    filename = Path(row["filename"]).stem
    filetype = row["filetype"]
    doc_bytes = row["doc_bytes"]

    # Configure image output directory
    images_dir = os.path.join(base_dir, "images")
    os.makedirs(images_dir, exist_ok=True)
    output_images = []

    try:
        if filetype.startswith("image"):
            # Process single image files
            image_path = os.path.join(images_dir, f"{filename}.png")
            if not os.path.exists(image_path):
                Image.open(io.BytesIO(doc_bytes)).save(image_path)
            output_images.append(os.path.relpath(image_path, base_dir))
        
        elif filetype == "application/pdf":
            # Process PDF documents with multi-page support
            pdf = pdfium.PdfDocument(doc_bytes)
            for page_number in range(min(len(pdf), max_pages)):
                page_path = os.path.join(images_dir, f"{filename}_page{page_number:03}.png")
                if not os.path.exists(page_path):
                    pdf[page_number].render(scale=1.53).to_pil().save(page_path)
                output_images.append(os.path.relpath(page_path, base_dir))
    
    except Exception as e:
        print(f"Error processing {filename}: {str(e)}")
    
    return output_images

def transform_annotations(data, defaults=None, remove_bbox=True):
    """Convert bounding boxes to Swift XML format and simplify structure
    
    Args:
        data: Dictionary or list to transform
        defaults: List of keys that should exist in first-level dictionaries
        remove_bbox: Whether to remove bbox entries from the structure
    """
    bbox_list = []
    bbox_pointer = "<bbox>"
    if isinstance(data, dict):      
        # Add missing default keys (only at the first recursion level)
        if defaults is not None:
            # Use dictionary comprehension for efficiency
            missing_keys = {k: None for k in defaults if k not in data}
            data.update(missing_keys)
        
        for key, value in list(data.items()):            
            if key == "bbox":
                if remove_bbox:
                    del data[key]  # Skip bbox entries when removing
                else:
                    # Convert coordinates from {'bbox': [[20.0, 372.8898], [570.0, 282.8898]]} to {'bbox': [20.0, 372.8898, 570.0, 282.8898]}                    
                    bbox_value = value[0] + value[1]
                    bbox_list.append(bbox_value)
                    data[key] = bbox_pointer
            else:
                value_transformed, bb_list = transform_annotations(value, remove_bbox=remove_bbox)
                bbox_list.extend(bb_list)
                data[key] = value_transformed
        # flatten the object if only one key is left
        if len(data) == 1 and "bbox" not in data:
            data = list(data.values())[0]        
            return data, bbox_list
        return data, bbox_list
        # return {k: v for k, v in data.items() if v is not None}, bbox_list
    elif isinstance(data, list):
        items = []
        for item in data:
            if item is None:
                warnings.warn("Ignoring None value in item list: ", data)
            else:
                item_transformed, bb_list = transform_annotations(item, remove_bbox=remove_bbox)
                items.append(item_transformed)
                bbox_list.extend(bb_list)
        return items, bbox_list
    return data, bbox_list

In [None]:
# Let's test the transformation on single example entry
row = dataset["dev"].to_pandas().iloc[3]
target_format, bbox_list = transform_annotations(json.loads(row["target_data"]))
JSON(target_format, expanded=True)

In [None]:
# lets view the bbox_list, contains items if remove_bbox=False
JSON(bbox_list)

## 5. Swift Format Conversion

### Conversion Logic:
1. **Conversation Structure**:
   - System: Define document processing task
   - User: Provide document images
   - Assistant: Return structured JSON
   
2. **Image Handling**:
   - Store relative paths in `images` array
   - Reference images in prompt using `<image>{count}` syntax

3. **Bounding Box Handling**:
   - If activated, store bounding boxes in `"objects": {"bbox": []}}` array
   - Replace the original bbox value with `<bbox>` reference

In [None]:
def collect_all_keys(dataset):
    """Efficiently collect all keys from all dataset splits"""
    print("Collecting all unique keys from all datasets...")
    all_keys = set()
    for split_name, dataset_split in dataset.items():
        df = dataset_split.to_pandas()
        # Extract keys from each target_data JSON and collect unique ones
        df["target_data"].apply(json.loads).apply(set).apply(list).explode().drop_duplicates().apply(all_keys.add)
        # df["target_data"].apply(extract_keys).apply(all_keys.update)
    print(f"Found {len(all_keys)} unique keys across all datasets")
    return list(all_keys)

In [None]:
keys = sorted(collect_all_keys(dataset))

In [None]:
from collections import OrderedDict

def order_json(data):
    # Create an OrderedDict with sorted keys
    ordered_data = OrderedDict(sorted(data.items()))
    return ordered_data


In [None]:
def create_swift_example(row):
    """Construct Swift-compatible training example"""
    conversation = {
        "messages": [
            {
                "role": "system", 
                "content": "You are a document processing expert and assistant."
            },
            {
                "role": "user",
                "content": f"{'Document pages: <image>'*len(row['images'])} Process all document pages and extract the following information in JSON format: {', '.join(keys)}"
            },
            {
                "role": "assistant",
                "content": json.dumps(row["target_data_clean"])
            }
        ],
        "images": row["images"]
        
    }
    
    bbox_list = row["bbox_list"]
    if bbox_list and len(bbox_list):
        conversation["objects"]: {"ref": [], "bbox": bbox_list}
    
    return conversation 

def convert_dataset(dataset_split, split_name):
    """Full conversion pipeline for dataset split"""
    df = dataset_split.to_pandas()
    
    # Process documents and images
    df["images"] = [process_row(row, 2, dataset_dir) for _, row in tqdm(df.iterrows(), total=len(df))]
    
    df[["target_data_clean","bbox_list"]] = df["target_data"].apply(lambda x: pd.Series(transform_annotations(json.loads(x), defaults=keys,remove_bbox=True)))

    # Ensure that all attributes are in the same order
    df["target_data_clean"] = df["target_data_clean"].apply(order_json)
    
    # Generate Swift format examples
    converted_data = df.apply(create_swift_example, axis=1).tolist()
    
    # Save converted dataset
    output_path = os.path.join(dataset_dir, f"conversations_{split_name}_swift_format.json")
    with open(output_path, "w") as f:
        json.dump(converted_data, f, indent=2)
    
    return output_path, df

# Process all dataset splits
swift_files = []
swift_df_all = {}


for split_name, dataset_split in dataset.items():
    print(f"Processing dataset: {split_name}")
    output_file, df = convert_dataset(dataset_split, split_name)
    swift_files.append(output_file)
    swift_df_all[split_name]=df
    print(f"Finished writing: {output_file}")

### Generate and Store JSON Schema format for usage during inference

In [None]:
# Vertical concatenation
merged = pd.concat([swift_df_all["test"]["target_data_clean"], swift_df_all["train"]["target_data_clean"], swift_df_all["dev"]["target_data_clean"]])

In [None]:
from genson import SchemaBuilder

# json_targets = swift_df_all["train"]["target_data_clean"]
json_targets = merged
builder = SchemaBuilder()

for item in json_targets:
    builder.add_object(item)

schema = builder.to_schema()

outfile_json_schema = f"{dataset_dir}/groundtruth_schema.json"
#write schema to file
with open(outfile_json_schema, "w") as f:
    json.dump(schema, f, indent=2)
JSON(schema)

In [None]:
from IPython.display import JSON

# Let's read and show a sample entry of the first dataset
print(f"dataset file: {swift_files[0]}")
data = json.load(open(swift_files[0]))
JSON(data[4], expanded=False, root = "sample")

In [None]:
# final target json format, to be generated by the LLM
JSON(json.loads(data[4]["messages"][2]["content"]))

## 6. Upload dataset to S3

In [None]:
!aws s3 sync $dataset_dir $dataset_s3_uri --exclude ".ipynb_checkpoints/*" --quiet
print(f"\n✅ Dataset successfully uploaded to {dataset_s3_uri}")

## Next Steps

The processed dataset is now ready for training document understanding models with Swift. In the next notebook we will execute the training and consider the following points.

1. **Model Selection**: Choose appropriate multimodal vision models (e.g., Qwen-VL, Yi-VL)
2. **Training Configuration**: Set hyperparameters in Swift training scripts
3. **Validation**: Use the processed dev set for training monitoring
4. **Evaluation**: Utilize the test set for final model benchmarking