# Converting Fatura2 Dataset to Hugging Face Format

This notebook demonstrates how to convert the Fatura2 invoice dataset into a Hugging Face dataset format for easier use in document processing tasks. The Fatura2 dataset contains invoice images paired with JSON annotations.


Reference: Fatura Dataset on Zenodo

Paper: Limam, M., Dhiaf, M., & Kessentini, Y. (2023). FATURA: A Multi-Layout Invoice Image Dataset for Document Analysis and Understanding - [https://arxiv.org/abs/2311.11856](https://arxiv.org/abs/2311.11856)

Dataset: Limam, M., Dhiaf, M., & Kessentini, Y. (2023). FATURA Dataset. Zenodo. [https://doi.org/10.5281/zenodo.10371464](https://doi.org/10.5281/zenodo.10371464)

License: Creative Commons Attribution 4.0 International (CC BY 4.0)

## Features:
- Downloads and extracts the Fatura2 dataset
- Converts images and annotations into a structured format
- Creates train/dev/test splits using two different strategies
- Saves the processed dataset locally in parquet format
- Uploads the processed dataset to Hugging Face Hub

## 1. Install Dependencies

In [None]:
%pip install --quiet huggingface_hub[hf_transfer]==0.27.1 datasets==3.2.0

## 2. Download and Extract Dataset

In [None]:
import os

data_dir = "./data"

In [None]:
os.makedirs(data_dir, exist_ok=True)

In [None]:
# Download Fatura2 dataset if not present
![ -f {data_dir}/FATURA2.zip ] || curl https://zenodo.org/records/10371464/files/FATURA2.zip?download=1 -o {data_dir}/FATURA2.zip

In [None]:
# Extract dataset
!unzip -n {data_dir}/FATURA2.zip -d {data_dir} || echo "Failed to unzip or files already exist"

## 3. Define Data Loading Utilities

Create helper functions to load and process the dataset files:

In [None]:
import pandas as pd
import json
from pathlib import Path
from typing import List, Dict, Any
import mimetypes

def load_files_from_csv(
    csv_path: str,
    base_dir_images: str = None,
    base_dir_json: str = None,
    template_inds=None,
) -> List[Dict[str, Any]]:
    """Load and process files from CSV containing image and JSON paths.
    
    Args:
        csv_path: Path to the CSV file containing file references
        base_dir_images: Base directory for image files
        base_dir_json: Base directory for JSON annotation files
        template_inds: Optional list of template indices to filter
    
    Returns:
        DataFrame containing processed files with images and annotations
    """
    try:
        # Read CSV file
        df = pd.read_csv(csv_path)
        
        # Filter templates if specified
        if template_inds:
            df = df[
                df["img_path"].apply(
                    lambda x: int(x.split("_")[0].split("Template")[1]) in template_inds
                )
            ]

        # Set base paths
        base_path_images = Path(base_dir_images) if base_dir_images else Path(csv_path).parent
        base_path_json = Path(base_dir_json) if base_dir_json else Path(csv_path).parent

        # Create full paths
        df["full_img_path"] = df["img_path"].apply(lambda x: str(base_path_images / x))
        df["full_annot_path"] = df["annot_path"].apply(lambda x: str(base_path_json / x))

        def process_row(row: pd.Series) -> Dict[str, Any]:
            try:
                # Read image bytes
                with open(row["full_img_path"], "rb") as img_file:
                    img_bytes = img_file.read()

                # Read JSON annotation
                with open(row["full_annot_path"], "r") as json_file:
                    json_dict = json.load(json_file)

                json_dict.pop("OTHER", None)
                json_string = json.dumps(json_dict)
                document_file = Path(row["full_img_path"])
                mime_type = mimetypes.guess_type(document_file)[0]

                return {
                    "filename": document_file.name,
                    "filetype": mime_type,
                    "target_data": json_string,
                    "doc_bytes": img_bytes,
                }
            except Exception as e:
                print(f"Error processing files for row {row.name}: {e}")
                return None

        # Process all rows
        results = df.apply(process_row, axis=1).tolist()
        return pd.DataFrame(results)

    except Exception as e:
        print(f"Error reading CSV: {e}")
        return []

## 4. Create Dataset Splits

Process the data into train/dev/test splits using two different strategies:

In [None]:
from datasets import Dataset
from datasets import DatasetDict

# Define splits
splits = {
    "dev": "strat1_dev.csv",
    "test": "strat1_test.csv",
    "train": "strat1_train.csv",
}

dataset_dir = f"{data_dir}/invoices_dataset_final"
images_dir = f"{dataset_dir}/colored_images"
annotations_dir = f"{dataset_dir}/Annotations/Original_Format"

# Create Strategy 1 dataset (random split)
dataset_strat1 = DatasetDict()
for split_name, split_file in splits.items():
    df = load_files_from_csv(
        f"{dataset_dir}/{splits['dev']}", 
        images_dir, 
        annotations_dir, 
        None
    )
    dataset = Dataset.from_dict(df.to_dict(orient="list"))
    dataset_strat1[split_name] = dataset

# Template indices for Strategy 2
train_inds = set([
    3, 11, 30, 24, 40, 48, 41, 22, 27, 19, 45, 1, 29, 44, 9, 47, 36, 23, 18, 42,
    15, 14, 28, 43, 33, 6, 38, 26, 13, 34, 17, 37, 5, 8, 21, 35, 16, 20, 31, 46
])
dev_inds = set([50, 7, 32, 39, 2, 12, 4, 49, 10, 25])

# Create Strategy 2 dataset (template-based split)
dataset_strat2 = DatasetDict()
for split_name, split_file in splits.items():
    df = load_files_from_csv(
        f"{dataset_dir}/{splits['dev']}",
        images_dir,
        annotations_dir,
        train_inds if "train" in split_name else dev_inds,
    )
    dataset = Dataset.from_dict(df.to_dict(orient="list"))
    dataset_strat2[split_name] = dataset

In [None]:
# lets inspect dataset with strategy 1
dataset_strat1

## 5. Inspect Sample Data

View an example document and its JSON annotation:

In [None]:
import io
from IPython.display import JSON, Image
import json

sample_row = dataset_strat1["dev"][3]
target_data = sample_row["target_data"]
image = Image(data=sample_row["doc_bytes"], width=400)

print("\nDocument Sample:")
image

In [None]:
print("JSON Annotation:")
JSON(json.loads(target_data))

## 6. Save and Upload to Hugging Face Hub

Save the processed datasets locally and upload to Hugging Face Hub.
You can get your HF token from - [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)

In [None]:
# Enable faster transfers
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

In [None]:
# Login to Hugging Face
from huggingface_hub import notebook_login, HfApi
from huggingface_hub.errors import LocalTokenNotFoundError
try:
    print(f"HF Name: {HfApi().whoami()['name']}")
except LocalTokenNotFoundError:
    notebook_login()

In [None]:
# Save and upload datasets
dataset_name1 = "Fatura2-invoices-original-strat1"
dataset_name2 = "Fatura2-invoices-original-strat2"
#define hf username
hf_username = "arlind0xbb" # adjust with your own hf_username

# Save locally
dataset_strat1.save_to_disk(f"{data_dir}/{dataset_name1}.hf")
dataset_strat2.save_to_disk(f"{data_dir}/{dataset_name2}.hf")

# Upload to Hugging Face Hub
dataset_strat1.push_to_hub(f"{hf_username}/{dataset_name1}", private=True)
dataset_strat2.push_to_hub(f"{hf_username}/{dataset_name2}", private=True)