# Using the pulmo-cristal Package
## This guide shows how to use the pulmo-cristal package for extracting and processing data from donor PDF documents.


### Installation

In [1]:
# Install the package from GitHub
!pip uninstall pulmo-cristal -y
!pip uninstall camelot-py -y

Found existing installation: pulmo-cristal 0.1.0
Uninstalling pulmo-cristal-0.1.0:
  Successfully uninstalled pulmo-cristal-0.1.0
Found existing installation: camelot-py 1.0.0
Uninstalling camelot-py-1.0.0:
  Successfully uninstalled camelot-py-1.0.0


In [2]:
!pip install pulmo-cristal

Collecting pulmo-cristal
  Downloading pulmo_cristal-0.1.0-py3-none-any.whl.metadata (608 bytes)
Collecting camelot-py>=1.0.0 (from pulmo-cristal)
  Using cached camelot_py-1.0.0-py3-none-any.whl.metadata (9.4 kB)
Downloading pulmo_cristal-0.1.0-py3-none-any.whl (47 kB)
Using cached camelot_py-1.0.0-py3-none-any.whl (66 kB)
Installing collected packages: camelot-py, pulmo-cristal
Successfully installed camelot-py-1.0.0 pulmo-cristal-0.1.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### Getting Started
Let's check if the package is correctly installed:

In [3]:
# Check the package version
!pulmo-cristal --version

  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4
pulmo-cristal 0.1.0


In [4]:
# View available commands
!pulmo-cristal --help

  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4
usage: pulmo-cristal [-h] [--version] [--verbose]
                     {extract,list,convert,validate} ...

Extract data from donor PDF documents.

positional arguments:
  {extract,list,convert,validate}
                        Command to execute
    extract             Extract data from PDF files
    list                List PDF files in a directory
    convert             Convert between output formats
    validate            Validate extracted data

options:
  -h, --help            show this help message and exit
  --version, -V         show program's version number and exit
  --verbose, -v         Increase verbosity (can be used multiple times)
                        (default: 0)


### Finding PDF Files
First, let's see what PDF files we have available:

In [5]:
from pulmo_cristal.utils import find_pdf_files, list_directory_tree
import os

# Set the path to your directory containing existing PDFs
pdf_dir = "./sample_pdfs"  # Change this to your actual path

# Print directory tree to see the structure
tree = list_directory_tree(pdf_dir, max_depth=2, file_types=[".pdf"])
print(tree)

sample_pdfs/
├── subfolder_test/
│   ├── patient3.pdf
│   └── patient_test_2.pdf
├── 235350.LEGENDRE SURLIGNE BIS .pdf
├── patient1.pdf
├── patient2.pdf
├── patient_test_1.pdf
└── patient_test_2.pdf

Total: 1 directories, 7 files


In [6]:
# Find all PDF files in the directory
pdf_files = find_pdf_files(pdf_dir, recursive=True)
print(f"Found {len(pdf_files)} PDF files")

# Display the first few files
if pdf_files:
    print("\nSample of PDF files found:")
    for pdf in pdf_files[:5]:
        print(f"  - {pdf}")

Found 7 PDF files

Sample of PDF files found:
  - sample_pdfs/patient2.pdf
  - sample_pdfs/patient1.pdf
  - sample_pdfs/235350.LEGENDRE SURLIGNE BIS .pdf
  - sample_pdfs/patient_test_1.pdf
  - sample_pdfs/patient_test_2.pdf


### Extracting Data from a PDF
Now let's extract data from our sample PDF:

In [7]:
from pulmo_cristal.extractors import DonorPDFExtractor, HLAExtractor

# Select a sample PDF to process
sample_pdf = pdf_files[0] if pdf_files else None

if sample_pdf:
    print(f"Processing: {sample_pdf}")

    # Initialize extractors
    donor_extractor = DonorPDFExtractor()
    hla_extractor = HLAExtractor()

    try:
        # Extract donor data
        donor_data = donor_extractor.extract_donor_data(sample_pdf)

        # Extract HLA data
        hla_data, hla_status = hla_extractor.extract_hla_data(sample_pdf)

        # Add HLA data to donor data
        donor_data["informations_donneur"]["hla"] = hla_data
        donor_data["informations_donneur"]["hla_extraction_status"] = hla_status

        # Display extracted data
        print("\nExtracted Donor Information:")
        for key, value in donor_data["informations_donneur"].items():
            if key != "hla":  # Skip HLA for now
                print(f"  {key}: {value}")

        print("\nExtracted HLA Data:")
        for key, value in hla_data.items():
            print(f"  {key}: {value}")

    except Exception as e:
        print(f"Error extracting data: {e}")
else:
    print("No PDF files found to process")

  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4
2025-06-18 23:13:42,140 - DonorPDFExtractor - INFO - Extracting text from PDF: sample_pdfs/patient2.pdf
2025-06-18 23:13:42,143 - DonorPDFExtractor - INFO - PDF contains 22 pages
2025-06-18 23:13:42,159 - DonorPDFExtractor - INFO - Page 1/22: Extracted 1224 characters
2025-06-18 23:13:42,160 - DonorPDFExtractor - INFO - Page 2/22: Extracted 85 characters
2025-06-18 23:13:42,168 - DonorPDFExtractor - INFO - Page 3/22: Extracted 1546 characters
2025-06-18 23:13:42,200 - DonorPDFExtractor - INFO - Page 4/22: Extracted 720 characters
2025-06-18 23:13:42,204 - DonorPDFExtractor - INFO - Page 5/22: Extracted 398 characters
2025-06-18 23:13:42,208 - DonorPDFExtractor - INFO - Page 6/22: Extracted 429 characters
2025-06-18 23:13:42,212 - DonorPDFExtractor - INFO - Page 7/22: Extracted 490 characters
2025-06-18 23:13:42,216 - DonorPDFExtractor - INFO - Page 8/22: Extracted 367 characters
2025-06-18 23:13:42,226 - DonorPDF

Processing: sample_pdfs/patient2.pdf

Extracted Donor Information:
  num_cristal: 175862
  type_donneur: Donneur prélevé, organe greffé
Coeur Battant (SME)
  age: 62
  sexe: F
  groupe_sanguin: A  +
  date_naissance: 13/03/1957
  taille: 157
  poids: 46.5
  date_clampage: 16/01/2020
  heure_clampage: 13:01
  etiologie: Traumatique A.V.P.
  duree_ventilation: 5
  commentaire: s
Page 22 sur  22 Edité le 29 novembre 2022 à 18h37
  hla_extraction_status: OK

Extracted HLA Data:
  A1: 3
  A2: 24
  B1: 8
  B2: 35
  C1: 7
  C2: 0
  DR1: 8
  DR2: 17
  DQA: 0
  DQB: 7
  DP1: 4
  DP2: 6


### Converting to a Structured Model
Convert the raw extracted data to a structured Donneur model:

In [8]:
from pulmo_cristal.models import Donneur

if "donor_data" in locals():
    try:
        # Convert to Donneur model
        donneur = Donneur.from_dict(donor_data)

        # Display basic model information
        print(f"Donor ID: {donneur.id}")
        print(f"Donor Type: {donneur.type_donneur}")
        print(f"Age: {donneur.age}")
        print(f"Sex: {donneur.sexe}")

        # Validate the model
        is_valid = donneur.validate()
        print(f"\nValidation Result: {'Valid' if is_valid else 'Invalid'}")
        if not is_valid and hasattr(donneur, "validation_errors"):
            print("Validation Errors:")
            for error in donneur.validation_errors:
                print(f"  - {error}")
    except Exception as e:
        print(f"Error converting to model: {e}")

Donor ID: 175862
Donor Type: DonneurType.INCONNU
Age: 62
Sex: SexeType.FEMME

Validation Result: Valid


### Exporting Data to JSON and CSV
Now let's export the extracted data to JSON and CSV formats:

In [9]:
from pulmo_cristal.exporters import DonorJSONExporter, DonorCSVExporter

# Create output directory
output_dir = "./output"
os.makedirs(output_dir, exist_ok=True)

if "donor_data" in locals():
    # Initialize exporters
    json_exporter = DonorJSONExporter()
    csv_exporter = DonorCSVExporter()

    try:
        # Export to JSON
        json_path = os.path.join(output_dir, "donor_data.json")
        json_exporter.export_json([donor_data], json_path)
        print(f"Exported JSON to: {json_path}")

        # Export to CSV
        csv_path = os.path.join(output_dir, "donor_data.csv")
        csv_exporter.export_csv([donor_data], csv_path)
        print(f"Exported CSV to: {csv_path}")
    except Exception as e:
        print(f"Error exporting data: {e}")

2025-06-18 23:13:42,608 - DonorJSONExporter - INFO - JSON file generated successfully: output/donor_data_20250618_231342.json
2025-06-18 23:13:42,610 - DonorCSVExporter - INFO - CSV file generated successfully: output/donor_data_20250618_231342.csv


Exported JSON to: ./output/donor_data.json
Exported CSV to: ./output/donor_data.csv


## Processing Multiple Files
For processing all your existing PDFs in batches:

In [10]:
from pulmo_cristal.utils import batch_process_files
import time

if pdf_files:
    # Initialize extractors
    donor_extractor = DonorPDFExtractor()
    hla_extractor = HLAExtractor()

    # Initialize exporters
    json_exporter = DonorJSONExporter()

    # Process in batches
    all_data = []
    successful = 0
    failed = 0
    start_time = time.time()

    # Define batch size
    batch_size = 5  # Adjust based on your needs

    for batch_idx, batch in enumerate(
        batch_process_files(pdf_files, batch_size=batch_size)
    ):
        print(
            f"\nProcessing batch {batch_idx + 1}/{(len(pdf_files) + batch_size - 1) // batch_size}"
        )
        batch_data = []

        for pdf_file in batch:
            try:
                # Extract donor data
                donor_data = donor_extractor.extract_donor_data(pdf_file)

                # Extract HLA data
                hla_data, status = hla_extractor.extract_hla_data(pdf_file)
                donor_data["informations_donneur"]["hla"] = hla_data
                donor_data["informations_donneur"]["hla_extraction_status"] = status

                # Add to batch
                batch_data.append(donor_data)
                successful += 1
                print(f"✓ Processed: {pdf_file.name}")
            except Exception as e:
                failed += 1
                print(f"✗ Error processing {pdf_file.name}: {e}")

        # Add batch to all data
        all_data.extend(batch_data)

        # Save intermediate results
        if batch_data and (batch_idx + 1) % 2 == 0:  # Save every 2 batches
            interim_path = os.path.join(
                output_dir, f"donors_interim_batch_{batch_idx + 1}.json"
            )
            try:
                json_exporter.export_json(all_data, interim_path)
                print(f"Saved interim results to: {interim_path}")
            except Exception as e:
                print(f"Error saving interim results: {e}")

    end_time = time.time()
    elapsed_time = end_time - start_time

    print("\nProcessing complete:")
    print(f"  - Total files: {len(pdf_files)}")
    print(f"  - Successful: {successful}")
    print(f"  - Failed: {failed}")
    print(f"  - Total time: {elapsed_time:.2f} seconds")
    print(f"  - Avg time per file: {elapsed_time / len(pdf_files):.2f} seconds")

    # Export all data
    if all_data:
        final_json_path = os.path.join(output_dir, "all_donors.json")
        final_csv_path = os.path.join(output_dir, "all_donors.csv")

        try:
            # Export to JSON
            json_exporter.export_json(all_data, final_json_path)
            print(f"Exported all data to JSON: {final_json_path}")

            # Export to CSV
            csv_exporter = DonorCSVExporter()
            csv_exporter.export_csv(all_data, final_csv_path)
            print(f"Exported all data to CSV: {final_csv_path}")
        except Exception as e:
            print(f"Error exporting final data: {e}")
else:
    print("No PDF files found to process")

2025-06-18 23:13:42,618 - DonorPDFExtractor - INFO - Extracting text from PDF: sample_pdfs/patient2.pdf
2025-06-18 23:13:42,622 - DonorPDFExtractor - INFO - PDF contains 22 pages
2025-06-18 23:13:42,638 - DonorPDFExtractor - INFO - Page 1/22: Extracted 1224 characters
2025-06-18 23:13:42,639 - DonorPDFExtractor - INFO - Page 2/22: Extracted 85 characters
2025-06-18 23:13:42,648 - DonorPDFExtractor - INFO - Page 3/22: Extracted 1546 characters
2025-06-18 23:13:42,657 - DonorPDFExtractor - INFO - Page 4/22: Extracted 720 characters
2025-06-18 23:13:42,661 - DonorPDFExtractor - INFO - Page 5/22: Extracted 398 characters
2025-06-18 23:13:42,664 - DonorPDFExtractor - INFO - Page 6/22: Extracted 429 characters
2025-06-18 23:13:42,668 - DonorPDFExtractor - INFO - Page 7/22: Extracted 490 characters
2025-06-18 23:13:42,672 - DonorPDFExtractor - INFO - Page 8/22: Extracted 367 characters
2025-06-18 23:13:42,683 - DonorPDFExtractor - INFO - Page 9/22: Extracted 1070 characters
2025-06-18 23:13:4


Processing batch 1/2


2025-06-18 23:13:42,794 - DonorPDFExtractor - INFO - Page 20/22: Extracted 1062 characters
2025-06-18 23:13:42,805 - DonorPDFExtractor - INFO - Page 21/22: Extracted 846 characters
2025-06-18 23:13:42,808 - DonorPDFExtractor - INFO - Page 22/22: Extracted 99 characters
2025-06-18 23:13:42,808 - DonorPDFExtractor - INFO - Total extracted text: 16277 characters
2025-06-18 23:13:42,821 - HLAExtractor - INFO - Extracting HLA data from sample_pdfs/patient2.pdf using original approach
2025-06-18 23:13:43,132 - DonorPDFExtractor - INFO - Extracting text from PDF: sample_pdfs/patient1.pdf
2025-06-18 23:13:43,135 - DonorPDFExtractor - INFO - PDF contains 21 pages
2025-06-18 23:13:43,173 - DonorPDFExtractor - INFO - Page 1/21: Extracted 1377 characters
2025-06-18 23:13:43,175 - DonorPDFExtractor - INFO - Page 2/21: Extracted 92 characters
2025-06-18 23:13:43,183 - DonorPDFExtractor - INFO - Page 3/21: Extracted 1444 characters
2025-06-18 23:13:43,191 - DonorPDFExtractor - INFO - Page 4/21: Extra

✓ Processed: patient2.pdf


2025-06-18 23:13:43,336 - DonorPDFExtractor - INFO - Page 20/21: Extracted 1009 characters
2025-06-18 23:13:43,338 - DonorPDFExtractor - INFO - Page 21/21: Extracted 203 characters
2025-06-18 23:13:43,339 - DonorPDFExtractor - INFO - Total extracted text: 15729 characters
2025-06-18 23:13:43,349 - HLAExtractor - INFO - Extracting HLA data from sample_pdfs/patient1.pdf using original approach
2025-06-18 23:13:43,576 - DonorPDFExtractor - INFO - Extracting text from PDF: sample_pdfs/235350.LEGENDRE SURLIGNE BIS .pdf
2025-06-18 23:13:43,582 - DonorPDFExtractor - INFO - PDF contains 23 pages
2025-06-18 23:13:43,596 - DonorPDFExtractor - INFO - Page 1/23: Extracted 975 characters
2025-06-18 23:13:43,601 - DonorPDFExtractor - INFO - Page 2/23: Extracted 642 characters
2025-06-18 23:13:43,610 - DonorPDFExtractor - INFO - Page 3/23: Extracted 771 characters
2025-06-18 23:13:43,619 - DonorPDFExtractor - INFO - Page 4/23: Extracted 1198 characters
2025-06-18 23:13:43,623 - DonorPDFExtractor - IN

✓ Processed: patient1.pdf


2025-06-18 23:13:44,053 - DonorPDFExtractor - INFO - Extracting text from PDF: sample_pdfs/patient_test_1.pdf
2025-06-18 23:13:44,057 - DonorPDFExtractor - INFO - PDF contains 22 pages
2025-06-18 23:13:44,073 - DonorPDFExtractor - INFO - Page 1/22: Extracted 1308 characters
2025-06-18 23:13:44,075 - DonorPDFExtractor - INFO - Page 2/22: Extracted 147 characters
2025-06-18 23:13:44,082 - DonorPDFExtractor - INFO - Page 3/22: Extracted 1549 characters
2025-06-18 23:13:44,091 - DonorPDFExtractor - INFO - Page 4/22: Extracted 786 characters
2025-06-18 23:13:44,095 - DonorPDFExtractor - INFO - Page 5/22: Extracted 425 characters
2025-06-18 23:13:44,098 - DonorPDFExtractor - INFO - Page 6/22: Extracted 663 characters
2025-06-18 23:13:44,103 - DonorPDFExtractor - INFO - Page 7/22: Extracted 608 characters
2025-06-18 23:13:44,107 - DonorPDFExtractor - INFO - Page 8/22: Extracted 380 characters
2025-06-18 23:13:44,117 - DonorPDFExtractor - INFO - Page 9/22: Extracted 1357 characters
2025-06-18 

✓ Processed: 235350.LEGENDRE SURLIGNE BIS .pdf


2025-06-18 23:13:44,491 - DonorPDFExtractor - INFO - Extracting text from PDF: sample_pdfs/patient_test_2.pdf
2025-06-18 23:13:44,495 - DonorPDFExtractor - INFO - PDF contains 23 pages
2025-06-18 23:13:44,509 - DonorPDFExtractor - INFO - Page 1/23: Extracted 975 characters
2025-06-18 23:13:44,514 - DonorPDFExtractor - INFO - Page 2/23: Extracted 642 characters
2025-06-18 23:13:44,524 - DonorPDFExtractor - INFO - Page 3/23: Extracted 771 characters
2025-06-18 23:13:44,556 - DonorPDFExtractor - INFO - Page 4/23: Extracted 1198 characters
2025-06-18 23:13:44,560 - DonorPDFExtractor - INFO - Page 5/23: Extracted 366 characters
2025-06-18 23:13:44,566 - DonorPDFExtractor - INFO - Page 6/23: Extracted 610 characters
2025-06-18 23:13:44,573 - DonorPDFExtractor - INFO - Page 7/23: Extracted 806 characters
2025-06-18 23:13:44,584 - DonorPDFExtractor - INFO - Page 8/23: Extracted 1097 characters
2025-06-18 23:13:44,595 - DonorPDFExtractor - INFO - Page 9/23: Extracted 1269 characters
2025-06-18 

✓ Processed: patient_test_1.pdf


2025-06-18 23:13:44,930 - DonorPDFExtractor - INFO - Extracting text from PDF: sample_pdfs/subfolder_test/patient3.pdf
2025-06-18 23:13:44,933 - DonorPDFExtractor - INFO - PDF contains 22 pages
2025-06-18 23:13:44,948 - DonorPDFExtractor - INFO - Page 1/22: Extracted 1329 characters
2025-06-18 23:13:44,950 - DonorPDFExtractor - INFO - Page 2/22: Extracted 85 characters
2025-06-18 23:13:44,958 - DonorPDFExtractor - INFO - Page 3/22: Extracted 1469 characters
2025-06-18 23:13:44,966 - DonorPDFExtractor - INFO - Page 4/22: Extracted 718 characters
2025-06-18 23:13:44,971 - DonorPDFExtractor - INFO - Page 5/22: Extracted 611 characters
2025-06-18 23:13:44,974 - DonorPDFExtractor - INFO - Page 6/22: Extracted 347 characters
2025-06-18 23:13:44,979 - DonorPDFExtractor - INFO - Page 7/22: Extracted 490 characters
2025-06-18 23:13:44,982 - DonorPDFExtractor - INFO - Page 8/22: Extracted 381 characters
2025-06-18 23:13:44,993 - DonorPDFExtractor - INFO - Page 9/22: Extracted 1135 characters
202

✓ Processed: patient_test_2.pdf

Processing batch 2/2


2025-06-18 23:13:45,323 - DonorPDFExtractor - INFO - Extracting text from PDF: sample_pdfs/subfolder_test/patient_test_2.pdf
2025-06-18 23:13:45,327 - DonorPDFExtractor - INFO - PDF contains 23 pages
2025-06-18 23:13:45,341 - DonorPDFExtractor - INFO - Page 1/23: Extracted 975 characters
2025-06-18 23:13:45,346 - DonorPDFExtractor - INFO - Page 2/23: Extracted 642 characters
2025-06-18 23:13:45,355 - DonorPDFExtractor - INFO - Page 3/23: Extracted 771 characters
2025-06-18 23:13:45,364 - DonorPDFExtractor - INFO - Page 4/23: Extracted 1198 characters
2025-06-18 23:13:45,368 - DonorPDFExtractor - INFO - Page 5/23: Extracted 366 characters
2025-06-18 23:13:45,398 - DonorPDFExtractor - INFO - Page 6/23: Extracted 610 characters
2025-06-18 23:13:45,406 - DonorPDFExtractor - INFO - Page 7/23: Extracted 806 characters
2025-06-18 23:13:45,417 - DonorPDFExtractor - INFO - Page 8/23: Extracted 1097 characters
2025-06-18 23:13:45,429 - DonorPDFExtractor - INFO - Page 9/23: Extracted 1269 charact

✓ Processed: patient3.pdf


2025-06-18 23:13:45,772 - DonorJSONExporter - INFO - JSON file generated successfully: output/donors_interim_batch_2_20250618_231345.json
2025-06-18 23:13:45,773 - DonorJSONExporter - INFO - JSON file generated successfully: output/all_donors_20250618_231345.json
2025-06-18 23:13:45,774 - DonorCSVExporter - INFO - CSV file generated successfully: output/all_donors_20250618_231345.csv


✓ Processed: patient_test_2.pdf
Saved interim results to: ./output/donors_interim_batch_2.json

Processing complete:
  - Total files: 7
  - Successful: 7
  - Failed: 0
  - Total time: 3.15 seconds
  - Avg time per file: 0.45 seconds
Exported all data to JSON: ./output/all_donors.json
Exported all data to CSV: ./output/all_donors.csv


### Using the Command Line Interface with Your Existing PDFs
The package also provides a command-line interface which might be easier for batch processing:

In [11]:
# Extract data from all PDFs using the CLI
!pulmo-cristal extract --input {pdf_dir} --output {output_dir} --format both --recursive

  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4


In [12]:
# List PDF files in your directory
!pulmo-cristal list --input {pdf_dir} --recursive

  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4
sample_pdfs/
├── subfolder_test/
│   ├── patient3.pdf
│   └── patient_test_2.pdf
├── 235350.LEGENDRE SURLIGNE BIS .pdf
├── patient1.pdf
├── patient2.pdf
├── patient_test_1.pdf
└── patient_test_2.pdf

Total: 1 directories, 7 files
