# Using the pulmo-cristal Package
## This guide shows how to use the pulmo-cristal package for extracting and processing data from donor PDF documents.


### Installation

In [13]:
# Install the package from GitHub
!pip uninstall pulmo-cristal -y
!pip uninstall camelot-py -y

Found existing installation: pulmo-cristal 0.1.0
Uninstalling pulmo-cristal-0.1.0:
  Successfully uninstalled pulmo-cristal-0.1.0
Found existing installation: camelot-py 1.0.0
Uninstalling camelot-py-1.0.0:
  Successfully uninstalled camelot-py-1.0.0


In [14]:
!pip install git+https://github.com/drci-foch/pulmo-cristal.git

Collecting git+https://github.com/drci-foch/pulmo-cristal.git
  Cloning https://github.com/drci-foch/pulmo-cristal.git to c:\users\benysar\appdata\local\temp\pip-req-build-3g75ack5
  Resolved https://github.com/drci-foch/pulmo-cristal.git to commit 329e1f51395b333e31bd2b5d6993aa96a315c2e7
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting camelot-py>=1.0.0 (from pulmo-cristal==0.1.0)
  Using cached camelot_py-1.0.0-py3-none-any.whl.metadata (9.4 kB)
Using cached camelot_py-1.0.0-py3-none-any.whl (66 kB)
Building wheels for collected packages: pulmo-cristal
  Building wheel for pulmo-cristal (setup.py): started
  Building wheel for pulmo-cristal (setup.py): finished with status 'done'
  Created wheel for pulmo-cristal: filename=pulmo_cristal-0.1.0-py3-none-any.whl size=47973 sha256=db723b6a9d50a05d09fa8c6836f604e00572d4bedd1bf47bbc0a8a659a14cbbd
  Stored in directory: C:\Users\benysar\AppData\Local\Temp\pip-ephem-wheel-cache-

  Running command git clone --filter=blob:none --quiet https://github.com/drci-foch/pulmo-cristal.git 'C:\Users\benysar\AppData\Local\Temp\pip-req-build-3g75ack5'


### Getting Started
Let's check if the package is correctly installed:

In [15]:
# Check the package version
!pulmo-cristal --version

pulmo-cristal 0.1.0


  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4


In [16]:
# View available commands
!pulmo-cristal --help

usage: pulmo-cristal [-h] [--version] [--verbose]
                     {extract,list,convert,validate} ...

Extract data from donor PDF documents.

positional arguments:
  {extract,list,convert,validate}
                        Command to execute
    extract             Extract data from PDF files
    list                List PDF files in a directory
    convert             Convert between output formats
    validate            Validate extracted data

options:
  -h, --help            show this help message and exit
  --version, -V         show program's version number and exit
  --verbose, -v         Increase verbosity (can be used multiple times)
                        (default: 0)


  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4


### Finding PDF Files
First, let's see what PDF files we have available:

In [17]:
from pulmo_cristal.utils import find_pdf_files, list_directory_tree
import os

# Set the path to your directory containing existing PDFs
pdf_dir = "./sample_pdfs"  # Change this to your actual path

# Print directory tree to see the structure
tree = list_directory_tree(pdf_dir, max_depth=2, file_types=[".pdf"])
print(tree)

sample_pdfs/
├── subfolder_test/
│   └── patient3.pdf
├── patient1.pdf
└── patient2.pdf

Total: 1 directories, 3 files


In [18]:
# Find all PDF files in the directory
pdf_files = find_pdf_files(pdf_dir, recursive=True)
print(f"Found {len(pdf_files)} PDF files")

# Display the first few files
if pdf_files:
    print("\nSample of PDF files found:")
    for pdf in pdf_files[:5]:
        print(f"  - {pdf}")

Found 3 PDF files

Sample of PDF files found:
  - sample_pdfs\patient1.pdf
  - sample_pdfs\patient2.pdf
  - sample_pdfs\subfolder_test\patient3.pdf


### Extracting Data from a PDF
Now let's extract data from our sample PDF:

In [19]:
from pulmo_cristal.extractors import DonorPDFExtractor, HLAExtractor

# Select a sample PDF to process
sample_pdf = pdf_files[0] if pdf_files else None

if sample_pdf:
    print(f"Processing: {sample_pdf}")

    # Initialize extractors
    donor_extractor = DonorPDFExtractor()
    hla_extractor = HLAExtractor()

    try:
        # Extract donor data
        donor_data = donor_extractor.extract_donor_data(sample_pdf)

        # Extract HLA data
        hla_data, hla_status = hla_extractor.extract_hla_data(sample_pdf)

        # Add HLA data to donor data
        donor_data["informations_donneur"]["hla"] = hla_data
        donor_data["informations_donneur"]["hla_extraction_status"] = hla_status

        # Display extracted data
        print("\nExtracted Donor Information:")
        for key, value in donor_data["informations_donneur"].items():
            if key != "hla":  # Skip HLA for now
                print(f"  {key}: {value}")

        print("\nExtracted HLA Data:")
        for key, value in hla_data.items():
            print(f"  {key}: {value}")

    except Exception as e:
        print(f"Error extracting data: {e}")
else:
    print("No PDF files found to process")

2025-04-08 16:41:10,578 - DonorPDFExtractor - INFO - Extracting text from PDF: sample_pdfs\patient1.pdf
2025-04-08 16:41:10,581 - DonorPDFExtractor - INFO - PDF contains 21 pages
2025-04-08 16:41:10,598 - DonorPDFExtractor - INFO - Page 1/21: Extracted 1377 characters
2025-04-08 16:41:10,600 - DonorPDFExtractor - INFO - Page 2/21: Extracted 92 characters
2025-04-08 16:41:10,608 - DonorPDFExtractor - INFO - Page 3/21: Extracted 1444 characters
2025-04-08 16:41:10,616 - DonorPDFExtractor - INFO - Page 4/21: Extracted 719 characters
2025-04-08 16:41:10,621 - DonorPDFExtractor - INFO - Page 5/21: Extracted 645 characters
2025-04-08 16:41:10,623 - DonorPDFExtractor - INFO - Page 6/21: Extracted 262 characters
2025-04-08 16:41:10,629 - DonorPDFExtractor - INFO - Page 7/21: Extracted 490 characters
2025-04-08 16:41:10,634 - DonorPDFExtractor - INFO - Page 8/21: Extracted 367 characters
2025-04-08 16:41:10,644 - DonorPDFExtractor - INFO - Page 9/21: Extracted 951 characters
2025-04-08 16:41:10

Processing: sample_pdfs\patient1.pdf


2025-04-08 16:41:10,710 - DonorPDFExtractor - INFO - Page 16/21: Extracted 808 characters
2025-04-08 16:41:10,738 - DonorPDFExtractor - INFO - Page 17/21: Extracted 1624 characters
2025-04-08 16:41:10,740 - DonorPDFExtractor - INFO - Page 18/21: Extracted 134 characters
2025-04-08 16:41:10,750 - DonorPDFExtractor - INFO - Page 19/21: Extracted 1065 characters
2025-04-08 16:41:10,760 - DonorPDFExtractor - INFO - Page 20/21: Extracted 1009 characters
2025-04-08 16:41:10,762 - DonorPDFExtractor - INFO - Page 21/21: Extracted 203 characters
2025-04-08 16:41:10,762 - DonorPDFExtractor - INFO - Total extracted text: 15729 characters
2025-04-08 16:41:10,769 - HLAExtractor - INFO - Extracting HLA data from sample_pdfs\patient1.pdf using original approach
CropBox missing from /Page, defaulting to MediaBox



Extracted Donor Information:
  num_cristal: 175394
  type_donneur: Donneur prélevé, organe greffé
Coeur Battant (SME)
  age: 67
  sexe: M
  groupe_sanguin: o-
  date_naissance: 03/12/1952
  taille: 173
  poids: 78
  date_clampage: 05/01/2020
  etiologie: Traumatique Non A.V.P.
  duree_ventilation: 1
  hla_extraction_status: OK

Extracted HLA Data:
  A1: 2
  A2: 24
  B1: 35
  B2: 51
  C1: 4
  C2: 14
  DR1: 11
  DR2: 16
  DQA: 5
  DQB: 7
  DP1: 4
  DP2: 0


### Converting to a Structured Model
Convert the raw extracted data to a structured Donneur model:

In [20]:
from pulmo_cristal.models import Donneur

if "donor_data" in locals():
    try:
        # Convert to Donneur model
        donneur = Donneur.from_dict(donor_data)

        # Display basic model information
        print(f"Donor ID: {donneur.id}")
        print(f"Donor Type: {donneur.type_donneur}")
        print(f"Age: {donneur.age}")
        print(f"Sex: {donneur.sexe}")

        # Validate the model
        is_valid = donneur.validate()
        print(f"\nValidation Result: {'Valid' if is_valid else 'Invalid'}")
        if not is_valid and hasattr(donneur, "validation_errors"):
            print("Validation Errors:")
            for error in donneur.validation_errors:
                print(f"  - {error}")
    except Exception as e:
        print(f"Error converting to model: {e}")

Donor ID: 175394
Donor Type: DonneurType.INCONNU
Age: 67
Sex: SexeType.HOMME

Validation Result: Valid


### Exporting Data to JSON and CSV
Now let's export the extracted data to JSON and CSV formats:

In [21]:
from pulmo_cristal.exporters import DonorJSONExporter, DonorCSVExporter

# Create output directory
output_dir = "./output"
os.makedirs(output_dir, exist_ok=True)

if "donor_data" in locals():
    # Initialize exporters
    json_exporter = DonorJSONExporter()
    csv_exporter = DonorCSVExporter()

    try:
        # Export to JSON
        json_path = os.path.join(output_dir, "donor_data.json")
        json_exporter.export_json([donor_data], json_path)
        print(f"Exported JSON to: {json_path}")

        # Export to CSV
        csv_path = os.path.join(output_dir, "donor_data.csv")
        csv_exporter.export_csv([donor_data], csv_path)
        print(f"Exported CSV to: {csv_path}")
    except Exception as e:
        print(f"Error exporting data: {e}")

2025-04-08 16:41:11,046 - DonorJSONExporter - INFO - JSON file generated successfully: output\donor_data_20250408_164111.json
2025-04-08 16:41:11,047 - DonorCSVExporter - INFO - CSV file generated successfully: output\donor_data_20250408_164111.csv


Exported JSON to: ./output\donor_data.json
Exported CSV to: ./output\donor_data.csv


## Processing Multiple Files
For processing all your existing PDFs in batches:

In [22]:
from pulmo_cristal.utils import batch_process_files
import time

if pdf_files:
    # Initialize extractors
    donor_extractor = DonorPDFExtractor()
    hla_extractor = HLAExtractor()

    # Initialize exporters
    json_exporter = DonorJSONExporter()

    # Process in batches
    all_data = []
    successful = 0
    failed = 0
    start_time = time.time()

    # Define batch size
    batch_size = 5  # Adjust based on your needs

    for batch_idx, batch in enumerate(
        batch_process_files(pdf_files, batch_size=batch_size)
    ):
        print(
            f"\nProcessing batch {batch_idx + 1}/{(len(pdf_files) + batch_size - 1) // batch_size}"
        )
        batch_data = []

        for pdf_file in batch:
            try:
                # Extract donor data
                donor_data = donor_extractor.extract_donor_data(pdf_file)

                # Extract HLA data
                hla_data, status = hla_extractor.extract_hla_data(pdf_file)
                donor_data["informations_donneur"]["hla"] = hla_data
                donor_data["informations_donneur"]["hla_extraction_status"] = status

                # Add to batch
                batch_data.append(donor_data)
                successful += 1
                print(f"✓ Processed: {pdf_file.name}")
            except Exception as e:
                failed += 1
                print(f"✗ Error processing {pdf_file.name}: {e}")

        # Add batch to all data
        all_data.extend(batch_data)

        # Save intermediate results
        if batch_data and (batch_idx + 1) % 2 == 0:  # Save every 2 batches
            interim_path = os.path.join(
                output_dir, f"donors_interim_batch_{batch_idx + 1}.json"
            )
            try:
                json_exporter.export_json(all_data, interim_path)
                print(f"Saved interim results to: {interim_path}")
            except Exception as e:
                print(f"Error saving interim results: {e}")

    end_time = time.time()
    elapsed_time = end_time - start_time

    print("\nProcessing complete:")
    print(f"  - Total files: {len(pdf_files)}")
    print(f"  - Successful: {successful}")
    print(f"  - Failed: {failed}")
    print(f"  - Total time: {elapsed_time:.2f} seconds")
    print(f"  - Avg time per file: {elapsed_time / len(pdf_files):.2f} seconds")

    # Export all data
    if all_data:
        final_json_path = os.path.join(output_dir, "all_donors.json")
        final_csv_path = os.path.join(output_dir, "all_donors.csv")

        try:
            # Export to JSON
            json_exporter.export_json(all_data, final_json_path)
            print(f"Exported all data to JSON: {final_json_path}")

            # Export to CSV
            csv_exporter = DonorCSVExporter()
            csv_exporter.export_csv(all_data, final_csv_path)
            print(f"Exported all data to CSV: {final_csv_path}")
        except Exception as e:
            print(f"Error exporting final data: {e}")
else:
    print("No PDF files found to process")

2025-04-08 16:41:11,062 - DonorPDFExtractor - INFO - Extracting text from PDF: sample_pdfs\patient1.pdf
2025-04-08 16:41:11,065 - DonorPDFExtractor - INFO - PDF contains 21 pages
2025-04-08 16:41:11,081 - DonorPDFExtractor - INFO - Page 1/21: Extracted 1377 characters
2025-04-08 16:41:11,083 - DonorPDFExtractor - INFO - Page 2/21: Extracted 92 characters
2025-04-08 16:41:11,091 - DonorPDFExtractor - INFO - Page 3/21: Extracted 1444 characters
2025-04-08 16:41:11,099 - DonorPDFExtractor - INFO - Page 4/21: Extracted 719 characters
2025-04-08 16:41:11,104 - DonorPDFExtractor - INFO - Page 5/21: Extracted 645 characters
2025-04-08 16:41:11,106 - DonorPDFExtractor - INFO - Page 6/21: Extracted 262 characters
2025-04-08 16:41:11,111 - DonorPDFExtractor - INFO - Page 7/21: Extracted 490 characters
2025-04-08 16:41:11,115 - DonorPDFExtractor - INFO - Page 8/21: Extracted 367 characters
2025-04-08 16:41:11,146 - DonorPDFExtractor - INFO - Page 9/21: Extracted 951 characters
2025-04-08 16:41:11


Processing batch 1/1


2025-04-08 16:41:11,197 - DonorPDFExtractor - INFO - Page 15/21: Extracted 256 characters
2025-04-08 16:41:11,213 - DonorPDFExtractor - INFO - Page 16/21: Extracted 808 characters
2025-04-08 16:41:11,222 - DonorPDFExtractor - INFO - Page 17/21: Extracted 1624 characters
2025-04-08 16:41:11,223 - DonorPDFExtractor - INFO - Page 18/21: Extracted 134 characters
2025-04-08 16:41:11,234 - DonorPDFExtractor - INFO - Page 19/21: Extracted 1065 characters
2025-04-08 16:41:11,244 - DonorPDFExtractor - INFO - Page 20/21: Extracted 1009 characters
2025-04-08 16:41:11,246 - DonorPDFExtractor - INFO - Page 21/21: Extracted 203 characters
2025-04-08 16:41:11,247 - DonorPDFExtractor - INFO - Total extracted text: 15729 characters
2025-04-08 16:41:11,254 - HLAExtractor - INFO - Extracting HLA data from sample_pdfs\patient1.pdf using original approach
CropBox missing from /Page, defaulting to MediaBox
2025-04-08 16:41:11,512 - DonorPDFExtractor - INFO - Extracting text from PDF: sample_pdfs\patient2.pd

✓ Processed: patient1.pdf


2025-04-08 16:41:11,699 - DonorPDFExtractor - INFO - Page 21/22: Extracted 846 characters
2025-04-08 16:41:11,700 - DonorPDFExtractor - INFO - Page 22/22: Extracted 99 characters
2025-04-08 16:41:11,701 - DonorPDFExtractor - INFO - Total extracted text: 16277 characters
2025-04-08 16:41:11,708 - HLAExtractor - INFO - Extracting HLA data from sample_pdfs\patient2.pdf using original approach
CropBox missing from /Page, defaulting to MediaBox
2025-04-08 16:41:11,947 - DonorPDFExtractor - INFO - Extracting text from PDF: sample_pdfs\subfolder_test\patient3.pdf
2025-04-08 16:41:11,950 - DonorPDFExtractor - INFO - PDF contains 22 pages
2025-04-08 16:41:11,965 - DonorPDFExtractor - INFO - Page 1/22: Extracted 1329 characters
2025-04-08 16:41:11,966 - DonorPDFExtractor - INFO - Page 2/22: Extracted 85 characters
2025-04-08 16:41:11,975 - DonorPDFExtractor - INFO - Page 3/22: Extracted 1469 characters
2025-04-08 16:41:12,002 - DonorPDFExtractor - INFO - Page 4/22: Extracted 718 characters
2025-

✓ Processed: patient2.pdf


CropBox missing from /Page, defaulting to MediaBox
2025-04-08 16:41:12,382 - DonorJSONExporter - INFO - JSON file generated successfully: output\all_donors_20250408_164112.json
2025-04-08 16:41:12,384 - DonorCSVExporter - INFO - CSV file generated successfully: output\all_donors_20250408_164112.csv


✓ Processed: patient3.pdf

Processing complete:
  - Total files: 3
  - Successful: 3
  - Failed: 0
  - Total time: 1.32 seconds
  - Avg time per file: 0.44 seconds
Exported all data to JSON: ./output\all_donors.json
Exported all data to CSV: ./output\all_donors.csv


### Using the Command Line Interface with Your Existing PDFs
The package also provides a command-line interface which might be easier for batch processing:

In [23]:
# Extract data from all PDFs using the CLI
!pulmo-cristal extract --input {pdf_dir} --output {output_dir} --format both --recursive

  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox


In [24]:
# List PDF files in your directory
!pulmo-cristal list --input {pdf_dir} --recursive

sample_pdfs/
├── subfolder_test/
│   └── patient3.pdf
├── patient1.pdf
└── patient2.pdf

Total: 1 directories, 3 files


  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4
