# Using the pulmo-cristal Package
## This guide shows how to use the pulmo-cristal package for extracting and processing data from donor PDF documents.


### Installation

In [85]:
# Install the package from GitHub
!pip uninstall pulmo-cristal -y
!pip uninstall camelot-py -y
!pip install "camelot-py[cv]"

!pip install git+https://github.com/drci-foch/pulmo-cristal.git

Found existing installation: pulmo-cristal 0.1.0
Uninstalling pulmo-cristal-0.1.0:
  Successfully uninstalled pulmo-cristal-0.1.0
Found existing installation: camelot-py 1.0.0
Uninstalling camelot-py-1.0.0:
  Successfully uninstalled camelot-py-1.0.0
Collecting camelot-py[cv]
  Using cached camelot_py-1.0.0-py3-none-any.whl.metadata (9.4 kB)
Using cached camelot_py-1.0.0-py3-none-any.whl (66 kB)
Installing collected packages: camelot-py
Successfully installed camelot-py-1.0.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Collecting git+https://github.com/drci-foch/pulmo-cristal.git
  Cloning https://github.com/drci-foch/pulmo-cristal.git to /private/var/folders/4l/60lgrf9s6sqbf6lcr70gbl2c0000gn/T/pip-req-build-ubhlct5q
  Running command git clone --filter=blob:none --quiet https://g

### Getting Started
Let's check if the package is correctly installed:

In [86]:
# Check the package version
!pulmo-cristal --version

  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4
pulmo-cristal 0.1.0


In [87]:
# View available commands
!pulmo-cristal --help

  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4
usage: pulmo-cristal [-h] [--version] [--verbose]
                     {extract,list,convert,validate} ...

Extract data from donor PDF documents.

positional arguments:
  {extract,list,convert,validate}
                        Command to execute
    extract             Extract data from PDF files
    list                List PDF files in a directory
    convert             Convert between output formats
    validate            Validate extracted data

options:
  -h, --help            show this help message and exit
  --version, -V         show program's version number and exit
  --verbose, -v         Increase verbosity (can be used multiple times)
                        (default: 0)


### Finding PDF Files
First, let's see what PDF files we have available:

In [88]:
from pulmo_cristal.utils import find_pdf_files, list_directory_tree
import os

# Set the path to your directory containing existing PDFs
pdf_dir = "./sample_pdfs"  # Change this to your actual path

# Print directory tree to see the structure
tree = list_directory_tree(pdf_dir, max_depth=2, file_types=[".pdf"])
print(tree)

sample_pdfs/
├── subfolder_test/
│   └── patient_test_2.pdf
├── patient_test_1.pdf
└── patient_test_2.pdf

Total: 1 directories, 3 files


In [89]:
# Find all PDF files in the directory
pdf_files = find_pdf_files(pdf_dir, recursive=True)
print(f"Found {len(pdf_files)} PDF files")

# Display the first few files
if pdf_files:
    print("\nSample of PDF files found:")
    for pdf in pdf_files[:5]:
        print(f"  - {pdf}")

Found 3 PDF files

Sample of PDF files found:
  - sample_pdfs/patient_test_1.pdf
  - sample_pdfs/patient_test_2.pdf
  - sample_pdfs/subfolder_test/patient_test_2.pdf


### Extracting Data from a PDF
Now let's extract data from our sample PDF:

In [90]:
from pulmo_cristal.extractors import DonorPDFExtractor, HLAExtractor
import json

# Select a sample PDF to process
sample_pdf = pdf_files[0] if pdf_files else None

if sample_pdf:
    print(f"Processing: {sample_pdf}")
    
    # Initialize extractors
    donor_extractor = DonorPDFExtractor()
    hla_extractor = HLAExtractor()
    
    try:
        # Extract donor data
        donor_data = donor_extractor.extract_donor_data(sample_pdf)
        
        # Extract HLA data
        hla_data, hla_status = hla_extractor.extract_hla_data(sample_pdf)
        
        # Add HLA data to donor data
        donor_data["informations_donneur"]["hla"] = hla_data
        donor_data["informations_donneur"]["hla_extraction_status"] = hla_status
        
        # Display extracted data
        print("\nExtracted Donor Information:")
        for key, value in donor_data["informations_donneur"].items():
            if key != "hla":  # Skip HLA for now
                print(f"  {key}: {value}")
        
        print("\nExtracted HLA Data:")
        for key, value in hla_data.items():
            print(f"  {key}: {value}")
    
    except Exception as e:
        print(f"Error extracting data: {e}")
else:
    print("No PDF files found to process")

2025-04-08 15:19:23,906 - DonorPDFExtractor - INFO - Extracting text from PDF: sample_pdfs/patient_test_1.pdf


Processing: sample_pdfs/patient_test_1.pdf


2025-04-08 15:19:23,924 - DonorPDFExtractor - INFO - PDF contains 22 pages
2025-04-08 15:19:23,972 - DonorPDFExtractor - INFO - Page 1/22: Extracted 1308 characters
2025-04-08 15:19:23,973 - DonorPDFExtractor - INFO - Page 2/22: Extracted 147 characters
2025-04-08 15:19:23,981 - DonorPDFExtractor - INFO - Page 3/22: Extracted 1549 characters
2025-04-08 15:19:23,990 - DonorPDFExtractor - INFO - Page 4/22: Extracted 786 characters
2025-04-08 15:19:23,995 - DonorPDFExtractor - INFO - Page 5/22: Extracted 425 characters
2025-04-08 15:19:23,998 - DonorPDFExtractor - INFO - Page 6/22: Extracted 663 characters
2025-04-08 15:19:24,003 - DonorPDFExtractor - INFO - Page 7/22: Extracted 608 characters
2025-04-08 15:19:24,009 - DonorPDFExtractor - INFO - Page 8/22: Extracted 380 characters
2025-04-08 15:19:24,019 - DonorPDFExtractor - INFO - Page 9/22: Extracted 1357 characters
2025-04-08 15:19:24,025 - DonorPDFExtractor - INFO - Page 10/22: Extracted 415 characters
2025-04-08 15:19:24,034 - Donor


Extracted Donor Information:
  num_cristal: 235183
  type_donneur: Donneur prélevé, organe greffé
DDME
  age: 64
  sexe: M
  groupe_sanguin: A  +
  date_naissance: 03/08/1960
  taille: 175
  poids: 84.5
  date_clampage: 16/02/2025
  etiologie: Vasculaire
  duree_ventilation: 2
  hla_extraction_status: À VÉRIFIER MANUELLEMENT

Extracted HLA Data:
  A1: À AJOUTER
  A2: À AJOUTER
  B1: À AJOUTER
  B2: À AJOUTER
  C1: À AJOUTER
  C2: À AJOUTER
  DR1: À AJOUTER
  DR2: À AJOUTER
  DQB1: À AJOUTER
  DQB2: À AJOUTER
  DP1: À AJOUTER
  DP2: À AJOUTER


### Converting to a Structured Model
Convert the raw extracted data to a structured Donneur model:

In [91]:
from pulmo_cristal.models import Donneur

if 'donor_data' in locals():
    try:
        # Convert to Donneur model
        donneur = Donneur.from_dict(donor_data)
        
        # Display basic model information
        print(f"Donor ID: {donneur.id}")
        print(f"Donor Type: {donneur.type_donneur}")
        print(f"Age: {donneur.age}")
        print(f"Sex: {donneur.sexe}")
        
        # Validate the model
        is_valid = donneur.validate()
        print(f"\nValidation Result: {'Valid' if is_valid else 'Invalid'}")
        if not is_valid and hasattr(donneur, 'validation_errors'):
            print("Validation Errors:")
            for error in donneur.validation_errors:
                print(f"  - {error}")
    except Exception as e:
        print(f"Error converting to model: {e}")

Donor ID: 235183
Donor Type: DonneurType.INCONNU
Age: 64
Sex: SexeType.HOMME

Validation Result: Valid


### Exporting Data to JSON and CSV
Now let's export the extracted data to JSON and CSV formats:

In [92]:
from pulmo_cristal.exporters import DonorJSONExporter, DonorCSVExporter
import os

# Create output directory
output_dir = "./output"
os.makedirs(output_dir, exist_ok=True)

if 'donor_data' in locals():
    # Initialize exporters
    json_exporter = DonorJSONExporter()
    csv_exporter = DonorCSVExporter()
    
    try:
        # Export to JSON
        json_path = os.path.join(output_dir, "donor_data.json")
        json_exporter.export_json([donor_data], json_path)
        print(f"Exported JSON to: {json_path}")
        
        # Export to CSV
        csv_path = os.path.join(output_dir, "donor_data.csv")
        csv_exporter.export_csv([donor_data], csv_path)
        print(f"Exported CSV to: {csv_path}")
    except Exception as e:
        print(f"Error exporting data: {e}")

2025-04-08 15:19:24,336 - DonorJSONExporter - INFO - JSON file generated successfully: output/donor_data_20250408_151924.json
2025-04-08 15:19:24,338 - DonorCSVExporter - INFO - CSV file generated successfully: output/donor_data_20250408_151924.csv


Exported JSON to: ./output/donor_data.json
Exported CSV to: ./output/donor_data.csv


## Processing Multiple Files
For processing all your existing PDFs in batches:

In [93]:
from pulmo_cristal.utils import batch_process_files
import time

if pdf_files:
    # Initialize extractors
    donor_extractor = DonorPDFExtractor()
    hla_extractor = HLAExtractor()
    
    # Initialize exporters
    json_exporter = DonorJSONExporter()
    
    # Process in batches
    all_data = []
    successful = 0
    failed = 0
    start_time = time.time()
    
    # Define batch size
    batch_size = 5  # Adjust based on your needs
    
    for batch_idx, batch in enumerate(batch_process_files(pdf_files, batch_size=batch_size)):
        print(f"\nProcessing batch {batch_idx+1}/{(len(pdf_files) + batch_size - 1) // batch_size}")
        batch_data = []
        
        for pdf_file in batch:
            try:
                # Extract donor data
                donor_data = donor_extractor.extract_donor_data(pdf_file)
                
                # Extract HLA data
                hla_data, status = hla_extractor.extract_hla_data(pdf_file)
                donor_data["informations_donneur"]["hla"] = hla_data
                donor_data["informations_donneur"]["hla_extraction_status"] = status
                
                # Add to batch
                batch_data.append(donor_data)
                successful += 1
                print(f"✓ Processed: {pdf_file.name}")
            except Exception as e:
                failed += 1
                print(f"✗ Error processing {pdf_file.name}: {e}")
        
        # Add batch to all data
        all_data.extend(batch_data)
        
        # Save intermediate results
        if batch_data and (batch_idx + 1) % 2 == 0:  # Save every 2 batches
            interim_path = os.path.join(output_dir, f"donors_interim_batch_{batch_idx+1}.json")
            try:
                json_exporter.export_json(all_data, interim_path)
                print(f"Saved interim results to: {interim_path}")
            except Exception as e:
                print(f"Error saving interim results: {e}")
    
    end_time = time.time()
    elapsed_time = end_time - start_time
    
    print(f"\nProcessing complete:")
    print(f"  - Total files: {len(pdf_files)}")
    print(f"  - Successful: {successful}")
    print(f"  - Failed: {failed}")
    print(f"  - Total time: {elapsed_time:.2f} seconds")
    print(f"  - Avg time per file: {elapsed_time/len(pdf_files):.2f} seconds")
    
    # Export all data
    if all_data:
        final_json_path = os.path.join(output_dir, "all_donors.json")
        final_csv_path = os.path.join(output_dir, "all_donors.csv")
        
        try:
            # Export to JSON
            json_exporter.export_json(all_data, final_json_path)
            print(f"Exported all data to JSON: {final_json_path}")
            
            # Export to CSV
            csv_exporter = DonorCSVExporter()
            csv_exporter.export_csv(all_data, final_csv_path)
            print(f"Exported all data to CSV: {final_csv_path}")
        except Exception as e:
            print(f"Error exporting final data: {e}")
else:
    print("No PDF files found to process")

2025-04-08 15:19:24,347 - DonorPDFExtractor - INFO - Extracting text from PDF: sample_pdfs/patient_test_1.pdf
2025-04-08 15:19:24,351 - DonorPDFExtractor - INFO - PDF contains 22 pages
2025-04-08 15:19:24,367 - DonorPDFExtractor - INFO - Page 1/22: Extracted 1308 characters
2025-04-08 15:19:24,369 - DonorPDFExtractor - INFO - Page 2/22: Extracted 147 characters
2025-04-08 15:19:24,376 - DonorPDFExtractor - INFO - Page 3/22: Extracted 1549 characters
2025-04-08 15:19:24,385 - DonorPDFExtractor - INFO - Page 4/22: Extracted 786 characters
2025-04-08 15:19:24,389 - DonorPDFExtractor - INFO - Page 5/22: Extracted 425 characters
2025-04-08 15:19:24,393 - DonorPDFExtractor - INFO - Page 6/22: Extracted 663 characters
2025-04-08 15:19:24,398 - DonorPDFExtractor - INFO - Page 7/22: Extracted 608 characters
2025-04-08 15:19:24,402 - DonorPDFExtractor - INFO - Page 8/22: Extracted 380 characters
2025-04-08 15:19:24,412 - DonorPDFExtractor - INFO - Page 9/22: Extracted 1357 characters
2025-04-08 


Processing batch 1/1


2025-04-08 15:19:24,533 - DonorPDFExtractor - INFO - Page 22/22: Extracted 183 characters
2025-04-08 15:19:24,533 - DonorPDFExtractor - INFO - Total extracted text: 17009 characters
2025-04-08 15:19:24,542 - HLAExtractor - INFO - Extracting HLA data from sample_pdfs/patient_test_1.pdf using original approach
2025-04-08 15:19:24,713 - HLAExtractor - ERROR - Error in HLA extraction: 'NoneType' object has no attribute 'read_pdf'
2025-04-08 15:19:24,713 - DonorPDFExtractor - INFO - Extracting text from PDF: sample_pdfs/patient_test_2.pdf
2025-04-08 15:19:24,717 - DonorPDFExtractor - INFO - PDF contains 23 pages
2025-04-08 15:19:24,746 - DonorPDFExtractor - INFO - Page 1/23: Extracted 975 characters
2025-04-08 15:19:24,750 - DonorPDFExtractor - INFO - Page 2/23: Extracted 642 characters
2025-04-08 15:19:24,759 - DonorPDFExtractor - INFO - Page 3/23: Extracted 771 characters
2025-04-08 15:19:24,768 - DonorPDFExtractor - INFO - Page 4/23: Extracted 1198 characters
2025-04-08 15:19:24,772 - Do

✓ Processed: patient_test_1.pdf


2025-04-08 15:19:25,087 - HLAExtractor - ERROR - Error in HLA extraction: 'NoneType' object has no attribute 'read_pdf'
2025-04-08 15:19:25,087 - DonorPDFExtractor - INFO - Extracting text from PDF: sample_pdfs/subfolder_test/patient_test_2.pdf
2025-04-08 15:19:25,091 - DonorPDFExtractor - INFO - PDF contains 23 pages
2025-04-08 15:19:25,105 - DonorPDFExtractor - INFO - Page 1/23: Extracted 975 characters
2025-04-08 15:19:25,109 - DonorPDFExtractor - INFO - Page 2/23: Extracted 642 characters
2025-04-08 15:19:25,118 - DonorPDFExtractor - INFO - Page 3/23: Extracted 771 characters
2025-04-08 15:19:25,127 - DonorPDFExtractor - INFO - Page 4/23: Extracted 1198 characters
2025-04-08 15:19:25,131 - DonorPDFExtractor - INFO - Page 5/23: Extracted 366 characters
2025-04-08 15:19:25,136 - DonorPDFExtractor - INFO - Page 6/23: Extracted 610 characters
2025-04-08 15:19:25,143 - DonorPDFExtractor - INFO - Page 7/23: Extracted 806 characters
2025-04-08 15:19:25,153 - DonorPDFExtractor - INFO - Pag

✓ Processed: patient_test_2.pdf


2025-04-08 15:19:25,447 - HLAExtractor - ERROR - Error in HLA extraction: 'NoneType' object has no attribute 'read_pdf'
2025-04-08 15:19:25,448 - DonorJSONExporter - INFO - JSON file generated successfully: output/all_donors_20250408_151925.json
2025-04-08 15:19:25,449 - DonorCSVExporter - INFO - CSV file generated successfully: output/all_donors_20250408_151925.csv


✓ Processed: patient_test_2.pdf

Processing complete:
  - Total files: 3
  - Successful: 3
  - Failed: 0
  - Total time: 1.10 seconds
  - Avg time per file: 0.37 seconds
Exported all data to JSON: ./output/all_donors.json
Exported all data to CSV: ./output/all_donors.csv


### Using the Command Line Interface with Your Existing PDFs
The package also provides a command-line interface which might be easier for batch processing:

In [94]:
# Extract data from all PDFs using the CLI
!pulmo-cristal extract --input {pdf_dir} --output {output_dir} --format both --recursive

  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4
2025-04-08 15:19:26 - pulmo_cristal - ERROR - Error in HLA extraction: 'NoneType' object has no attribute 'read_pdf'
2025-04-08 15:19:26 - pulmo_cristal - ERROR - Error in HLA extraction: 'NoneType' object has no attribute 'read_pdf'
2025-04-08 15:19:26 - pulmo_cristal - ERROR - Error in HLA extraction: 'NoneType' object has no attribute 'read_pdf'


In [95]:
# List PDF files in your directory
!pulmo-cristal list --input {pdf_dir} --recursive

  from cryptography.hazmat.primitives.ciphers.algorithms import AES, ARC4
sample_pdfs/
├── subfolder_test/
│   └── patient_test_2.pdf
├── patient_test_1.pdf
└── patient_test_2.pdf

Total: 1 directories, 3 files
