# Using the pulmo-cristal Package
## This guide shows how to use the pulmo-cristal package for extracting and processing data from donor PDF documents.


### Installation

In [5]:
# Install the package from GitHub
!pip uninstall pulmo-cristal -y
!pip uninstall camelot-py -y

Found existing installation: pulmo-cristal 0.1.0
Uninstalling pulmo-cristal-0.1.0:
  Successfully uninstalled pulmo-cristal-0.1.0
Found existing installation: camelot-py 1.0.0
Uninstalling camelot-py-1.0.0:
  Successfully uninstalled camelot-py-1.0.0


In [6]:
!pip install git+https://github.com/drci-foch/pulmo-cristal.git

Collecting git+https://github.com/drci-foch/pulmo-cristal.git
  Cloning https://github.com/drci-foch/pulmo-cristal.git to /private/var/folders/4l/60lgrf9s6sqbf6lcr70gbl2c0000gn/T/pip-req-build-03t6k98m
  Running command git clone --filter=blob:none --quiet https://github.com/drci-foch/pulmo-cristal.git /private/var/folders/4l/60lgrf9s6sqbf6lcr70gbl2c0000gn/T/pip-req-build-03t6k98m
  Resolved https://github.com/drci-foch/pulmo-cristal.git to commit d6f7befa4a445cfc6528900f48048347423b5f8c
  Preparing metadata (setup.py) ... [?25ldone
Collecting camelot-py>=1.0.0 (from pulmo-cristal==0.1.0)
  Using cached camelot_py-1.0.0-py3-none-any.whl.metadata (9.4 kB)
Using cached camelot_py-1.0.0-py3-none-any.whl (66 kB)
Building wheels for collected packages: pulmo-cristal
  Building wheel for pulmo-cristal (setup.py) ... [?25ldone
[?25h  Created wheel for pulmo-cristal: filename=pulmo_cristal-0.1.0-py3-none-any.whl size=47225 sha256=b23b928d9936633f032b4e24cdb396ce36bcd8fb7cbd7f4792f0a83cd1a5c

In [7]:
from pulmo_cristal.utils import find_pdf_files, list_directory_tree
import os

# Set the path to your directory containing existing PDFs
pdf_dir = "./sample_pdfs"  # Change this to your actual path

# Print directory tree to see the structure
tree = list_directory_tree(pdf_dir, max_depth=2, file_types=[".pdf"])
print(tree)

sample_pdfs/
├── subfolder_test/
│   ├── patient3.pdf
│   └── patient_test_2.pdf
├── 235350.LEGENDRE SURLIGNE BIS .pdf
├── patient1.pdf
├── patient2.pdf
├── patient_test_1.pdf
└── patient_test_2.pdf

Total: 1 directories, 7 files


In [8]:
# Find all PDF files in the directory
pdf_files = find_pdf_files(pdf_dir, recursive=True)
print(f"Found {len(pdf_files)} PDF files")

# Display the first few files
if pdf_files:
    print("\nSample of PDF files found:")
    for pdf in pdf_files[:5]:
        print(f"  - {pdf}")

Found 7 PDF files

Sample of PDF files found:
  - sample_pdfs/patient2.pdf
  - sample_pdfs/patient1.pdf
  - sample_pdfs/235350.LEGENDRE SURLIGNE BIS .pdf
  - sample_pdfs/patient_test_1.pdf
  - sample_pdfs/patient_test_2.pdf


### Extracting Data from a PDF
Now let's extract data from our sample PDF:

In [9]:
pdf_files[2]

PosixPath('sample_pdfs/235350.LEGENDRE SURLIGNE BIS .pdf')

In [10]:
from pulmo_cristal.extractors import DonorPDFExtractor, HLAExtractor
import json

# Select a sample PDF to process
sample_pdf = pdf_files[2] if pdf_files else None

if sample_pdf:
    print(f"Processing: {sample_pdf}")
    
    # Initialize extractors
    donor_extractor = DonorPDFExtractor()
    hla_extractor = HLAExtractor()
    
    try:
        # Extract donor data
        donor_data = donor_extractor.extract_donor_data(sample_pdf)
        
        # Extract HLA data
        hla_data, hla_status = hla_extractor.extract_hla_data(sample_pdf)
        
        # Add HLA data to donor data
        donor_data["informations_donneur"]["hla"] = hla_data
        donor_data["informations_donneur"]["hla_extraction_status"] = hla_status
        
        # Display extracted data
        print("\nExtracted Donor Information:")
        for key, value in donor_data["informations_donneur"].items():
            if key != "hla":  # Skip HLA for now
                print(f"  {key}: {value}")
        
        print("\nExtracted HLA Data:")
        for key, value in hla_data.items():
            print(f"  {key}: {value}")
    
    except Exception as e:
        print(f"Error extracting data: {e}")
else:
    print("No PDF files found to process")

2025-05-20 16:20:22,189 - DonorPDFExtractor - INFO - Extracting text from PDF: sample_pdfs/235350.LEGENDRE SURLIGNE BIS .pdf
2025-05-20 16:20:22,189 - DonorPDFExtractor - INFO - Extracting text from PDF: sample_pdfs/235350.LEGENDRE SURLIGNE BIS .pdf
2025-05-20 16:20:22,195 - DonorPDFExtractor - INFO - PDF contains 23 pages
2025-05-20 16:20:22,195 - DonorPDFExtractor - INFO - PDF contains 23 pages
2025-05-20 16:20:22,212 - DonorPDFExtractor - INFO - Page 1/23: Extracted 975 characters
2025-05-20 16:20:22,212 - DonorPDFExtractor - INFO - Page 1/23: Extracted 975 characters
2025-05-20 16:20:22,218 - DonorPDFExtractor - INFO - Page 2/23: Extracted 642 characters
2025-05-20 16:20:22,218 - DonorPDFExtractor - INFO - Page 2/23: Extracted 642 characters
2025-05-20 16:20:22,228 - DonorPDFExtractor - INFO - Page 3/23: Extracted 771 characters
2025-05-20 16:20:22,228 - DonorPDFExtractor - INFO - Page 3/23: Extracted 771 characters
2025-05-20 16:20:22,239 - DonorPDFExtractor - INFO - Page 4/23: Ex

Processing: sample_pdfs/235350.LEGENDRE SURLIGNE BIS .pdf


2025-05-20 16:20:22,435 - DonorPDFExtractor - INFO - Page 19/23: Extracted 951 characters
2025-05-20 16:20:22,435 - DonorPDFExtractor - INFO - Page 19/23: Extracted 951 characters
2025-05-20 16:20:22,447 - DonorPDFExtractor - INFO - Page 20/23: Extracted 804 characters
2025-05-20 16:20:22,447 - DonorPDFExtractor - INFO - Page 20/23: Extracted 804 characters
2025-05-20 16:20:22,450 - DonorPDFExtractor - INFO - Page 21/23: Extracted 734 characters
2025-05-20 16:20:22,450 - DonorPDFExtractor - INFO - Page 21/23: Extracted 734 characters
2025-05-20 16:20:22,454 - DonorPDFExtractor - INFO - Page 22/23: Extracted 243 characters
2025-05-20 16:20:22,454 - DonorPDFExtractor - INFO - Page 22/23: Extracted 243 characters
2025-05-20 16:20:22,458 - DonorPDFExtractor - INFO - Page 23/23: Extracted 382 characters
2025-05-20 16:20:22,458 - DonorPDFExtractor - INFO - Page 23/23: Extracted 382 characters
2025-05-20 16:20:22,458 - DonorPDFExtractor - INFO - Total extracted text: 16233 characters
2025-05-


Extracted Donor Information:
  num_cristal: 235350
  type_donneur: :Donneur prélevé, organe greffé
DDAC M3
  age: 33
  sexe: M
  groupe_sanguin: O  +
  date_naissance: 04/02/1992
  taille: 180
  poids: 80
  date_clampage: 20/02/2025
  heure_clampage: 13:04
  hla_extraction_status: OK

Extracted HLA Data:
  A1: 2
  A2: 3
  B1: 7
  B2: 27
  C1: 1
  C2: 7
  DR1: 103
  DR2: 13
  DQA: 5
  DQB: 6
  DP1: 4
  DP2: 0


### Converting to a Structured Model
Convert the raw extracted data to a structured Donneur model:

In [11]:
from pulmo_cristal.models import Donneur

if 'donor_data' in locals():
    try:
        # Convert to Donneur model
        donneur = Donneur.from_dict(donor_data)
        
        # Display basic model information
        print(f"Donor ID: {donneur.id}")
        print(f"Donor Type: {donneur.type_donneur}")
        print(f"Age: {donneur.age}")
        print(f"Sex: {donneur.sexe}")
        
        # Validate the model
        is_valid = donneur.validate()
        print(f"\nValidation Result: {'Valid' if is_valid else 'Invalid'}")
        if not is_valid and hasattr(donneur, 'validation_errors'):
            print("Validation Errors:")
            for error in donneur.validation_errors:
                print(f"  - {error}")
    except Exception as e:
        print(f"Error converting to model: {e}")

Donor ID: 235350
Donor Type: DonneurType.INCONNU
Age: 33
Sex: SexeType.HOMME

Validation Result: Valid


### Exporting Data to JSON and CSV
Now let's export the extracted data to JSON and CSV formats:

In [12]:
from pulmo_cristal.exporters import DonorJSONExporter, DonorCSVExporter
import os

# Create output directory
output_dir = "./output"
os.makedirs(output_dir, exist_ok=True)

if 'donor_data' in locals():
    # Initialize exporters
    json_exporter = DonorJSONExporter()
    csv_exporter = DonorCSVExporter()
    
    try:
        # Export to JSON
        json_path = os.path.join(output_dir, "donor_data.json")
        json_exporter.export_json([donor_data], json_path)
        print(f"Exported JSON to: {json_path}")
        
        # Export to CSV
        csv_path = os.path.join(output_dir, "donor_data.csv")
        csv_exporter.export_csv([donor_data], csv_path)
        print(f"Exported CSV to: {csv_path}")
    except Exception as e:
        print(f"Error exporting data: {e}")

2025-05-20 16:20:22,767 - DonorJSONExporter - INFO - JSON file generated successfully: output/donor_data_20250520_162022.json
2025-05-20 16:20:22,767 - DonorJSONExporter - INFO - JSON file generated successfully: output/donor_data_20250520_162022.json
2025-05-20 16:20:22,770 - DonorCSVExporter - INFO - CSV file generated successfully: output/donor_data_20250520_162022.csv
2025-05-20 16:20:22,770 - DonorCSVExporter - INFO - CSV file generated successfully: output/donor_data_20250520_162022.csv


Exported JSON to: ./output/donor_data.json
Exported CSV to: ./output/donor_data.csv


In [13]:
import logging
from typing import Optional, Union, Tuple, List, Dict
from pathlib import Path
import re

In [14]:
try:
    import PyPDF2
except ImportError:
    PyPDF2 = None
    print("PyPDF2 is not installed. Please run the cell above to install it.")

In [15]:
class BaseExtractor:
    """Base class for all extractors."""
    
    def __init__(self, logger: Optional[logging.Logger] = None):
        """Initialize the base extractor.
        
        Args:
            logger: Optional logger instance
        """
        if logger is None:
            # Create a default logger
            self.logger = logging.getLogger(self.__class__.__name__)
            handler = logging.StreamHandler()
            formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
            handler.setFormatter(formatter)
            self.logger.addHandler(handler)
            self.logger.setLevel(logging.INFO)
        else:
            self.logger = logger
    
    def log(self, message: str, level: int = logging.INFO) -> None:
        """Log a message at the specified level."""
        self.logger.log(level, message)

In [16]:
class PDFExtractor(BaseExtractor):
    """
    Extracts text content and sections from PDF documents.
    
    This class handles the basic PDF reading operations and text extraction,
    providing methods to access both the full text and specific page content.
    """
    def __init__(self, logger: Optional[logging.Logger] = None, debug: bool = False):
        """
        Initialize the PDF extractor.
        
        Args:
            logger: Optional logger instance
            debug: Enable debug mode for verbose logging
        """
        super().__init__(logger=logger)
        self.debug = debug
        self._extracted_text = None
        self._page_texts = []
        
        # Check if PyPDF2 is available
        if PyPDF2 is None:
            self.log("PyPDF2 is not installed. PDF extraction will not be available.", 
                     level=logging.ERROR)
            
    def extract_from_pdf(self, pdf_path: Union[str, Path]) -> Tuple[str, List[str]]:
        """
        Extract text content from a PDF file.
        
        Args:
            pdf_path: Path to the PDF file (str or Path object)
            
        Returns:
            Tuple containing:
                - Full text content as a single string
                - List of text content per page
                
        Raises:
            FileNotFoundError: If the PDF file doesn't exist
            PermissionError: If the PDF file can't be accessed
            ValueError: If PyPDF2 is not installed
        """
        if PyPDF2 is None:
            raise ValueError("PyPDF2 is not installed. Cannot extract PDF content.")
        
        # Reset stored text
        self._extracted_text = ""
        self._page_texts = []
        
        # Convert to Path object for better path handling
        path_obj = Path(pdf_path) if isinstance(pdf_path, str) else pdf_path
        
        try:
            self.log(f"Extracting text from PDF: {path_obj}")
            
            with open(path_obj, "rb") as file:
                try:
                    reader = PyPDF2.PdfReader(file)
                    num_pages = len(reader.pages)
                    self.log(f"PDF contains {num_pages} pages")
                    
                    # Extract text from each page
                    for i, page in enumerate(reader.pages):
                        try:
                            page_text = page.extract_text()
                            self._page_texts.append(page_text)
                            self.log(f"Page {i+1}/{num_pages}: Extracted {len(page_text)} characters")
                        except Exception as e:
                            self.log(f"Error extracting text from page {i+1}: {str(e)}", 
                                    level=logging.WARNING)
                            self._page_texts.append("")
                    
                    # Combine all pages with page separators
                    self._extracted_text = "\n\n".join(self._page_texts)
                    self.log(f"Total extracted text: {len(self._extracted_text)} characters")
                    
                except PyPDF2.errors.PdfReadError as e:
                    self.log(f"Error reading PDF structure: {str(e)}", level=logging.ERROR)
                    raise
                except Exception as e:
                    self.log(f"Unexpected error in PDF processing: {str(e)}", level=logging.ERROR)
                    raise
        
        except FileNotFoundError:
            self.log(f"PDF file not found: {path_obj}", level=logging.ERROR)
            raise
        except PermissionError:
            self.log(f"Permission denied when accessing: {path_obj}", level=logging.ERROR)
            raise
        except Exception as e:
            self.log(f"Error opening PDF file: {str(e)}", level=logging.ERROR)
            raise
            
        return self._extracted_text, self._page_texts

    def get_text(self) -> str:
        """
        Get the extracted text content.
        
        Returns:
            Full text content as a string
        """
        if self._extracted_text is None:
            self.log("No text has been extracted yet. Call extract_from_pdf first.", 
                     level=logging.WARNING)
            return ""
        
        return self._extracted_text
    
    def get_page_text(self, page_index: int) -> str:
        """
        Get the text content of a specific page.
        
        Args:
            page_index: 0-based index of the page
            
        Returns:
            Text content of the specified page
            
        Raises:
            IndexError: If the page index is out of range
        """
        if not self._page_texts:
            self.log("No text has been extracted yet. Call extract_from_pdf first.", 
                     level=logging.WARNING)
            return ""
        
        if page_index < 0 or page_index >= len(self._page_texts):
            raise IndexError(f"Page index {page_index} is out of range. PDF has {len(self._page_texts)} pages.")
        
        return self._page_texts[page_index]
    
    def save_text_to_file(self, output_path: Union[str, Path]) -> None:
        """
        Save the extracted text to a file.
        
        Args:
            output_path: Path where the text file will be saved
            
        Raises:
            ValueError: If no text has been extracted yet
        """
        if self._extracted_text is None:
            raise ValueError("No text has been extracted yet. Call extract_from_pdf first.")
        
        # Convert to Path object
        path_obj = Path(output_path) if isinstance(output_path, str) else output_path
        
        try:
            with open(path_obj, "w", encoding="utf-8") as f:
                f.write(self._extracted_text)
            
            self.log(f"Text saved to {path_obj}")
        
        except Exception as e:
            self.log(f"Error saving text to {path_obj}: {str(e)}", level=logging.ERROR)
            raise

    def find_sections(self, section_markers: Dict[str, str]) -> Dict[str, str]:
        """
        Find and extract sections from the PDF text based on section markers.
        
        Args:
            section_markers: Dictionary mapping section names to marker patterns
            
        Returns:
            Dictionary mapping section names to extracted text
        """
        if self._extracted_text is None:
            self.log("No text has been extracted yet. Call extract_from_pdf first.", 
                     level=logging.WARNING)
            return {}
        
        sections = {}
        
        for section_name, marker in section_markers.items():
            try:
                pattern = re.compile(marker + r"(.*?)(?=(?:" + "|".join(section_markers.values()) + r")|$)", 
                                     re.IGNORECASE | re.DOTALL)
                matches = pattern.findall(self._extracted_text)
                
                if matches:
                    sections[section_name] = matches[0].strip()
                    if self.debug:
                        self.log(f"Found section '{section_name}' with {len(sections[section_name])} characters")
                else:
                    sections[section_name] = ""
                    if self.debug:
                        self.log(f"Section '{section_name}' not found")
            
            except re.error as e:
                self.log(f"Error in regex pattern for section '{section_name}': {str(e)}", 
                         level=logging.ERROR)
                sections[section_name] = ""
        
        return sections


In [17]:
logging.basicConfig(level=logging.INFO, 
                   format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')

# Test PDF extraction (replace with an actual PDF path)
def test_pdf_extractor():
    # Create an instance of PDFExtractor with debug mode enabled
    extractor = PDFExtractor(debug=True)
    
    # Define test PDF path - replace with your actual PDF file path
    pdf_path = "/Users/sarrabenyahia/Documents/GitHub/pulmo-cristal/sandbox/sample_pdfs/235350.LEGENDRE SURLIGNE BIS .pdf"  # Update this with your actual PDF file path
    
    try:
        # Extract text from PDF
        print(f"Attempting to extract text from: {pdf_path}")
        full_text, page_texts = extractor.extract_from_pdf(pdf_path)
        
        # Print some statistics
        print(f"\nExtraction Results:")
        print(f"Total pages: {len(page_texts)}")
        print(f"Total characters: {len(full_text)}")
        
        # Print first 100 characters of the first page (if available)
        if page_texts and page_texts[0]:
            print(f"\nFirst 100 characters of first page:")
            print(page_texts[0][:100] + "...")
        
        # Test getting a specific page
        if len(page_texts) > 1:
            page_idx = 1  # Second page (0-indexed)
            page_text = extractor.get_page_text(page_idx)
            print(f"\nPage {page_idx+1} has {len(page_text)} characters")
        
        # Test section finding
        section_markers = {
            "introduction": r"Introduction|INTRODUCTION|1\.\s+Introduction",
            "methodology": r"Methodology|METHODOLOGY|Method|METHODS",
            "results": r"Results|RESULTS|Findings|FINDINGS",
            "conclusion": r"Conclusion|CONCLUSION|Summary|SUMMARY"
        }
        
        sections = extractor.find_sections(section_markers)
        print("\nSections found:")
        for section_name, content in sections.items():
            if content:
                preview = content[:50].replace("\n", " ") + "..."
                print(f"- {section_name}: {preview}")
            else:
                print(f"- {section_name}: Not found")
        
        # Test saving to file
        # output_path = "extracted_text.txt"
        # extractor.save_text_to_file(output_path)
        
    except FileNotFoundError:
        print(f"Test file not found. Please update the pdf_path variable with a valid path.")
    except Exception as e:
        print(f"Error during testing: {str(e)}")

In [18]:
test_pdf_extractor()

2025-05-20 16:20:22,883 - PDFExtractor - INFO - Extracting text from PDF: /Users/sarrabenyahia/Documents/GitHub/pulmo-cristal/sandbox/sample_pdfs/235350.LEGENDRE SURLIGNE BIS .pdf
2025-05-20 16:20:22,883 - PDFExtractor - INFO - Extracting text from PDF: /Users/sarrabenyahia/Documents/GitHub/pulmo-cristal/sandbox/sample_pdfs/235350.LEGENDRE SURLIGNE BIS .pdf
2025-05-20 16:20:22,893 - PDFExtractor - INFO - PDF contains 23 pages
2025-05-20 16:20:22,893 - PDFExtractor - INFO - PDF contains 23 pages
2025-05-20 16:20:22,912 - PDFExtractor - INFO - Page 1/23: Extracted 975 characters
2025-05-20 16:20:22,912 - PDFExtractor - INFO - Page 1/23: Extracted 975 characters
2025-05-20 16:20:22,918 - PDFExtractor - INFO - Page 2/23: Extracted 642 characters
2025-05-20 16:20:22,918 - PDFExtractor - INFO - Page 2/23: Extracted 642 characters
2025-05-20 16:20:22,928 - PDFExtractor - INFO - Page 3/23: Extracted 771 characters
2025-05-20 16:20:22,928 - PDFExtractor - INFO - Page 3/23: Extracted 771 charact

Attempting to extract text from: /Users/sarrabenyahia/Documents/GitHub/pulmo-cristal/sandbox/sample_pdfs/235350.LEGENDRE SURLIGNE BIS .pdf


2025-05-20 16:20:22,944 - PDFExtractor - INFO - Page 5/23: Extracted 366 characters
2025-05-20 16:20:22,944 - PDFExtractor - INFO - Page 5/23: Extracted 366 characters
2025-05-20 16:20:22,951 - PDFExtractor - INFO - Page 6/23: Extracted 610 characters
2025-05-20 16:20:22,951 - PDFExtractor - INFO - Page 6/23: Extracted 610 characters
2025-05-20 16:20:22,959 - PDFExtractor - INFO - Page 7/23: Extracted 806 characters
2025-05-20 16:20:22,959 - PDFExtractor - INFO - Page 7/23: Extracted 806 characters
2025-05-20 16:20:22,972 - PDFExtractor - INFO - Page 8/23: Extracted 1097 characters
2025-05-20 16:20:22,972 - PDFExtractor - INFO - Page 8/23: Extracted 1097 characters
2025-05-20 16:20:22,986 - PDFExtractor - INFO - Page 9/23: Extracted 1269 characters
2025-05-20 16:20:22,986 - PDFExtractor - INFO - Page 9/23: Extracted 1269 characters
2025-05-20 16:20:22,996 - PDFExtractor - INFO - Page 10/23: Extracted 617 characters
2025-05-20 16:20:22,996 - PDFExtractor - INFO - Page 10/23: Extracted 6


Extraction Results:
Total pages: 23
Total characters: 16233

First 100 characters of first page:
Dossier du donneur
N° CRISTAL 235350
Le donneur est :Donneur prélevé, organe greffé
DDAC M3Site de d...

Page 2 has 642 characters

Sections found:
- introduction: Not found
- methodology: Not found
- results: Not found
- conclusion: Not found


In [19]:
# Test function for PDF extraction to text file

def test_pdf_extraction_to_txt():
    """
    Simple test function that demonstrates extracting text from a PDF
    and saving it to a text file.
    """
    import logging
    from pathlib import Path
    
    # Set up logging
    logging.basicConfig(level=logging.INFO, 
                       format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
    
    # Import the PDFExtractor class
    # Note: Make sure you've run the cells defining BaseExtractor and PDFExtractor classes first
    
    # Create an instance of the PDFExtractor
    extractor = PDFExtractor(debug=True)
    
    # Define paths for input PDF and output text file
    pdf_path = "/Users/sarrabenyahia/Documents/GitHub/pulmo-cristal/sandbox/sample_pdfs/patient1.pdf"  # Replace with your actual PDF file path
    txt_output_path = "extracted_text.txt"
    
    try:
        print(f"Starting extraction from: {pdf_path}")
        
        # Extract text from the PDF
        _, _ = extractor.extract_from_pdf(pdf_path)
        
        # Get the full extracted text
        full_text = extractor.get_text()
        print(f"Successfully extracted {len(full_text)} characters from the PDF")
        
        # Save the extracted text to a file
        extractor.save_text_to_file(txt_output_path)
        print(f"Text successfully saved to: {txt_output_path}")
        
        # Print a preview of the extracted text (first 200 characters)
        if full_text:
            print("\nPreview of extracted text:")
            preview = full_text[:200].replace("\n", " ")
            print(f"{preview}...")
        
    except FileNotFoundError:
        print(f"Error: PDF file not found at {pdf_path}")
        print("Please update the pdf_path variable with a valid path to a PDF file")
    except Exception as e:
        print(f"Error during extraction: {str(e)}")

# Run the test function
# Uncomment the line below to run the test
test_pdf_extraction_to_txt()

2025-05-20 16:20:23,160 - PDFExtractor - INFO - Extracting text from PDF: /Users/sarrabenyahia/Documents/GitHub/pulmo-cristal/sandbox/sample_pdfs/patient1.pdf
2025-05-20 16:20:23,160 - PDFExtractor - INFO - Extracting text from PDF: /Users/sarrabenyahia/Documents/GitHub/pulmo-cristal/sandbox/sample_pdfs/patient1.pdf
2025-05-20 16:20:23,160 - PDFExtractor - INFO - Extracting text from PDF: /Users/sarrabenyahia/Documents/GitHub/pulmo-cristal/sandbox/sample_pdfs/patient1.pdf
2025-05-20 16:20:23,168 - PDFExtractor - INFO - PDF contains 21 pages
2025-05-20 16:20:23,168 - PDFExtractor - INFO - PDF contains 21 pages
2025-05-20 16:20:23,168 - PDFExtractor - INFO - PDF contains 21 pages
2025-05-20 16:20:23,191 - PDFExtractor - INFO - Page 1/21: Extracted 1377 characters
2025-05-20 16:20:23,191 - PDFExtractor - INFO - Page 1/21: Extracted 1377 characters
2025-05-20 16:20:23,191 - PDFExtractor - INFO - Page 1/21: Extracted 1377 characters
2025-05-20 16:20:23,193 - PDFExtractor - INFO - Page 2/21:

Starting extraction from: /Users/sarrabenyahia/Documents/GitHub/pulmo-cristal/sandbox/sample_pdfs/patient1.pdf


2025-05-20 16:20:23,330 - PDFExtractor - INFO - Page 14/21: Extracted 1097 characters
2025-05-20 16:20:23,330 - PDFExtractor - INFO - Page 14/21: Extracted 1097 characters
2025-05-20 16:20:23,330 - PDFExtractor - INFO - Page 14/21: Extracted 1097 characters
2025-05-20 16:20:23,335 - PDFExtractor - INFO - Page 15/21: Extracted 256 characters
2025-05-20 16:20:23,335 - PDFExtractor - INFO - Page 15/21: Extracted 256 characters
2025-05-20 16:20:23,335 - PDFExtractor - INFO - Page 15/21: Extracted 256 characters
2025-05-20 16:20:23,352 - PDFExtractor - INFO - Page 16/21: Extracted 808 characters
2025-05-20 16:20:23,352 - PDFExtractor - INFO - Page 16/21: Extracted 808 characters
2025-05-20 16:20:23,352 - PDFExtractor - INFO - Page 16/21: Extracted 808 characters
2025-05-20 16:20:23,362 - PDFExtractor - INFO - Page 17/21: Extracted 1624 characters
2025-05-20 16:20:23,362 - PDFExtractor - INFO - Page 17/21: Extracted 1624 characters
2025-05-20 16:20:23,362 - PDFExtractor - INFO - Page 17/21: 

Successfully extracted 15729 characters from the PDF
Text successfully saved to: extracted_text.txt

Preview of extracted text:
Dossier du donneur N° CRISTAL 175394 Le donneur est Donneur prélevé, organe greffé Coeur Battant (SME)Site de décèsSite de diagnosticSite de prise en charge initialeFilière de prélèvementZIPR Date de ...


In [20]:
# Test function for PDF extraction to text file
import re 

def test_pdf_extraction_to_txt():
    """
    Simple test function that demonstrates extracting text from a PDF
    and saving it to a text file.
    """
    import logging
    from pathlib import Path
    
    # Set up logging
    logging.basicConfig(level=logging.INFO, 
                       format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
    
    # Import the PDFExtractor class
    # Note: Make sure you've run the cells defining BaseExtractor and PDFExtractor classes first
    
    # Create an instance of the PDFExtractor
    extractor = PDFExtractor(debug=True)
    
    # Define paths for input PDF and output text file
    pdf_path = "/Users/sarrabenyahia/Documents/GitHub/pulmo-cristal/sandbox/sample_pdfs/235350.LEGENDRE SURLIGNE BIS .pdf"  # Replace with your actual PDF file path
    txt_output_path = "extracted_text.txt"
    
    try:
        print(f"Starting extraction from: {pdf_path}")
        
        # Extract text from the PDF
        _, _ = extractor.extract_from_pdf(pdf_path)
        
        # Get the full extracted text
        full_text = extractor.get_text()
        print(f"Successfully extracted {len(full_text)} characters from the PDF")
        
        # Save the extracted text to a file
        extractor.save_text_to_file(txt_output_path)
        print(f"Text successfully saved to: {txt_output_path}")
        
        # Print a preview of the extracted text (first 200 characters)
        if full_text:
            print("\nPreview of extracted text:")
            preview = full_text[:200].replace("\n", " ")
            print(f"{preview}...")
        
    except FileNotFoundError:
        print(f"Error: PDF file not found at {pdf_path}")
        print("Please update the pdf_path variable with a valid path to a PDF file")
    except Exception as e:
        print(f"Error during extraction: {str(e)}")

# Run the test function
test_pdf_extraction_to_txt()

2025-05-20 16:20:23,404 - PDFExtractor - INFO - Extracting text from PDF: /Users/sarrabenyahia/Documents/GitHub/pulmo-cristal/sandbox/sample_pdfs/235350.LEGENDRE SURLIGNE BIS .pdf
2025-05-20 16:20:23,404 - PDFExtractor - INFO - Extracting text from PDF: /Users/sarrabenyahia/Documents/GitHub/pulmo-cristal/sandbox/sample_pdfs/235350.LEGENDRE SURLIGNE BIS .pdf
2025-05-20 16:20:23,404 - PDFExtractor - INFO - Extracting text from PDF: /Users/sarrabenyahia/Documents/GitHub/pulmo-cristal/sandbox/sample_pdfs/235350.LEGENDRE SURLIGNE BIS .pdf
2025-05-20 16:20:23,404 - PDFExtractor - INFO - Extracting text from PDF: /Users/sarrabenyahia/Documents/GitHub/pulmo-cristal/sandbox/sample_pdfs/235350.LEGENDRE SURLIGNE BIS .pdf
2025-05-20 16:20:23,415 - PDFExtractor - INFO - PDF contains 23 pages
2025-05-20 16:20:23,415 - PDFExtractor - INFO - PDF contains 23 pages
2025-05-20 16:20:23,415 - PDFExtractor - INFO - PDF contains 23 pages
2025-05-20 16:20:23,415 - PDFExtractor - INFO - PDF contains 23 pages


Starting extraction from: /Users/sarrabenyahia/Documents/GitHub/pulmo-cristal/sandbox/sample_pdfs/235350.LEGENDRE SURLIGNE BIS .pdf


2025-05-20 16:20:23,595 - PDFExtractor - INFO - Page 14/23: Extracted 474 characters
2025-05-20 16:20:23,595 - PDFExtractor - INFO - Page 14/23: Extracted 474 characters
2025-05-20 16:20:23,595 - PDFExtractor - INFO - Page 14/23: Extracted 474 characters
2025-05-20 16:20:23,604 - PDFExtractor - INFO - Page 15/23: Extracted 348 characters
2025-05-20 16:20:23,604 - PDFExtractor - INFO - Page 15/23: Extracted 348 characters
2025-05-20 16:20:23,604 - PDFExtractor - INFO - Page 15/23: Extracted 348 characters
2025-05-20 16:20:23,604 - PDFExtractor - INFO - Page 15/23: Extracted 348 characters
2025-05-20 16:20:23,612 - PDFExtractor - INFO - Page 16/23: Extracted 844 characters
2025-05-20 16:20:23,612 - PDFExtractor - INFO - Page 16/23: Extracted 844 characters
2025-05-20 16:20:23,612 - PDFExtractor - INFO - Page 16/23: Extracted 844 characters
2025-05-20 16:20:23,612 - PDFExtractor - INFO - Page 16/23: Extracted 844 characters
2025-05-20 16:20:23,619 - PDFExtractor - INFO - Page 17/23: Extra

Successfully extracted 16233 characters from the PDF
Text successfully saved to: extracted_text.txt

Preview of extracted text:
Dossier du donneur N° CRISTAL 235350 Le donneur est :Donneur prélevé, organe greffé DDAC M3Site de décès :ZIPR Date de création du dossier Le dossier est Date de clôture Régulateur Etablissement de pr...


In [21]:
DEFAULT_FLAGS = re.IGNORECASE | re.DOTALL

In [22]:
import re


# Default flags for regex
DEFAULT_FLAGS = re.IGNORECASE | re.DOTALL

# Regex to capture "Date de clampage" with a date in DD/MM/YYYY format
date_clampage_pattern = re.compile(
    r"(?:Date de clampage|Date et heure du constat de décès)\s*:?\s*(\d{2}/\d{2}/\d{4})",
    DEFAULT_FLAGS)

# Path to save the extracted text
txt_output_path = "extracted_text.txt"

try:
    with open(txt_output_path, "r", encoding="utf-8") as file:
        content = file.read()
    
    # Search for the pattern
    match = date_clampage_pattern.search(content)
    if match:
        print(f"Found date: {match.group(1)}")
    else:
        print("No date found in the file.")
except FileNotFoundError:
    print(f"File {txt_output_path} does not exist.")
    


Found date: 20/02/2025
