# Data Extraction and Summarization of PDFs (Medical Reports)

This notebook demonstrates how to extract and summarize data from PDF medical reports, such as chest X-ray reports related to the ChestMNIST dataset. We use Optical Character Recognition (OCR) to read PDFs, prompt Large Language Models (LLMs) for summarization and entity extraction, perform diagnosis classification, and flag uncertain or missing data.

**Author**: Mohammad Rezapourian
<br>
**Date**: May 15, 2025
<br>
**License**: Apache-2.0

## Table of Contents
1. [Initial Setup](#initial-setup)
   - [Setup for Google Colab](#setup-for-google-colab)
   - [Setup for Offline Use](#setup-for-offline-use)
2. [PDF Reading via OCR](#pdf-reading-via-ocr)
   - [Loading and Processing PDFs](#loading-and-processing-pdfs)
3. [PDF Summarization](#pdf-summarization)
   - [Prompting for Summarization](#prompting-for-summarization)
   - [Exercise: Summarize an Unseen PDF](#exercise-summarize-an-unseen-pdf)
4. [Entity Extraction](#entity-extraction)
   - [Prompting for Entity Extraction](#prompting-for-entity-extraction)
   - [Collecting Entities into a Table](#collecting-entities-into-a-table)
   - [Exercise: Extract Entities from a New PDF](#exercise-extract-entities-from-a-new-pdf)
5. [Diagnosis Classification](#diagnosis-classification)
   - [Prompting for Glioblastoma Diagnosis](#prompting-for-glioblastoma-diagnosis)
   - [Expanding the Entity Table](#expanding-the-entity-table)
   - [Exercise: Evaluate Classification Results](#exercise-evaluate-classification-results)
6. [Flagging Uncertain Extractions and Missing Values](#flagging-uncertain-extractions-and-missing-values)
   - [Prompting for Uncertainty Detection](#prompting-for-uncertainty-detection)
7. [Conclusion](#conclusion)
   - [References](#references)

## Initial Setup

Set up the environment for running the notebook in Google Colab or locally. We’ll install libraries for PDF processing, OCR, and data analysis.

### Setup for Google Colab
<u>Execute these code blocks only in Google Colab!</u>

In [None]:
!apt-get install -q poppler-utils tesseract-ocr
!pip install -q pdf2image pytesseract pandas numpy matplotlib seaborn

In [None]:
import os
import sys
from google.colab import output
output.enable_custom_widget_manager()
%matplotlib inline
import pdf2image
import pytesseract
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
import re

### Setup for Offline Use

Ensure `poppler-utils` and `tesseract-ocr` are installed on your system. For example, on Ubuntu:
```bash
sudo apt-get install poppler-utils tesseract-ocr
```
On macOS:
```bash
brew install poppler tesseract
```

In [None]:
%matplotlib inline
import pdf2image
import pytesseract
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
import re

## PDF Reading via OCR

We’ll read PDF medical reports using OCR to extract text, assuming the PDFs contain chest X-ray reports with patient details and findings.

### Loading and Processing PDFs

We’ll convert PDFs to images and apply OCR to extract text. For demonstration, assume a sample PDF `report1.pdf` in `./Data/Reports/`. Users should replace it with their own PDF.

In [None]:
# Function to read PDF and extract text
def read_pdf_ocr(pdf_path):
    try:
        # Convert PDF to images
        images = pdf2image.convert_from_path(pdf_path)
        text = ''
        for img in images:
            # Apply OCR
            text += pytesseract.image_to_string(img) + '\n'
        return text
    except Exception as e:
        print(f'Error reading {pdf_path}: {e}')
        return None

# Example PDF
pdf_path = './Data/Reports/report1.pdf'
text = read_pdf_ocr(pdf_path)
if text:
    print('Extracted text (first 500 characters):')
    print(text[:500])
else:
    print('Failed to extract text. Please check the PDF path and OCR setup.')

**Note**: Replace `report1.pdf` with your PDF file. Ensure the `./Data/Reports/` directory exists. If no PDF is available, use the sample text below for testing:
```
Patient Report
Name: John Doe
Age: 55
Gender: Male
Date: 2025-01-10
Findings: Chest X-ray shows bilateral infiltrates and pleural effusion. No evidence of glioblastoma. Suspected pneumonia and pulmonary edema.
Diagnosis: Pneumonia, Effusion
```
Save this as `sample_report.txt` and modify the code to read it if needed.

## PDF Summarization

We’ll prompt an LLM to summarize the PDF content, extracting key information like age, gender, and report summary.

### Prompting for Summarization

**Prompt Template**:
```
Summarize the following medical report. Extract the patient's age, gender, and a concise summary of the findings. Format the output as:
- Age: [value]
- Gender: [value]
- Summary: [text]

Report text:
[insert extracted text]
```

**Simulated LLM Response** (for `sample_report.txt`):
```plaintext
- Age: 55
- Gender: Male
- Summary: The chest X-ray indicates bilateral infiltrates and pleural effusion, suggestive of pneumonia and pulmonary edema. No glioblastoma observed.
```

In [None]:
# Function to simulate LLM summarization
def summarize_report(text):
    # Simulated LLM logic (replace with actual LLM call if available)
    summary = {'Age': None, 'Gender': None, 'Summary': ''}
    # Extract age
    age_match = re.search(r'Age:\s*(\d+)', text, re.IGNORECASE)
    if age_match:
        summary['Age'] = int(age_match.group(1))
    # Extract gender
    gender_match = re.search(r'Gender:\s*(Male|Female|Other)', text, re.IGNORECASE)
    if gender_match:
        summary['Gender'] = gender_match.group(1)
    # Extract findings
    findings_match = re.search(r'Findings:\s*(.+?)(?:Diagnosis|$)', text, re.IGNORECASE | re.DOTALL)
    if findings_match:
        summary['Summary'] = findings_match.group(1).strip()
    return summary

# Summarize sample report
if text:
    summary = summarize_report(text)
    print('Summary:')
    for key, value in summary.items():
        print(f'{key}: {value}')

**Note**: Replace the simulated logic with an actual LLM call.

### Exercise: Summarize an Unseen PDF

**Task**: Apply the summarization function to a new PDF (`report2.pdf`) and verify the output.

**Code**:


In [None]:
# Your code here
new_pdf_path = './Data/Reports/report2.pdf'
new_text = read_pdf_ocr(new_pdf_path)
if new_text:
    new_summary = summarize_report(new_text)
    print('Summary of report2.pdf:')
    for key, value in new_summary.items():
        print(f'{key}: {value}')
else:
    print('Failed to read report2.pdf.')

**Solution**: The code reads a new PDF, extracts text via OCR, and summarizes it. Verify the output by checking the extracted age, gender, and summary against the PDF content. If no PDF is available, test with a new sample text.

## Entity Extraction

We’ll extract entities like patient name, diagnosis, and findings from the PDF and store them in a table.

### Prompting for Entity Extraction

**Prompt Template**:
```
Extract the following entities from the medical report:
- Patient Name
- Diagnosis
- Findings
- Date
Return the results in a JSON format:
{
  "Patient Name": "value",
  "Diagnosis": "value",
  "Findings": "value",
  "Date": "value"
}

Report text:
[insert extracted text]
```

**Simulated LLM Response**:
```json
{
  "Patient Name": "John Doe",
  "Diagnosis": "Pneumonia, Effusion",
  "Findings": "Bilateral infiltrates and pleural effusion. No glioblastoma.",
  "Date": "2025-01-10"
}
```

In [None]:
# Function to simulate entity extraction
def extract_entities(text):
    entities = {'Patient Name': None, 'Diagnosis': None, 'Findings': None, 'Date': None}
    # Name
    name_match = re.search(r'Name:\s*([A-Za-z\s]+)', text, re.IGNORECASE)
    if name_match:
        entities['Patient Name'] = name_match.group(1).strip()
    # Diagnosis
    diag_match = re.search(r'Diagnosis:\s*(.+)', text, re.IGNORECASE)
    if diag_match:
        entities['Diagnosis'] = diag_match.group(1).strip()
    # Findings
    findings_match = re.search(r'Findings:\s*(.+?)(?:Diagnosis|$)', text, re.IGNORECASE | re.DOTALL)
    if findings_match:
        entities['Findings'] = findings_match.group(1).strip()
    # Date
    date_match = re.search(r'Date:\s*(\d{4}-\d{2}-\d{2})', text, re.IGNORECASE)
    if date_match:
        entities['Date'] = date_match.group(1)
    return entities

# Extract entities
if text:
    entities = extract_entities(text)
    print('Extracted entities:')
    print(entities)

### Collecting Entities into a Table

Store entities in a pandas DataFrame for further analysis.

In [None]:
# Create DataFrame
if text:
    entity_df = pd.DataFrame([entities])
    print('Entity table:')
    print(entity_df)
else:
    entity_df = pd.DataFrame(columns=['Patient Name', 'Diagnosis', 'Findings', 'Date'])

### Exercise: Extract Entities from a New PDF

**Task**: Extract entities from `report2.pdf` and add them to the DataFrame.

**Code**:


In [None]:
# Your code here
new_entities = extract_entities(new_text)
if new_entities:
    new_entity_df = pd.DataFrame([new_entities])
    entity_df = pd.concat([entity_df, new_entity_df], ignore_index=True)
    print('Updated entity table:')
    print(entity_df)
else:
    print('Failed to extract entities from report2.pdf.')

**Solution**: The code extracts entities from the new PDF and appends them to the DataFrame. Verify the table for accuracy against the PDF content.

## Diagnosis Classification

Classify whether the patient has glioblastoma based on the findings.

### Prompting for Glioblastoma Diagnosis

**Prompt Template**:
```
Based on the findings in the medical report, does the patient have glioblastoma? Answer Yes or No.

Findings:
[insert findings]
```

**Simulated LLM Response**:
```plaintext
No
```

In [None]:
# Function to simulate glioblastoma classification
def classify_glioblastoma(findings):
    # Simulated LLM logic
    if findings and 'glioblastoma' in findings.lower():
        return 'Yes'
    return 'No'

# Apply classification
if not entity_df.empty:
    entity_df['Glioblastoma'] = entity_df['Findings'].apply(classify_glioblastoma)
    print('Updated entity table with glioblastoma classification:')
    print(entity_df)

### Expanding the Entity Table

The table now includes a `Glioblastoma` column with Yes/No values.

### Exercise: Evaluate Classification Results

**Task**: Evaluate the glioblastoma classification by checking the findings for false positives/negatives.

**Code**:


In [None]:
# Your code here
if not entity_df.empty:
    print('Evaluation of glioblastoma classification:')
    for idx, row in entity_df.iterrows():
        print(f'Patient: {row["Patient Name"]}')
        print(f'Findings: {row["Findings"]}')
        print(f'Classified as Glioblastoma: {row["Glioblastoma"]}')
        print('Correct? (Manual check required)')
        print('-' * 50)
else:
    print('No data to evaluate.')

**Solution**: Manually review the findings and classification. For example, if findings mention "no glioblastoma," the classification should be "No." Flag any mismatches for further LLM prompt refinement.

## Flagging Uncertain Extractions and Missing Values

Prompt the LLM to identify uncertain extractions or missing values in the entities.

### Prompting for Uncertainty Detection

**Prompt Template**:
```
Review the extracted entities from the medical report. Flag any uncertain extractions (e.g., ambiguous text) or missing values. Provide a list of issues in the format:
- Entity: [name], Issue: [description]

Entities:
[insert entities JSON]

Report text:
[insert extracted text]
```

**Simulated LLM Response**:
```plaintext
- Entity: Patient Name, Issue: None
- Entity: Diagnosis, Issue: None
- Entity: Findings, Issue: None
- Entity: Date, Issue: None
```

In [None]:
# Function to flag uncertain or missing values
def flag_uncertain_entities(entities, text):
    issues = []
    for key, value in entities.items():
        if value is None:
            issues.append(f'Entity: {key}, Issue: Missing value')
        elif key in ['Patient Name', 'Diagnosis', 'Findings', 'Date']:
            # Check for ambiguous or short values
            if len(str(value)) < 3:
                issues.append(f'Entity: {key}, Issue: Potentially ambiguous (too short: {value})')
    return issues

# Flag issues
if text and entities:
    issues = flag_uncertain_entities(entities, text)
    print('Uncertainty and missing value issues:')
    if issues:
        for issue in issues:
            print(issue)
    else:
        print('No issues detected.')

## Conclusion

This notebook demonstrated comprehensive data extraction and summarization from PDF medical reports using OCR, LLM prompting, and pandas. We summarized reports, extracted entities, classified glioblastoma, and flagged uncertainties. Exercises reinforced these skills with practical applications. The DataFrame enables further data science tasks, such as statistical analysis or machine learning.

### References
- Tesseract OCR: https://github.com/tesseract-ocr/tesseract
- PDF2Image: https://github.com/oschwartz10612/poppler-python
- Pandas Documentation: https://pandas.pydata.org/docs/
- ChestMNIST Context: https://www.nih.gov/news-events/news-releases/nih-clinical-center-provides-one-largest-publicly-available-chest-x-ray-datasets-scientific-community