# Schemas and Validation with Pandera

In medical data integration, ensuring data quality and consistency is critical for patient safety and regulatory compliance. Pandera is a powerful Python library that provides statistical data testing and schema validation for pandas DataFrames. This notebook will demonstrate how to define schemas, validate data, apply coercion, and set constraints for medical datasets.

First, let's import the necessary libraries and create a sample medical dataset that we'll use throughout this notebook.

In [1]:
import pandas as pd
import pandera.pandas as pa
from pandera import Column, DataFrameSchema, Check
import numpy as np
from datetime import datetime

# Create sample medical data
medical_data = pd.DataFrame({
    'patient_id': ['P001', 'P002', 'P003', 'P004', 'P005'],
    'age': [45, 67, 23, 89, 34],
    'systolic_bp': [120, 140, 110, 160, 115],
    'diastolic_bp': [80, 90, 70, 95, 75],
    'heart_rate': [72, 68, 85, 62, 78],
    'temperature': [36.5, 37.2, 36.8, 38.1, 36.7],
    'admission_date': ['2024-01-15', '2024-01-16', '2024-01-17', '2024-01-18', '2024-01-19'],
    'diagnosis_code': ['I10', 'E11.9', 'J44.1', 'N18.6', 'M79.3']
})

print(medical_data.dtypes)
medical_data.head()

patient_id         object
age                 int64
systolic_bp         int64
diastolic_bp        int64
heart_rate          int64
temperature       float64
admission_date     object
diagnosis_code     object
dtype: object


Unnamed: 0,patient_id,age,systolic_bp,diastolic_bp,heart_rate,temperature,admission_date,diagnosis_code
0,P001,45,120,80,72,36.5,2024-01-15,I10
1,P002,67,140,90,68,37.2,2024-01-16,E11.9
2,P003,23,110,70,85,36.8,2024-01-17,J44.1
3,P004,89,160,95,62,38.1,2024-01-18,N18.6
4,P005,34,115,75,78,36.7,2024-01-19,M79.3


Now let's define a basic schema for our medical data using Pandera. This schema will specify the expected data types for each column.

In [2]:
# Define a basic schema
basic_schema = DataFrameSchema({
    'patient_id': Column(str),
    'age': Column(int),
    'systolic_bp': Column(int),
    'diastolic_bp': Column(int),
    'heart_rate': Column(int),
    'temperature': Column(float),
    'admission_date': Column(str),
    'diagnosis_code': Column(str)
})

print("Basic schema defined successfully")
print(basic_schema)

Basic schema defined successfully
<Schema DataFrameSchema(
    columns={
        'patient_id': <Schema Column(name=patient_id, type=DataType(str))>
        'age': <Schema Column(name=age, type=DataType(int64))>
        'systolic_bp': <Schema Column(name=systolic_bp, type=DataType(int64))>
        'diastolic_bp': <Schema Column(name=diastolic_bp, type=DataType(int64))>
        'heart_rate': <Schema Column(name=heart_rate, type=DataType(int64))>
        'temperature': <Schema Column(name=temperature, type=DataType(float64))>
        'admission_date': <Schema Column(name=admission_date, type=DataType(str))>
        'diagnosis_code': <Schema Column(name=diagnosis_code, type=DataType(str))>
    },
    checks=[],
    parsers=[],
    coerce=False,
    dtype=None,
    index=None,
    strict=False,
    name=None,
    ordered=False,
    unique_column_names=False,
    metadata=None, 
    add_missing_columns=False
)>


top-level pandera module will be **removed in a future version of pandera**.
If you're using pandera to validate pandas objects, we highly recommend updating
your import:

```
# old import
import pandera as pa

# new import
import pandera.pandas as pa
```

If you're using pandera to validate objects from other compatible libraries
like pyspark or polars, see the supported libraries section of the documentation
for more information on how to import pandera:

https://pandera.readthedocs.io/en/stable/supported_libraries.html


```
```



Let's validate our medical data against the basic schema to ensure it conforms to the expected structure.

In [3]:
# Validate the data against the schema
try:
    validated_data = basic_schema.validate(medical_data)
    print("Data validation successful!")
    print(f"Validated {len(validated_data)} rows")
except pa.errors.SchemaError as e:
    print(f"Validation error: {e}")

Data validation successful!
Validated 5 rows


Now let's enhance our schema with medical constraints to ensure data quality. We'll add range checks for vital signs and other medical parameters.

In [4]:
# Define schema with medical constraints
medical_schema = DataFrameSchema({
    'patient_id': Column(str, checks=[
        Check.str_matches(r'^P\d{3}$')  # Pattern: P followed by 3 digits
    ]),
    'age': Column(int, checks=[
        Check.greater_than_or_equal_to(0),
        Check.less_than_or_equal_to(120)
    ]),
    'systolic_bp': Column(int, checks=[
        Check.in_range(70, 250)  # Reasonable range for systolic BP
    ]),
    'diastolic_bp': Column(int, checks=[
        Check.in_range(40, 150)  # Reasonable range for diastolic BP
    ]),
    'heart_rate': Column(int, checks=[
        Check.in_range(30, 200)  # Reasonable range for heart rate
    ]),
    'temperature': Column(float, checks=[
        Check.in_range(30.0, 45.0)  # Body temperature in Celsius
    ]),
    'admission_date': Column(str),
    'diagnosis_code': Column(str, checks=[
        Check.str_length(min_value=3, max_value=10)  # ICD-10 code length
    ])
})

print("Enhanced medical schema with constraints defined")

Enhanced medical schema with constraints defined


Let's validate our data against the enhanced schema to see if all medical constraints are satisfied.

In [5]:
# Validate against enhanced schema
try:
    validated_medical_data = medical_schema.validate(medical_data)
    print("Medical data validation successful!")
    print("All constraints satisfied")
except pa.errors.SchemaError as e:
    print(f"Validation error: {e}")

Medical data validation successful!
All constraints satisfied


Now let's demonstrate data coercion by creating a dataset with mixed data types and showing how Pandera can automatically convert them to the expected types.

In [6]:
# Create data with mixed types that need coercion
messy_data = pd.DataFrame({
    'patient_id': ['P006', 'P007', 'P008'],
    'age': ['55', '42', '71'],  # String instead of int
    'systolic_bp': [130.0, 125.5, 145.2],  # Float instead of int
    'diastolic_bp': ['85', '82', '92'],  # String instead of int
    'heart_rate': [75, 80, 68],
    'temperature': ['37.1', '36.9', '37.5'],  # String instead of float
    'admission_date': ['2024-01-20', '2024-01-21', '2024-01-22'],
    'diagnosis_code': ['M25.5', 'K59.0', 'F32.9']
})

print("Original data types:")
print(messy_data.dtypes)
print("\nData:")
messy_data

Original data types:
patient_id         object
age                object
systolic_bp       float64
diastolic_bp       object
heart_rate          int64
temperature        object
admission_date     object
diagnosis_code     object
dtype: object

Data:


Unnamed: 0,patient_id,age,systolic_bp,diastolic_bp,heart_rate,temperature,admission_date,diagnosis_code
0,P006,55,130.0,85,75,37.1,2024-01-20,M25.5
1,P007,42,125.5,82,80,36.9,2024-01-21,K59.0
2,P008,71,145.2,92,68,37.5,2024-01-22,F32.9


We'll create a schema with coercion enabled to automatically convert the data types to the expected format.

In [7]:
# Schema with coercion
coercion_schema = DataFrameSchema({
    'patient_id': Column(str),
    'age': Column(int, coerce=True),  # Coerce string to int
    'systolic_bp': Column(int, coerce=True),  # Coerce float to int
    'diastolic_bp': Column(int, coerce=True),  # Coerce string to int
    'heart_rate': Column(int),
    'temperature': Column(float, coerce=True),  # Coerce string to float
    'admission_date': Column(str),
    'diagnosis_code': Column(str)
})

print("Coercion schema defined")

Coercion schema defined


Let's apply the coercion schema to our messy data and observe how the data types are automatically converted.

In [8]:
# Apply coercion
try:
    coerced_data = coercion_schema.validate(messy_data)
    print("Data coercion successful!")
    print("\nCoerced data types:")
    print(coerced_data.dtypes)
    print("\nCoerced data:")
    print(coerced_data)
except pa.errors.SchemaError as e:
    print(f"Coercion error: {e}")

Data coercion successful!

Coerced data types:
patient_id         object
age                 int64
systolic_bp         int64
diastolic_bp        int64
heart_rate          int64
temperature       float64
admission_date     object
diagnosis_code     object
dtype: object

Coerced data:
  patient_id  age  systolic_bp  diastolic_bp  heart_rate  temperature  \
0       P006   55          130            85          75         37.1   
1       P007   42          125            82          80         36.9   
2       P008   71          145            92          68         37.5   

  admission_date diagnosis_code  
0     2024-01-20          M25.5  
1     2024-01-21          K59.0  
2     2024-01-22          F32.9  


Let's create a comprehensive schema that combines coercion with medical constraints and custom validation functions for more complex business rules.

In [9]:
# Define custom validation functions
def check_bp_relationship(df):
    """Ensure systolic BP is always higher than diastolic BP"""
    return df['systolic_bp'] > df['diastolic_bp']

def check_age_heart_rate(df):
    """Basic check: very high heart rates should be flagged for elderly patients"""
    elderly = df['age'] > 80
    high_hr = df['heart_rate'] > 100
    return ~(elderly & high_hr)  # Flag if elderly AND high heart rate

# Comprehensive schema
comprehensive_schema = DataFrameSchema({
    'patient_id': Column(str, checks=[Check.str_matches(r'^P\d{3}$')]),
    'age': Column(int, checks=[Check.in_range(0, 120)], coerce=True),
    'systolic_bp': Column(int, checks=[Check.in_range(70, 250)], coerce=True),
    'diastolic_bp': Column(int, checks=[Check.in_range(40, 150)], coerce=True),
    'heart_rate': Column(int, checks=[Check.in_range(30, 200)]),
    'temperature': Column(float, checks=[Check.in_range(30.0, 45.0)], coerce=True),
    'admission_date': Column(str),
    'diagnosis_code': Column(str, checks=[Check.str_length(min_value=3, max_value=10)])
}, checks=[
    Check(check_bp_relationship, element_wise=False, 
          error="Systolic BP must be higher than diastolic BP"),
    Check(check_age_heart_rate, element_wise=False,
          error="High heart rate in elderly patients needs review")
])

print("Comprehensive schema with custom validations defined")

Comprehensive schema with custom validations defined


Let's test our comprehensive schema with the original data to see all validation features in action.

In [10]:
# Test comprehensive validation
try:
    final_validated_data = comprehensive_schema.validate(medical_data)
    print("Comprehensive validation successful!")
    print(f"Validated {len(final_validated_data)} rows with all constraints")
    print("\nFinal validated data:")
    print(final_validated_data)
except pa.errors.SchemaError as e:
    print(f"Comprehensive validation error: {e}")

Comprehensive validation successful!
Validated 5 rows with all constraints

Final validated data:
  patient_id  age  systolic_bp  diastolic_bp  heart_rate  temperature  \
0       P001   45          120            80          72         36.5   
1       P002   67          140            90          68         37.2   
2       P003   23          110            70          85         36.8   
3       P004   89          160            95          62         38.1   
4       P005   34          115            75          78         36.7   

  admission_date diagnosis_code  
0     2024-01-15            I10  
1     2024-01-16          E11.9  
2     2024-01-17          J44.1  
3     2024-01-18          N18.6  
4     2024-01-19          M79.3  


Now let's intentionally create invalid data to demonstrate how Pandera catches validation errors.

In [11]:
# Create invalid data to test error handling
invalid_data = pd.DataFrame({
    'patient_id': ['P009', 'INVALID', 'P011'],  # Invalid patient ID format
    'age': [45, 150, -5],  # Invalid ages
    'systolic_bp': [120, 90, 140],  # Note: second row has systolic < diastolic
    'diastolic_bp': [80, 95, 85],   # This will trigger BP relationship error
    'heart_rate': [72, 150, 68],    # High heart rate with high age
    'temperature': [36.5, 50.0, 36.8],  # Invalid temperature
    'admission_date': ['2024-01-25', '2024-01-26', '2024-01-27'],
    'diagnosis_code': ['I10', 'AB', 'J44.1']  # Invalid diagnosis code length
})

print("Invalid test data created:")
invalid_data

Invalid test data created:


Unnamed: 0,patient_id,age,systolic_bp,diastolic_bp,heart_rate,temperature,admission_date,diagnosis_code
0,P009,45,120,80,72,36.5,2024-01-25,I10
1,INVALID,150,90,95,150,50.0,2024-01-26,AB
2,P011,-5,140,85,68,36.8,2024-01-27,J44.1


Let's validate the invalid data and observe how Pandera provides detailed error messages for each constraint violation.

In [12]:
# Test validation with invalid data
try:
    comprehensive_schema.validate(invalid_data)
    print("Validation passed (unexpected!)")
except pa.errors.SchemaError as e:
    print("Expected validation errors found:")
    print(f"Error summary: {e}")
    print("\nDetailed error information:")
    print(f"Failed checks: {len(e.failure_cases)} cases")
    print(e.failure_cases)

Expected validation errors found:
Error summary: Column 'patient_id' failed element-wise validator number 0: str_matches('^P\d{3}$') failure cases: INVALID

Detailed error information:
Failed checks: 1 cases
   index failure_case
0      1      INVALID


Finally, let's demonstrate how to use Pandera decorators for function-based validation, which is useful for data processing pipelines.

In [13]:
# Function-based validation using decorators
@pa.check_input(comprehensive_schema)
@pa.check_output(comprehensive_schema)
def process_medical_data(df):
    """Process medical data with automatic input/output validation"""
    # Add a calculated field: pulse pressure
    processed_df = df.copy()
    processed_df['pulse_pressure'] = processed_df['systolic_bp'] - processed_df['diastolic_bp']
    
    # Remove the calculated field for output validation
    return processed_df.drop('pulse_pressure', axis=1)

# Test the decorated function
try:
    processed_result = process_medical_data(medical_data)
    print("Function-based validation successful!")
    print(f"Processed {len(processed_result)} rows")
except pa.errors.SchemaError as e:
    print(f"Function validation error: {e}")

Function-based validation successful!
Processed 5 rows


## Exercise

Create your own medical data validation schema for a patient laboratory results dataset. Your schema should include:

1. **Patient Information**: patient_id (format: LAB followed by 4 digits), age (0-120), gender (M/F)
2. **Lab Values**: 
   - glucose (mg/dL): normal range 70-200
   - cholesterol (mg/dL): range 100-400
   - hemoglobin (g/dL): range 8.0-20.0
   - white_blood_cell_count (cells/μL): range 3000-12000
3. **Test Information**: test_date (string), lab_technician_id (string)

**Tasks:**
1. Define a comprehensive schema with appropriate data types, coercion, and range constraints
2. Add a custom validation function to ensure hemoglobin levels are appropriate for gender (males: 13.5-17.5 g/dL, females: 12.0-15.5 g/dL)
3. Create sample data with both valid and invalid entries
4. Test your schema and fix any validation errors
5. Create a function decorated with your schema that calculates and adds a 'risk_score' based on how many values are outside normal ranges

This exercise will help you practice creating domain-specific validation rules for medical laboratory data, which is crucial for ensuring data quality in clinical decision-making systems.