# Integrity Checks (Hashing) and Small-to-Large File Strategies

In medical data integration, ensuring data integrity throughout the pipeline is crucial for patient safety and regulatory compliance. This notebook covers hashing techniques for integrity verification and strategies for handling files of different sizes in medical data workflows.

## Introduction to Data Integrity in Medical Context

Medical data must maintain its integrity from collection to analysis to ensure accurate diagnoses and treatments. Hash functions provide a mathematical way to verify that data hasn't been corrupted or tampered with during transfer or storage.

Let's start by importing the necessary libraries for hashing and file operations.

In [1]:
import hashlib
import pandas as pd
import numpy as np
import os
import json
from pathlib import Path
import time
from typing import Dict, List, Tuple

## Basic Hashing Concepts

We'll create a simple function to generate MD5 and SHA-256 hashes for different types of data. These algorithms produce fixed-length strings that uniquely represent the input data.

In [2]:
def generate_hash(data: str, algorithm: str = 'sha256') -> str:
    """Generate hash for given data using specified algorithm."""
    if algorithm == 'md5':
        return hashlib.md5(data.encode()).hexdigest()
    elif algorithm == 'sha256':
        return hashlib.sha256(data.encode()).hexdigest()
    else:
        raise ValueError("Supported algorithms: 'md5', 'sha256'")

# Test with sample medical data
patient_id = "PAT001"
md5_hash = generate_hash(patient_id, 'md5')
sha256_hash = generate_hash(patient_id, 'sha256')

print(f"Original data: {patient_id}")
print(f"MD5 hash: {md5_hash}")
print(f"SHA-256 hash: {sha256_hash}")

Original data: PAT001
MD5 hash: 23a636587eb00292688bacedf0be14db
SHA-256 hash: ea711c655944838aa8a4b98030207d6d37e9495e5fba12af2209d8275ec3e41d


Let's demonstrate how even tiny changes in data result in completely different hashes, which is crucial for detecting data corruption.

In [3]:
# Original patient record
original_record = "Patient: John Doe, Age: 45, Diagnosis: Hypertension"
original_hash = generate_hash(original_record)

# Modified record (small typo)
modified_record = "Patient: John Doe, Age: 46, Diagnosis: Hypertension"
modified_hash = generate_hash(modified_record)

print(f"Original: {original_record}")
print(f"Original hash: {original_hash[:16]}...")
print(f"\nModified: {modified_record}")
print(f"Modified hash: {modified_hash[:16]}...")
print(f"\nHashes match: {original_hash == modified_hash}")

Original: Patient: John Doe, Age: 45, Diagnosis: Hypertension
Original hash: 7e3942315c13ca93...

Modified: Patient: John Doe, Age: 46, Diagnosis: Hypertension
Modified hash: e379a88b747881ff...

Hashes match: False


## File Hashing for Different File Sizes

Now we'll create functions to handle hashing of files efficiently, with different strategies for small and large files. For large files, we'll use chunk-based processing to avoid memory issues.

In [4]:
def hash_small_file(filepath: str, algorithm: str = 'sha256') -> str:
    """Hash small files by loading entire content into memory."""
    hasher = hashlib.sha256() if algorithm == 'sha256' else hashlib.md5()
    
    with open(filepath, 'rb') as f:
        content = f.read()
        hasher.update(content)
    
    return hasher.hexdigest()

# Create a sample small medical data file
sample_data = {
    'patient_id': ['PAT001', 'PAT002', 'PAT003'],
    'age': [45, 32, 67],
    'diagnosis': ['Hypertension', 'Diabetes', 'Arthritis']
}
df = pd.DataFrame(sample_data)
df.to_csv('small_medical_data.csv', index=False)

small_file_hash = hash_small_file('small_medical_data.csv')
print(f"Small file hash: {small_file_hash}")

Small file hash: 39305a36efa785d368b4ef3155b3b70361e8117c9c6ab536591b4293e54b4269


For large medical files (like medical imaging data or genomic sequences), we need a chunk-based approach to avoid loading the entire file into memory.

In [5]:
def hash_large_file(filepath: str, chunk_size: int = 8192, algorithm: str = 'sha256') -> str:
    """Hash large files using chunk-based processing."""
    hasher = hashlib.sha256() if algorithm == 'sha256' else hashlib.md5()
    
    with open(filepath, 'rb') as f:
        while chunk := f.read(chunk_size):
            hasher.update(chunk)
    
    return hasher.hexdigest()

# Create a larger sample file to demonstrate
large_data = pd.DataFrame({
    'patient_id': [f'PAT{i:05d}' for i in range(10000)],
    'measurement_1': np.random.normal(100, 15, 10000),
    'measurement_2': np.random.normal(80, 10, 10000),
    'timestamp': pd.date_range('2023-01-01', periods=10000, freq='H')
})
large_data.to_csv('large_medical_data.csv', index=False)

large_file_hash = hash_large_file('large_medical_data.csv')
print(f"Large file hash: {large_file_hash}")
print(f"File size: {os.path.getsize('large_medical_data.csv')} bytes")

Large file hash: 05f1d7f74340f83940810eb3526c0b8837ac7fa010d2b859ef278ce46a74a614
File size: 662183 bytes


  'timestamp': pd.date_range('2023-01-01', periods=10000, freq='H')


## Smart File Processing Strategy

Let's create an intelligent function that automatically chooses the appropriate hashing strategy based on file size. This is particularly useful in medical data pipelines where file sizes can vary dramatically.

In [6]:
def smart_file_hash(filepath: str, size_threshold: int = 10*1024*1024) -> Dict:
    """Choose hashing strategy based on file size (default threshold: 10MB)."""
    file_size = os.path.getsize(filepath)
    start_time = time.time()
    
    if file_size < size_threshold:
        strategy = "small_file"
        file_hash = hash_small_file(filepath)
    else:
        strategy = "large_file_chunked"
        file_hash = hash_large_file(filepath)
    
    processing_time = time.time() - start_time
    
    return {
        'filepath': filepath,
        'file_size_bytes': file_size,
        'strategy_used': strategy,
        'hash': file_hash,
        'processing_time_seconds': round(processing_time, 4)
    }

# Test with both files
small_result = smart_file_hash('small_medical_data.csv')
large_result = smart_file_hash('large_medical_data.csv')

print("Small file processing:")
for key, value in small_result.items():
    print(f"  {key}: {value}")

print("\nLarge file processing:")
for key, value in large_result.items():
    print(f"  {key}: {value}")

Small file processing:
  filepath: small_medical_data.csv
  file_size_bytes: 91
  strategy_used: small_file
  hash: 39305a36efa785d368b4ef3155b3b70361e8117c9c6ab536591b4293e54b4269
  processing_time_seconds: 0.0

Large file processing:
  filepath: large_medical_data.csv
  file_size_bytes: 662183
  strategy_used: small_file
  hash: 05f1d7f74340f83940810eb3526c0b8837ac7fa010d2b859ef278ce46a74a614
  processing_time_seconds: 0.0018


## Creating an Integrity Manifest

In medical data integration, it's common to create manifest files that store hash values for multiple files. This allows for batch integrity verification across entire datasets.

In [7]:
def create_integrity_manifest(file_paths: List[str], manifest_path: str = 'integrity_manifest.json') -> Dict:
    """Create a manifest file with hash values for multiple files."""
    manifest = {
        'created_timestamp': pd.Timestamp.now().isoformat(),
        'files': []
    }
    
    for filepath in file_paths:
        if os.path.exists(filepath):
            file_info = smart_file_hash(filepath)
            manifest['files'].append(file_info)
    
    # Save manifest to file
    with open(manifest_path, 'w') as f:
        json.dump(manifest, f, indent=2)
    
    return manifest

# Create manifest for our sample files
files_to_check = ['small_medical_data.csv', 'large_medical_data.csv']
manifest = create_integrity_manifest(files_to_check)

print(f"Manifest created with {len(manifest['files'])} files")
print(f"Creation time: {manifest['created_timestamp']}")

Manifest created with 2 files
Creation time: 2025-09-13T17:35:55.212573


## Verifying File Integrity

Now we'll create a function to verify file integrity by comparing current hash values with those stored in our manifest. This is essential for detecting data corruption in medical data pipelines.

In [8]:
def verify_integrity(manifest_path: str = 'integrity_manifest.json') -> Dict:
    """Verify file integrity against stored manifest."""
    with open(manifest_path, 'r') as f:
        manifest = json.load(f)
    
    verification_results = {
        'verification_timestamp': pd.Timestamp.now().isoformat(),
        'total_files': len(manifest['files']),
        'passed': 0,
        'failed': 0,
        'missing': 0,
        'details': []
    }
    
    for file_info in manifest['files']:
        filepath = file_info['filepath']
        expected_hash = file_info['hash']
        
        if not os.path.exists(filepath):
            verification_results['missing'] += 1
            status = 'MISSING'
            current_hash = None
        else:
            current_result = smart_file_hash(filepath)
            current_hash = current_result['hash']
            
            if current_hash == expected_hash:
                verification_results['passed'] += 1
                status = 'PASSED'
            else:
                verification_results['failed'] += 1
                status = 'FAILED'
        
        verification_results['details'].append({
            'filepath': filepath,
            'status': status,
            'expected_hash': expected_hash[:16] + '...',
            'current_hash': current_hash[:16] + '...' if current_hash else None
        })
    
    return verification_results

# Verify integrity of our files
verification = verify_integrity()

print(f"Verification completed at: {verification['verification_timestamp']}")
print(f"Files passed: {verification['passed']}/{verification['total_files']}")
print(f"Files failed: {verification['failed']}")
print(f"Files missing: {verification['missing']}")

print("\nDetailed results:")
for detail in verification['details']:
    print(f"  {detail['filepath']}: {detail['status']}")

Verification completed at: 2025-09-13T17:35:55.234712
Files passed: 2/2
Files failed: 0
Files missing: 0

Detailed results:
  small_medical_data.csv: PASSED
  large_medical_data.csv: PASSED


Let's simulate a file corruption scenario to see how our integrity verification detects changes.

In [9]:
# Simulate file corruption by modifying the small file
original_df = pd.read_csv('small_medical_data.csv')
corrupted_df = original_df.copy()
corrupted_df.loc[0, 'age'] = 999  # Introduce corruption
corrupted_df.to_csv('small_medical_data.csv', index=False)

print("File corrupted (age changed from 45 to 999)")
print("\nRe-running integrity verification...")

verification_after_corruption = verify_integrity()

print(f"Files passed: {verification_after_corruption['passed']}/{verification_after_corruption['total_files']}")
print(f"Files failed: {verification_after_corruption['failed']}")

for detail in verification_after_corruption['details']:
    if detail['status'] == 'FAILED':
        print(f"\nCorruption detected in: {detail['filepath']}")
        print(f"Expected hash: {detail['expected_hash']}")
        print(f"Current hash:  {detail['current_hash']}")

File corrupted (age changed from 45 to 999)

Re-running integrity verification...
Files passed: 1/2
Files failed: 1

Corruption detected in: small_medical_data.csv
Expected hash: 39305a36efa785d3...
Current hash:  af90fa5a280c0b40...


## Performance Comparison

Let's compare the performance of different hashing strategies to understand when to use each approach. This helps optimize medical data processing pipelines.

In [10]:
def performance_comparison(filepath: str, chunk_sizes: List[int] = [1024, 4096, 8192, 16384]) -> pd.DataFrame:
    """Compare performance of different chunk sizes for large file hashing."""
    results = []
    
    # Test small file strategy
    start_time = time.time()
    hash_small_file(filepath)
    small_file_time = time.time() - start_time
    
    results.append({
        'strategy': 'load_entire_file',
        'chunk_size': 'N/A',
        'processing_time': small_file_time
    })
    
    # Test different chunk sizes
    for chunk_size in chunk_sizes:
        start_time = time.time()
        hash_large_file(filepath, chunk_size)
        chunk_time = time.time() - start_time
        
        results.append({
            'strategy': 'chunked_processing',
            'chunk_size': chunk_size,
            'processing_time': chunk_time
        })
    
    return pd.DataFrame(results)

# Compare performance on our large file
perf_results = performance_comparison('large_medical_data.csv')
perf_results['processing_time'] = perf_results['processing_time'].round(4)

print("Performance comparison results:")
print(perf_results.to_string(index=False))

Performance comparison results:
          strategy chunk_size  processing_time
  load_entire_file        N/A           0.0010
chunked_processing       1024           0.0000
chunked_processing       4096           0.0021
chunked_processing       8192           0.0000
chunked_processing      16384           0.0000


Finally, let's clean up the temporary files we created during this demonstration.

In [11]:
# Clean up temporary files
temp_files = ['small_medical_data.csv', 'large_medical_data.csv', 'integrity_manifest.json']

for filepath in temp_files:
    if os.path.exists(filepath):
        os.remove(filepath)
        print(f"Removed: {filepath}")

print("\nCleanup completed!")

Removed: small_medical_data.csv
Removed: large_medical_data.csv
Removed: integrity_manifest.json

Cleanup completed!


## Key Takeaways

1. **Hash functions provide reliable integrity verification** for medical data files
2. **Different strategies are needed** for small vs. large files to optimize performance
3. **Manifest files enable batch integrity verification** across multiple files
4. **Chunk-based processing** prevents memory issues with large medical datasets
5. **Regular integrity checks** are essential in medical data pipelines for patient safety

## Exercise

Create a medical data integrity monitoring system with the following requirements:

1. **Create sample medical files**: Generate 3 CSV files with different sizes:
   - Small file: 100 patient records with basic demographics
   - Medium file: 5,000 patient records with vital signs data
   - Large file: 50,000 patient records with lab results

2. **Implement automated integrity checking**: Create a function that:
   - Automatically detects the appropriate hashing strategy for each file
   - Creates a timestamped integrity manifest
   - Logs all operations with file sizes and processing times

3. **Simulate real-world scenarios**: 
   - Verify integrity of all files initially (should pass)
   - Simulate data corruption in one file by modifying a few values
   - Run integrity verification again and identify which file was corrupted
   - Generate a summary report showing which files passed/failed verification

4. **Bonus challenge**: Implement a feature that can restore corrupted files from backup copies and re-verify integrity.

This exercise will help you understand how integrity checking works in practice and how to build robust medical data processing pipelines.