# CSV/Excel/NDJSON → Parquet/Arrow: Batch vs Streaming

In medical data integration, we often need to convert data from various formats (CSV, Excel, NDJSON) into more efficient columnar formats like Parquet or Arrow. This notebook explores both batch and streaming approaches to handle large medical datasets efficiently.

Let's start by importing the necessary libraries for data processing and conversion. We'll use pandas for basic operations, pyarrow for Arrow/Parquet handling, and other utilities for file operations.

In [1]:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import json
import numpy as np
from pathlib import Path
import time
import os

First, let's create sample medical datasets in different formats to demonstrate the conversion process. We'll generate synthetic patient data with common medical fields.

In [2]:
# Create sample medical data
np.random.seed(42)
n_patients = 10000

medical_data = {
    'patient_id': [f'P{i:06d}' for i in range(n_patients)],
    'age': np.random.randint(18, 90, n_patients),
    'gender': np.random.choice(['M', 'F'], n_patients),
    'diagnosis_code': np.random.choice(['I10', 'E11', 'J44', 'N18', 'F32'], n_patients),
    'systolic_bp': np.random.normal(130, 20, n_patients).round(1),
    'diastolic_bp': np.random.normal(80, 10, n_patients).round(1),
    'lab_value_glucose': np.random.normal(100, 30, n_patients).round(2),
    'admission_date': pd.date_range('2020-01-01', periods=n_patients, freq='H')
}

df_medical = pd.DataFrame(medical_data)
print(f"Created medical dataset with {len(df_medical)} rows and {len(df_medical.columns)} columns")
df_medical.head()

Created medical dataset with 10000 rows and 8 columns


  'admission_date': pd.date_range('2020-01-01', periods=n_patients, freq='H')


Unnamed: 0,patient_id,age,gender,diagnosis_code,systolic_bp,diastolic_bp,lab_value_glucose,admission_date
0,P000000,69,M,F32,154.8,70.1,88.42,2020-01-01 00:00:00
1,P000001,32,M,F32,146.9,81.5,100.87,2020-01-01 01:00:00
2,P000002,89,F,E11,100.7,89.1,98.61,2020-01-01 02:00:00
3,P000003,78,M,E11,134.9,96.4,122.34,2020-01-01 03:00:00
4,P000004,38,F,J44,126.6,62.2,104.78,2020-01-01 04:00:00


Now let's save this data in different source formats that are commonly encountered in medical data integration. We'll create CSV, Excel, and NDJSON files to demonstrate various input scenarios.

In [3]:
# Create data directory
Path('data').mkdir(exist_ok=True)

# Save as CSV
df_medical.to_csv('data/medical_data.csv', index=False)
print(f"CSV file size: {os.path.getsize('data/medical_data.csv') / 1024 / 1024:.2f} MB")

# Save as Excel
df_medical.to_excel('data/medical_data.xlsx', index=False)
print(f"Excel file size: {os.path.getsize('data/medical_data.xlsx') / 1024 / 1024:.2f} MB")

CSV file size: 0.53 MB
Excel file size: 0.43 MB


Let's create an NDJSON file where each line represents a patient record. This format is common in streaming medical data systems and log files.

In [4]:
# Save as NDJSON (Newline Delimited JSON)
with open('data/medical_data.ndjson', 'w') as f:
    for _, row in df_medical.iterrows():
        # Convert datetime to string for JSON serialization
        row_dict = row.to_dict()
        row_dict['admission_date'] = row_dict['admission_date'].isoformat()
        f.write(json.dumps(row_dict) + '\n')

print(f"NDJSON file size: {os.path.getsize('data/medical_data.ndjson') / 1024 / 1024:.2f} MB")

NDJSON file size: 1.83 MB


## Batch Processing Approach

Now let's implement the batch processing approach where we load the entire dataset into memory at once. This method is suitable for datasets that fit comfortably in available RAM.

In [5]:
def batch_csv_to_parquet(csv_path, parquet_path):
    """Convert CSV to Parquet using batch processing"""
    start_time = time.time()
    
    # Read entire CSV into memory
    df = pd.read_csv(csv_path, parse_dates=['admission_date'])
    
    # Convert to Arrow table and save as Parquet
    table = pa.Table.from_pandas(df)
    pq.write_table(table, parquet_path)
    
    end_time = time.time()
    return end_time - start_time

# Convert CSV to Parquet using batch processing
batch_time = batch_csv_to_parquet('data/medical_data.csv', 'data/medical_batch.parquet')
print(f"Batch conversion completed in {batch_time:.2f} seconds")
print(f"Parquet file size: {os.path.getsize('data/medical_batch.parquet') / 1024 / 1024:.2f} MB")

Batch conversion completed in 0.04 seconds
Parquet file size: 0.24 MB


Let's also implement batch processing for Excel files, which requires special handling due to the more complex file structure. Excel files often contain multiple sheets and formatting information.

In [6]:
def batch_excel_to_parquet(excel_path, parquet_path):
    """Convert Excel to Parquet using batch processing"""
    start_time = time.time()
    
    # Read entire Excel file into memory
    df = pd.read_excel(excel_path)
    df['admission_date'] = pd.to_datetime(df['admission_date'])
    
    # Convert to Arrow table and save as Parquet
    table = pa.Table.from_pandas(df)
    pq.write_table(table, parquet_path)
    
    end_time = time.time()
    return end_time - start_time

# Convert Excel to Parquet using batch processing
batch_excel_time = batch_excel_to_parquet('data/medical_data.xlsx', 'data/medical_excel_batch.parquet')
print(f"Excel batch conversion completed in {batch_excel_time:.2f} seconds")

Excel batch conversion completed in 1.00 seconds


Now let's implement batch processing for NDJSON files. Each line needs to be parsed as a separate JSON object and then combined into a single DataFrame.

In [7]:
def batch_ndjson_to_parquet(ndjson_path, parquet_path):
    """Convert NDJSON to Parquet using batch processing"""
    start_time = time.time()
    
    # Read all lines and parse JSON
    records = []
    with open(ndjson_path, 'r') as f:
        for line in f:
            records.append(json.loads(line.strip()))
    
    # Create DataFrame and convert dates
    df = pd.DataFrame(records)
    df['admission_date'] = pd.to_datetime(df['admission_date'])
    
    # Convert to Arrow table and save as Parquet
    table = pa.Table.from_pandas(df)
    pq.write_table(table, parquet_path)
    
    end_time = time.time()
    return end_time - start_time

# Convert NDJSON to Parquet using batch processing
batch_ndjson_time = batch_ndjson_to_parquet('data/medical_data.ndjson', 'data/medical_ndjson_batch.parquet')
print(f"NDJSON batch conversion completed in {batch_ndjson_time:.2f} seconds")

NDJSON batch conversion completed in 0.07 seconds


## Streaming Processing Approach

Now let's implement streaming processing, which processes data in chunks rather than loading everything into memory. This approach is essential for large medical datasets that exceed available RAM.

In [8]:
def streaming_csv_to_parquet(csv_path, parquet_path, chunk_size=1000):
    """Convert CSV to Parquet using streaming processing"""
    start_time = time.time()
    
    # Initialize Parquet writer
    writer = None
    
    # Process CSV in chunks
    for chunk_df in pd.read_csv(csv_path, chunksize=chunk_size, parse_dates=['admission_date']):
        # Convert chunk to Arrow table
        table = pa.Table.from_pandas(chunk_df)
        
        # Initialize writer with schema from first chunk
        if writer is None:
            writer = pq.ParquetWriter(parquet_path, table.schema)
        
        # Write chunk to Parquet file
        writer.write_table(table)
    
    # Close the writer
    if writer:
        writer.close()
    
    end_time = time.time()
    return end_time - start_time

# Convert CSV to Parquet using streaming processing
streaming_time = streaming_csv_to_parquet('data/medical_data.csv', 'data/medical_streaming.parquet')
print(f"Streaming conversion completed in {streaming_time:.2f} seconds")
print(f"Streaming Parquet file size: {os.path.getsize('data/medical_streaming.parquet') / 1024 / 1024:.2f} MB")

Streaming conversion completed in 0.08 seconds
Streaming Parquet file size: 0.28 MB


Let's implement streaming processing for NDJSON files, which is particularly useful for log files and real-time medical data streams. We'll process the file line by line in batches.

In [9]:
def streaming_ndjson_to_parquet(ndjson_path, parquet_path, chunk_size=1000):
    """Convert NDJSON to Parquet using streaming processing"""
    start_time = time.time()
    
    writer = None
    records_batch = []
    
    with open(ndjson_path, 'r') as f:
        for line_num, line in enumerate(f):
            # Parse JSON record
            record = json.loads(line.strip())
            records_batch.append(record)
            
            # Process batch when chunk_size is reached
            if len(records_batch) >= chunk_size:
                # Create DataFrame from batch
                df_batch = pd.DataFrame(records_batch)
                df_batch['admission_date'] = pd.to_datetime(df_batch['admission_date'])
                
                # Convert to Arrow table
                table = pa.Table.from_pandas(df_batch)
                
                # Initialize writer if needed
                if writer is None:
                    writer = pq.ParquetWriter(parquet_path, table.schema)
                
                # Write batch
                writer.write_table(table)
                records_batch = []  # Reset batch
        
        # Process remaining records
        if records_batch:
            df_batch = pd.DataFrame(records_batch)
            df_batch['admission_date'] = pd.to_datetime(df_batch['admission_date'])
            table = pa.Table.from_pandas(df_batch)
            
            if writer is None:
                writer = pq.ParquetWriter(parquet_path, table.schema)
            writer.write_table(table)
    
    if writer:
        writer.close()
    
    end_time = time.time()
    return end_time - start_time

# Convert NDJSON to Parquet using streaming processing
streaming_ndjson_time = streaming_ndjson_to_parquet('data/medical_data.ndjson', 'data/medical_ndjson_streaming.parquet')
print(f"NDJSON streaming conversion completed in {streaming_ndjson_time:.2f} seconds")

NDJSON streaming conversion completed in 0.12 seconds


## Performance Comparison and Memory Usage

Let's compare the performance and file sizes of both approaches. We'll also verify that the converted files contain the same data as the original.

In [10]:
# Create performance comparison
performance_data = {
    'Method': ['CSV Batch', 'CSV Streaming', 'NDJSON Batch', 'NDJSON Streaming', 'Excel Batch'],
    'Time (seconds)': [batch_time, streaming_time, batch_ndjson_time, streaming_ndjson_time, batch_excel_time]
}

performance_df = pd.DataFrame(performance_data)
print("Performance Comparison:")
print(performance_df)

# File size comparison
file_sizes = {
    'Original CSV': os.path.getsize('data/medical_data.csv') / 1024 / 1024,
    'Original Excel': os.path.getsize('data/medical_data.xlsx') / 1024 / 1024,
    'Original NDJSON': os.path.getsize('data/medical_data.ndjson') / 1024 / 1024,
    'Batch Parquet': os.path.getsize('data/medical_batch.parquet') / 1024 / 1024,
    'Streaming Parquet': os.path.getsize('data/medical_streaming.parquet') / 1024 / 1024
}

print("\nFile Size Comparison (MB):")
for file_type, size in file_sizes.items():
    print(f"{file_type}: {size:.2f} MB")

Performance Comparison:
             Method  Time (seconds)
0         CSV Batch        0.039538
1     CSV Streaming        0.077791
2      NDJSON Batch        0.071356
3  NDJSON Streaming        0.118001
4       Excel Batch        1.002253

File Size Comparison (MB):
Original CSV: 0.53 MB
Original Excel: 0.43 MB
Original NDJSON: 1.83 MB
Batch Parquet: 0.24 MB
Streaming Parquet: 0.28 MB


Let's verify data integrity by reading back one of the Parquet files and comparing it with the original data. This ensures our conversion process preserved all the medical data accurately.

In [11]:
# Verify data integrity
original_df = pd.read_csv('data/medical_data.csv', parse_dates=['admission_date'])
parquet_df = pd.read_parquet('data/medical_batch.parquet')

print(f"Original shape: {original_df.shape}")
print(f"Parquet shape: {parquet_df.shape}")
print(f"Data types match: {original_df.dtypes.equals(parquet_df.dtypes)}")
print(f"Data values match: {original_df.equals(parquet_df)}")

# Show compression ratio
csv_size = os.path.getsize('data/medical_data.csv')
parquet_size = os.path.getsize('data/medical_batch.parquet')
compression_ratio = csv_size / parquet_size
print(f"\nCompression ratio: {compression_ratio:.2f}x smaller")

Original shape: (10000, 8)
Parquet shape: (10000, 8)
Data types match: True
Data values match: True

Compression ratio: 2.17x smaller


## Working with Arrow Format In-Memory

Let's demonstrate working directly with Apache Arrow format in memory, which is beneficial for medical data processing pipelines. Arrow provides zero-copy reads and efficient columnar operations.

In [12]:
# Load data as Arrow table directly
arrow_table = pq.read_table('data/medical_batch.parquet')

print(f"Arrow table schema:")
print(arrow_table.schema)
print(f"\nNumber of rows: {arrow_table.num_rows}")
print(f"Number of columns: {arrow_table.num_columns}")

# Show memory usage
memory_usage = sum(array.nbytes for array in arrow_table.columns)
print(f"Memory usage: {memory_usage / 1024 / 1024:.2f} MB")

Arrow table schema:
patient_id: string
age: int64
gender: string
diagnosis_code: string
systolic_bp: double
diastolic_bp: double
lab_value_glucose: double
admission_date: timestamp[ns]
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 1246

Number of rows: 10000
Number of columns: 8
Memory usage: 0.61 MB


Let's perform some efficient filtering operations directly on the Arrow table. This demonstrates how columnar format enables fast analytical queries on medical data.

In [13]:
# Efficient filtering with Arrow compute functions
import pyarrow.compute as pc

# Filter patients with high systolic blood pressure (>150)
high_bp_mask = pc.greater(arrow_table['systolic_bp'], 150)
high_bp_patients = arrow_table.filter(high_bp_mask)

print(f"Patients with systolic BP > 150: {high_bp_patients.num_rows}")

# Filter by diagnosis code
diabetes_mask = pc.equal(arrow_table['diagnosis_code'], 'E11')
diabetes_patients = arrow_table.filter(diabetes_mask)

print(f"Patients with diabetes (E11): {diabetes_patients.num_rows}")

# Convert filtered results back to pandas for further analysis
diabetes_df = diabetes_patients.to_pandas()
print(f"\nMean glucose level in diabetes patients: {diabetes_df['lab_value_glucose'].mean():.2f}")

Patients with systolic BP > 150: 1574
Patients with diabetes (E11): 1967

Mean glucose level in diabetes patients: 99.42


## Best Practices Summary

Let's summarize the key considerations for choosing between batch and streaming approaches in medical data integration. The choice depends on data size, available memory, and processing requirements.

In [14]:
# Clean up temporary files
import shutil
shutil.rmtree('data', ignore_errors=True)

print("Best Practices Summary:")
print("\n1. Use BATCH processing when:")
print("   - Dataset fits comfortably in memory")
print("   - Faster processing is priority")
print("   - Simple one-time conversion")

print("\n2. Use STREAMING processing when:")
print("   - Dataset is larger than available memory")
print("   - Memory efficiency is critical")
print("   - Processing real-time data feeds")
print("   - Want to minimize memory footprint")

print("\n3. Parquet benefits for medical data:")
print("   - Excellent compression (2-5x smaller files)")
print("   - Fast analytical queries")
print("   - Schema preservation")
print("   - Cross-platform compatibility")

Best Practices Summary:

1. Use BATCH processing when:
   - Dataset fits comfortably in memory
   - Faster processing is priority
   - Simple one-time conversion

2. Use STREAMING processing when:
   - Dataset is larger than available memory
   - Memory efficiency is critical
   - Processing real-time data feeds
   - Want to minimize memory footprint

3. Parquet benefits for medical data:
   - Excellent compression (2-5x smaller files)
   - Fast analytical queries
   - Schema preservation
   - Cross-platform compatibility


## Exercise

Create a medical data processing pipeline that:

1. Generate a synthetic dataset of 50,000 patient records with fields: patient_id, age, gender, BMI, blood_pressure_systolic, blood_pressure_diastolic, diagnosis_codes (as a list), and visit_date
2. Save this data as both CSV and NDJSON formats
3. Implement both batch and streaming conversion functions to convert these files to Parquet format
4. Compare the performance, memory usage, and file sizes between the two approaches
5. Use Arrow compute functions to filter patients with BMI > 30 and systolic BP > 140 (potential cardiovascular risk)
6. Calculate and compare the processing time for this filtering operation on the original CSV vs. the Parquet format

Bonus: Implement error handling for corrupted records in the NDJSON file and demonstrate how streaming processing can continue despite individual record failures.