# Reading and Writing Medical Data: CSV and NDJSON with Pandas and Polars

In medical data integration, we frequently work with structured data in CSV format and semi-structured data in NDJSON (Newline Delimited JSON) format. This notebook covers essential techniques for reading, writing, and handling these formats using both pandas and polars libraries, with special attention to data type control and missing value management.

## Setup and Imports

We'll start by importing the necessary libraries and creating sample medical data to work with.

In [1]:
import pandas as pd
import polars as pl
import numpy as np
import json
from pathlib import Path
from datetime import datetime, date

print(f"Pandas version: {pd.__version__}")
print(f"Polars version: {pl.__version__}")

Pandas version: 2.3.1
Polars version: 1.33.1


## Creating Sample Medical Data

Let's create a sample dataset representing patient information with various data types commonly found in medical records.

In [2]:
# Create sample medical data
medical_data = {
    'patient_id': ['P001', 'P002', 'P003', 'P004', 'P005'],
    'age': [45, 32, None, 67, 28],
    'gender': ['M', 'F', 'F', 'M', None],
    'blood_pressure_systolic': [120, 130, 140, None, 110],
    'blood_pressure_diastolic': [80, 85, 90, None, 70],
    'diagnosis_date': ['2023-01-15', '2023-02-20', None, '2023-03-10', '2023-04-05'],
    'has_diabetes': [True, False, False, True, None],
    'medication_count': [2, 0, 1, 5, 1]
}

df_sample = pd.DataFrame(medical_data)
print("Sample medical dataset:")
print(df_sample)

Sample medical dataset:
  patient_id   age gender  blood_pressure_systolic  blood_pressure_diastolic  \
0       P001  45.0      M                    120.0                      80.0   
1       P002  32.0      F                    130.0                      85.0   
2       P003   NaN      F                    140.0                      90.0   
3       P004  67.0      M                      NaN                       NaN   
4       P005  28.0   None                    110.0                      70.0   

  diagnosis_date has_diabetes  medication_count  
0     2023-01-15         True                 2  
1     2023-02-20        False                 0  
2           None        False                 1  
3     2023-03-10         True                 5  
4     2023-04-05         None                 1  


## Writing CSV Files with Pandas

We'll save our sample data to a CSV file, which is one of the most common formats for medical data exchange.

In [3]:
# Write CSV file with pandas
csv_file = 'medical_data.csv'
df_sample.to_csv(csv_file, index=False)
print(f"Data saved to {csv_file}")

# Check the file contents
with open(csv_file, 'r') as f:
    print("\nCSV file contents:")
    print(f.read())

Data saved to medical_data.csv

CSV file contents:
patient_id,age,gender,blood_pressure_systolic,blood_pressure_diastolic,diagnosis_date,has_diabetes,medication_count
P001,45.0,M,120.0,80.0,2023-01-15,True,2
P002,32.0,F,130.0,85.0,2023-02-20,False,0
P003,,F,140.0,90.0,,False,1
P004,67.0,M,,,2023-03-10,True,5
P005,28.0,,110.0,70.0,2023-04-05,,1



## Reading CSV Files with Pandas and Data Type Control

When reading CSV files, it's crucial to control data types to ensure data integrity, especially for medical data where precision matters.

In [4]:
# Read CSV with automatic type inference
df_auto = pd.read_csv(csv_file)
print("Data types with automatic inference:")
print(df_auto.dtypes)
print("\nData preview:")
print(df_auto.head())

Data types with automatic inference:
patient_id                   object
age                         float64
gender                       object
blood_pressure_systolic     float64
blood_pressure_diastolic    float64
diagnosis_date               object
has_diabetes                 object
medication_count              int64
dtype: object

Data preview:
  patient_id   age gender  blood_pressure_systolic  blood_pressure_diastolic  \
0       P001  45.0      M                    120.0                      80.0   
1       P002  32.0      F                    130.0                      85.0   
2       P003   NaN      F                    140.0                      90.0   
3       P004  67.0      M                      NaN                       NaN   
4       P005  28.0    NaN                    110.0                      70.0   

  diagnosis_date has_diabetes  medication_count  
0     2023-01-15         True                 2  
1     2023-02-20        False                 0  
2            Na

Now let's read the same CSV file with explicit data type control to ensure medical data is properly typed.

In [5]:
# Define explicit data types for medical data
dtype_dict = {
    'patient_id': 'string',
    'age': 'Int64',  # Nullable integer
    'gender': 'category',
    'blood_pressure_systolic': 'Int64',
    'blood_pressure_diastolic': 'Int64',
    'has_diabetes': 'boolean',
    'medication_count': 'Int64'
}

df_typed = pd.read_csv(csv_file, dtype=dtype_dict, parse_dates=['diagnosis_date'])
print("Data types with explicit control:")
print(df_typed.dtypes)

Data types with explicit control:
patient_id                  string[python]
age                                  Int64
gender                            category
blood_pressure_systolic              Int64
blood_pressure_diastolic             Int64
diagnosis_date              datetime64[ns]
has_diabetes                       boolean
medication_count                     Int64
dtype: object


## Handling Missing Values in CSV Reading

Medical data often contains missing values, and we need to handle them appropriately during data loading.

In [6]:
# Check for missing values
print("Missing values count:")
print(df_typed.isnull().sum())
print("\nMissing values percentage:")
print((df_typed.isnull().sum() / len(df_typed)) * 100)

Missing values count:
patient_id                  0
age                         1
gender                      1
blood_pressure_systolic     1
blood_pressure_diastolic    1
diagnosis_date              1
has_diabetes                1
medication_count            0
dtype: int64

Missing values percentage:
patient_id                   0.0
age                         20.0
gender                      20.0
blood_pressure_systolic     20.0
blood_pressure_diastolic    20.0
diagnosis_date              20.0
has_diabetes                20.0
medication_count             0.0
dtype: float64


We can customize how missing values are interpreted during CSV reading by specifying custom NA values.

In [7]:
# Read CSV with custom missing value indicators
df_custom_na = pd.read_csv(
    csv_file, 
    dtype=dtype_dict, 
    parse_dates=['diagnosis_date'],
    na_values=['', 'NA', 'NULL', 'missing', 'unknown', '999', '-1']
)
print("DataFrame info after custom NA handling:")
print(df_custom_na.info())

DataFrame info after custom NA handling:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 8 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   patient_id                5 non-null      string        
 1   age                       4 non-null      Int64         
 2   gender                    4 non-null      category      
 3   blood_pressure_systolic   4 non-null      Int64         
 4   blood_pressure_diastolic  4 non-null      Int64         
 5   diagnosis_date            4 non-null      datetime64[ns]
 6   has_diabetes              4 non-null      boolean       
 7   medication_count          5 non-null      Int64         
dtypes: Int64(4), boolean(1), category(1), datetime64[ns](1), string(1)
memory usage: 531.0 bytes
None


## Working with Polars for CSV Operations

Polars offers excellent performance for large medical datasets and has built-in support for proper data type handling.

In [8]:
# Read CSV with Polars
df_polars = pl.read_csv(csv_file)
print("Polars DataFrame schema:")
print(df_polars.schema)
print("\nPolars DataFrame:")
print(df_polars)

Polars DataFrame schema:
Schema([('patient_id', String), ('age', Float64), ('gender', String), ('blood_pressure_systolic', Float64), ('blood_pressure_diastolic', Float64), ('diagnosis_date', String), ('has_diabetes', Boolean), ('medication_count', Int64)])

Polars DataFrame:
shape: (5, 8)
┌────────────┬──────┬────────┬───────────────────┬──────────────────┬────────────────┬──────────────┬──────────────────┐
│ patient_id ┆ age  ┆ gender ┆ blood_pressure_sy ┆ blood_pressure_d ┆ diagnosis_date ┆ has_diabetes ┆ medication_count │
│ ---        ┆ ---  ┆ ---    ┆ stolic            ┆ iastolic         ┆ ---            ┆ ---          ┆ ---              │
│ str        ┆ f64  ┆ str    ┆ ---               ┆ ---              ┆ str            ┆ bool         ┆ i64              │
│            ┆      ┆        ┆ f64               ┆ f64              ┆                ┆              ┆                  │
╞════════════╪══════╪════════╪═══════════════════╪══════════════════╪════════════════╪══════════════╪════

Let's read the CSV with explicit schema definition in Polars to ensure proper medical data typing.

In [9]:
# Define schema for Polars
polars_schema = {
    'patient_id': pl.Utf8,
    'age': pl.Float64,
    'gender': pl.Categorical,
    'blood_pressure_systolic': pl.Float64,
    'blood_pressure_diastolic': pl.Float64,
    'diagnosis_date': pl.Date,
    'has_diabetes': pl.Boolean,
    'medication_count': pl.Float64
}

df_polars_typed = pl.read_csv(csv_file, schema=polars_schema, try_parse_dates=True)
print("Polars DataFrame with explicit schema:")
print(df_polars_typed)

Polars DataFrame with explicit schema:
shape: (5, 8)
┌────────────┬──────┬────────┬───────────────────┬──────────────────┬────────────────┬──────────────┬──────────────────┐
│ patient_id ┆ age  ┆ gender ┆ blood_pressure_sy ┆ blood_pressure_d ┆ diagnosis_date ┆ has_diabetes ┆ medication_count │
│ ---        ┆ ---  ┆ ---    ┆ stolic            ┆ iastolic         ┆ ---            ┆ ---          ┆ ---              │
│ str        ┆ f64  ┆ cat    ┆ ---               ┆ ---              ┆ date           ┆ bool         ┆ f64              │
│            ┆      ┆        ┆ f64               ┆ f64              ┆                ┆              ┆                  │
╞════════════╪══════╪════════╪═══════════════════╪══════════════════╪════════════════╪══════════════╪══════════════════╡
│ P001       ┆ 45.0 ┆ M      ┆ 120.0             ┆ 80.0             ┆ 2023-01-15     ┆ true         ┆ 2.0              │
│ P002       ┆ 32.0 ┆ F      ┆ 130.0             ┆ 85.0             ┆ 2023-02-20     ┆ false        

## Creating and Writing NDJSON Data

NDJSON (Newline Delimited JSON) is commonly used for streaming medical data or when dealing with nested patient records.

In [10]:
# Create sample NDJSON medical data with nested structure
medical_records = [
    {
        "patient_id": "P001",
        "demographics": {"age": 45, "gender": "M"},
        "vitals": {"bp_sys": 120, "bp_dia": 80},
        "conditions": ["diabetes", "hypertension"],
        "last_visit": "2023-01-15"
    },
    {
        "patient_id": "P002",
        "demographics": {"age": 32, "gender": "F"},
        "vitals": {"bp_sys": 130, "bp_dia": 85},
        "conditions": [],
        "last_visit": "2023-02-20"
    },
    {
        "patient_id": "P003",
        "demographics": {"age": None, "gender": "F"},
        "vitals": {"bp_sys": 140, "bp_dia": 90},
        "conditions": ["anxiety"],
        "last_visit": None
    }
]

# Write NDJSON file
ndjson_file = 'medical_records.ndjson'
with open(ndjson_file, 'w') as f:
    for record in medical_records:
        f.write(json.dumps(record) + '\n')

print(f"NDJSON data written to {ndjson_file}")

NDJSON data written to medical_records.ndjson


## Reading NDJSON with Pandas

We'll read the NDJSON file and handle the nested structure appropriately for analysis.

In [11]:
# Read NDJSON file with pandas
records = []
with open(ndjson_file, 'r') as f:
    for line in f:
        records.append(json.loads(line.strip()))

df_ndjson = pd.json_normalize(records)
print("NDJSON data loaded with pandas:")
print(df_ndjson)
print("\nColumn names:")
print(df_ndjson.columns.tolist())

NDJSON data loaded with pandas:
  patient_id                conditions  last_visit  demographics.age  \
0       P001  [diabetes, hypertension]  2023-01-15              45.0   
1       P002                        []  2023-02-20              32.0   
2       P003                 [anxiety]        None               NaN   

  demographics.gender  vitals.bp_sys  vitals.bp_dia  
0                   M            120             80  
1                   F            130             85  
2                   F            140             90  

Column names:
['patient_id', 'conditions', 'last_visit', 'demographics.age', 'demographics.gender', 'vitals.bp_sys', 'vitals.bp_dia']


## Reading NDJSON with Polars

Polars provides native support for NDJSON files, making it easy to work with semi-structured medical data.

In [12]:
# Read NDJSON with Polars
df_polars_ndjson = pl.read_ndjson(ndjson_file)
print("NDJSON data loaded with Polars:")
print(df_polars_ndjson)
print("\nSchema:")
print(df_polars_ndjson.schema)

NDJSON data loaded with Polars:
shape: (3, 5)
┌────────────┬──────────────┬───────────┬──────────────────────────────┬────────────┐
│ patient_id ┆ demographics ┆ vitals    ┆ conditions                   ┆ last_visit │
│ ---        ┆ ---          ┆ ---       ┆ ---                          ┆ ---        │
│ str        ┆ struct[2]    ┆ struct[2] ┆ list[str]                    ┆ str        │
╞════════════╪══════════════╪═══════════╪══════════════════════════════╪════════════╡
│ P001       ┆ {45,"M"}     ┆ {120,80}  ┆ ["diabetes", "hypertension"] ┆ 2023-01-15 │
│ P002       ┆ {32,"F"}     ┆ {130,85}  ┆ []                           ┆ 2023-02-20 │
│ P003       ┆ {null,"F"}   ┆ {140,90}  ┆ ["anxiety"]                  ┆ null       │
└────────────┴──────────────┴───────────┴──────────────────────────────┴────────────┘

Schema:
Schema([('patient_id', String), ('demographics', Struct({'age': Int64, 'gender': String})), ('vitals', Struct({'bp_sys': Int64, 'bp_dia': Int64})), ('conditions', List(Str

## Handling Missing Values in NDJSON Data

Let's examine how missing values are handled in the NDJSON data and apply appropriate strategies.

In [13]:
# Check missing values in pandas NDJSON data
print("Missing values in pandas NDJSON data:")
print(df_ndjson.isnull().sum())

# Check null values in Polars NDJSON data
print("\nNull values in Polars NDJSON data:")
print(df_polars_ndjson.null_count())

Missing values in pandas NDJSON data:
patient_id             0
conditions             0
last_visit             1
demographics.age       1
demographics.gender    0
vitals.bp_sys          0
vitals.bp_dia          0
dtype: int64

Null values in Polars NDJSON data:
shape: (1, 5)
┌────────────┬──────────────┬────────┬────────────┬────────────┐
│ patient_id ┆ demographics ┆ vitals ┆ conditions ┆ last_visit │
│ ---        ┆ ---          ┆ ---    ┆ ---        ┆ ---        │
│ u32        ┆ u32          ┆ u32    ┆ u32        ┆ u32        │
╞════════════╪══════════════╪════════╪════════════╪════════════╡
│ 0          ┆ 0            ┆ 0      ┆ 0          ┆ 1          │
└────────────┴──────────────┴────────┴────────────┴────────────┘


## Writing Data Back to Files

Finally, let's demonstrate how to write the processed data back to both CSV and NDJSON formats.

In [14]:
# Write processed data to CSV with pandas
df_ndjson.to_csv('processed_medical_data.csv', index=False)

# Write data to CSV with Polars
df_polars_typed.write_csv('polars_medical_data.csv')

# Write back to NDJSON with Polars
df_polars_ndjson.write_ndjson('processed_medical_records.ndjson')

print("Files written successfully:")
print("- processed_medical_data.csv (pandas)")
print("- polars_medical_data.csv (polars)")
print("- processed_medical_records.ndjson (polars)")

Files written successfully:
- processed_medical_data.csv (pandas)
- polars_medical_data.csv (polars)
- processed_medical_records.ndjson (polars)


## Performance Comparison

Let's create a larger dataset to compare the performance of pandas vs polars for medical data operations.

In [15]:
import time

# Create larger dataset for performance testing
n_patients = 10000
large_data = {
    'patient_id': [f'P{i:06d}' for i in range(n_patients)],
    'age': np.random.randint(18, 90, n_patients),
    'gender': np.random.choice(['M', 'F'], n_patients),
    'blood_pressure_systolic': np.random.randint(90, 180, n_patients),
    'blood_pressure_diastolic': np.random.randint(60, 120, n_patients),
    'has_diabetes': np.random.choice([True, False], n_patients),
    'medication_count': np.random.randint(0, 10, n_patients)
}

large_df = pd.DataFrame(large_data)
large_df.to_csv('large_medical_data.csv', index=False)
print(f"Created dataset with {n_patients:,} patients")

Created dataset with 10,000 patients


Now let's compare the read performance between pandas and polars for this larger medical dataset.

In [16]:
# Time pandas read
start_time = time.time()
df_pandas_large = pd.read_csv('large_medical_data.csv')
pandas_time = time.time() - start_time

# Time polars read
start_time = time.time()
df_polars_large = pl.read_csv('large_medical_data.csv')
polars_time = time.time() - start_time

print(f"Pandas read time: {pandas_time:.4f} seconds")
print(f"Polars read time: {polars_time:.4f} seconds")
print(f"Speedup: {pandas_time/polars_time:.2f}x")

Pandas read time: 0.0875 seconds
Polars read time: 0.0020 seconds
Speedup: 43.69x


## Summary

In this notebook, we've covered essential techniques for reading and writing medical data in CSV and NDJSON formats using both pandas and polars. Key takeaways include:

1. **Data Type Control**: Always specify explicit data types when reading medical data to ensure data integrity
2. **Missing Value Handling**: Use appropriate nullable data types and custom NA value specifications
3. **Format Choice**: CSV for tabular data, NDJSON for semi-structured or nested medical records
4. **Performance**: Polars often provides better performance for large datasets
5. **Schema Definition**: Define schemas upfront to catch data quality issues early

## Exercise

Create a medical dataset with the following requirements:

1. Generate a CSV file containing 1,000 patient records with columns: `patient_id`, `age`, `gender`, `height_cm`, `weight_kg`, `diagnosis_code`, `admission_date`, `discharge_date`, `treatment_cost`
2. Introduce missing values in 10% of the records across different columns
3. Read the CSV file using both pandas and polars with appropriate data types:
   - Use categorical type for gender
   - Use nullable integers for height and weight
   - Parse dates correctly
   - Use string type for diagnosis codes
4. Create an NDJSON version of the same data with nested structure (group demographic info and medical info)
5. Compare the file sizes and read performance between CSV and NDJSON formats
6. Calculate summary statistics for missing values and identify patterns

Bonus: Implement data validation to ensure age is between 0-120, weight and height are positive, and discharge_date is after admission_date.