# Data Analytics for Health - Task 1.2.3: Feature Consolidation

## Overview
This notebook consolidates all features computed at admission level (hadm_id, subject_id), then aggregates them to subject_id level to create the final patient profile with ~10 features.

## Objectives
- Compute admission-level features for all datasets
- Aggregate admission-level features to subject_id level
- Add additional features to reach ~10 total features
- Create final patient profile dataset

## Approach
1. Compute all features per admission (hadm_id, subject_id)
2. Aggregate admission-level features to subject_id level
3. Merge all features on subject_id
4. Save consolidated patient profile

---


In [20]:
import os
import pandas as pd
import numpy as np
import warnings
from pathlib import Path

warnings.filterwarnings('ignore')

# Set up file paths
notebook_dir = Path.cwd().resolve()
data_path = (notebook_dir / '..' / 'Data').resolve()

print("Libraries imported successfully")
print(f"Data path: {data_path}")


Libraries imported successfully
Data path: /Users/alexandermittet/Library/Mobile Documents/com~apple~CloudDocs/uni_life/UniPi DAD/data_analytics_4_health_unipi/Data


## 1. Load Raw Datasets


In [21]:
# Load all four datasets
df1 = pd.read_csv(data_path / 'heart_diagnoses_1.csv')  # Heart Diagnoses
df2 = pd.read_csv(data_path / 'laboratory_events_codes_2.csv')  # Laboratory Events
df3 = pd.read_csv(data_path / 'microbiology_events_codes_3.csv')  # Microbiology Events
df4 = pd.read_csv(data_path / 'procedure_code_4.csv')  # Procedure Codes

print(f"Loaded Heart Diagnoses: {df1.shape[0]:,} rows × {df1.shape[1]} columns")
print(f"Loaded Laboratory Events: {df2.shape[0]:,} rows × {df2.shape[1]} columns")
print(f"Loaded Microbiology Events: {df3.shape[0]:,} rows × {df3.shape[1]} columns")
print(f"Loaded Procedure Codes: {df4.shape[0]:,} rows × {df4.shape[1]} columns")
print("\nAll datasets loaded successfully!")


Loaded Heart Diagnoses: 4,864 rows × 25 columns
Loaded Laboratory Events: 978,503 rows × 14 columns
Loaded Microbiology Events: 15,587 rows × 14 columns
Loaded Procedure Codes: 14,497 rows × 6 columns

All datasets loaded successfully!


## 2. Clean Data: Remove Problematic hadm_ids

Remove hadm_ids that map to multiple subject_ids to ensure data integrity.


In [22]:
# Function to clean datasets: remove hadm_ids with multiple subject_ids
def clean_df(df):
    """Remove rows where hadm_id maps to multiple subject_ids"""
    if 'hadm_id' in df.columns and 'subject_id' in df.columns:
        # Count unique subject_ids per hadm_id
        counts = df.groupby('hadm_id')['subject_id'].nunique()
        # Keep only hadm_ids with exactly one subject_id
        valid_hadm = counts[counts == 1].index
        return df[df['hadm_id'].isin(valid_hadm)].copy()
    return df.copy()

# Clean all datasets
df1_clean = clean_df(df1)
df2_clean = df2.copy()  # df2 doesn't have subject_id yet
df3_clean = clean_df(df3)
df4_clean = clean_df(df4)

print("Data cleaning completed!")
print(f"df1: {len(df1_clean):,} rows (removed {len(df1) - len(df1_clean):,} problematic rows)")
print(f"df3: {len(df3_clean):,} rows (removed {len(df3) - len(df3_clean):,} problematic rows)")
print(f"df4: {len(df4_clean):,} rows (removed {len(df4) - len(df4_clean):,} problematic rows)")


Data cleaning completed!
df1: 4,660 rows (removed 204 problematic rows)
df3: 11,459 rows (removed 4,128 problematic rows)
df4: 14,497 rows (removed 0 problematic rows)


## 3. Create Reference Table for hadm_id → subject_id Mapping


In [23]:
# Create reference table for hadm_id -> subject_id mapping
ref_table = pd.concat([
    df1_clean[['hadm_id', 'subject_id']].drop_duplicates(),
    df3_clean[['hadm_id', 'subject_id']].drop_duplicates()
]).drop_duplicates()

print(f"Reference table: {len(ref_table):,} unique (hadm_id, subject_id) pairs")
print(f"Unique hadm_ids: {ref_table['hadm_id'].nunique():,}")
print(f"Unique subject_ids: {ref_table['subject_id'].nunique():,}")


Reference table: 4,711 unique (hadm_id, subject_id) pairs
Unique hadm_ids: 4,711
Unique subject_ids: 4,244


## 4. Compute Admission-Level Features

### 4.1 Laboratory Features per Admission


In [24]:
# Add subject_id to df2 (Labs) - it only has hadm_id
df2_with_subject = df2_clean.merge(ref_table, on='hadm_id', how='left')
print(f"df2 (Labs) after adding subject_id: {df2_with_subject['subject_id'].notna().sum():,} / {len(df2_with_subject):,} rows have subject_id")

# Remove rows without subject_id
df2_with_subject = df2_with_subject[df2_with_subject['subject_id'].notna()].copy()
print(f"Working with {len(df2_with_subject):,} lab events with valid subject_id")


df2 (Labs) after adding subject_id: 971,633 / 978,503 rows have subject_id
Working with 971,633 lab events with valid subject_id


In [25]:
# Feature 1: Total count of laboratory events per admission
lab_count_per_adm = df2_with_subject.groupby(['hadm_id', 'subject_id']).size().reset_index(name='n_lab_events')
print(f"Feature 1 - n_lab_events: {len(lab_count_per_adm):,} admissions")
print(lab_count_per_adm.describe())


Feature 1 - n_lab_events: 4,702 admissions
            hadm_id    subject_id  n_lab_events
count  4.702000e+03  4.702000e+03   4702.000000
mean   2.502633e+07  1.500560e+07    206.642493
std    2.868247e+06  2.881679e+06    242.259902
min    2.000446e+07  1.000098e+07      3.000000
25%    2.262870e+07  1.246864e+07     79.000000
50%    2.504050e+07  1.498441e+07    134.000000
75%    2.747417e+07  1.750755e+07    248.000000
max    2.999967e+07  1.999850e+07   4591.000000


In [26]:
# Feature 2: Ratio of abnormal tests per admission
df2_with_subject['is_abnormal'] = df2_with_subject['flag'].str.contains('abnormal', case=False, na=False)
abnormal_ratio_per_adm = df2_with_subject.groupby(['hadm_id', 'subject_id'])['is_abnormal'].mean().reset_index(name='abnormal_ratio')
print(f"Feature 2 - abnormal_ratio: {len(abnormal_ratio_per_adm):,} admissions")
print(abnormal_ratio_per_adm.describe())


Feature 2 - abnormal_ratio: 4,702 admissions
            hadm_id    subject_id  abnormal_ratio
count  4.702000e+03  4.702000e+03     4702.000000
mean   2.502633e+07  1.500560e+07        0.309212
std    2.868247e+06  2.881679e+06        0.116179
min    2.000446e+07  1.000098e+07        0.000000
25%    2.262870e+07  1.246864e+07        0.231752
50%    2.504050e+07  1.498441e+07        0.311666
75%    2.747417e+07  1.750755e+07        0.390623
max    2.999967e+07  1.999850e+07        1.000000


In [27]:
# Feature 3: Maximum glucose value per admission
glucose_df = df2_with_subject[df2_with_subject['label'].str.contains('glucose', case=False, na=False)]
max_glucose_per_adm = glucose_df.groupby(['hadm_id', 'subject_id'])['valuenum'].max().reset_index(name='max_glucose')
print(f"Feature 3 - max_glucose: {len(max_glucose_per_adm):,} admissions with glucose measurements")
print(max_glucose_per_adm.describe())


Feature 3 - max_glucose: 4,677 admissions with glucose measurements
            hadm_id    subject_id  max_glucose
count  4.677000e+03  4.677000e+03  4674.000000
mean   2.502625e+07  1.500462e+07   180.003851
std    2.869366e+06  2.884325e+06   123.341160
min    2.000446e+07  1.000098e+07    46.000000
25%    2.262667e+07  1.246802e+07   115.000000
50%    2.503418e+07  1.498146e+07   147.000000
75%    2.747503e+07  1.751445e+07   199.000000
max    2.999967e+07  1.999850e+07  1190.000000


In [28]:
# Feature 4: Additional lab features - Mean creatinine and hemoglobin per admission
creatinine_df = df2_with_subject[df2_with_subject['label'].str.contains('creatinine', case=False, na=False)]
hemoglobin_df = df2_with_subject[df2_with_subject['label'].str.contains('hemoglobin', case=False, na=False)]

lab_features_adm = []
if len(creatinine_df) > 0:
    mean_creatinine = creatinine_df.groupby(['hadm_id', 'subject_id'])['valuenum'].mean().reset_index(name='mean_creatinine')
    lab_features_adm.append(mean_creatinine)
    print(f"Mean creatinine: {len(mean_creatinine):,} admissions")

if len(hemoglobin_df) > 0:
    mean_hemoglobin = hemoglobin_df.groupby(['hadm_id', 'subject_id'])['valuenum'].mean().reset_index(name='mean_hemoglobin')
    lab_features_adm.append(mean_hemoglobin)
    print(f"Mean hemoglobin: {len(mean_hemoglobin):,} admissions")

# Merge all lab features
lab_features = lab_count_per_adm.merge(abnormal_ratio_per_adm, on=['hadm_id', 'subject_id'], how='outer')
lab_features = lab_features.merge(max_glucose_per_adm, on=['hadm_id', 'subject_id'], how='outer')

if len(lab_features_adm) > 0:
    for feature_df in lab_features_adm:
        lab_features = lab_features.merge(feature_df, on=['hadm_id', 'subject_id'], how='outer')

print(f"\nFinal lab features shape: {lab_features.shape}")
print(f"Admissions with lab data: {len(lab_features):,}")


Mean creatinine: 4,689 admissions
Mean hemoglobin: 4,669 admissions

Final lab features shape: (4702, 7)
Admissions with lab data: 4,702


### 4.2 Microbiology Features per Admission


In [29]:
# Feature 5: Total count of microbiology examinations per admission
micro_count_per_adm = df3_clean.groupby(['hadm_id', 'subject_id']).size().reset_index(name='n_micro_exam')
print(f"Feature 5 - n_micro_exam: {len(micro_count_per_adm):,} admissions")
print(micro_count_per_adm.describe())


Feature 5 - n_micro_exam: 2,198 admissions
            hadm_id    subject_id  n_micro_exam
count  2.198000e+03  2.198000e+03   2198.000000
mean   2.503509e+07  1.504292e+07      5.213376
std    2.885937e+06  2.845304e+06      6.923173
min    2.001360e+07  1.000098e+07      1.000000
25%    2.259458e+07  1.260172e+07      1.000000
50%    2.511358e+07  1.501327e+07      3.000000
75%    2.750232e+07  1.751577e+07      7.000000
max    2.999967e+07  1.999737e+07    107.000000


In [30]:
# Feature 6: Count of positive microbiology results per admission
# Check if there's a column indicating positive results
if 'interpretation' in df3_clean.columns:
    df3_clean['is_positive'] = df3_clean['interpretation'].str.contains('positive', case=False, na=False)
    positive_count = df3_clean.groupby(['hadm_id', 'subject_id'])['is_positive'].sum().reset_index(name='n_positive_micro')
    print(f"Feature 6 - n_positive_micro: {len(positive_count):,} admissions")
    print(positive_count.describe())
    
    # Merge microbiology features
    micro_features = micro_count_per_adm.merge(positive_count, on=['hadm_id', 'subject_id'], how='outer')
else:
    micro_features = micro_count_per_adm.copy()
    print("No 'interpretation' column found, skipping positive count feature")

print(f"\nFinal micro features shape: {micro_features.shape}")


Feature 6 - n_positive_micro: 2,198 admissions
            hadm_id    subject_id  n_positive_micro
count  2.198000e+03  2.198000e+03            2198.0
mean   2.503509e+07  1.504292e+07               0.0
std    2.885937e+06  2.845304e+06               0.0
min    2.001360e+07  1.000098e+07               0.0
25%    2.259458e+07  1.260172e+07               0.0
50%    2.511358e+07  1.501327e+07               0.0
75%    2.750232e+07  1.751577e+07               0.0
max    2.999967e+07  1.999737e+07               0.0

Final micro features shape: (2198, 4)


### 4.3 Procedure Features per Admission


In [31]:
# Feature 7: Total count of procedures per admission
proc_count_per_adm = df4_clean.groupby(['hadm_id', 'subject_id']).size().reset_index(name='total_procedures')
print(f"Feature 7 - total_procedures: {len(proc_count_per_adm):,} admissions")
print(proc_count_per_adm.describe())


Feature 7 - total_procedures: 3,459 admissions
            hadm_id    subject_id  total_procedures
count  3.459000e+03  3.459000e+03       3459.000000
mean   2.502943e+07  1.499381e+07          4.191096
std    2.853466e+06  2.868527e+06          2.989024
min    2.000790e+07  1.000098e+07          1.000000
25%    2.264270e+07  1.249509e+07          2.000000
50%    2.506423e+07  1.495295e+07          3.000000
75%    2.747180e+07  1.747168e+07          6.000000
max    2.999967e+07  1.999850e+07         28.000000


In [32]:
# Feature 8: Count of unique procedure types per admission
if 'icd_code' in df4_clean.columns:
    unique_proc_types = df4_clean.groupby(['hadm_id', 'subject_id'])['icd_code'].nunique().reset_index(name='n_unique_procedures')
    print(f"Feature 8 - n_unique_procedures: {len(unique_proc_types):,} admissions")
    print(unique_proc_types.describe())
    
    # Merge procedure features
    proc_features = proc_count_per_adm.merge(unique_proc_types, on=['hadm_id', 'subject_id'], how='outer')
else:
    proc_features = proc_count_per_adm.copy()
    print("No 'icd_code' column found, skipping unique procedure count")

print(f"\nFinal proc features shape: {proc_features.shape}")


Feature 8 - n_unique_procedures: 3,459 admissions
            hadm_id    subject_id  n_unique_procedures
count  3.459000e+03  3.459000e+03          3459.000000
mean   2.502943e+07  1.499381e+07             4.058398
std    2.853466e+06  2.868527e+06             2.717689
min    2.000790e+07  1.000098e+07             1.000000
25%    2.264270e+07  1.249509e+07             2.000000
50%    2.506423e+07  1.495295e+07             3.000000
75%    2.747180e+07  1.747168e+07             6.000000
max    2.999967e+07  1.999850e+07            21.000000

Final proc features shape: (3459, 4)


### 4.4 Heart Diagnoses Features per Admission


In [33]:
# Feature 9: Age and gender per admission (from df1)
heart_features_adm = df1_clean[['hadm_id', 'subject_id', 'age', 'gender']].drop_duplicates(['hadm_id', 'subject_id']).copy()
print(f"Heart features (age, gender): {len(heart_features_adm):,} admissions")
print(heart_features_adm.describe())


Heart features (age, gender): 4,660 admissions
            hadm_id    subject_id         age
count  4.660000e+03  4.660000e+03  1326.00000
mean   2.502295e+07  1.500005e+07    68.96908
std    2.865917e+06  2.880019e+06    14.99554
min    2.000446e+07  1.000098e+07    18.00000
25%    2.263073e+07  1.247198e+07    60.00000
50%    2.503405e+07  1.497279e+07    70.00000
75%    2.746393e+07  1.749304e+07    81.00000
max    2.999967e+07  1.999850e+07    95.00000


In [34]:
# Feature 10: Diagnosis count per admission
if 'icd_code' in df1_clean.columns:
    diagnosis_count = df1_clean.groupby(['hadm_id', 'subject_id'])['icd_code'].count().reset_index(name='n_diagnoses')
    print(f"Feature 10 - n_diagnoses: {len(diagnosis_count):,} admissions")
    print(diagnosis_count.describe())
    
    # Merge heart features
    heart_features = heart_features_adm.merge(diagnosis_count, on=['hadm_id', 'subject_id'], how='outer')
else:
    heart_features = heart_features_adm.copy()
    print("No 'icd_code' column found, skipping diagnosis count")

print(f"\nFinal heart features shape: {heart_features.shape}")


Feature 10 - n_diagnoses: 4,660 admissions
            hadm_id    subject_id  n_diagnoses
count  4.660000e+03  4.660000e+03       4660.0
mean   2.502295e+07  1.500005e+07          1.0
std    2.865917e+06  2.880019e+06          0.0
min    2.000446e+07  1.000098e+07          1.0
25%    2.263073e+07  1.247198e+07          1.0
50%    2.503405e+07  1.497279e+07          1.0
75%    2.746393e+07  1.749304e+07          1.0
max    2.999967e+07  1.999850e+07          1.0

Final heart features shape: (4660, 5)


### 4.5 Time Features per Admission


In [35]:
# Load time features computed in previous notebook
time_features_adm = pd.read_csv(data_path / '1.2_admission_time_features.csv')
print(f"Time features per admission: {len(time_features_adm):,} admissions")
print(time_features_adm.describe())


Time features per admission: 4,864 admissions
         subject_id       hadm_id  days_since_last_admission
count  4.864000e+03  4.864000e+03                 472.000000
mean   1.510717e+07  2.501745e+07                 533.220339
std    2.938761e+06  2.873736e+06                 644.585515
min    1.000098e+07  2.000446e+07                   2.000000
25%    1.252385e+07  2.260252e+07                  70.750000
50%    1.507553e+07  2.503238e+07                 273.000000
75%    1.764939e+07  2.746833e+07                 716.250000
max    1.999860e+07  2.999967e+07                3581.000000


## 5. Merge All Admission-Level Features


In [36]:
# Start with reference table as base
admission_features = ref_table.copy()

# Merge all feature datasets
admission_features = admission_features.merge(lab_features, on=['hadm_id', 'subject_id'], how='outer')
admission_features = admission_features.merge(micro_features, on=['hadm_id', 'subject_id'], how='outer')
admission_features = admission_features.merge(proc_features, on=['hadm_id', 'subject_id'], how='outer')
admission_features = admission_features.merge(heart_features, on=['hadm_id', 'subject_id'], how='outer')
admission_features = admission_features.merge(time_features_adm, on=['hadm_id', 'subject_id'], how='outer')

print(f"Admission-level features shape: {admission_features.shape}")
print(f"\nFeatures: {admission_features.columns.tolist()}")
print(f"\nMissing values per column:")
print(admission_features.isna().sum())


Admission-level features shape: (4864, 15)

Features: ['hadm_id', 'subject_id', 'n_lab_events', 'abnormal_ratio', 'max_glucose', 'mean_creatinine', 'mean_hemoglobin', 'n_micro_exam', 'n_positive_micro', 'total_procedures', 'n_unique_procedures', 'age', 'gender', 'n_diagnoses', 'days_since_last_admission']

Missing values per column:
hadm_id                         0
subject_id                      0
n_lab_events                  162
abnormal_ratio                162
max_glucose                   190
mean_creatinine               175
mean_hemoglobin               196
n_micro_exam                 2666
n_positive_micro             2666
total_procedures             1405
n_unique_procedures          1405
age                          3538
gender                       3538
n_diagnoses                   204
days_since_last_admission    4392
dtype: int64


## 6. Aggregate Admission-Level Features to Subject Level


In [37]:
# Define aggregation strategy for each feature
agg_dict = {}

# Count features: sum across admissions
count_features = ['n_lab_events', 'n_micro_exam', 'total_procedures', 'n_diagnoses']
if 'n_positive_micro' in admission_features.columns:
    count_features.append('n_positive_micro')
if 'n_unique_procedures' in admission_features.columns:
    count_features.append('n_unique_procedures')

for feat in count_features:
    if feat in admission_features.columns:
        agg_dict[feat] = 'sum'

# Ratio/mean features: mean across admissions
ratio_features = ['abnormal_ratio']
for feat in ratio_features:
    if feat in admission_features.columns:
        agg_dict[feat] = 'mean'

# Max features: max across admissions
max_features = ['max_glucose']
for feat in max_features:
    if feat in admission_features.columns:
        agg_dict[feat] = 'max'

# Mean features: mean across admissions
mean_features = ['mean_creatinine', 'mean_hemoglobin']
for feat in mean_features:
    if feat in admission_features.columns:
        agg_dict[feat] = 'mean'

# Age: mean across admissions (or first if same)
if 'age' in admission_features.columns:
    agg_dict['age'] = 'mean'

# Gender: mode (most common) across admissions
if 'gender' in admission_features.columns:
    def mode_func(x):
        return x.mode()[0] if len(x.mode()) > 0 else None
    agg_dict['gender'] = mode_func

# Time features: mean across admissions
if 'days_since_last_admission' in admission_features.columns:
    agg_dict['days_since_last_admission'] = 'mean'

print("Aggregation strategy:")
for feat, strategy in agg_dict.items():
    print(f"  {feat}: {strategy}")


Aggregation strategy:
  n_lab_events: sum
  n_micro_exam: sum
  total_procedures: sum
  n_diagnoses: sum
  n_positive_micro: sum
  n_unique_procedures: sum
  abnormal_ratio: mean
  max_glucose: max
  mean_creatinine: mean
  mean_hemoglobin: mean
  age: mean
  gender: <function mode_func at 0x12961b940>
  days_since_last_admission: mean


In [38]:
# Aggregate to subject level
subject_features_list = []

# Handle gender separately (mode) - only if it exists in the dataframe
if 'gender' in admission_features.columns:
    gender_mode = admission_features.groupby('subject_id')['gender'].apply(lambda x: x.mode()[0] if len(x.mode()) > 0 else None).reset_index(name='gender')
    subject_features_list.append(gender_mode)

# Aggregate numeric features - create filtered agg_dict without gender
numeric_features = [f for f in agg_dict.keys() if f != 'gender']
numeric_agg_dict = {k: v for k, v in agg_dict.items() if k != 'gender'}

if numeric_features:
    # Only aggregate features that actually exist in the dataframe
    available_numeric_features = [f for f in numeric_features if f in admission_features.columns]
    if available_numeric_features:
        available_agg_dict = {k: v for k, v in numeric_agg_dict.items() if k in available_numeric_features}
        subject_features_numeric = admission_features.groupby('subject_id')[available_numeric_features].agg(available_agg_dict).reset_index()
        subject_features_list.append(subject_features_numeric)

# Merge all subject-level features
if len(subject_features_list) > 0:
    subject_features = subject_features_list[0]
    for df in subject_features_list[1:]:
        subject_features = subject_features.merge(df, on='subject_id', how='outer')
else:
    subject_features = admission_features.groupby('subject_id').first().reset_index()

print(f"Subject-level features shape: {subject_features.shape}")
print(f"\nFeatures: {subject_features.columns.tolist()}")


Subject-level features shape: (4392, 14)

Features: ['subject_id', 'gender', 'n_lab_events', 'n_micro_exam', 'total_procedures', 'n_diagnoses', 'n_positive_micro', 'n_unique_procedures', 'abnormal_ratio', 'max_glucose', 'mean_creatinine', 'mean_hemoglobin', 'age', 'days_since_last_admission']


## 7. Add Subject-Level Time Features


In [39]:
# Load subject-level time features
time_features_subject = pd.read_csv(data_path / '1.2_subject_time_features.csv')
print(f"Subject-level time features: {len(time_features_subject):,} subjects")
print(time_features_subject.columns.tolist())

# Merge with subject features
subject_features = subject_features.merge(time_features_subject, on='subject_id', how='outer')

print(f"\nFinal subject features shape: {subject_features.shape}")
print(f"Total features: {len(subject_features.columns) - 1}")  # -1 for subject_id


Subject-level time features: 4,392 subjects
['subject_id', 'n_total_admissions', 'days_since_last_admission']

Final subject features shape: (4392, 16)
Total features: 15


## 8. Feature Summary and Statistics


In [40]:
print("="*80)
print("FINAL PATIENT PROFILE SUMMARY")
print("="*80)
print(f"\nTotal subjects: {len(subject_features):,}")
print(f"Total features: {len(subject_features.columns) - 1}")  # Excluding subject_id

print("\n" + "="*80)
print("FEATURE LIST:")
print("="*80)
feature_list = [col for col in subject_features.columns if col != 'subject_id']
for i, feat in enumerate(feature_list, 1):
    print(f"{i}. {feat}")

print("\n" + "="*80)
print("MISSING VALUES:")
print("="*80)
missing = subject_features.isna().sum()
missing_pct = (missing / len(subject_features) * 100).round(2)
missing_df = pd.DataFrame({'Missing Count': missing, 'Missing %': missing_pct})
print(missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False))

print("\n" + "="*80)
print("FEATURE STATISTICS:")
print("="*80)
print(subject_features.describe())


FINAL PATIENT PROFILE SUMMARY

Total subjects: 4,392
Total features: 15

FEATURE LIST:
1. gender
2. n_lab_events
3. n_micro_exam
4. total_procedures
5. n_diagnoses
6. n_positive_micro
7. n_unique_procedures
8. abnormal_ratio
9. max_glucose
10. mean_creatinine
11. mean_hemoglobin
12. age
13. days_since_last_admission_x
14. n_total_admissions
15. days_since_last_admission_y

MISSING VALUES:
                             Missing Count  Missing %
days_since_last_admission_x           4001      91.10
days_since_last_admission_y           4001      91.10
gender                                3158      71.90
age                                   3158      71.90
mean_hemoglobin                        186       4.23
max_glucose                            176       4.01
mean_creatinine                        167       3.80
abnormal_ratio                         155       3.53

FEATURE STATISTICS:
         subject_id  n_lab_events  n_micro_exam  total_procedures  \
count  4.392000e+03   4392.00000

## 9. Save Consolidated Patient Profile


In [41]:
# Convert subject_id to integer (nullable)
if 'subject_id' in subject_features.columns:
    subject_features['subject_id'] = subject_features['subject_id'].astype('Int64')

# Save consolidated profile
output_file = data_path / '1.2.3_final_patient_profile.csv'
subject_features.to_csv(output_file, index=False)
print(f"✓ Saved consolidated patient profile to: {output_file}")
print(f"  Subjects: {len(subject_features):,}")
print(f"  Features: {len(subject_features.columns) - 1}")
print(f"  File size: {output_file.stat().st_size / 1024:.2f} KB")


✓ Saved consolidated patient profile to: /Users/alexandermittet/Library/Mobile Documents/com~apple~CloudDocs/uni_life/UniPi DAD/data_analytics_4_health_unipi/Data/1.2.3_final_patient_profile.csv
  Subjects: 4,392
  Features: 15
  File size: 386.93 KB


## Summary

**Features Created (Admission-Level → Subject-Level):**

1. **Laboratory Features:**
   - `n_lab_events`: Total count of laboratory events (sum across admissions)
   - `abnormal_ratio`: Ratio of abnormal tests (mean across admissions)
   - `max_glucose`: Maximum glucose value (max across admissions)
   - `mean_creatinine`: Mean creatinine value (mean across admissions, if available)
   - `mean_hemoglobin`: Mean hemoglobin value (mean across admissions, if available)

2. **Microbiology Features:**
   - `n_micro_exam`: Total count of microbiology examinations (sum across admissions)
   - `n_positive_micro`: Count of positive results (sum across admissions, if available)

3. **Procedure Features:**
   - `total_procedures`: Total count of procedures (sum across admissions)
   - `n_unique_procedures`: Count of unique procedure types (sum across admissions, if available)

4. **Heart Diagnoses Features:**
   - `age`: Average age (mean across admissions)
   - `gender`: First gender recorded (first value across admissions)
   - `n_diagnoses`: Total diagnosis count (sum across admissions)

5. **Time Features:**
   - `n_total_admissions`: Total number of unique admissions per subject
   - `days_since_last_admission`: Mean days between consecutive admissions

**Total: 12 features**
