# Data Analytics for Health - Task 1.2: Feature Engineering - Time-Based Features

## Overview
This notebook implements time-based features for the patient profile:
1. **Total admission count** per subject_id
2. **Time since last admission** per subject_id

## Objectives
- Compute total number of admissions per patient
- Calculate time intervals between admissions for each patient
- Integrate these features into the patient profile

---


In [4]:
import os
import pandas as pd
import numpy as np
import warnings
from pathlib import Path
from datetime import datetime

warnings.filterwarnings('ignore')

# Set up file paths
notebook_dir = Path.cwd().resolve()
data_path = (notebook_dir / '..' / 'Data').resolve()

print("Libraries imported successfully")
print(f"Data path: {data_path}")


Libraries imported successfully
Data path: /Users/alexandermittet/Library/Mobile Documents/com~apple~CloudDocs/uni_life/UniPi DAD/data_analytics_4_health_unipi/Data


## 1. Load Data

Load the heart diagnoses dataset which contains admission timestamps.


In [5]:
# Load heart diagnoses dataset (contains admission timestamps)
df1 = pd.read_csv(data_path / 'heart_diagnoses_1.csv')

print(f"Loaded Heart Diagnoses: {df1.shape[0]:,} rows × {df1.shape[1]} columns")
print(f"\nColumns: {df1.columns.tolist()}")
print(f"\nFirst few rows:")
df1[['subject_id', 'hadm_id', 'charttime', 'storetime']].head()


Loaded Heart Diagnoses: 4,864 rows × 25 columns

Columns: ['note_id', 'subject_id', 'hadm_id', 'note_type', 'note_seq', 'charttime', 'storetime', 'HPI', 'physical_exam', 'chief_complaint', 'invasions', 'X-ray', 'CT', 'Ultrasound', 'CATH', 'ECG', 'MRI', 'reports', 'subject_id_dx', 'icd_code', 'long_title', 'gender', 'age', 'anchor_year', 'dod']

First few rows:


Unnamed: 0,subject_id,hadm_id,charttime,storetime
0,10000980,29654838,2188-01-06 03:00:00,2188-01-07 23:49:00
1,10000980,26913865,2189-07-04 03:00:00,2189-07-04 22:50:00
2,10002013,24760295,2160-07-13 03:00:00,2160-07-15 16:59:00
3,10002155,23822395,2129-08-19 03:00:00,2129-08-20 15:29:00
4,10004457,28723315,2141-08-14 03:00:00,2141-08-14 21:50:00


## 2. Feature 1: Total Admission Count per subject_id

Count the total number of unique admissions (hadm_id) for each patient (subject_id).


In [6]:
# Compute total admission count per subject_id
# Group by subject_id and count unique hadm_id values
total_admissions = df1.groupby('subject_id')['hadm_id'].nunique().reset_index()
total_admissions.columns = ['subject_id', 'n_total_admissions']

print(f"Total admissions per subject:")
print(total_admissions.describe())
print(f"\nSample:")
print(total_admissions.head(10))


Total admissions per subject:
         subject_id  n_total_admissions
count  4.392000e+03         4392.000000
mean   1.512674e+07            1.107468
std    2.945158e+06            0.374936
min    1.000098e+07            1.000000
25%    1.254920e+07            1.000000
50%    1.509920e+07            1.000000
75%    1.768003e+07            1.000000
max    1.999860e+07            5.000000

Sample:
   subject_id  n_total_admissions
0    10000980                   2
1    10002013                   1
2    10002155                   1
3    10004457                   1
4    10007058                   1
5    10010424                   1
6    10012343                   1
7    10013569                   1
8    10014651                   1
9    10017531                   1


## 3. Feature 2: Time Since Last Admission per subject_id

For each admission, calculate the number of days since the previous admission for that patient.
For the first admission of each patient, this will be NaN (no previous admission).


In [7]:
# Prepare data: get unique (subject_id, hadm_id) pairs with timestamps
# Use charttime as the admission timestamp
admission_times = df1[['subject_id', 'hadm_id', 'charttime']].drop_duplicates()

# Convert charttime to datetime
admission_times['charttime_dt'] = pd.to_datetime(admission_times['charttime'], errors='coerce')

# Remove rows with invalid timestamps
admission_times = admission_times[admission_times['charttime_dt'].notna()].copy()

print(f"Valid admission timestamps: {len(admission_times):,}")
print(f"\nDate range: {admission_times['charttime_dt'].min()} to {admission_times['charttime_dt'].max()}")
print(f"\nSample:")
admission_times.head(10)


Valid admission timestamps: 4,864

Date range: 2110-02-03 03:00:00 to 2209-12-24 03:00:00

Sample:


Unnamed: 0,subject_id,hadm_id,charttime,charttime_dt
0,10000980,29654838,2188-01-06 03:00:00,2188-01-06 03:00:00
1,10000980,26913865,2189-07-04 03:00:00,2189-07-04 03:00:00
2,10002013,24760295,2160-07-13 03:00:00,2160-07-13 03:00:00
3,10002155,23822395,2129-08-19 03:00:00,2129-08-19 03:00:00
4,10004457,28723315,2141-08-14 03:00:00,2141-08-14 03:00:00
5,10007058,22954658,2167-11-12 03:00:00,2167-11-12 03:00:00
6,10010424,28388172,2164-05-31 03:00:00,2164-05-31 03:00:00
7,10012343,27658045,2146-03-22 03:00:00,2146-03-22 03:00:00
8,10013569,22891949,2167-11-15 03:00:00,2167-11-15 03:00:00
9,10014651,20051301,2138-05-01 03:00:00,2138-05-01 03:00:00


In [8]:
# For each subject_id, sort admissions by time and calculate days since previous admission
admission_times_sorted = admission_times.sort_values(['subject_id', 'charttime_dt']).copy()

# Calculate days since last admission
admission_times_sorted['days_since_last_admission'] = (
    admission_times_sorted.groupby('subject_id')['charttime_dt']
    .diff()
    .dt.total_seconds() / (24 * 3600)  # Convert to days
)

print(f"Time since last admission statistics:")
print(admission_times_sorted['days_since_last_admission'].describe())
print(f"\nNumber of first admissions (NaN): {admission_times_sorted['days_since_last_admission'].isna().sum()}")
print(f"\nSample (showing subjects with multiple admissions):")
# Show examples where days_since_last_admission is not NaN
admission_times_sorted[admission_times_sorted['days_since_last_admission'].notna()].head(10)


Time since last admission statistics:
count     472.000000
mean      533.220339
std       644.585515
min         2.000000
25%        70.750000
50%       273.000000
75%       716.250000
max      3581.000000
Name: days_since_last_admission, dtype: float64

Number of first admissions (NaN): 4392

Sample (showing subjects with multiple admissions):


Unnamed: 0,subject_id,hadm_id,charttime,charttime_dt,days_since_last_admission
1,10000980,26913865,2189-07-04 03:00:00,2189-07-04 03:00:00,545.0
25,10055344,29209451,2171-11-03 03:00:00,2171-11-03 03:00:00,1756.0
38,10070614,21019690,2161-10-15 03:00:00,2161-10-15 03:00:00,676.0
55,10104308,24307783,2161-05-21 03:00:00,2161-05-21 03:00:00,575.0
58,10108435,26448261,2192-02-06 03:00:00,2192-02-06 03:00:00,2773.0
59,10108435,23333218,2193-04-22 03:00:00,2193-04-22 03:00:00,441.0
69,10124367,27078967,2170-01-12 03:00:00,2170-01-12 03:00:00,424.0
86,10182211,21994510,2130-01-12 03:00:00,2130-01-12 03:00:00,28.0
88,10188275,25261717,2148-02-18 03:00:00,2148-02-18 03:00:00,112.0
97,10203235,24203891,2130-04-26 03:00:00,2130-04-26 03:00:00,2224.0


## 4. Aggregate Features at Subject Level

Since we need subject-level features, we'll aggregate the admission-level data:
- `n_total_admissions`: Already per subject
- `days_since_last_admission`: For subjects with multiple admissions, we can use:
  - Mean time between admissions
  - Minimum time between admissions (shortest gap)
  - Maximum time between admissions (longest gap)
  - Or time since most recent admission (for the latest admission only)
  
We'll compute multiple aggregations and choose the most relevant one.


In [9]:
# Aggregate time features per subject_id
time_features = admission_times_sorted.groupby('subject_id').agg({
    'days_since_last_admission': ['mean', 'min', 'max', 'count']
}).reset_index()

# Flatten column names
time_features.columns = ['subject_id', 'mean_days_between_admissions', 
                         'min_days_between_admissions', 'max_days_between_admissions',
                         'n_admissions_with_previous']

# For subjects with only one admission, all time features will be NaN
# We can also compute "days since most recent admission" (time from last admission to now/reference date)
# But for now, we'll use the mean as the primary feature

print(f"Time features per subject:")
print(time_features.describe())
print(f"\nSample:")
time_features.head(10)


Time features per subject:
         subject_id  mean_days_between_admissions  \
count  4.392000e+03                    391.000000   
mean   1.512674e+07                    531.629156   
std    2.945158e+06                    619.454271   
min    1.000098e+07                      2.000000   
25%    1.254920e+07                     81.000000   
50%    1.509920e+07                    298.500000   
75%    1.768003e+07                    728.500000   
max    1.999860e+07                   3494.000000   

       min_days_between_admissions  max_days_between_admissions  \
count                   391.000000                   391.000000   
mean                    473.547315                   591.718670   
std                     615.789301                   679.965182   
min                       2.000000                     2.000000   
25%                      50.000000                    86.500000   
50%                     225.000000                   333.000000   
75%                     63

Unnamed: 0,subject_id,mean_days_between_admissions,min_days_between_admissions,max_days_between_admissions,n_admissions_with_previous
0,10000980,545.0,545.0,545.0,1
1,10002013,,,,0
2,10002155,,,,0
3,10004457,,,,0
4,10007058,,,,0
5,10010424,,,,0
6,10012343,,,,0
7,10013569,,,,0
8,10014651,,,,0
9,10017531,,,,0


In [10]:
# Create final subject-level feature set
subject_features = total_admissions.merge(
    time_features[['subject_id', 'mean_days_between_admissions']], 
    on='subject_id', 
    how='left'
)

# Rename for clarity
subject_features = subject_features.rename(columns={
    'mean_days_between_admissions': 'days_since_last_admission'
})

print(f"Final subject-level features:")
print(f"Shape: {subject_features.shape}")
print(f"\nSummary:")
print(subject_features.describe())
print(f"\nMissing values:")
print(subject_features.isna().sum())
print(f"\nSample:")
subject_features.head(10)


Final subject-level features:
Shape: (4392, 3)

Summary:
         subject_id  n_total_admissions  days_since_last_admission
count  4.392000e+03         4392.000000                 391.000000
mean   1.512674e+07            1.107468                 531.629156
std    2.945158e+06            0.374936                 619.454271
min    1.000098e+07            1.000000                   2.000000
25%    1.254920e+07            1.000000                  81.000000
50%    1.509920e+07            1.000000                 298.500000
75%    1.768003e+07            1.000000                 728.500000
max    1.999860e+07            5.000000                3494.000000

Missing values:
subject_id                      0
n_total_admissions              0
days_since_last_admission    4001
dtype: int64

Sample:


Unnamed: 0,subject_id,n_total_admissions,days_since_last_admission
0,10000980,2,545.0
1,10002013,1,
2,10002155,1,
3,10004457,1,
4,10007058,1,
5,10010424,1,
6,10012343,1,
7,10013569,1,
8,10014651,1,
9,10017531,1,


## 5. Save Features

Save the computed features for integration with other datasets.


In [None]:
# Save subject-level time features
output_file = data_path / '1.2_subject_time_features.csv'
subject_features.to_csv(output_file, index=False)
print(f"✓ Saved subject-level time features to: {output_file}")
print(f"  Features: {subject_features.columns.tolist()}")
print(f"  Subjects: {len(subject_features):,}")

# Also save admission-level features (with days_since_last_admission per admission)
admission_features = admission_times_sorted[['subject_id', 'hadm_id', 'days_since_last_admission']].copy()
admission_output_file = data_path / '1.2_admission_time_features.csv'
admission_features.to_csv(admission_output_file, index=False)
print(f"\n✓ Saved admission-level time features to: {admission_output_file}")
print(f"  Admissions: {len(admission_features):,}")


✓ Saved subject-level time features to: /Users/alexandermittet/Library/Mobile Documents/com~apple~CloudDocs/uni_life/UniPi DAD/data_analytics_4_health_unipi/Data/subject_time_features.csv
  Features: ['subject_id', 'n_total_admissions', 'days_since_last_admission']
  Subjects: 4,392

✓ Saved admission-level time features to: /Users/alexandermittet/Library/Mobile Documents/com~apple~CloudDocs/uni_life/UniPi DAD/data_analytics_4_health_unipi/Data/admission_time_features.csv
  Admissions: 4,864


## Summary

**Features Created:**

1. **`n_total_admissions`**: Total number of unique admissions per subject_id
   - Objective: Capture patient admission frequency/history
   - Mathematical formulation: `n_total_admissions(subject_id) = count(unique hadm_id for subject_id)`

2. **`days_since_last_admission`**: Mean number of days between consecutive admissions per subject_id
   - Objective: Measure time intervals between patient readmissions
   - Mathematical formulation: `days_since_last_admission(subject_id) = mean(diff(charttime) for all admissions of subject_id)`
   - Note: NaN for subjects with only one admission (no previous admission to compare)

