## Feature Engineering    

**Definition**   
   
Feature engineering in machine learning is the process of transforming raw data into a set of useful features (also called variables or attributes) that help a machine learning model make better predictions
    
**Files we will work with:**    
1. diagnoses_icd_sel: /data/processed/diagnoses_icd_sel.csv     
2. admissions_sel: /data/processed/admissions_selected.csv      
3. patients_sel: /data/processed/patients_selected.csv       
   
**Features we will extract:**   
1. Age at the time of admission   
2. Gender of the patient   
3. Race of the patient   
4. Marital Status    
5. Insurance category   
6. Admission type   
7. Admission Location   
8. Discharge Location   
9. Did the patient die in the hospital  
10. Embeddings from the diagnoses(and potentially reduced in dimension)      
11. Length of stay   
12. Total number of previous visits  


In [25]:
# Load the relevant libraries  
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# Load the files that were processed during data exploration. 
diagnoses_icd_sel = pd.read_csv('../data/processed/diagnoses_icd_sel.csv',
                                dtype={'subject_id': str,
                                       'hadm_id': str,
                                       'long_title': str})  
parse_dates = ['admittime', 'dischtime', 'deathtime','edregtime','edouttime']
admissions_sel = pd.read_csv('../data/processed/admissions_selected.csv',
                                dtype={'subject_id': str,
                                       'hadm_id': str,
                                       'long_title': str},
                                       parse_dates=parse_dates)   
patients_sel = pd.read_csv('../data/processed/patients_selected.csv',
                                dtype={'subject_id': str,
                                       'anchor_age': int,
                                       'anchor_year': int})[['subject_id','gender','anchor_age','anchor_year']]

### Age at the time of admission   

age_at_admission = anchor_age + (admittime - anchor_year_start_date).days / 365.25

This is a good approximation to get the age at the time of admission. More details on the logic can be found at [https://physionet.org/content/mimiciv/3.1/]. Since the admissions table doesnt contain anchor_age, ad anchor_year, we will have to use the patients table to extract those information. We will merge the four needed columns from the patient table to the admissions table.  

In [27]:
admissions_sel = admissions_sel.merge(patients_sel, 
                     on='subject_id', 
                     how='left')

In [28]:
# anchor_year is an int, like 2190
# anchor_year_start_date will be '2190-01-01' as datetime
admissions_sel['anchor_year_start_date'] = pd.to_datetime(admissions_sel['anchor_year'].astype(str) + '-01-01')
admissions_sel['age_at_admission'] = (admissions_sel['admittime'] - admissions_sel['anchor_year_start_date']).dt.days / 365.25 + admissions_sel['anchor_age']
admissions_sel['length_of_stay'] = (admissions_sel['dischtime'] - admissions_sel['admittime']).dt.days
admissions_sel['length_of_stay'] = admissions_sel['length_of_stay'].fillna(0)
admissions_sel['length_of_stay'] = admissions_sel['length_of_stay'].astype(int)

In [29]:
admissions_sel = admissions_sel.sort_values(by=['subject_id', 'admittime'])
admissions_sel['num_prev_admissions'] = admissions_sel.groupby('subject_id').cumcount()

In [30]:
# Ensure the data is sorted by subject and admission time
admissions_sel = admissions_sel.sort_values(by=['subject_id', 'admittime'])

# For each subject, get the previous discharge time
admissions_sel['prev_dischtime'] = admissions_sel.groupby('subject_id')['dischtime'].shift(1)

# Calculate days since previous discharge
admissions_sel['days_since_prev_admit'] = (
    (admissions_sel['admittime'] - admissions_sel['prev_dischtime']).dt.total_seconds() / (60 * 60 * 24)
).round(1)

# Optional: drop the helper column
# admissions_sel = admissions_sel.drop(columns=['prev_dischtime'])


In [31]:
# Ensure the data is sorted by subject and admission time
admissions_sel = admissions_sel.sort_values(by=['subject_id', 'admittime'])

# For each subject, get the previous discharge time
admissions_sel['prev_dischtime'] = admissions_sel.groupby('subject_id')['dischtime'].shift(1)

# Calculate days since previous discharge
admissions_sel['days_since_prev_admit'] = (
    (admissions_sel['admittime'] - admissions_sel['prev_dischtime']).dt.total_seconds() / (60 * 60 * 24)
).round(1)

# Optional: drop the helper column
# admissions_sel = admissions_sel.drop(columns=['prev_dischtime'])


In [32]:
# define the conditions (order matters: “BLACK” before “BLACK/CAPE…” both match the first)
conds = [
    admissions_sel['race'].str.contains('BLACK', case=False, na=False),
    admissions_sel['race'].str.contains('WHITE', case=False, na=False),
    admissions_sel['race'].str.contains('ASIAN', case=False, na=False),
    admissions_sel['race'].str.contains('HISPANIC', case=False, na=False),
    # any of these substrings → native
    admissions_sel['race'].str.contains('AMERICAN INDIAN|ALASKA NATIVE|HAWAIIAN|PACIFIC ISLANDER',
                               case=False, na=False),
]

# the new group names
choices = [
    'BLACK',
    'WHITE',
    'ASIAN',
    'HISPANIC',
    'NATIVE',
]

# collapse into a new column
admissions_sel['race_group'] = np.select(conds, choices, default='OTHER_ALL')

# make it a categorical for memory & modeling
admissions_sel['race_group'] = admissions_sel['race_group'].astype('category')

# check your work
print(admissions_sel['race_group'].value_counts())

race_group
BLACK        16479
WHITE        15867
HISPANIC     13896
ASIAN        10250
OTHER_ALL     8792
NATIVE        1741
Name: count, dtype: int64


In [33]:
# Mark readmissions within 30, 60, and 90 days
admissions_sel['readmit_30d'] = admissions_sel['days_since_prev_admit'] <= 30
admissions_sel['readmit_60d'] = admissions_sel['days_since_prev_admit'] <= 60
admissions_sel['readmit_90d'] = admissions_sel['days_since_prev_admit'] <= 90

# Optional: convert NaNs to False (first admission will be NaN)
admissions_sel[['readmit_30d', 'readmit_60d', 'readmit_90d']] = admissions_sel[
    ['readmit_30d', 'readmit_60d', 'readmit_90d']
].fillna(False)


In [34]:
admissions_sel.info()  

<class 'pandas.core.frame.DataFrame'>
Index: 67025 entries, 0 to 67023
Data columns (total 29 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   subject_id              67025 non-null  object        
 1   hadm_id                 67025 non-null  object        
 2   admittime               67025 non-null  datetime64[ns]
 3   dischtime               67025 non-null  datetime64[ns]
 4   deathtime               1387 non-null   datetime64[ns]
 5   admission_type          67025 non-null  object        
 6   admit_provider_id       67025 non-null  object        
 7   admission_location      67025 non-null  object        
 8   discharge_location      47014 non-null  object        
 9   insurance               65671 non-null  object        
 10  language                66933 non-null  object        
 11  marital_status          64755 non-null  object        
 12  race                    67025 non-null  object     

In [None]:
admissions_sel.to_csv('../data/processed/admissions_with_prev_admit.csv',
                    index=False)

In [36]:
admissions_sel.head()

Unnamed: 0,subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admit_provider_id,admission_location,discharge_location,insurance,...,anchor_year_start_date,age_at_admission,length_of_stay,num_prev_admissions,prev_dischtime,days_since_prev_admit,race_group,readmit_30d,readmit_60d,readmit_90d
0,10000764,27897940,2132-10-14 23:31:00,2132-10-19 16:30:00,NaT,URGENT,P38YR6,TRANSFER FROM HOSPITAL,HOME HEALTH CARE,Medicare,...,2132-01-01,86.785763,4,0,NaT,,WHITE,False,False,False
6,10000980,29654838,2188-01-03 17:41:00,2188-01-05 17:30:00,NaT,EW EMER.,P63MXO,EMERGENCY ROOM,HOME HEALTH CARE,Medicare,...,2186-01-01,75.004107,1,0,NaT,,BLACK,False,False,False
5,10000980,26913865,2189-06-27 07:38:00,2189-07-03 03:00:00,NaT,EW EMER.,P06OTX,EMERGENCY ROOM,HOME HEALTH CARE,Medicare,...,2186-01-01,76.485284,5,1,2188-01-05 17:30:00,538.6,BLACK,False,False,False
2,10000980,24947999,2190-11-06 20:57:00,2190-11-08 15:58:00,NaT,EW EMER.,P07L9V,EMERGENCY ROOM,HOME HEALTH CARE,Medicare,...,2186-01-01,77.845996,1,2,2189-07-03 03:00:00,491.7,BLACK,False,False,False
3,10000980,25242409,2191-04-03 18:48:00,2191-04-11 16:21:00,NaT,EW EMER.,P12VNM,EMERGENCY ROOM,SKILLED NURSING FACILITY,Medicare,...,2186-01-01,78.251198,7,3,2190-11-08 15:58:00,146.1,BLACK,False,False,False


## Generating Embeddings for the diagnoses   

We have already cleaned up the diagnoses file. The next step is to convert these long text format diagnoses to embeddings. Since these embeddings are long vectors, we can resort to using PCA or similar strategy to reduce the number of dimensions, or we could also use smaller vector embeddings for this purpose.  

In [None]:
diagnoses_icd_sel = pd.read_csv('../data/processed/diagnoses_icd_sel.csv',
                                dtype={'subject_id': str,
                                       'hadm_id': str,
                                       'long_title': str})

diagnoses_icd_sel.head()

Unnamed: 0,subject_id,hadm_id,long_title
0,10000764,27897940,Closed fracture of nasal bones Subendocardial ...
1,10000980,20897796,Hypertensive heart and chronic kidney disease ...
2,10000980,24947999,Acute on chronic systolic heart failure Chroni...
3,10000980,25242409,Acute venous embolism and thrombosis of deep v...
4,10000980,25911675,Iron deficiency anemia secondary to blood loss...


In [38]:
texts = diagnoses_icd_sel['long_title'].tolist()[0:10]

In [39]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb')

In [None]:
embeddings = model.encode(texts, show_progress_bar=True)

In [41]:
embeddings_df = pd.DataFrame(embeddings, columns=[f'dx_emb_{i}' for i in range(embeddings.shape[1])])
print(embeddings_df.head())

   dx_emb_0  dx_emb_1  dx_emb_2  dx_emb_3  dx_emb_4  dx_emb_5  dx_emb_6  \
0  0.048205  0.346949  0.073995  0.421037  0.425019  0.658539  0.055178   
1  0.038815  0.631561 -0.101334  0.579412  0.857697  0.291526 -0.005433   
2  0.062561  0.918950 -0.065482  0.348627  0.714114  0.563903  0.043631   
3 -0.044729  1.015024 -0.271827  0.188405  0.529622  0.445726  0.006442   
4 -0.202826  0.652059 -0.051373  0.222098  0.239650  0.727077 -0.284437   

   dx_emb_7  dx_emb_8  dx_emb_9  ...  dx_emb_758  dx_emb_759  dx_emb_760  \
0 -0.584762 -0.094246 -0.415138  ...   -0.410363   -0.276975    0.110320   
1 -0.260978 -0.224834 -0.611035  ...   -0.275132   -0.176171   -0.142483   
2 -0.740989  0.819205 -0.456968  ...   -0.456391   -0.452052   -0.072999   
3 -0.938168 -0.103282 -0.314952  ...   -0.278209   -0.552843   -0.119909   
4 -0.904236  0.433263 -0.616253  ...   -0.671413   -0.437905   -0.371307   

   dx_emb_761  dx_emb_762  dx_emb_763  dx_emb_764  dx_emb_765  dx_emb_766  \
0    0.243383  

In [None]:
from sentence_transformers import SentenceTransformer
import pandas as pd

# Step 1: Select input text
texts = diagnoses_icd_sel['long_title'].tolist()

# Step 2: Load the model
# model = SentenceTransformer('pritamdeka/BioBERT-mnli-snli-scinli-scitail-mednli-stsb')

# Step 3: Generate embeddings
embeddings = model.encode(texts, show_progress_bar=True)

# Step 4: Create embeddings DataFrame
embeddings_df = pd.DataFrame(embeddings, columns=[f'dx_emb_{i}' for i in range(embeddings.shape[1])])

# Step 5: Combine with original DataFrame (only first 10 rows for now)
diagnoses_icd_sel_with_embeddings = pd.concat([diagnoses_icd_sel, embeddings_df], axis=1)
diagnoses_icd_sel_with_embeddings.to_csv('../data/processed/diagnoses_icd_with_embeddings.csv', index=False)


In [None]:
diagnoses_icd_sel_with_embeddings.drop_duplicates().drop(columns=['long_title']).to_csv(
    '../data/processed/diagnoses_icd_with_embeddings.csv',
    index=False)

### Quick checkpoint   
   
At this point, we have all the information we need to build our ML model, They are just in the format ready for training the ML algorithm. The two most useful columns for merging tables are `subject_id` and `hadm_id`. There are two type of tables, one which are specific to the `subject_id` like demographics and the others that depend on both `subject_id` and `hadm_id`.    

### Demographics - Gender   

In [None]:
patients_sel.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25363 entries, 0 to 25362
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   subject_id   25363 non-null  object
 1   gender       25363 non-null  object
 2   anchor_age   25363 non-null  int64 
 3   anchor_year  25363 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 792.7+ KB


This table has only the `gender` information that is useful for the algorithm. It may be best to one-hot encode this feature.    

In [None]:
# One-hot encode gender
gender_onehot = pd.get_dummies(patients_sel[['subject_id', 'gender']], columns=['gender'])[['subject_id','gender_M']]

# Show the result
print(gender_onehot.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25363 entries, 0 to 25362
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   subject_id  25363 non-null  object
 1   gender_M    25363 non-null  bool  
dtypes: bool(1), object(1)
memory usage: 223.0+ KB
None


### Demograhics - Race  

In [45]:
# One-hot encode 'race_group' while retaining 'subject_id'
race_onehot = pd.get_dummies(admissions_sel[['subject_id', 'race_group']], columns=['race_group'])

# Optionally: drop duplicates if each subject_id is repeated
race_onehot = race_onehot.drop_duplicates(subset=['subject_id'])[['subject_id','race_group_BLACK','race_group_WHITE','race_group_ASIAN','race_group_HISPANIC','race_group_NATIVE'   ]]

# Preview result
print(race_onehot.head())

   subject_id  race_group_BLACK  race_group_WHITE  race_group_ASIAN  \
0    10000764             False              True             False   
6    10000980              True             False             False   
9    10001884              True             False             False   
46   10002013             False              True             False   
48   10002114             False             False             False   

    race_group_HISPANIC  race_group_NATIVE  
0                 False              False  
6                 False              False  
9                 False              False  
46                False              False  
48                False              False  


In [58]:
features_df = admissions_sel[['subject_id', 
                 'hadm_id', 
                 'admission_type',
                 'admission_location',
                 'discharge_location', 
                 'insurance',
                 'marital_status',
                 'hospital_expire_flag', 
                 'gender',
                 'age_at_admission',
                 'length_of_stay',
                 'num_prev_admissions',
                 'race_group',
                 'readmit_30d',
                 'readmit_60d',
                 'readmit_90d']].copy().reset_index(drop=True)
print(features_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67025 entries, 0 to 67024
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   subject_id            67025 non-null  object  
 1   hadm_id               67025 non-null  object  
 2   admission_type        67025 non-null  object  
 3   admission_location    67025 non-null  object  
 4   discharge_location    47014 non-null  object  
 5   insurance             65671 non-null  object  
 6   marital_status        64755 non-null  object  
 7   hospital_expire_flag  67025 non-null  int64   
 8   gender                67025 non-null  object  
 9   age_at_admission      67025 non-null  float64 
 10  length_of_stay        67025 non-null  int64   
 11  num_prev_admissions   67025 non-null  int64   
 12  race_group            67025 non-null  category
 13  readmit_30d           67025 non-null  bool    
 14  readmit_60d           67025 non-null  bool    
 15  re

In [52]:
categorical_cols = ['admission_type',
                    'admission_location',
                    'discharge_location', 
                    'insurance',
                    'marital_status',
                    'gender',
                    'race_group']

admissions_sel[categorical_cols].head()

Unnamed: 0,admission_type,admission_location,discharge_location,insurance,marital_status,gender,race_group
0,URGENT,TRANSFER FROM HOSPITAL,HOME HEALTH CARE,Medicare,SINGLE,M,WHITE
6,EW EMER.,EMERGENCY ROOM,HOME HEALTH CARE,Medicare,MARRIED,F,BLACK
5,EW EMER.,EMERGENCY ROOM,HOME HEALTH CARE,Medicare,MARRIED,F,BLACK
2,EW EMER.,EMERGENCY ROOM,HOME HEALTH CARE,Medicare,MARRIED,F,BLACK
3,EW EMER.,EMERGENCY ROOM,SKILLED NURSING FACILITY,Medicare,MARRIED,F,BLACK


In [60]:
# One-hot encode 'race_group' while retaining 'subject_id'
featured_onehot = pd.get_dummies(features_df, 
                                    columns=categorical_cols)

print(featured_onehot.info())    

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67025 entries, 0 to 67024
Data columns (total 59 columns):
 #   Column                                                     Non-Null Count  Dtype  
---  ------                                                     --------------  -----  
 0   subject_id                                                 67025 non-null  object 
 1   hadm_id                                                    67025 non-null  object 
 2   hospital_expire_flag                                       67025 non-null  int64  
 3   age_at_admission                                           67025 non-null  float64
 4   length_of_stay                                             67025 non-null  int64  
 5   num_prev_admissions                                        67025 non-null  int64  
 6   readmit_30d                                                67025 non-null  bool   
 7   readmit_60d                                                67025 non-null  bool   
 8   readmi

In [None]:
features_final_df = featured_onehot.merge(
    diagnoses_icd_sel_with_embeddings,
    on=['subject_id', 'hadm_id'],
    how='left'
).reset_index(drop=True)

features_final_df.to_csv('../data/processed/features_final_df.csv', index=False)

In [65]:
input_df = features_final_df.drop(columns=['subject_id', 'hadm_id', 'readmit_30d', 'readmit_60d', 'readmit_90d'])
output_df = features_final_df[['readmit_30d', 'readmit_60d', 'readmit_90d']]