## Extended grading

| Criteria | Weighting (%) | Score | Justification | Improvements |
|----------|---------------|-------|---------------|---------------|
|Data preparation | 20 | 20 | age engineered and outliers handled, dimension of ICD9s logically reduced, target encoding fr diagnoses, number of hospital stays and prevalent/recent diagnoses for patients, comorbidity data included, feature engineering of vital signs, sensible dimension reduction for ethnicity and religion, attempt at text analysis of diagnoses data. Un order categories encoded, missing data imputed by kNN and data scaled
|kNN(s) have been used | 10 | 10 | with explicit arguments for number of neighbours, weights and metric |
|Probability of death for each test patient is computed | 5 | 5 |
|Accuracy | 5 | 5 | > 0.9
|Hyperparameter optimization | 10 | 6 | Reasonable grid search, I would have liked to see evidence on how you identified 32-34 as reasonable rage for te number of neighbours, did you try uniform weight or any other distance?
|SVM(s) have been used | 10 | 10 | with explicit arguments for cost, kernel and hyperparameter |
|Probability of death for each test patient is computed | 5 | 5 |
|Accuracy | 5 | 4 | > 0.9
|Hyperparameter optimization | 10 | 6 | Reasonable grid search, again I would have liked to see evidence of how you identified 10-15 as a good range for the cost, what values for the hyperparameter gamma did you try, this will matter a lot
|Class Imbalance Managed | 5 | 3 | SMOTE used but probabilities not reweighed, remember if yo resample you must reweigh the resulting probabilities
|Neat code with titles and comments | 5 | 3 | Good presentation, I would have lied to see slightly more detailed comments on what your code is doing and why and also interpreting the results
|Improved methods from those discussed in class | 10 | 6 | Text analysis, cosine similarity, SMOTE

Score: 83

Other Feedback:

Rather than dropping patients with no ICD9_code you could have created an other class.

Your target encoding done by hand is nice, you could also add some smoothing, i.e. not just look at the raw average but also add add the average death rate so as not to overfit to diagnoses that don't appear too often - this would be smoothed target encoding.

Be careful with your most prevalent diagnoses, the count for most of these is 1, which is probably just arbitrarily choosing the first one listed

I don't think the rbf kernel has a degree

In [37]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import klib as kl 
from sklearn.impute import KNNImputer
import re

from datetime import datetime, timedelta
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import TruncatedSVD
import plotly.express as px
from sklearn.preprocessing import LabelEncoder
from scipy.stats import skew 
import category_encoders as ce
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.compose import ColumnTransformer
from imblearn.pipeline import make_pipeline 
from sklearn.base import BaseEstimator, TransformerMixin

# import Target Encoder
from category_encoders import TargetEncoder, BinaryEncoder
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

from data_science_toolkit.toolkit import *


### Overview

##### snake case column names
##### clean up and bin age 
##### create standardized diagnosis
##### process extra data set 
    
    - most recurrent diagnosis

    - diagnosis count

    - columns for each diagnosis
##### clean up, consolidate, and feature engineer 

    - ethnicity

    - religion

    - marital status

    - health metrics
##### text processing 

    - preprocess text

    - create cosine similarity for death and survival texts 
##### preprocessing

    - log transform

    - encode (target, binary)

    - impute

    - scale

    - resample
##### grid search

In [38]:
# robust scaler is less prone to outliers because it uses the median and the interquartile range, 
# instead of mean and standard deviation


In [39]:
# supress warnings
import warnings
warnings.filterwarnings('ignore')

In [40]:
# Load data
test = pd.read_csv('data/mimic_test_death.csv')
train = pd.read_csv('data/mimic_train.csv')
diagnosis = pd.read_csv('data/MIMIC_metadata_diagnose.csv')
all_patient_diagnosis = pd.read_csv('data/MIMIC_diagnoses.csv')

In [41]:
test = kl.clean_column_names(test)
train = kl.clean_column_names(train)
diagnosis = kl.clean_column_names(diagnosis)
all_patient_diagnosis = kl.clean_column_names(all_patient_diagnosis)


## General Preprocessing

In [42]:
# consolidate into one function: 

def preprocess(df): 

    diagnosis.columns = ['icd9_diagnosis', 'short_diagnose', 'long_diagnose']
    df = df.join(diagnosis.set_index('icd9_diagnosis'), on='icd9_diagnosis')

    df = kl.data_cleaning(df)

    df.admittime = pd.to_datetime(df.admittime)
    df.dob = pd.to_datetime(df.dob)

    df['dob'] = df.dob.apply(lambda e: (e.date()))
    df['admittime'] = df.admittime.apply(lambda e: (e.date()))

    df['age'] = df.apply(lambda e: (e['admittime'] - e['dob']).days/365, axis=1)

    df.loc[df['age'] > 300, 'age'] = 90
    
    # KeyError: "['dod', 'dischtime', 'deathtime'] not in index for test set
    df = df.drop(columns=[ 'dod', 'dischtime', 'deathtime'], errors='ignore')

    return df

# apply function to both dataframes
train = preprocess(train)
test = preprocess(test)

Shape of cleaned data: (20885, 46) - Remaining NAs: 80819


Dropped rows: 0
     of which 0 duplicates. (Rows (first 150 shown): [])

Dropped columns: 0
     of which 0 single valued.     Columns: []
Dropped missing values: 0
Reduced memory by at least: 3.34 MB (-45.57%)

Shape of cleaned data: (5221, 41) - Remaining NAs: 12194


Dropped rows: 0
     of which 0 duplicates. (Rows (first 150 shown): [])

Dropped columns: 0
     of which 0 single valued.     Columns: []
Dropped missing values: 0
Reduced memory by at least: 0.78 MB (-47.85%)



## Engineer ICD9 Short Codes

In [43]:
train['icd9_diagnosis_prefix'] = train['icd9_diagnosis'].str[:3]
train['icd9_diagnosis_standardize'] = ''
train['icd9_diagnosis_standardize'].loc[(train['icd9_diagnosis_prefix'].str.contains('V')) | (train['icd9_diagnosis_prefix'].str.contains('E'))] = 'external causes of injury and supplemental classification'
train['icd9_diagnosis_prefix'].loc[(train['icd9_diagnosis_prefix'].str.contains('V')) | (train['icd9_diagnosis_prefix'].str.contains('E'))] = '0'
train.icd9_diagnosis_prefix = train.icd9_diagnosis_prefix.astype(int)
train['icd9_diagnosis_standardize'].loc[(train['icd9_diagnosis_prefix'].between(1, 139)) & (train['icd9_diagnosis_standardize'] == '')] = 'infectious and parasitic diseases'
train['icd9_diagnosis_standardize'].loc[(train['icd9_diagnosis_prefix'].between(140, 239)) & (train['icd9_diagnosis_standardize'] == '')] = 'neoplasms'
train['icd9_diagnosis_standardize'].loc[(train['icd9_diagnosis_prefix'].between(240, 279)) & (train['icd9_diagnosis_standardize'] == '')] = 'endocrine, nutritional and metabolic diseases, and immunity disorders'
train['icd9_diagnosis_standardize'].loc[(train['icd9_diagnosis_prefix'].between(280, 289)) & (train['icd9_diagnosis_standardize'] == '')] = 'diseases of the blood and blood-forming organs'
train['icd9_diagnosis_standardize'].loc[(train['icd9_diagnosis_prefix'].between(290, 319)) & (train['icd9_diagnosis_standardize'] == '')] = 'mental disorders'
train['icd9_diagnosis_standardize'].loc[(train['icd9_diagnosis_prefix'].between(320, 389)) & (train['icd9_diagnosis_standardize'] == '')] = 'diseases of the nervous system and sense organs'
train['icd9_diagnosis_standardize'].loc[(train['icd9_diagnosis_prefix'].between(390, 459)) & (train['icd9_diagnosis_standardize'] == '')] = 'diseases of the circulatory system'
train['icd9_diagnosis_standardize'].loc[(train['icd9_diagnosis_prefix'].between(460, 519)) & (train['icd9_diagnosis_standardize'] == '')] = 'diseases of the respiratory system'
train['icd9_diagnosis_standardize'].loc[(train['icd9_diagnosis_prefix'].between(520, 579)) & (train['icd9_diagnosis_standardize'] == '')] = 'diseases of the digestive system'
train['icd9_diagnosis_standardize'].loc[(train['icd9_diagnosis_prefix'].between(580, 629)) & (train['icd9_diagnosis_standardize'] == '')] = 'diseases of the genitourinary system'
train['icd9_diagnosis_standardize'].loc[(train['icd9_diagnosis_prefix'].between(630, 679)) & (train['icd9_diagnosis_standardize'] == '')] = 'complications of pregnancy, childbirth, and the puerperium'
train['icd9_diagnosis_standardize'].loc[(train['icd9_diagnosis_prefix'].between(680, 709)) & (train['icd9_diagnosis_standardize'] == '')] = 'diseases of the skin and subcutaneous tissue'
train['icd9_diagnosis_standardize'].loc[(train['icd9_diagnosis_prefix'].between(710, 739)) & (train['icd9_diagnosis_standardize'] == '')] = 'diseases of the musculoskeletal system and connective tissue'
train['icd9_diagnosis_standardize'].loc[(train['icd9_diagnosis_prefix'].between(740, 759)) & (train['icd9_diagnosis_standardize'] == '')] = 'congenital anomalies'
train['icd9_diagnosis_standardize'].loc[(train['icd9_diagnosis_prefix'].between(760, 779)) & (train['icd9_diagnosis_standardize'] == '')] = 'certain conditions originating in the perinatal period'
train['icd9_diagnosis_standardize'].loc[(train['icd9_diagnosis_prefix'].between(780, 799)) & (train['icd9_diagnosis_standardize'] == '')] = 'symptoms, signs, and ill-defined conditions'
train['icd9_diagnosis_standardize'].loc[(train['icd9_diagnosis_prefix'].between(800, 999)) & (train['icd9_diagnosis_standardize'] == '')] = 'injury and poisoning'

# do the same for test set
test['icd9_diagnosis_prefix'] = test['icd9_diagnosis'].str[:3]
test['icd9_diagnosis_standardize'] = ''
test['icd9_diagnosis_standardize'].loc[(test['icd9_diagnosis_prefix'].str.contains('V')) | (test['icd9_diagnosis_prefix'].str.contains('E'))] = 'external causes of injury and supplemental classification'
test['icd9_diagnosis_prefix'].loc[(test['icd9_diagnosis_prefix'].str.contains('V')) | (test['icd9_diagnosis_prefix'].str.contains('E'))] = '0'
test.icd9_diagnosis_prefix = test.icd9_diagnosis_prefix.astype(int)
test['icd9_diagnosis_standardize'].loc[(test['icd9_diagnosis_prefix'].between(1, 139)) & (test['icd9_diagnosis_standardize'] == '')] = 'infectious and parasitic diseases'
test['icd9_diagnosis_standardize'].loc[(test['icd9_diagnosis_prefix'].between(140, 239)) & (test['icd9_diagnosis_standardize'] == '')] = 'neoplasms'
test['icd9_diagnosis_standardize'].loc[(test['icd9_diagnosis_prefix'].between(240, 279)) & (test['icd9_diagnosis_standardize'] == '')] = 'endocrine, nutritional and metabolic diseases, and immunity disorders'
test['icd9_diagnosis_standardize'].loc[(test['icd9_diagnosis_prefix'].between(280, 289)) & (test['icd9_diagnosis_standardize'] == '')] = 'diseases of the blood and blood-forming organs'
test['icd9_diagnosis_standardize'].loc[(test['icd9_diagnosis_prefix'].between(290, 319)) & (test['icd9_diagnosis_standardize'] == '')] = 'mental disorders'
test['icd9_diagnosis_standardize'].loc[(test['icd9_diagnosis_prefix'].between(320, 389)) & (test['icd9_diagnosis_standardize'] == '')] = 'diseases of the nervous system and sense organs'
test['icd9_diagnosis_standardize'].loc[(test['icd9_diagnosis_prefix'].between(390, 459)) & (test['icd9_diagnosis_standardize'] == '')] = 'diseases of the circulatory system'
test['icd9_diagnosis_standardize'].loc[(test['icd9_diagnosis_prefix'].between(460, 519)) & (test['icd9_diagnosis_standardize'] == '')] = 'diseases of the respiratory system'
test['icd9_diagnosis_standardize'].loc[(test['icd9_diagnosis_prefix'].between(520, 579)) & (test['icd9_diagnosis_standardize'] == '')] = 'diseases of the digestive system'
test['icd9_diagnosis_standardize'].loc[(test['icd9_diagnosis_prefix'].between(580, 629)) & (test['icd9_diagnosis_standardize'] == '')] = 'diseases of the genitourinary system'
test['icd9_diagnosis_standardize'].loc[(test['icd9_diagnosis_prefix'].between(630, 679)) & (test['icd9_diagnosis_standardize'] == '')] = 'complications of pregnancy, childbirth, and the puerperium'
test['icd9_diagnosis_standardize'].loc[(test['icd9_diagnosis_prefix'].between(680, 709)) & (test['icd9_diagnosis_standardize'] == '')] = 'diseases of the skin and subcutaneous tissue'
test['icd9_diagnosis_standardize'].loc[(test['icd9_diagnosis_prefix'].between(710, 739)) & (test['icd9_diagnosis_standardize'] == '')] = 'diseases of the musculoskeletal system and connective tissue'
test['icd9_diagnosis_standardize'].loc[(test['icd9_diagnosis_prefix'].between(740, 759)) & (test['icd9_diagnosis_standardize'] == '')] = 'congenital anomalies'
test['icd9_diagnosis_standardize'].loc[(test['icd9_diagnosis_prefix'].between(760, 779)) & (test['icd9_diagnosis_standardize'] == '')] = 'certain conditions originating in the perinatal period'
test['icd9_diagnosis_standardize'].loc[(test['icd9_diagnosis_prefix'].between(780, 799)) & (test['icd9_diagnosis_standardize'] == '')] = 'symptoms, signs, and ill-defined conditions'
test['icd9_diagnosis_standardize'].loc[(test['icd9_diagnosis_prefix'].between(800, 999)) & (test['icd9_diagnosis_standardize'] == '')] = 'injury and poisoning'

# do the same for all_patient_diagnosis
all_patient_diagnosis['icd9_diagnosis_prefix'] = all_patient_diagnosis['icd9_code'].str[:3]
all_patient_diagnosis['icd9_diagnosis_standardize'] = ''
all_patient_diagnosis['icd9_diagnosis_standardize'].loc[(all_patient_diagnosis['icd9_diagnosis_prefix'].str.contains('V')) | (all_patient_diagnosis['icd9_diagnosis_prefix'].str.contains('E'))] = 'external causes of injury and supplemental classification'
all_patient_diagnosis['icd9_diagnosis_standardize'].loc[(all_patient_diagnosis['icd9_diagnosis_prefix'].str.contains('V')) | (all_patient_diagnosis['icd9_diagnosis_prefix'].str.contains('E'))] = 'external causes of injury and supplemental classification'
# drop rows with missing diagnosis
all_patient_diagnosis = all_patient_diagnosis.dropna(subset=['icd9_code'])
all_patient_diagnosis['icd9_diagnosis_prefix'].loc[(all_patient_diagnosis['icd9_diagnosis_prefix'].str.contains('V')) | (all_patient_diagnosis['icd9_diagnosis_prefix'].str.contains('E'))] = '0'
all_patient_diagnosis.icd9_diagnosis_prefix = all_patient_diagnosis.icd9_diagnosis_prefix.astype(int)
all_patient_diagnosis['icd9_diagnosis_standardize'].loc[(all_patient_diagnosis['icd9_diagnosis_prefix'].between(1, 139)) & (all_patient_diagnosis['icd9_diagnosis_standardize'] == '')] = 'infectious and parasitic diseases'
all_patient_diagnosis['icd9_diagnosis_standardize'].loc[(all_patient_diagnosis['icd9_diagnosis_prefix'].between(140, 239)) & (all_patient_diagnosis['icd9_diagnosis_standardize'] == '')] = 'neoplasms'
all_patient_diagnosis['icd9_diagnosis_standardize'].loc[(all_patient_diagnosis['icd9_diagnosis_prefix'].between(240, 279)) & (all_patient_diagnosis['icd9_diagnosis_standardize'] == '')] = 'endocrine, nutritional and metabolic diseases, and immunity disorders'
all_patient_diagnosis['icd9_diagnosis_standardize'].loc[(all_patient_diagnosis['icd9_diagnosis_prefix'].between(280, 289)) & (all_patient_diagnosis['icd9_diagnosis_standardize'] == '')] = 'diseases of the blood and blood-forming organs'
all_patient_diagnosis['icd9_diagnosis_standardize'].loc[(all_patient_diagnosis['icd9_diagnosis_prefix'].between(290, 319)) & (all_patient_diagnosis['icd9_diagnosis_standardize'] == '')] = 'mental disorders'
all_patient_diagnosis['icd9_diagnosis_standardize'].loc[(all_patient_diagnosis['icd9_diagnosis_prefix'].between(320, 389)) & (all_patient_diagnosis['icd9_diagnosis_standardize'] == '')] = 'diseases of the nervous system and sense organs'
all_patient_diagnosis['icd9_diagnosis_standardize'].loc[(all_patient_diagnosis['icd9_diagnosis_prefix'].between(390, 459)) & (all_patient_diagnosis['icd9_diagnosis_standardize'] == '')] = 'diseases of the circulatory system'
all_patient_diagnosis['icd9_diagnosis_standardize'].loc[(all_patient_diagnosis['icd9_diagnosis_prefix'].between(460, 519)) & (all_patient_diagnosis['icd9_diagnosis_standardize'] == '')] = 'diseases of the respiratory system'
all_patient_diagnosis['icd9_diagnosis_standardize'].loc[(all_patient_diagnosis['icd9_diagnosis_prefix'].between(520, 579)) & (all_patient_diagnosis['icd9_diagnosis_standardize'] == '')] = 'diseases of the digestive system'
all_patient_diagnosis['icd9_diagnosis_standardize'].loc[(all_patient_diagnosis['icd9_diagnosis_prefix'].between(580, 629)) & (all_patient_diagnosis['icd9_diagnosis_standardize'] == '')] = 'diseases of the genitourinary system'
all_patient_diagnosis['icd9_diagnosis_standardize'].loc[(all_patient_diagnosis['icd9_diagnosis_prefix'].between(630, 679)) & (all_patient_diagnosis['icd9_diagnosis_standardize'] == '')] = 'complications of pregnancy, childbirth, and the puerperium'
all_patient_diagnosis['icd9_diagnosis_standardize'].loc[(all_patient_diagnosis['icd9_diagnosis_prefix'].between(680, 709)) & (all_patient_diagnosis['icd9_diagnosis_standardize'] == '')] = 'diseases of the skin and subcutaneous tissue'
all_patient_diagnosis['icd9_diagnosis_standardize'].loc[(all_patient_diagnosis['icd9_diagnosis_prefix'].between(710, 739)) & (all_patient_diagnosis['icd9_diagnosis_standardize'] == '')] = 'diseases of the musculoskeletal system and connective tissue'
all_patient_diagnosis['icd9_diagnosis_standardize'].loc[(all_patient_diagnosis['icd9_diagnosis_prefix'].between(740, 759)) & (all_patient_diagnosis['icd9_diagnosis_standardize'] == '')] = 'congenital anomalies'
all_patient_diagnosis['icd9_diagnosis_standardize'].loc[(all_patient_diagnosis['icd9_diagnosis_prefix'].between(760, 779)) & (all_patient_diagnosis['icd9_diagnosis_standardize'] == '')] = 'certain conditions originating in the perinatal period'
all_patient_diagnosis['icd9_diagnosis_standardize'].loc[(all_patient_diagnosis['icd9_diagnosis_prefix'].between(780, 799)) & (all_patient_diagnosis['icd9_diagnosis_standardize'] == '')] = 'symptoms, signs, and ill-defined conditions'
all_patient_diagnosis['icd9_diagnosis_standardize'].loc[(all_patient_diagnosis['icd9_diagnosis_prefix'].between(800, 999)) & (all_patient_diagnosis['icd9_diagnosis_standardize'] == '')] = 'injury and poisoning'


diagnosis_standardization_dict = dict(zip(all_patient_diagnosis.icd9_code, all_patient_diagnosis.icd9_diagnosis_standardize))

train.drop(columns=['icd9_diagnosis_prefix'], inplace=True)
test.drop(columns=['icd9_diagnosis_prefix'], inplace=True)

In [44]:
all_patient_diagnosis['icd9_diagnosis_standardize'].unique()

array(['diseases of the digestive system',
       'diseases of the circulatory system',
       'diseases of the genitourinary system',
       'infectious and parasitic diseases',
       'endocrine, nutritional and metabolic diseases, and immunity disorders',
       'diseases of the respiratory system',
       'external causes of injury and supplemental classification',
       'symptoms, signs, and ill-defined conditions',
       'injury and poisoning',
       'diseases of the blood and blood-forming organs',
       'certain conditions originating in the perinatal period',
       'neoplasms', 'diseases of the nervous system and sense organs',
       'mental disorders', 'congenital anomalies',
       'diseases of the skin and subcutaneous tissue',
       'diseases of the musculoskeletal system and connective tissue',
       'complications of pregnancy, childbirth, and the puerperium'],
      dtype=object)

In [45]:
all_patient_diagnosis['icd9_diagnosis_prefix'] = all_patient_diagnosis['icd9_code'].str[:3]
all_patient_diagnosis.loc[(all_patient_diagnosis['icd9_diagnosis_prefix'].str.contains('V')) | (all_patient_diagnosis['icd9_diagnosis_prefix'].str.contains('E'))]

Unnamed: 0,subject_id,hadm_id,seq_num,icd9_code,icd9_diagnosis_prefix,icd9_diagnosis_standardize
10,256,108811,11.0,V4581,V45,external causes of injury and supplemental cla...
19,256,153771,9.0,V4582,V45,external causes of injury and supplemental cla...
36,256,188869,11.0,V1507,V15,external causes of injury and supplemental cla...
38,256,188869,13.0,V1251,V12,external causes of injury and supplemental cla...
40,512,102509,1.0,V3101,V31,external causes of injury and supplemental cla...
...,...,...,...,...,...,...
651001,62975,166032,23.0,V4986,V49,external causes of injury and supplemental cla...
651011,63999,120514,6.0,V103,V10,external causes of injury and supplemental cla...
651016,63999,120514,11.0,V1083,V10,external causes of injury and supplemental cla...
651017,63999,120514,12.0,V1582,V15,external causes of injury and supplemental cla...


## Engineering Columns From Extra Data Set

get severity of each diagnosis and standardized diagnosis

In [46]:

# get a death rate for each diagnosis
diagnosis_from_train = train[['icd9_diagnosis','hospital_expire_flag']]
diagnosis_from_train = diagnosis_from_train.groupby('icd9_diagnosis')['hospital_expire_flag'].sum().sort_values(ascending=False).reset_index()
diagnosis_from_train = diagnosis_from_train.merge(train[['icd9_diagnosis','hospital_expire_flag']].groupby('icd9_diagnosis')['hospital_expire_flag'].count().reset_index(), on='icd9_diagnosis')
diagnosis_from_train.columns = ['icd9_diagnosis','deaths','total_cases']
diagnosis_from_train['icd9_death_rate'] = diagnosis_from_train['deaths'] / diagnosis_from_train['total_cases']
icd9_death_rate_dict = dict(zip(diagnosis_from_train.icd9_diagnosis, diagnosis_from_train.icd9_death_rate))


# get a death rate for each diagnosis standardized 
diagnosis_from_train_std = train[['icd9_diagnosis_standardize','hospital_expire_flag']]
diagnosis_from_train_std = diagnosis_from_train_std.groupby('icd9_diagnosis_standardize')['hospital_expire_flag'].sum().sort_values(ascending=False).reset_index()
diagnosis_from_train_std = diagnosis_from_train_std.merge(train[['icd9_diagnosis_standardize','hospital_expire_flag']].groupby('icd9_diagnosis_standardize')['hospital_expire_flag'].count().reset_index(), on='icd9_diagnosis_standardize')
diagnosis_from_train_std.columns = ['icd9_diagnosis_standardize','deaths','total_cases']
diagnosis_from_train_std['icd9_standardize_death_rate'] = diagnosis_from_train_std['deaths'] / diagnosis_from_train_std['total_cases']
icd9_standardize_death_rate_dict = dict(zip(diagnosis_from_train_std.icd9_diagnosis_standardize, diagnosis_from_train_std.icd9_standardize_death_rate))


get unique diagnosis count for each hospital visit

In [47]:

# get the number of hospital visits per patient and the number of unique diagnosis
hospital_stay_level_data = pd.DataFrame(all_patient_diagnosis.groupby('hadm_id')['icd9_code'].unique().reset_index())
hospital_stay_level_data.columns = ['hadm_id', 'icd9_codes']
hospital_stay_level_data['unique_diagnosis_count'] = hospital_stay_level_data['icd9_codes'].apply(lambda x: len(x))
hospital_stay_level_data

Unnamed: 0,hadm_id,icd9_codes,unique_diagnosis_count
0,100001,"[25013, 3371, 5849, 5780, V5867, 25063, 5363, ...",16
1,100003,"[53100, 2851, 07054, 5715, 45621, 53789, 4019,...",9
2,100006,"[49320, 51881, 486, 20300, 2761, 7850, 3090, V...",9
3,100007,"[56081, 5570, 9973, 486, 4019]",5
4,100009,"[41401, 99604, 4142, 25000, 27800, V8535, 4148...",18
...,...,...,...
58924,199993,"[41031, 42821, 42731, 4271, 5180, 4240, 2760, ...",9
58925,199994,"[486, 4280, 51881, 3970, 496, 4169, 585, 42732...",9
58926,199995,"[4210, 7464, 42971, 30401, 4412, 44284, V1259,...",10
58927,199998,"[41401, 9971, 9975, 42731, 78820, 4111, V4582,...",16


get most recurrent diagnosis and count

In [48]:
# get mode of diagnosis per subject
temp_group_by = all_patient_diagnosis.groupby(['hadm_id', 'seq_num'])['icd9_code'].value_counts().reset_index()

temp_group_by = temp_group_by.sort_values(by=['count', 'seq_num'], ascending=False)
temp_group_by = temp_group_by.drop_duplicates(subset='hadm_id', keep='first')

hospital_stay_level_data['most_recurrent_diagnosis'] = hospital_stay_level_data.merge(temp_group_by, on='hadm_id', how='left')['icd9_code']
hospital_stay_level_data['most_recurrent_diagnosis_count'] = hospital_stay_level_data.merge(temp_group_by, on='hadm_id', how='left')['count']

hospital_stay_level_data

Unnamed: 0,hadm_id,icd9_codes,unique_diagnosis_count,most_recurrent_diagnosis,most_recurrent_diagnosis_count
0,100001,"[25013, 3371, 5849, 5780, V5867, 25063, 5363, ...",16,V1351,1
1,100003,"[53100, 2851, 07054, 5715, 45621, 53789, 4019,...",9,7823,1
2,100006,"[49320, 51881, 486, 20300, 2761, 7850, 3090, V...",9,V1582,1
3,100007,"[56081, 5570, 9973, 486, 4019]",5,4019,1
4,100009,"[41401, 99604, 4142, 25000, 27800, V8535, 4148...",18,V4502,1
...,...,...,...,...,...
58924,199993,"[41031, 42821, 42731, 4271, 5180, 4240, 2760, ...",9,5184,1
58925,199994,"[486, 4280, 51881, 3970, 496, 4169, 585, 42732...",9,2762,1
58926,199995,"[4210, 7464, 42971, 30401, 4412, 44284, V1259,...",10,3051,1
58927,199998,"[41401, 9971, 9975, 42731, 78820, 4111, V4582,...",16,V4589,1


get most recent diagnosis

In [49]:

# get the most recent diagnosis
temp_seq_num = all_patient_diagnosis.groupby(['hadm_id','seq_num'])['icd9_code'].unique().reset_index()
temp_seq_num['seq_num'] = temp_seq_num['seq_num'].astype(int)

temp_seq_num.icd9_code = temp_seq_num.icd9_code.apply(lambda x: x[0])
temp_seq_num = temp_seq_num.loc[temp_seq_num['seq_num'] == 1]

hospital_stay_level_data['most_recent_diagnosis'] = hospital_stay_level_data.merge(temp_seq_num, on='hadm_id', how='left')['icd9_code']
hospital_stay_level_data

Unnamed: 0,hadm_id,icd9_codes,unique_diagnosis_count,most_recurrent_diagnosis,most_recurrent_diagnosis_count,most_recent_diagnosis
0,100001,"[25013, 3371, 5849, 5780, V5867, 25063, 5363, ...",16,V1351,1,25013
1,100003,"[53100, 2851, 07054, 5715, 45621, 53789, 4019,...",9,7823,1,53100
2,100006,"[49320, 51881, 486, 20300, 2761, 7850, 3090, V...",9,V1582,1,49320
3,100007,"[56081, 5570, 9973, 486, 4019]",5,4019,1,56081
4,100009,"[41401, 99604, 4142, 25000, 27800, V8535, 4148...",18,V4502,1,41401
...,...,...,...,...,...,...
58924,199993,"[41031, 42821, 42731, 4271, 5180, 4240, 2760, ...",9,5184,1,41031
58925,199994,"[486, 4280, 51881, 3970, 496, 4169, 585, 42732...",9,2762,1,486
58926,199995,"[4210, 7464, 42971, 30401, 4412, 44284, V1259,...",10,3051,1,4210
58927,199998,"[41401, 9971, 9975, 42731, 78820, 4111, V4582,...",16,V4589,1,41401


gget severity for each of the diagnosis

In [50]:

# merge the death rate for each diagnosis
hospital_stay_level_data['most_recent_diagnosis_severity'] = hospital_stay_level_data.merge(diagnosis_from_train, left_on='most_recent_diagnosis', right_on='icd9_diagnosis', how='left')['icd9_death_rate']
hospital_stay_level_data['most_recurrent_diagnosis_severity'] = hospital_stay_level_data.merge(diagnosis_from_train, left_on='most_recurrent_diagnosis', right_on='icd9_diagnosis', how='left')['icd9_death_rate']

# add mappings for standardized diagnosis
hospital_stay_level_data['most_recent_diagnosis_standardize'] = hospital_stay_level_data.most_recent_diagnosis.map(diagnosis_standardization_dict)
hospital_stay_level_data['most_recurrent_diagnosis_standardize'] = hospital_stay_level_data.most_recurrent_diagnosis.map(diagnosis_standardization_dict)

# merge the death rate for each diagnosis standardized
hospital_stay_level_data['most_recent_diagnosis_std_severity'] = hospital_stay_level_data.merge(diagnosis_from_train_std, left_on='most_recent_diagnosis_standardize', right_on='icd9_diagnosis_standardize', how='left')['icd9_standardize_death_rate']
hospital_stay_level_data['most_recurrent_diagnosis_std_severity'] = hospital_stay_level_data.merge(diagnosis_from_train_std, left_on='most_recurrent_diagnosis_standardize', right_on='icd9_diagnosis_standardize', how='left')['icd9_standardize_death_rate']

hospital_stay_level_data

Unnamed: 0,hadm_id,icd9_codes,unique_diagnosis_count,most_recurrent_diagnosis,most_recurrent_diagnosis_count,most_recent_diagnosis,most_recent_diagnosis_severity,most_recurrent_diagnosis_severity,most_recent_diagnosis_standardize,most_recurrent_diagnosis_standardize,most_recent_diagnosis_std_severity,most_recurrent_diagnosis_std_severity
0,100001,"[25013, 3371, 5849, 5780, V5867, 25063, 5363, ...",16,V1351,1,25013,0.005952,,"endocrine, nutritional and metabolic diseases,...",external causes of injury and supplemental cla...,0.036066,0.076923
1,100003,"[53100, 2851, 07054, 5715, 45621, 53789, 4019,...",9,7823,1,53100,0.071429,0.000000,diseases of the digestive system,"symptoms, signs, and ill-defined conditions",0.104567,0.066116
2,100006,"[49320, 51881, 486, 20300, 2761, 7850, 3090, V...",9,V1582,1,49320,0.000000,,diseases of the respiratory system,external causes of injury and supplemental cla...,0.136037,0.076923
3,100007,"[56081, 5570, 9973, 486, 4019]",5,4019,1,56081,0.080000,0.000000,diseases of the digestive system,diseases of the circulatory system,0.104567,0.093176
4,100009,"[41401, 99604, 4142, 25000, 27800, V8535, 4148...",18,V4502,1,41401,0.009107,,diseases of the circulatory system,external causes of injury and supplemental cla...,0.093176,0.076923
...,...,...,...,...,...,...,...,...,...,...,...,...
58924,199993,"[41031, 42821, 42731, 4271, 5180, 4240, 2760, ...",9,5184,1,41031,0.142857,0.000000,diseases of the circulatory system,diseases of the respiratory system,0.093176,0.136037
58925,199994,"[486, 4280, 51881, 3970, 496, 4169, 585, 42732...",9,2762,1,486,0.141066,0.090909,diseases of the respiratory system,"endocrine, nutritional and metabolic diseases,...",0.136037,0.036066
58926,199995,"[4210, 7464, 42971, 30401, 4412, 44284, V1259,...",10,3051,1,4210,0.031250,,diseases of the circulatory system,mental disorders,0.093176,0.003891
58927,199998,"[41401, 9971, 9975, 42731, 78820, 4111, V4582,...",16,V4589,1,41401,0.009107,,diseases of the circulatory system,external causes of injury and supplemental cla...,0.093176,0.076923


In [51]:
# join with train and test
train = train.merge(hospital_stay_level_data, left_on='hadm_id', right_on='hadm_id', how='left')
test = test.merge(hospital_stay_level_data, left_on='hadm_id', right_on='hadm_id', how='left')


## Add all diagnosis as extra columns

text dictionaries

In [52]:
short_diagnose_dict = dict(zip(diagnosis.icd9_diagnosis, diagnosis.short_diagnose))
long_diagnose_dict = dict(zip(diagnosis.icd9_diagnosis, diagnosis.long_diagnose))

In [53]:
all_diagnosis_df = all_patient_diagnosis[['hadm_id', 'icd9_code']].groupby('hadm_id')['icd9_code'].unique().apply(list).reset_index()

# expand list into columns
all_cols = all_diagnosis_df.icd9_code.astype(str).str.split(',', expand=True)
for i in range(all_cols.shape[1]):
    all_cols[i] = all_cols[i].str.strip('[').str.strip(']').str.strip('\'').str.strip('\'').str.strip(' ').str.strip('\'').str.strip()
    all_diagnosis_df['diagnosis_'+str(i)] = all_cols[i]
    # each column in the df is now a diagnosis text string
    # look up short and long diagnosis text string against patient metadata
    all_diagnosis_df['short_diagnosis_'+str(i)] = all_diagnosis_df['diagnosis_'+str(i)].astype(str).map(short_diagnose_dict)
    all_diagnosis_df['long_diagnosis_'+str(i)] = all_diagnosis_df['diagnosis_'+str(i)].astype(str).map(long_diagnose_dict)
    all_diagnosis_df['diagnosis_severity_'+str(i)] = all_diagnosis_df['diagnosis_'+str(i)].astype(str).map(icd9_death_rate_dict)
all_diagnosis_df = all_diagnosis_df.drop(columns=['icd9_code'])

all_diagnosis_cols = [col for col in all_diagnosis_df.columns if 'diagnosis_severity_' in col]
# calculate the mean death rate for each patient
all_diagnosis_df['mean_diagnosis_severity'] = all_diagnosis_df[all_diagnosis_cols].mean(axis=1)
all_diagnosis_df = all_diagnosis_df.drop(columns=all_diagnosis_cols)
all_diagnosis_df
# target encode columns as is

Unnamed: 0,hadm_id,diagnosis_0,short_diagnosis_0,long_diagnosis_0,diagnosis_1,short_diagnosis_1,long_diagnosis_1,diagnosis_2,short_diagnosis_2,long_diagnosis_2,...,diagnosis_36,short_diagnosis_36,long_diagnosis_36,diagnosis_37,short_diagnosis_37,long_diagnosis_37,diagnosis_38,short_diagnosis_38,long_diagnosis_38,mean_diagnosis_severity
0,100001,25013,DMI ketoacd uncontrold,"Diabetes with ketoacidosis, type I [juvenile t...",3371,Aut neuropthy in oth dis,Peripheral autonomic neuropathy in disorders c...,5849,Acute kidney failure NOS,"Acute kidney failure, unspecified",...,,,,,,,,,,0.029028
1,100003,53100,Ac stomach ulcer w hem,"Acute gastric ulcer with hemorrhage, without m...",2851,Ac posthemorrhag anemia,Acute posthemorrhagic anemia,07054,Chrnc hpt C wo hpat coma,Chronic hepatitis C without mention of hepatic...,...,,,,,,,,,,0.071992
2,100006,49320,Chronic obst asthma NOS,"Chronic obstructive asthma, unspecified",51881,Acute respiratry failure,Acute respiratory failure,486,"Pneumonia, organism NOS","Pneumonia, organism unspecified",...,,,,,,,,,,0.237269
3,100007,56081,Intestinal adhes w obstr,Intestinal or peritoneal adhesions with obstru...,5570,Ac vasc insuff intestine,Acute vascular insufficiency of intestine,9973,,,...,,,,,,,,,,0.199507
4,100009,41401,Crnry athrscl natve vssl,Coronary atherosclerosis of native coronary ar...,99604,Mch cmp autm mplnt dfbrl,Mechanical complication of automatic implantab...,4142,Chr tot occlus cor artry,Chronic total occlusion of coronary artery,...,,,,,,,,,,0.001821
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58924,199993,41031,"AMI inferopost, initial",Acute myocardial infarction of inferoposterior...,42821,Ac systolic hrt failure,Acute systolic heart failure,42731,Atrial fibrillation,Atrial fibrillation,...,,,,,,,,,,0.071501
58925,199994,486,"Pneumonia, organism NOS","Pneumonia, organism unspecified",4280,CHF NOS,"Congestive heart failure, unspecified",51881,Acute respiratry failure,Acute respiratory failure,...,,,,,,,,,,0.090125
58926,199995,4210,Ac/subac bact endocard,Acute and subacute bacterial endocarditis,7464,Cong aorta valv insuffic,Congenital insufficiency of aortic valve,42971,Acq cardiac septl defect,Acquired cardiac septal defect,...,,,,,,,,,,0.093918
58927,199998,41401,Crnry athrscl natve vssl,Coronary atherosclerosis of native coronary ar...,9971,Surg compl-heart,"Cardiac complications, not elsewhere classified",9975,Surg compl-urinary tract,"Urinary complications, not elsewhere classified",...,,,,,,,,,,0.023194


create one field with all text then drop the text fields

In [54]:
all_diagnosis_cols = [col for col in all_diagnosis_df.columns if 'short_diagnosis_' in col or 'long_diagnosis_' in col]
all_diagnosis_df['all_diagnosis_texts'] = all_diagnosis_df[all_diagnosis_cols].astype(str).agg(' '.join, axis=1)
all_diagnosis_df = all_diagnosis_df.drop(columns=all_diagnosis_cols)
all_diagnosis_df

Unnamed: 0,hadm_id,diagnosis_0,diagnosis_1,diagnosis_2,diagnosis_3,diagnosis_4,diagnosis_5,diagnosis_6,diagnosis_7,diagnosis_8,...,diagnosis_31,diagnosis_32,diagnosis_33,diagnosis_34,diagnosis_35,diagnosis_36,diagnosis_37,diagnosis_38,mean_diagnosis_severity,all_diagnosis_texts
0,100001,25013,3371,5849,5780,V5867,25063,5363,4580,25043,...,,,,,,,,,0.029028,DMI ketoacd uncontrold Diabetes with ketoacido...
1,100003,53100,2851,07054,5715,45621,53789,4019,53550,7823,...,,,,,,,,,0.071992,Ac stomach ulcer w hem Acute gastric ulcer wit...
2,100006,49320,51881,486,20300,2761,7850,3090,V1251,V1582,...,,,,,,,,,0.237269,Chronic obst asthma NOS Chronic obstructive as...
3,100007,56081,5570,9973,486,4019,,,,,...,,,,,,,,,0.199507,Intestinal adhes w obstr Intestinal or periton...
4,100009,41401,99604,4142,25000,27800,V8535,4148,4111,V4582,...,,,,,,,,,0.001821,Crnry athrscl natve vssl Coronary atherosclero...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58924,199993,41031,42821,42731,4271,5180,4240,2760,5119,5184,...,,,,,,,,,0.071501,"AMI inferopost, initial Acute myocardial infar..."
58925,199994,486,4280,51881,3970,496,4169,585,42732,2762,...,,,,,,,,,0.090125,"Pneumonia, organism NOS Pneumonia, organism un..."
58926,199995,4210,7464,42971,30401,4412,44284,V1259,04111,30503,...,,,,,,,,,0.093918,Ac/subac bact endocard Acute and subacute bact...
58927,199998,41401,9971,9975,42731,78820,4111,V4582,E8782,4293,...,,,,,,,,,0.023194,Crnry athrscl natve vssl Coronary atherosclero...


In [55]:
# merge onto train and test
train = train.merge(all_diagnosis_df, left_on='hadm_id', right_on='hadm_id', how='left')
test = test.merge(all_diagnosis_df, left_on='hadm_id', right_on='hadm_id', how='left')

do same stats with the standardized icd9 diagnosis

In [56]:
all_diagnosis_std_df = pd.DataFrame(all_patient_diagnosis.groupby('hadm_id')['icd9_diagnosis_standardize'].unique().reset_index())
all_diagnosis_std_df.columns = ['hadm_id', 'icd9_diagnosis_standardize']
all_diagnosis_std_df['distinct_std_diagnosis_count'] = all_diagnosis_std_df['icd9_diagnosis_standardize'].apply(lambda x: len(x))


In [57]:

# expand list into columns
for i in range(0,all_diagnosis_std_df.distinct_std_diagnosis_count.max()):
    # get the first element from the list if it exists
    all_diagnosis_std_df['std_diagnosis_'+str(i)] = all_diagnosis_std_df['icd9_diagnosis_standardize'].apply(lambda x: x[i] if len(x) > 1+i else np.nan)
    # each column in the df is now a diagnosis text string
    # look up short and long diagnosis text string against patient metadata
    all_diagnosis_std_df['std_diagnosis_severity_'+str(i)] = all_diagnosis_std_df['std_diagnosis_'+str(i)].astype(str).map(icd9_standardize_death_rate_dict)
all_diagnosis_std_df = all_diagnosis_std_df.drop(columns=['icd9_diagnosis_standardize'])

all_diagnosis_cols = [col for col in all_diagnosis_std_df.columns if 'std_diagnosis_severity_' in col]
#all_diagnosis_cols
# calculate the mean death rate for each patient
all_diagnosis_std_df['mean_std_diagnosis_severity'] = all_diagnosis_std_df[all_diagnosis_cols].mean(axis=1)
all_diagnosis_std_df = all_diagnosis_std_df.drop(columns=all_diagnosis_cols)
all_diagnosis_std_df

Unnamed: 0,hadm_id,distinct_std_diagnosis_count,std_diagnosis_0,std_diagnosis_1,std_diagnosis_2,std_diagnosis_3,std_diagnosis_4,std_diagnosis_5,std_diagnosis_6,std_diagnosis_7,std_diagnosis_8,std_diagnosis_9,std_diagnosis_10,std_diagnosis_11,std_diagnosis_12,std_diagnosis_13,std_diagnosis_14,std_diagnosis_15,mean_std_diagnosis_severity
0,100001,7,"endocrine, nutritional and metabolic diseases,...",diseases of the nervous system and sense organs,diseases of the genitourinary system,diseases of the digestive system,external causes of injury and supplemental cla...,diseases of the circulatory system,,,,,,,,,,,0.079434
1,100003,5,diseases of the digestive system,diseases of the blood and blood-forming organs,infectious and parasitic diseases,diseases of the circulatory system,,,,,,,,,,,,,0.135671
2,100006,6,diseases of the respiratory system,neoplasms,"endocrine, nutritional and metabolic diseases,...","symptoms, signs, and ill-defined conditions",mental disorders,,,,,,,,,,,,0.078626
3,100007,4,diseases of the digestive system,injury and poisoning,diseases of the respiratory system,,,,,,,,,,,,,,0.111924
4,100009,5,diseases of the circulatory system,injury and poisoning,"endocrine, nutritional and metabolic diseases,...",external causes of injury and supplemental cla...,,,,,,,,,,,,,0.075333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58924,199993,3,diseases of the circulatory system,diseases of the respiratory system,,,,,,,,,,,,,,,0.114606
58925,199994,4,diseases of the respiratory system,diseases of the circulatory system,diseases of the genitourinary system,,,,,,,,,,,,,,0.106233
58926,199995,5,diseases of the circulatory system,congenital anomalies,mental disorders,external causes of injury and supplemental cla...,,,,,,,,,,,,,0.043497
58927,199998,6,diseases of the circulatory system,injury and poisoning,"symptoms, signs, and ill-defined conditions",external causes of injury and supplemental cla...,"endocrine, nutritional and metabolic diseases,...",,,,,,,,,,,,0.073490


In [58]:
# merge onto train and test
train = train.merge(all_diagnosis_std_df, left_on='hadm_id', right_on='hadm_id', how='left')
test = test.merge(all_diagnosis_std_df, left_on='hadm_id', right_on='hadm_id', how='left')

#test['certain_conditions_originating_in_the_perinatal_period'] = 0
cols = list(set(test.columns) - set(train.columns))
test[cols] = 0 

## Marital Status Engineering

In [59]:
train.marital_status.unique()

['SINGLE', 'MARRIED', 'SEPARATED', 'WIDOWED', 'DIVORCED', NaN, 'UNKNOWN (DEFAULT)', 'LIFE PARTNER']
Categories (7, object): ['DIVORCED', 'LIFE PARTNER', 'MARRIED', 'SEPARATED', 'SINGLE', 'UNKNOWN (DEFAULT)', 'WIDOWED']

In [60]:
train.marital_status =train.marital_status.astype(str)
test.marital_status =test.marital_status.astype(str)

train.marital_status.loc[train.marital_status == 'UNKNOWN (DEFAULT)'] = 'UNKNOWN'
train.marital_status.loc[train.marital_status == 'nan'] = 'UNKNOWN'
train.marital_status.loc[train.marital_status == 'LIFE PARTNER'] = 'MARRIED'
train.marital_status.loc[train.marital_status == 'WIDOWED'] = 'SEPARATED'

test.marital_status.loc[test.marital_status == 'UNKNOWN (DEFAULT)'] = 'UNKNOWN'
test.marital_status.loc[test.marital_status == 'nan'] = 'UNKNOWN'
test.marital_status.loc[test.marital_status == 'LIFE PARTNER'] = 'MARRIED'
test.marital_status.loc[test.marital_status == 'WIDOWED'] = 'SEPARATED'

## Health Metrics Engineering

In [61]:

def feature_engineering(df):
    df['shock_index'] = df['heart_rate_mean'] / df['sys_bp_mean']
    df['mean_arterial_pressure'] = df['dias_bp_mean'] + (df['sys_bp_mean'] - df['dias_bp_mean']) / df.shape[0]
    df['temp_variability'] = df['temp_c_max'] - df['temp_c_min']
    df['oxygen_saturation_variability'] = df['sp_o2_max'] - df['sp_o2_min']
    df['glucose_variability'] = df['glucose_max'] - df['glucose_min']
    df['cardiovasular_risk'] = df[['heart_rate_mean', 'sys_bp_mean', 'dias_bp_mean', 'mean_bp_mean']].mean(axis=1)
    df['pulse_pressure'] = df['sys_bp_mean'] - df['dias_bp_mean']
    df['oxygenation_index'] = df['resp_rate_mean'] / df['sp_o2_mean'] 
    df['thermal_stress_index'] = df['temp_c_mean'] * df['heart_rate_mean']
    df['fever_indicator'] = df['temp_c_mean'] > 37.5
    df['fever_indicator'] = df['fever_indicator'].astype(int)

    return df

train = feature_engineering(train)
test = feature_engineering(test)

## Ethnicity Engineering

In [62]:

def engineer_ethnicity(df): 

    print(df.columns)
    df[['ethnicity_v1','ethnicity_detail_2']] = df['ethnicity'].str.split('-', expand=True)
    df[['ethnicity_base','ethnicity_detail_1']] = df['ethnicity_v1'].str.split('/', expand=True)

    df[['ethnicity','ethnicity_base','ethnicity_v1','ethnicity_detail_1','ethnicity_detail_2']] = df[['ethnicity','ethnicity_base','ethnicity_v1','ethnicity_detail_1','ethnicity_detail_2']].apply(lambda x: x.str.strip())

    df.loc[df['ethnicity_base'] == 'PATIENT DECLINED TO ANSWER', 'ethnicity_base'] = 'UNKNOWN'
    df.loc[df['ethnicity_base'] == 'UNABLE TO OBTAIN', 'ethnicity_base'] = 'UNKNOWN'
    df.loc[df['ethnicity_base'] == 'HISPANIC OR LATINO', 'ethnicity_base'] = 'HISPANIC'
    df.loc[df['ethnicity_base'] == 'PORTUGUESE', 'ethnicity_detail_1'] = 'PORTUGUESE'
    df.loc[df['ethnicity_base'] == 'PORTUGUESE', 'ethnicity_base'] = 'WHITE'
    df.loc[df['ethnicity_detail_1'] == 'ALASKA NATIVE FEDERALLY RECOGNIZED TRIBE', 'ethnicity_detail_1'] = 'ALASKA NATIVE'

    df.ethnicity_base = df.ethnicity_base.str.strip()

    df.loc[df.ethnicity_detail_1 == 'LATINO', 'ethnicity_detail_1'] = ''

    df.ethnicity_detail_1 = df.ethnicity_detail_1.fillna('')
    df.ethnicity_detail_2 = df.ethnicity_detail_2.fillna('')
    # combine the two detail columns
    df['ethnicity_detail'] = df.ethnicity_detail_1 + ' ' + df.ethnicity_detail_2
    df['ethnicity_detail'] = df['ethnicity_detail'].str.strip().str.replace('  ', ' ')
    df.ethnicity_detail = df.ethnicity_detail.replace('', 'OTHER')

    df.drop(columns=['ethnicity','ethnicity_detail_1','ethnicity_detail_2','ethnicity_v1'], inplace=True)

    return df

# engineer features
train = engineer_ethnicity(train)

test = engineer_ethnicity(test)




Index(['hospital_expire_flag', 'subject_id', 'hadm_id', 'icustay_id',
       'heart_rate_min', 'heart_rate_max', 'heart_rate_mean', 'sys_bp_min',
       'sys_bp_max', 'sys_bp_mean',
       ...
       'shock_index', 'mean_arterial_pressure', 'temp_variability',
       'oxygen_saturation_variability', 'glucose_variability',
       'cardiovasular_risk', 'pulse_pressure', 'oxygenation_index',
       'thermal_stress_index', 'fever_indicator'],
      dtype='object', length=125)
Index(['subject_id', 'hadm_id', 'icustay_id', 'heart_rate_min',
       'heart_rate_max', 'heart_rate_mean', 'sys_bp_min', 'sys_bp_max',
       'sys_bp_mean', 'dias_bp_min',
       ...
       'shock_index', 'mean_arterial_pressure', 'temp_variability',
       'oxygen_saturation_variability', 'glucose_variability',
       'cardiovasular_risk', 'pulse_pressure', 'oxygenation_index',
       'thermal_stress_index', 'fever_indicator'],
      dtype='object', length=123)


## Engineer Religion

In [63]:

def engineer_religion(df):
    df['religion_clean'] = df['religion'].replace({
        'HEBREW':'JEWISH', 'EPISCOPALIAN':'CHRISTIAN', 'GREEK ORTHODOX':'CHRISTIAN', 
        'CHRISTIAN SCIENTIST':'CHRISTIAN', 'ROMANIAN EAST. ORTH':'CHRISTIAN', '7TH DAY ADVENTIST':'CHRISTIAN', 
        'PROTESTANT QUAKER':'CHRISTIAN', 'UNITARIAN-UNIVERSALIST':'CHRISTIAN', 'UNOBTAINABLE':'OTHER', 
        'NOT SPECIFIED':'OTHER'})

    df['religion'] = df['religion_clean'].str.strip()

    df.drop(columns=['religion'], inplace=True)

    return df

train = engineer_religion(train)
test = engineer_religion(test)

## Bin Ages 


In [64]:
# create bins for age

train['age_bins'] = pd.cut(train['age'], bins=[0, 2,5,13,18,33,48,64,78,1200], labels=['0–2', '3–5', '6–13', '14–18', '19–33', '34–48', '49–64', '65–78', '79+'])
test['age_bins'] = pd.cut(test['age'], bins=[0, 2,5,13,18,33,48,64,78, 1200], labels=['0–2', '3–5', '6–13', '14–18', '19–33', '34–48', '49–64', '65–78', '79+'])


## Preprocess Text

In [65]:
# consolidate into one function:

def process_text(df, ):
    text_columns = ['short_diagnose','long_diagnose', 'diagnosis', 'all_diagnosis_texts']

    df[text_columns] = df[text_columns].astype(str)
    
    df['all_text'] = df[text_columns].apply(lambda x: ' '.join(x), axis=1)

    corpus = df['all_text']

    corpus = preprocess_corpus(corpus)
    corpus = spacy_lemmatize_series(corpus)
    preprocessed_corpus = stem_series(corpus, stem_type='porter')

    df['processed_text'] = preprocessed_corpus

    return df

train = process_text(train)
test = process_text(test)

## Cosine Similarity

find max similarity for different min and max df values

In [66]:
from data_science_toolkit.models.text_cosine_similarity import *

In [67]:
# checkpoint 
train.to_csv('data/train_processed.csv', index=False)
test.to_csv('data/test_processed.csv', index=False)

In [68]:
cv = TfidfVectorizer(ngram_range = (1,1), norm=None, lowercase=True, min_df=0, max_df=.1, stop_words='english')
cv.fit(train.processed_text)

vectorized_text=cv.transform(train.processed_text)
vectorized_text=vectorized_text.todense()

normalized_dtm = normalize_vectors(vectorized_text)

normalized_dtm.shape

get_cosine_similarity_between_classes(vectorized_text, cv.get_feature_names_out(), label_array=train['hospital_expire_flag'])

creating class tfidf
(2, 5767)
getting top terms
getting top terms matrix
calculating avg and avg cosine similarity
Doing label: 0
Average cosine similarity to average vector: 0.22038414321891603
     
Doing label: 1
Average cosine similarity to average vector: 0.2936453744690145
     
Cosine similarity between groups is: 0.8449319558836953


In [69]:

cv = TfidfVectorizer(ngram_range = (1,1), norm=None, lowercase=True, min_df=0, max_df=.1, stop_words='english')
cv.fit(train['processed_text'])

vectorized_text=cv.transform(train['processed_text'])
vectorized_text=vectorized_text.todense()

normalized_dtm = normalize_vectors(vectorized_text)

normalized_dtm.shape

get_cosine_similarity_between_classes(vectorized_text, cv.get_feature_names_out(), label_array=train['hospital_expire_flag'])

avg_vectors, avg_cosine_similarities = calculate_avg_and_avg_cosine_similarity(normalized_dtm, train['hospital_expire_flag'])

flag_similarity = {}

for avg_vector_index in range(0,len(avg_vectors)):
    # get similarity for each document to the average vector
    all_similarities = []
    for i in range(normalized_dtm.shape[0]):
        all_similarities.append(np.mean(cosine_similarity(normalized_dtm[i].A1, avg_vectors[avg_vector_index].A1)))
    flag_similarity[avg_vector_index] = all_similarities

flag_similarity_df = pd.DataFrame(flag_similarity)
print(flag_similarity_df.shape)

flag_similarity_df.columns = ['survival_text_similarity', 'death_text_similarity']

train = pd.concat([train, flag_similarity_df], axis=1)


# now test similarity useing the same avg vectors from train for test

vectorized_text=cv.transform(test['processed_text'])
vectorized_text=vectorized_text.todense()

normalized_dtm = normalize_vectors(vectorized_text)

flag_similarity = {}

for avg_vector_index in range(0,len(avg_vectors)):
    # get similarity for each document to the average vector
    all_similarities = []
    for i in range(normalized_dtm.shape[0]):
        all_similarities.append(np.mean(cosine_similarity(normalized_dtm[i].A1, avg_vectors[avg_vector_index].A1)))
    flag_similarity[avg_vector_index] = all_similarities

flag_similarity_df = pd.DataFrame(flag_similarity)
print(flag_similarity_df.shape)

flag_similarity_df.columns = ['survival_text_similarity', 'death_text_similarity']

test = pd.concat([test, flag_similarity_df], axis=1)

creating class tfidf
(2, 5767)
getting top terms
getting top terms matrix
calculating avg and avg cosine similarity
Doing label: 0
Average cosine similarity to average vector: 0.22038414321891603
     
Doing label: 1
Average cosine similarity to average vector: 0.2936453744690145
     
Cosine similarity between groups is: 0.8449319558836953
Doing label: 0
Average cosine similarity to average vector: 0.15797922412837742
     
Doing label: 1
Average cosine similarity to average vector: 0.21296416403921703
     
(20885, 2)
(5221, 2)


In [72]:
train.los

0         4.5761
1         0.7582
2         3.7626
3         3.8734
4         5.8654
          ...   
20880    11.6116
20881     1.1593
20882     1.8830
20883     3.1981
20884     1.0869
Name: los, Length: 20885, dtype: float32

In [73]:
train.to_csv('fully_cleaned_train_data.csv', index=False)
test.to_csv('fully_cleaned_test_data.csv', index=False)


## Log Tranforms

In [34]:
# # get numerical columns
# numerical_columns = train.select_dtypes(include=[np.number]).columns.tolist()
# numerical_columns.remove('hospital_expire_flag')
# numerical_columns.remove('subject_id')
# numerical_columns.remove('hadm_id')
# numerical_columns.remove('icustay_id')

# # get skew of each numerical column
# skew = train[numerical_columns].skew()

# skew = skew.sort_values(ascending=False)
# # onlyy keep columns with skew greater than 0.5
# skew = skew[abs(skew) > 0.5]

# # plot skews
# plt.figure(figsize=(10,5))
# plt.xticks(rotation=90)
# plt.title('Skew of numerical columns')
# sns.barplot(x=skew.index, y=skew.values)
# plt.show()



In [35]:
# # log transform columns with skew greater than 0.5

# train[skew.index] = np.log1p(train[skew.index])
# test[skew.index] = np.log1p(test[skew.index])

## Final Column Cleanup

In [36]:
# clean up columns so both dfs match

train = train.drop(columns=['dob',
 'admittime',
 'diff',
 'icd9_codes',
 'icd9_diagnosis_standardize',
 'diagnosis',
 'short_diagnose',
  'long_diagnose',
 'subject_id',
 'all_diagnosis_texts',
 'hadm_id',
 'all_text',
 'processed_text'])

test = test.drop(columns=['dob',
   'admittime',
   'diff',
 'icd9_codes',
 'icd9_diagnosis_standardize',
   'diagnosis',
 'short_diagnose',
  'long_diagnose',
   'all_diagnosis_texts',
   'subject_id',
 'all_text',
 'processed_text',
 'hadm_id',
   ])

# make sure test and train have the same columns
cols_not_in_test = set(train.columns.tolist()) - set(test.columns.tolist())
cols_not_in_train = set(test.columns.tolist()) - set(train.columns.tolist())

print(cols_not_in_test)
print(cols_not_in_train)

for col in cols_not_in_test:
    if col != 'hospital_expire_flag':
        test[col] = 0
    

# train = kl.data_cleaning(train)
# test = kl.data_cleaning(test)

train['hospital_expire_flag'] = train['hospital_expire_flag'].astype(int)

{'hospital_expire_flag'}
set()


In [37]:
missing_cols = list(set(train.columns) - set(test.columns))
test[missing_cols] = 0

In [38]:
train.to_csv('final_csv.csv', index=False)
test.to_csv('final_test.csv', index=False)

In [39]:
train = pd.read_csv('final_csv.csv')
test = pd.read_csv('final_test.csv')

## Get columns by Type

In [40]:
# target encode cols
diagnosis_cols = [col for col in train.columns if re.search(r'diagnosis_[0-9]+', col) is not None]
std_diagnosis_cols = [col for col in train.columns if re.search(r'std_diagnosis_[0-9]+', col) is not None]
target_encode_col = ['icd9_diagnosis',  'most_recent_diagnosis', 'most_recurrent_diagnosis', 'most_recent_diagnosis_standardize', 'most_recurrent_diagnosis_standardize']
cat_cols = ['ethnicity_base',	'ethnicity_detail', 'religion_clean', 'age_bins', 'gender',	'admission_type','insurance',	'marital_status', 'first_careunit']

target_encode_col = target_encode_col + diagnosis_cols + std_diagnosis_cols


## Pipeline

In [41]:
X_cols = [col for col in train.columns if col not in ['hospital_expire_flag', 'icustay_id']]
X = train[X_cols]
y = train['hospital_expire_flag']

#### Encoding

In [42]:
# label encode cat cols

encoder = BinaryEncoder()
encoded_cols = encoder.fit_transform(X[cat_cols])
X = pd.concat([X, encoded_cols], axis=1)
print(X.shape)
X = X.drop(columns=cat_cols)
print(X.shape)

encoded_cols = encoder.transform(test[cat_cols])
test = pd.concat([test, encoded_cols], axis=1)
print(test.shape)
test = test.drop(columns=cat_cols)
print(test.shape)

(20885, 144)
(20885, 135)
(5221, 146)
(5221, 137)


In [43]:
import category_encoders as ce

target_encode = ce.TargetEncoder(smoothing=0.5, cols=target_encode_col)
target_encode.fit(X, y)

X = target_encode.transform(X)
test[X.columns] = target_encode.transform(test[X.columns])

In [44]:
X.to_csv('final_checks.csv')
y.to_csv('final_checks_y.csv')
test.to_csv('final_checks_test.csv')

In [46]:
std_diagnosis_cols = [col for col in X.columns if re.search(r'std_diagnosis_[0-9]+', col) is not None]
X = X.drop(columns=std_diagnosis_cols)
test = test.drop(columns=std_diagnosis_cols)

#### Impute

In [48]:
# impute 
imputer = KNNImputer(n_neighbors=10)
X[X.columns] = imputer.fit_transform(X[X.columns])

test[X.columns] = imputer.transform(test[X.columns])

#### Scale

In [49]:
# scale
scaler = StandardScaler()
X[X.columns] = scaler.fit_transform(X[X.columns])
test[X.columns] = scaler.transform(test[X.columns])

#### Resample

In [50]:
# resample
sm = SMOTE(random_state=42)
X, y = sm.fit_resample(X, y)

pipelines break my computer
###### using distance weights
###### algorithm automatically selected because brute force was too intensive
###### cosine metric gave best results

In [56]:
# Correctly defining the pipeline with steps as a list of tuples


# Define parameters for grid search
# Note: The parameter names must match the names given to the steps in the pipeline
param_grid = {
    'n_neighbors': list(range(32, 34,1)),
    'weights': ['distance'],
    'leaf_size': [20,30,40],
    'algorithm': ['auto'],
    'metric': ['cosine']
}

knn = KNeighborsClassifier()

# Note: To use the grid search, you'll need to split your data into training and testing sets
# For demonstration, X_train, X_test, y_train, y_test variables are used

# Setup GridSearchCV
grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='roc_auc', verbose=2, n_jobs=1)

# Fit the grid search to the data
grid_search.fit(X, y)


Fitting 5 folds for each of 7 candidates, totalling 35 fits


[CV] END algorithm=auto, metric=cosine, n_neighbors=30, weights=distance; total time=  16.6s
[CV] END algorithm=auto, metric=cosine, n_neighbors=30, weights=distance; total time=  12.6s
[CV] END algorithm=auto, metric=cosine, n_neighbors=30, weights=distance; total time=  15.4s
[CV] END algorithm=auto, metric=cosine, n_neighbors=30, weights=distance; total time=  17.2s
[CV] END algorithm=auto, metric=cosine, n_neighbors=30, weights=distance; total time=  11.8s
[CV] END algorithm=auto, metric=cosine, n_neighbors=31, weights=distance; total time=  12.0s
[CV] END algorithm=auto, metric=cosine, n_neighbors=31, weights=distance; total time=  16.0s
[CV] END algorithm=auto, metric=cosine, n_neighbors=31, weights=distance; total time=  14.3s
[CV] END algorithm=auto, metric=cosine, n_neighbors=31, weights=distance; total time=  14.9s
[CV] END algorithm=auto, metric=cosine, n_neighbors=31, weights=distance; total time=  11.3s
[CV] END algorithm=auto, metric=cosine, n_neighbors=32, weights=distan

In [57]:
#0.9905733354163537
# get best parameters
print(grid_search.best_params_)
print(grid_search.best_score_)

{'algorithm': 'auto', 'metric': 'cosine', 'n_neighbors': 34, 'weights': 'distance'}
0.9905144231720332


In [58]:
knn_preds = grid_search.predict_proba(test[X.columns])

# get all probabilities of death
knn_preds = knn_preds[:, 1]

# create a submission file from icusty_id and predictions

submission_df = pd.DataFrame({'icustay_id': test['icustay_id'], 'hospital_expire_flag': knn_preds})
file_path = datetime.now().strftime('data/knn_grid_submission_%Y%m%d_%H%M.csv')
submission_df.to_csv(file_path, index=False)
