# Project: (K-) Nearest Neighbors


## Programming project: probability of death

In this project, you have to predict the probability of death of a patient that is entering an ICU (Intensive Care Unit).

The dataset comes from MIMIC project (https://mimic.physionet.org/). MIMIC-III (Medical Information Mart for Intensive Care III) is a large, freely-available database comprising deidentified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012.

Each row of *mimic_train.csv* correponds to one ICU stay (*hadm_id*+*icustay_id*) of one patient (*subject_id*). Column HOSPITAL_EXPIRE_FLAG is the indicator of death (=1) as a result of the current hospital stay; this is the outcome to predict in our modelling exercise.
The remaining columns correspond to vitals of each patient (when entering the ICU), plus some general characteristics (age, gender, etc.), and their explanation can be found at *mimic_patient_metadata.csv*. 

Please don't use any feature that you infer you don't know the first day of a patient in an ICU.

Note that the main cause/disease of patient condition is embedded as a code at *ICD9_diagnosis* column. The meaning of this code can be found at *MIMIC_metadata_diagnose.csv*. **But** this is only the main one; a patient can have co-occurrent diseases (comorbidities). These secondary codes can be found at *extra_data/MIMIC_diagnoses.csv*.

As performance metric, you can use *AUC* for the binary classification case, but feel free to report as well any other metric if you can justify that is particularly suitable for this case.

Main tasks are:
+ Using *mimic_train.csv* file build a predictive model for *HOSPITAL_EXPIRE_FLAG* .
+ For this analysis there is an extra test dataset, *mimic_test_death.csv*. Apply your final model to this extra dataset and generate predictions following the same format as *mimic_kaggle_death_sample_submission.csv*. Once ready, you can submit to our Kaggle competition and iterate to improve the accuracy.

As a *bonus*, try different algorithms for neighbor search and for distance, and justify final selection. Try also different weights to cope with class imbalance and also to balance neighbor proximity. Try to assess somehow confidence interval of predictions.

You can follow those **steps** in your first implementation:
1. *Explore* and understand the dataset. 
2. Manage missing data.
2. Manage categorial features. E.g. create *dummy variables* for relevant categorical features, or build an ad hoc distance function.
3. Build a prediction model.
5. Assess expected accuracy and tune your models using *cross-validation*. 
6. Test the performance on the test file and report accuracy, following same preparation steps (missing data, dummies, etc). Remember that you should be able to yield a prediction for all the rows of the test dataset.

Feel free to reduce the training dataset if you experience computational constraints.

## Main criteria for IN_CLASS grading
The weighting of these components will vary between the in-class and extended projects:
+ Code runs - 20%
+ Data preparation - 35%
+ Nearest neighbor method(s) have been used - 15%
+ Probability of death for each test patient is computed - 10%
+ Accuracy of predictions for test patients is calculated (kaggle) - 10%
+ Hyperparameter optimization - 10%
+ Neat and understandable code, with some titles and comments - 0%
+ Improved methods from what we discussed in class (properly explained/justified) - 0%

In [19]:
import pandas as pd
# Training dataset
data=pd.read_csv('/Users/bertacanal/Desktop/cml23-probability-of-death-with-k-nn/mimic_train.csv')
data.head()

Unnamed: 0,HOSPITAL_EXPIRE_FLAG,subject_id,hadm_id,icustay_id,HeartRate_Min,HeartRate_Max,HeartRate_Mean,SysBP_Min,SysBP_Max,SysBP_Mean,...,Diff,ADMISSION_TYPE,INSURANCE,RELIGION,MARITAL_STATUS,ETHNICITY,DIAGNOSIS,ICD9_diagnosis,FIRST_CAREUNIT,LOS
0,0,55440,195768,228357,89.0,145.0,121.043478,74.0,127.0,106.586957,...,-61961.7847,EMERGENCY,Medicare,PROTESTANT QUAKER,SINGLE,WHITE,GASTROINTESTINAL BLEED,5789,MICU,4.5761
1,0,76908,126136,221004,63.0,110.0,79.117647,89.0,121.0,106.733333,...,-43146.18378,EMERGENCY,Private,UNOBTAINABLE,MARRIED,WHITE,ESOPHAGEAL FOOD IMPACTION,53013,MICU,0.7582
2,0,95798,136645,296315,81.0,98.0,91.689655,88.0,138.0,112.785714,...,-42009.96157,EMERGENCY,Medicare,PROTESTANT QUAKER,SEPARATED,BLACK/AFRICAN AMERICAN,UPPER GI BLEED,56983,MICU,3.7626
3,0,40708,102505,245557,76.0,128.0,98.857143,84.0,135.0,106.972973,...,-43585.37922,ELECTIVE,Medicare,NOT SPECIFIED,WIDOWED,WHITE,HIATAL HERNIA/SDA,5533,SICU,3.8734
4,0,28424,127337,225281,,,,,,,...,-50271.76602,EMERGENCY,Medicare,JEWISH,WIDOWED,WHITE,ABDOMINAL PAIN,56211,TSICU,5.8654


In [20]:
# Test dataset (to produce predictions)
data_test=pd.read_csv('/Users/bertacanal/Desktop/cml23-probability-of-death-with-k-nn/mimic_test_death.csv')
data_test.sort_values('icustay_id').head()

Unnamed: 0,subject_id,hadm_id,icustay_id,HeartRate_Min,HeartRate_Max,HeartRate_Mean,SysBP_Min,SysBP_Max,SysBP_Mean,DiasBP_Min,...,ADMITTIME,Diff,ADMISSION_TYPE,INSURANCE,RELIGION,MARITAL_STATUS,ETHNICITY,DIAGNOSIS,ICD9_diagnosis,FIRST_CAREUNIT
4930,93535,121562,200011,56.0,82.0,71.205128,123.0,185.0,156.411765,37.0,...,2188-08-05 20:27:00,-64881.43517,EMERGENCY,Medicare,JEWISH,SINGLE,WHITE,ASTHMA;COPD EXACERBATION,49322,MICU
1052,30375,177945,200044,,,,,,,,...,2135-07-07 16:13:00,-46540.62661,EMERGENCY,Medicare,CATHOLIC,WIDOWED,WHITE,HEAD BLEED,85220,SICU
3412,73241,149216,200049,54.0,76.0,64.833333,95.0,167.0,114.545455,33.0,...,2118-08-14 22:27:00,-38956.8589,EMERGENCY,Private,JEWISH,MARRIED,WHITE,HEPATIC ENCEPHALOPATHY,5722,MICU
1725,99052,129142,200063,85.0,102.0,92.560976,91.0,131.0,108.365854,42.0,...,2141-03-09 23:19:00,-47014.25437,EMERGENCY,Medicaid,NOT SPECIFIED,SINGLE,UNKNOWN/NOT SPECIFIED,TYPE A DISSECTION,44101,CSRU
981,51698,190004,200081,82.0,133.0,94.323529,86.0,143.0,111.09375,47.0,...,2142-02-23 06:56:00,-47377.26087,EMERGENCY,Medicare,OTHER,MARRIED,PORTUGUESE,PULMONARY EMBOLISM,41519,CCU


In [21]:
# Sample output prediction file
pred_sample=pd.read_csv('/Users/bertacanal/Desktop/cml23-probability-of-death-with-k-nn/mimic_kaggle_death_sample_submission.csv')
pred_sample.sort_values('icustay_id').head()

Unnamed: 0,icustay_id,HOSPITAL_EXPIRE_FLAG
1937,200011,0
4908,200044,0
829,200049,0
4378,200063,0
4946,200081,0


In [22]:
# Your code here

#import auxiliar functions
import os,sys,inspect
currentdir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
parentdir = os.path.dirname(os.path.dirname(currentdir))
sys.path.insert(1, parentdir)
#from utils.helper_functions import *

import pandas as pd
import seaborn as sns
import numpy as np
import sklearn
import ipywidgets
from math import floor, ceil
import random
import time
import scipy

from sklearn.neighbors import NearestNeighbors, KNeighborsClassifier 
from sklearn.metrics import accuracy_score, recall_score,precision_score,f1_score, confusion_matrix
from sklearn import preprocessing
from sklearn.impute import SimpleImputer

#create X and y test and y train

y_train = data['HOSPITAL_EXPIRE_FLAG']
X_train = data.drop(['HOSPITAL_EXPIRE_FLAG'], axis = 1) 
X_test = data_test.copy()

In [23]:
#drop id variables and categorical variables
list_drop_train = ['subject_id','hadm_id','icustay_id','DOD','ADMITTIME', 'DISCHTIME', 'DEATHTIME'
                        ,'RELIGION', 'MARITAL_STATUS', 'ICD9_diagnosis', 'FIRST_CAREUNIT', 'LOS', 'Diff', 'DOB', 'DIAGNOSIS']

list_drop_test = ['subject_id','hadm_id','icustay_id','ADMITTIME'
                        ,'RELIGION', 'MARITAL_STATUS', 'ICD9_diagnosis', 'FIRST_CAREUNIT', 'Diff', 'DOB', 'DIAGNOSIS']
X_train = X_train.drop(list_drop_train, axis=1)
X_test = X_test.drop(list_drop_test, axis=1)

#Get column names of vategorical and numeric variables
#cnames_numeric=list(X_train.select_dtypes(exclude=['object']).columns)
#X_train['DIAGNOSIS'].mask(X_train['DIAGNOSIS'].map(X_train['DIAGNOSIS'].value_counts(normalize=True)) < 0.1, 'Other')
#X_test['DIAGNOSIS'].mask(X_test['DIAGNOSIS'].map(X_test['DIAGNOSIS'].value_counts(normalize=True)) < 0.1, 'Other')

#lists of numeric and categorical columns
cnames_categorical=list(X_train.select_dtypes(include =['object']).columns)
cnames_categorical

cnames_numeric=list(X_train.select_dtypes(exclude =['object']).columns)
cnames_numeric

#insurance dummy
#X_train= pd.get_dummies(X_train, prefix = ['INSURANCE'], columns = ['INSURANCE'], drop_first = True)
#X_test = pd.get_dummies(X_test, prefix = ['INSURANCE'], columns = ['INSURANCE'], drop_first = True)
#Admission dummy
#X_train= pd.get_dummies(X_train, prefix = ['ADMISSION_TYPE'], columns = ['ADMISSION_TYPE'], drop_first = True)
#X_test = pd.get_dummies(X_test, prefix = ['ADMISSION_TYPE'], columns = ['ADMISSION_TYPE'], drop_first = True

['HeartRate_Min',
 'HeartRate_Max',
 'HeartRate_Mean',
 'SysBP_Min',
 'SysBP_Max',
 'SysBP_Mean',
 'DiasBP_Min',
 'DiasBP_Max',
 'DiasBP_Mean',
 'MeanBP_Min',
 'MeanBP_Max',
 'MeanBP_Mean',
 'RespRate_Min',
 'RespRate_Max',
 'RespRate_Mean',
 'TempC_Min',
 'TempC_Max',
 'TempC_Mean',
 'SpO2_Min',
 'SpO2_Max',
 'SpO2_Mean',
 'Glucose_Min',
 'Glucose_Max',
 'Glucose_Mean']

In [24]:
# Calculate the frequency of each category in the column
category_counts_train = X_train['ETHNICITY'].value_counts()
category_counts_test = X_test['ETHNICITY'].value_counts()

# Calculate the threshold for "other" based on 10% of the total count
threshold_train = 0.05 * len(X_train)
threshold_test = 0.05 * len(X_test)
# Identify categories that appear less than the threshold
infrequent_categories_train = category_counts_train[category_counts_train < threshold_train].index
infrequent_categories_test = category_counts_test[category_counts_test < threshold_test].index
# Replace infrequent categories with 'other'
X_train['ETHNICITY'] = X_train['ETHNICITY'].apply(lambda x: 'other' if x in infrequent_categories_train else x)
X_test['ETHNICITY'] = X_test['ETHNICITY'].apply(lambda x: 'other' if x in infrequent_categories_test else x)


In [25]:
print(X_train['ETHNICITY'].unique())
print(X_test['ETHNICITY'].unique())

['WHITE' 'BLACK/AFRICAN AMERICAN' 'other']
['WHITE' 'other' 'BLACK/AFRICAN AMERICAN']


In [26]:
# Create a subset of X_train and X_test with the specified columns
X_train_subset = X_train[cnames_categorical]
X_test_subset = X_test[cnames_categorical]

#Impute categorical variables
imp_frequent = SimpleImputer(missing_values = np.nan, strategy = 'most_frequent')

imp_frequent.fit(X_train_subset)

X_train_subset = pd.DataFrame(imp_frequent.transform(X_train_subset), columns = cnames_categorical)
X_test_subset = pd.DataFrame(imp_frequent.transform(X_test_subset), columns = cnames_categorical)

# Update the original DataFrames with the imputed values
X_train[cnames_categorical] = X_train_subset
X_test[cnames_categorical] = X_test_subset

#Check if imputation was done correctly
print(X_train.isnull().sum())
print(X_test.isnull().sum())

HeartRate_Min     2187
HeartRate_Max     2187
HeartRate_Mean    2187
SysBP_Min         2208
SysBP_Max         2208
SysBP_Mean        2208
DiasBP_Min        2209
DiasBP_Max        2209
DiasBP_Mean       2209
MeanBP_Min        2186
MeanBP_Max        2186
MeanBP_Mean       2186
RespRate_Min      2189
RespRate_Max      2189
RespRate_Mean     2189
TempC_Min         2497
TempC_Max         2497
TempC_Mean        2497
SpO2_Min          2203
SpO2_Max          2203
SpO2_Mean         2203
Glucose_Min        253
Glucose_Max        253
Glucose_Mean       253
GENDER               0
ADMISSION_TYPE       0
INSURANCE            0
ETHNICITY            0
dtype: int64
HeartRate_Min     545
HeartRate_Max     545
HeartRate_Mean    545
SysBP_Min         551
SysBP_Max         551
SysBP_Mean        551
DiasBP_Min        552
DiasBP_Max        552
DiasBP_Mean       552
MeanBP_Min        547
MeanBP_Max        547
MeanBP_Mean       547
RespRate_Min      546
RespRate_Max      546
RespRate_Mean     546
TempC_Min    

In [27]:
#Apply one-hot encoding

for col in cnames_categorical:
    X_train = pd.get_dummies(X_train, prefix=[col], columns=[col], drop_first=True)
    X_test = pd.get_dummies(X_test, prefix=[col], columns=[col], drop_first=True)
    
X_test.columns
        

Index(['HeartRate_Min', 'HeartRate_Max', 'HeartRate_Mean', 'SysBP_Min',
       'SysBP_Max', 'SysBP_Mean', 'DiasBP_Min', 'DiasBP_Max', 'DiasBP_Mean',
       'MeanBP_Min', 'MeanBP_Max', 'MeanBP_Mean', 'RespRate_Min',
       'RespRate_Max', 'RespRate_Mean', 'TempC_Min', 'TempC_Max', 'TempC_Mean',
       'SpO2_Min', 'SpO2_Max', 'SpO2_Mean', 'Glucose_Min', 'Glucose_Max',
       'Glucose_Mean', 'GENDER_M', 'ADMISSION_TYPE_EMERGENCY',
       'ADMISSION_TYPE_URGENT', 'INSURANCE_Medicaid', 'INSURANCE_Medicare',
       'INSURANCE_Private', 'INSURANCE_Self Pay', 'ETHNICITY_WHITE',
       'ETHNICITY_other'],
      dtype='object')

In [28]:
#Check for Nan
Nulls_train = X_train.isnull().sum()
Nulls_test = X_test.isnull().sum()

# We have to check both X_train and X_test

print(Nulls_train)
print(Nulls_test)

# Identifying Columns with Null variables
missing_data_col_X_train = X_train.columns[Nulls_train>0]
missing_data_col_X_test = X_test.columns[Nulls_test>0]
missing_data_col_X_train
missing_data_col_X_test

HeartRate_Min               2187
HeartRate_Max               2187
HeartRate_Mean              2187
SysBP_Min                   2208
SysBP_Max                   2208
SysBP_Mean                  2208
DiasBP_Min                  2209
DiasBP_Max                  2209
DiasBP_Mean                 2209
MeanBP_Min                  2186
MeanBP_Max                  2186
MeanBP_Mean                 2186
RespRate_Min                2189
RespRate_Max                2189
RespRate_Mean               2189
TempC_Min                   2497
TempC_Max                   2497
TempC_Mean                  2497
SpO2_Min                    2203
SpO2_Max                    2203
SpO2_Mean                   2203
Glucose_Min                  253
Glucose_Max                  253
Glucose_Mean                 253
GENDER_M                       0
ADMISSION_TYPE_EMERGENCY       0
ADMISSION_TYPE_URGENT          0
INSURANCE_Medicaid             0
INSURANCE_Medicare             0
INSURANCE_Private              0
INSURANCE_

Index(['HeartRate_Min', 'HeartRate_Max', 'HeartRate_Mean', 'SysBP_Min',
       'SysBP_Max', 'SysBP_Mean', 'DiasBP_Min', 'DiasBP_Max', 'DiasBP_Mean',
       'MeanBP_Min', 'MeanBP_Max', 'MeanBP_Mean', 'RespRate_Min',
       'RespRate_Max', 'RespRate_Mean', 'TempC_Min', 'TempC_Max', 'TempC_Mean',
       'SpO2_Min', 'SpO2_Max', 'SpO2_Mean', 'Glucose_Min', 'Glucose_Max',
       'Glucose_Mean'],
      dtype='object')

In [29]:
#Get the common column names
common_columns = list(set(X_train.columns).intersection(X_test.columns))

#Create new dataframes with only the common columns
X_train = X_train[common_columns]
X_test = X_test[common_columns]

In [30]:
#I'm going to try imputation by mean on both train-test sets instead of removing cols (numeric variables)

imp_mean = SimpleImputer(missing_values = np.nan, strategy = 'mean')

imp_mean.fit(X_train) #use train mean!

X_train = pd.DataFrame(imp_mean.transform(X_train), columns = X_train.columns)
X_test = pd.DataFrame(imp_mean.transform(X_test), columns = X_test.columns)

#Check if imputation was done correctly
print(X_train.isnull().sum())
print(X_test.isnull().sum())

MeanBP_Mean                 0
SpO2_Max                    0
Glucose_Min                 0
ADMISSION_TYPE_EMERGENCY    0
RespRate_Max                0
SysBP_Max                   0
TempC_Min                   0
MeanBP_Max                  0
SysBP_Min                   0
ADMISSION_TYPE_URGENT       0
RespRate_Mean               0
ETHNICITY_other             0
DiasBP_Mean                 0
INSURANCE_Private           0
ETHNICITY_WHITE             0
SysBP_Mean                  0
HeartRate_Max               0
RespRate_Min                0
INSURANCE_Medicare          0
Glucose_Max                 0
HeartRate_Min               0
MeanBP_Min                  0
HeartRate_Mean              0
TempC_Max                   0
DiasBP_Max                  0
SpO2_Mean                   0
DiasBP_Min                  0
INSURANCE_Self Pay          0
SpO2_Min                    0
Glucose_Mean                0
TempC_Mean                  0
INSURANCE_Medicaid          0
GENDER_M                    0
dtype: int

In [31]:
#Here I scale data

scaler = preprocessing.StandardScaler(with_mean = True, with_std = True)
scaler.fit(X_train) ## fit it to the train set
#Scale both sets
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
#View data
X_train = pd.DataFrame(X_train) 
X_test = pd.DataFrame(X_test)

In [33]:
#SVM model

from sklearn.svm import SVC

from sklearn.model_selection import GridSearchCV

#MySvc = SVC(C = 1.0, kernel= 'rbf', degree=3, gamma='scale', probability=True) # We set the default parameters except for probability that by default is set to False
#MySvc.fit(X_train, y_train)


##JUST FOR THE IN-CLASS TEST. IN ANY OTHER SITUATION Probability = True!
my_SVM_model = SVC(probability = False, kernel = 'rbf')

grid_values = {'C':[0.1, 0.3, 0.5], 'gamma':[0.25,0.5,0.75]}

grid_svc_acc = GridSearchCV(my_SVM_model, param_grid = grid_values,scoring = 'accuracy', n_jobs = -1, cv=5, max_iter=1000)

#Fit the model
#MySvc.fit(X_train, y_train)
grid_svc_acc.fit(X_train, y_train)

TypeError: GridSearchCV.__init__() got an unexpected keyword argument 'max_iter'

In [None]:
# Out-sample predicted probabilities

#y_pred_proba = MySvc.predict_proba(X_test)
y_pred_proba = grid_svc_acc.predict(X_test)

In [None]:
# AUC if probability = True
#Accuracy if probability = False

from sklearn.metrics import roc_auc_score, roc_curve, auc, confusion_matrix

## In-sample predicted probabilities

#insample_pred = MySvc.predict_proba(X_train)
insample_pred = grid_svc_acc.predict(X_train)

#print(roc_auc_score(y_train, insample_pred[:, 1]))
print(accuracy_score(y_train, insample_pred))

### Kaggle Predictions Submissions

Once you have produced testset predictions you can submit these to <i> kaggle </i> in order to see how your model performs. 

The following code provides an example of generating a <i> .csv </i> file to submit to kaggle
1) create a pandas dataframe with two columns, one with the test set "icustay_id"'s and the other with your predicted "HOSPITAL_EXPIRE_FLAG" for that observation

2) use the <i> .to_csv </i> pandas method to create a csv file. The <i> index = False </i> is important to ensure the <i> .csv </i> is in the format kaggle expects 

In [None]:
# Produce .csv for kaggle testing 
test_predictions_submit = pd.DataFrame({"icustay_id": data_test["icustay_id"], "HOSPITAL_EXPIRE_FLAG": y_pred_proba})
test_predictions_submit.to_csv("test_predictions_submit.csv", index = False)