<img src = "../../Data/bgsedsc_0.jpg">

# Project: Support Vector Machines (SVM)

## Programming project: probability of death

In this project, you have to predict the probability of death of a patient that is entering an ICU (Intensive Care Unit).

The dataset comes from MIMIC project (https://mimic.physionet.org/). MIMIC-III (Medical Information Mart for Intensive Care III) is a large, freely-available database comprising deidentified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012.

Each row of *mimic_train.csv* correponds to one ICU stay (*hadm_id*+*icustay_id*) of one patient (*subject_id*). Column HOSPITAL_EXPIRE_FLAG is the indicator of death (=1) as a result of the current hospital stay; this is the outcome to predict in our modelling exercise.
The remaining columns correspond to vitals of each patient (when entering the ICU), plus some general characteristics (age, gender, etc.), and their explanation can be found at *mimic_patient_metadata.csv*.

Note that the main cause/disease of patient contidition is embedded as a code at *ICD9_diagnosis* column. The meaning of this code can be found at *MIMIC_metadata_diagnose.csv*. **But** this is only the main one; a patient can have co-occurrent diseases (comorbidities). These secondary codes can be found at *extra_data/MIMIC_diagnoses.csv*.

Don't use features that you don't know the first day a patient enters the ICU, such as LOS.

As performance metric, you can use *AUC* for the binary classification case, but feel free to report as well any other metric if you can justify that is particularly suitable for this case.

Main tasks are:
+ Using *mimic_train.csv* file build a predictive model for *HOSPITAL_EXPIRE_FLAG* .
+ For this analysis there is an extra test dataset, *mimic_test.csv*. Apply your final model to this extra dataset and submit to Kaggle competition to obtain accuracy of prediction (follow the requested format).

Try to optimize hyperparameters of your SVM model.

You can follow those **steps** in your first implementation:
1. *Explore* and understand the dataset.
2. Manage missing data.
2. Manage categorial features. E.g. create *dummy variables* for relevant categorical features, or build an ad hoc distance function.
3. Build a prediction model. Try to improve it using methods to tackle class imbalance.
5. Assess expected accuracy  of previous models using *cross-validation*.
6. Test the performance on the test file by submitting to Kaggle, following same preparation steps (missing data, dummies, etc). Remember that you should be able to yield a prediction for all the rows of the test dataset.

For the in-class version, feel free to reduce the training dataset if you experience computational constraints.

## Main criteria for IN-CLASS grading
The weighting of these components will vary between the in-class and extended projects:
+ Code runs - 15%
+ Data preparation - 20%
+ SVMs method(s) have been used - 25%
+ Probability of death for each test patient is computed - 15%
+ Accuracy itself - 15%
+ Hyperparameter optimization - 10%
+ Class imbalance management - 0%
+ Neat and understandable code, with some titles and comments - 0%
+ Improved methods from what we discussed in class (properly explained/justified) - 0%


In [1]:
# Starter code to load data
import pandas as pd

# Training dataset
data=pd.read_csv('mimic_train.csv')
data.head()

Unnamed: 0,HOSPITAL_EXPIRE_FLAG,subject_id,hadm_id,icustay_id,HeartRate_Min,HeartRate_Max,HeartRate_Mean,SysBP_Min,SysBP_Max,SysBP_Mean,...,Diff,ADMISSION_TYPE,INSURANCE,RELIGION,MARITAL_STATUS,ETHNICITY,DIAGNOSIS,ICD9_diagnosis,FIRST_CAREUNIT,LOS
0,0,55440,195768,228357,89.0,145.0,121.043478,74.0,127.0,106.586957,...,-61961.7847,EMERGENCY,Medicare,PROTESTANT QUAKER,SINGLE,WHITE,GASTROINTESTINAL BLEED,5789,MICU,4.5761
1,0,76908,126136,221004,63.0,110.0,79.117647,89.0,121.0,106.733333,...,-43146.18378,EMERGENCY,Private,UNOBTAINABLE,MARRIED,WHITE,ESOPHAGEAL FOOD IMPACTION,53013,MICU,0.7582
2,0,95798,136645,296315,81.0,98.0,91.689655,88.0,138.0,112.785714,...,-42009.96157,EMERGENCY,Medicare,PROTESTANT QUAKER,SEPARATED,BLACK/AFRICAN AMERICAN,UPPER GI BLEED,56983,MICU,3.7626
3,0,40708,102505,245557,76.0,128.0,98.857143,84.0,135.0,106.972973,...,-43585.37922,ELECTIVE,Medicare,NOT SPECIFIED,WIDOWED,WHITE,HIATAL HERNIA/SDA,5533,SICU,3.8734
4,0,28424,127337,225281,,,,,,,...,-50271.76602,EMERGENCY,Medicare,JEWISH,WIDOWED,WHITE,ABDOMINAL PAIN,56211,TSICU,5.8654


In [2]:
# Test dataset (to produce predictions)
data_test=pd.read_csv('mimic_test_death.csv')
data_test.sort_values('icustay_id').head()

Unnamed: 0,subject_id,hadm_id,icustay_id,HeartRate_Min,HeartRate_Max,HeartRate_Mean,SysBP_Min,SysBP_Max,SysBP_Mean,DiasBP_Min,...,ADMITTIME,Diff,ADMISSION_TYPE,INSURANCE,RELIGION,MARITAL_STATUS,ETHNICITY,DIAGNOSIS,ICD9_diagnosis,FIRST_CAREUNIT
4930,93535,121562,200011,56.0,82.0,71.205128,123.0,185.0,156.411765,37.0,...,2188-08-05 20:27:00,-64881.43517,EMERGENCY,Medicare,JEWISH,SINGLE,WHITE,ASTHMA;COPD EXACERBATION,49322,MICU
1052,30375,177945,200044,,,,,,,,...,2135-07-07 16:13:00,-46540.62661,EMERGENCY,Medicare,CATHOLIC,WIDOWED,WHITE,HEAD BLEED,85220,SICU
3412,73241,149216,200049,54.0,76.0,64.833333,95.0,167.0,114.545455,33.0,...,2118-08-14 22:27:00,-38956.8589,EMERGENCY,Private,JEWISH,MARRIED,WHITE,HEPATIC ENCEPHALOPATHY,5722,MICU
1725,99052,129142,200063,85.0,102.0,92.560976,91.0,131.0,108.365854,42.0,...,2141-03-09 23:19:00,-47014.25437,EMERGENCY,Medicaid,NOT SPECIFIED,SINGLE,UNKNOWN/NOT SPECIFIED,TYPE A DISSECTION,44101,CSRU
981,51698,190004,200081,82.0,133.0,94.323529,86.0,143.0,111.09375,47.0,...,2142-02-23 06:56:00,-47377.26087,EMERGENCY,Medicare,OTHER,MARRIED,PORTUGUESE,PULMONARY EMBOLISM,41519,CCU


In [3]:
#your code here
from numpy import random

random.seed(8912)## fixing the seed of your subsample

n_subset = 2000## size of the subsample
ind_subset = random.choice(data.shape[0], size= n_subset, replace=False)## indicies of your sampled points

data_test = data.copy().iloc[ind_subset,]## selecting the subsampled indicies
data.shape
data.head()

Unnamed: 0,HOSPITAL_EXPIRE_FLAG,subject_id,hadm_id,icustay_id,HeartRate_Min,HeartRate_Max,HeartRate_Mean,SysBP_Min,SysBP_Max,SysBP_Mean,...,Diff,ADMISSION_TYPE,INSURANCE,RELIGION,MARITAL_STATUS,ETHNICITY,DIAGNOSIS,ICD9_diagnosis,FIRST_CAREUNIT,LOS
0,0,55440,195768,228357,89.0,145.0,121.043478,74.0,127.0,106.586957,...,-61961.7847,EMERGENCY,Medicare,PROTESTANT QUAKER,SINGLE,WHITE,GASTROINTESTINAL BLEED,5789,MICU,4.5761
1,0,76908,126136,221004,63.0,110.0,79.117647,89.0,121.0,106.733333,...,-43146.18378,EMERGENCY,Private,UNOBTAINABLE,MARRIED,WHITE,ESOPHAGEAL FOOD IMPACTION,53013,MICU,0.7582
2,0,95798,136645,296315,81.0,98.0,91.689655,88.0,138.0,112.785714,...,-42009.96157,EMERGENCY,Medicare,PROTESTANT QUAKER,SEPARATED,BLACK/AFRICAN AMERICAN,UPPER GI BLEED,56983,MICU,3.7626
3,0,40708,102505,245557,76.0,128.0,98.857143,84.0,135.0,106.972973,...,-43585.37922,ELECTIVE,Medicare,NOT SPECIFIED,WIDOWED,WHITE,HIATAL HERNIA/SDA,5533,SICU,3.8734
4,0,28424,127337,225281,,,,,,,...,-50271.76602,EMERGENCY,Medicare,JEWISH,WIDOWED,WHITE,ABDOMINAL PAIN,56211,TSICU,5.8654


In [4]:
## We load the relevant modules

#import auxiliar functions

import os,sys,inspect
currentdir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
parentdir = os.path.dirname(os.path.dirname(currentdir))
sys.path.insert(1, parentdir)
from utils.helper_functions import *

import pandas as pd
import seaborn as sns
import numpy as np
import sklearn
import ipywidgets
from math import floor, ceil

from sklearn import preprocessing
from sklearn.impute import SimpleImputer
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score

np.random.seed(3123) # set random seed to ensure reproducibility

!pwd

/home/adam/Desktop/computational Machine Learning/Exam preparation documents/Exam_SVM


# Preprocessing of the data

In [5]:
# We distinguish between X_train and y_train

X_train = data.drop(['HOSPITAL_EXPIRE_FLAG'], axis=1)
y_train = data['HOSPITAL_EXPIRE_FLAG']

print(X_train.shape)
X_train.head()

(20885, 43)


Unnamed: 0,subject_id,hadm_id,icustay_id,HeartRate_Min,HeartRate_Max,HeartRate_Mean,SysBP_Min,SysBP_Max,SysBP_Mean,DiasBP_Min,...,Diff,ADMISSION_TYPE,INSURANCE,RELIGION,MARITAL_STATUS,ETHNICITY,DIAGNOSIS,ICD9_diagnosis,FIRST_CAREUNIT,LOS
0,55440,195768,228357,89.0,145.0,121.043478,74.0,127.0,106.586957,42.0,...,-61961.7847,EMERGENCY,Medicare,PROTESTANT QUAKER,SINGLE,WHITE,GASTROINTESTINAL BLEED,5789,MICU,4.5761
1,76908,126136,221004,63.0,110.0,79.117647,89.0,121.0,106.733333,49.0,...,-43146.18378,EMERGENCY,Private,UNOBTAINABLE,MARRIED,WHITE,ESOPHAGEAL FOOD IMPACTION,53013,MICU,0.7582
2,95798,136645,296315,81.0,98.0,91.689655,88.0,138.0,112.785714,45.0,...,-42009.96157,EMERGENCY,Medicare,PROTESTANT QUAKER,SEPARATED,BLACK/AFRICAN AMERICAN,UPPER GI BLEED,56983,MICU,3.7626
3,40708,102505,245557,76.0,128.0,98.857143,84.0,135.0,106.972973,30.0,...,-43585.37922,ELECTIVE,Medicare,NOT SPECIFIED,WIDOWED,WHITE,HIATAL HERNIA/SDA,5533,SICU,3.8734
4,28424,127337,225281,,,,,,,,...,-50271.76602,EMERGENCY,Medicare,JEWISH,WIDOWED,WHITE,ABDOMINAL PAIN,56211,TSICU,5.8654


In [6]:
# Check the names of the columns

X_train.columns

Index(['subject_id', 'hadm_id', 'icustay_id', 'HeartRate_Min', 'HeartRate_Max',
       'HeartRate_Mean', 'SysBP_Min', 'SysBP_Max', 'SysBP_Mean', 'DiasBP_Min',
       'DiasBP_Max', 'DiasBP_Mean', 'MeanBP_Min', 'MeanBP_Max', 'MeanBP_Mean',
       'RespRate_Min', 'RespRate_Max', 'RespRate_Mean', 'TempC_Min',
       'TempC_Max', 'TempC_Mean', 'SpO2_Min', 'SpO2_Max', 'SpO2_Mean',
       'Glucose_Min', 'Glucose_Max', 'Glucose_Mean', 'GENDER', 'DOB', 'DOD',
       'ADMITTIME', 'DISCHTIME', 'DEATHTIME', 'Diff', 'ADMISSION_TYPE',
       'INSURANCE', 'RELIGION', 'MARITAL_STATUS', 'ETHNICITY', 'DIAGNOSIS',
       'ICD9_diagnosis', 'FIRST_CAREUNIT', 'LOS'],
      dtype='object')

In [7]:
# Drop some categorical variables, dates and id vars

# From the training data

X_train = X_train.drop(['subject_id', 'hadm_id', 'icustay_id', 'DOB', 'DOD', 'ADMITTIME', 'DISCHTIME', 'DEATHTIME', 'Diff', 'DIAGNOSIS', 'LOS', 'ETHNICITY', 'RELIGION'], axis=1)
print(X_train.shape)
print(X_train.head())

# From the test data

X_test = data_test.drop(['subject_id', 'hadm_id', 'icustay_id', 'DOB','ADMITTIME', 'DISCHTIME', 'DEATHTIME', 'Diff', 'DIAGNOSIS', 'ETHNICITY', 'RELIGION'], axis=1)
print(X_test.shape)
print(X_test.head())

(20885, 30)
   HeartRate_Min  HeartRate_Max  HeartRate_Mean  SysBP_Min  SysBP_Max  \
0           89.0          145.0      121.043478       74.0      127.0   
1           63.0          110.0       79.117647       89.0      121.0   
2           81.0           98.0       91.689655       88.0      138.0   
3           76.0          128.0       98.857143       84.0      135.0   
4            NaN            NaN             NaN        NaN        NaN   

   SysBP_Mean  DiasBP_Min  DiasBP_Max  DiasBP_Mean  MeanBP_Min  ...  \
0  106.586957        42.0        90.0    61.173913        59.0  ...   
1  106.733333        49.0        74.0    64.733333        58.0  ...   
2  112.785714        45.0        67.0    56.821429        64.0  ...   
3  106.972973        30.0        89.0    41.864865        48.0  ...   
4         NaN         NaN         NaN          NaN         NaN  ...   

    SpO2_Mean  Glucose_Min  Glucose_Max  Glucose_Mean  GENDER  ADMISSION_TYPE  \
0   95.739130        111.0        230.0  

In [8]:
# List of the names of the categorical and numerical columns

X_train_categorical_names = ['GENDER', 'INSURANCE', 'ADMISSION_TYPE', 'MARITAL_STATUS', 'FIRST_CAREUNIT', 'ICD9_diagnosis']
X_test_categorical_names = ['GENDER', 'INSURANCE', 'ADMISSION_TYPE', 'MARITAL_STATUS', 'FIRST_CAREUNIT', 'ICD9_diagnosis']

X_train_numerical_names = [name for name in X_train.columns if name not in X_train_categorical_names]
X_test_numerical_names = [name for name in X_test.columns if name not in X_test_categorical_names]

In [9]:
# To check the unique categories of the variable 'FIRST_CAREUNIT'

X_train['FIRST_CAREUNIT'].unique()

array(['MICU', 'SICU', 'TSICU', 'CSRU', 'CCU'], dtype=object)

In [10]:
# To check null values per feature

Nulls_train = X_train.isnull().sum() #sort_values(ascending=False)
Nulls_test = X_test.isnull().sum() #sort_values(ascending=False)

# We have to check both X_train and X_test

print(Nulls_train)
print(Nulls_test)

# Identify columns with null variables
missing_data_col_X_train = X_train.columns[Nulls_train>0]
missing_data_col_X_test = X_test.columns[Nulls_test>0]
missing_data_col_X_train
missing_data_col_X_test

HeartRate_Min     2187
HeartRate_Max     2187
HeartRate_Mean    2187
SysBP_Min         2208
SysBP_Max         2208
SysBP_Mean        2208
DiasBP_Min        2209
DiasBP_Max        2209
DiasBP_Mean       2209
MeanBP_Min        2186
MeanBP_Max        2186
MeanBP_Mean       2186
RespRate_Min      2189
RespRate_Max      2189
RespRate_Mean     2189
TempC_Min         2497
TempC_Max         2497
TempC_Mean        2497
SpO2_Min          2203
SpO2_Max          2203
SpO2_Mean         2203
Glucose_Min        253
Glucose_Max        253
Glucose_Mean       253
GENDER               0
ADMISSION_TYPE       0
INSURANCE            0
MARITAL_STATUS     722
ICD9_diagnosis       0
FIRST_CAREUNIT       0
dtype: int64
HOSPITAL_EXPIRE_FLAG       0
HeartRate_Min            214
HeartRate_Max            214
HeartRate_Mean           214
SysBP_Min                215
SysBP_Max                215
SysBP_Mean               215
DiasBP_Min               215
DiasBP_Max               215
DiasBP_Mean              215
MeanBP_

Index(['HeartRate_Min', 'HeartRate_Max', 'HeartRate_Mean', 'SysBP_Min',
       'SysBP_Max', 'SysBP_Mean', 'DiasBP_Min', 'DiasBP_Max', 'DiasBP_Mean',
       'MeanBP_Min', 'MeanBP_Max', 'MeanBP_Mean', 'RespRate_Min',
       'RespRate_Max', 'RespRate_Mean', 'TempC_Min', 'TempC_Max', 'TempC_Mean',
       'SpO2_Min', 'SpO2_Max', 'SpO2_Mean', 'Glucose_Min', 'Glucose_Max',
       'Glucose_Mean', 'DOD', 'DEATHTIME', 'MARITAL_STATUS'],
      dtype='object')

In [11]:
## Remove columns with a considerable high proportion of missing data 

initial_columns = X_train.columns

print(X_train.shape)
my_percentage_valid = 0.2 #This percentage indicates the number of valid answers required to keep a column (if we increase it, there is more chance to eliminate more columns)
X_train = X_train.dropna(axis=1, thresh=round(my_percentage_valid * len(X_train.index)))
print(X_train.shape)

dropped_columns = list(set(initial_columns) - set(X_train.columns))

X_test = X_test.drop(columns=dropped_columns)

(20885, 30)
(20885, 30)


In [12]:
X_train.head

<bound method NDFrame.head of        HeartRate_Min  HeartRate_Max  HeartRate_Mean  SysBP_Min  SysBP_Max  \
0               89.0          145.0      121.043478       74.0      127.0   
1               63.0          110.0       79.117647       89.0      121.0   
2               81.0           98.0       91.689655       88.0      138.0   
3               76.0          128.0       98.857143       84.0      135.0   
4                NaN            NaN             NaN        NaN        NaN   
...              ...            ...             ...        ...        ...   
20880           65.0           92.0       78.500000       60.0      160.0   
20881           74.0          112.0       89.156250      100.0      150.0   
20882           58.0           97.0       76.933333       94.0      131.0   
20883           59.0          102.0       81.844444       96.0      150.0   
20884           59.0           97.0       77.526316       82.0      139.0   

       SysBP_Mean  DiasBP_Min  DiasBP_Max  Di

In [13]:
# Check specifically if categorical variable have missing values

X_train_categorical_columns = X_train[X_train_categorical_names].copy()
X_test_categorical_columns = X_test[X_test_categorical_names].copy()

print(X_train_categorical_columns.isnull().sum())
print(X_test_categorical_columns.isnull().sum())

GENDER              0
INSURANCE           0
ADMISSION_TYPE      0
MARITAL_STATUS    722
FIRST_CAREUNIT      0
ICD9_diagnosis      0
dtype: int64
GENDER             0
INSURANCE          0
ADMISSION_TYPE     0
MARITAL_STATUS    57
FIRST_CAREUNIT     0
ICD9_diagnosis     0
dtype: int64


In [14]:
X_train['MARITAL_STATUS'].fillna("Missing values", inplace=True)
X_test['MARITAL_STATUS'].fillna("Missing values", inplace=True)

In [15]:
X_train.head

<bound method NDFrame.head of        HeartRate_Min  HeartRate_Max  HeartRate_Mean  SysBP_Min  SysBP_Max  \
0               89.0          145.0      121.043478       74.0      127.0   
1               63.0          110.0       79.117647       89.0      121.0   
2               81.0           98.0       91.689655       88.0      138.0   
3               76.0          128.0       98.857143       84.0      135.0   
4                NaN            NaN             NaN        NaN        NaN   
...              ...            ...             ...        ...        ...   
20880           65.0           92.0       78.500000       60.0      160.0   
20881           74.0          112.0       89.156250      100.0      150.0   
20882           58.0           97.0       76.933333       94.0      131.0   
20883           59.0          102.0       81.844444       96.0      150.0   
20884           59.0           97.0       77.526316       82.0      139.0   

       SysBP_Mean  DiasBP_Min  DiasBP_Max  Di

In [16]:
# Simple imputation to the numerical variables

X_train_numerical_columns = X_train[X_train_numerical_names]
X_test_numerical_columns = X_test[X_test_numerical_names]


imp_mean = SimpleImputer(missing_values = np.nan, strategy= 'mean')

imp_mean.fit(X_train_numerical_columns)

X_train_numerical_columns = pd.DataFrame(imp_mean.transform(X_train_numerical_columns), columns = X_train_numerical_names)
X_test_numerical_columns = pd.DataFrame(imp_mean.transform(X_test_numerical_columns), columns = X_test_numerical_names)

X_train[X_train_numerical_names] = X_train_numerical_columns
X_test[X_test_numerical_names] = X_test_numerical_columns

print(X_train_numerical_columns.isnull().sum().sum())
print(X_test_numerical_columns.isnull().sum().sum())

ValueError: The feature names should match those that were passed during fit.
Feature names unseen at fit time:
- DEATHTIME
- DISCHTIME
- DOD
- HOSPITAL_EXPIRE_FLAG
- LOS


In [None]:
X_train.head

In [None]:
# Això és com està als apunts però no funciona

#import category_encoders as ce

#diagnoses_encoder = ce.TargetEncoder(smoothing = 1.0)

#diagnoses_encoder.fit_transform(X_train['ICD9_diagnosis'], y_train['HOSPITAL_EXPIRE_FLAG'])

In [None]:
from category_encoders import TargetEncoder

# Canviar perquè no estigui igual que l'Adam.


Target_encoder = TargetEncoder(smoothing = 1.0) #handle_missing = 'return_nan'

# Fit the data
Target_encoder.fit(X = X_train['ICD9_diagnosis'], y = y_train)

# Transform

Values_train = Target_encoder.transform(X_train['ICD9_diagnosis'])
Values_test = Target_encoder.transform(X_test['ICD9_diagnosis'])
X_train['ICD9_diagnosis'] = Values_train
X_test['ICD9_diagnosis'] = Values_test

In [None]:
# Transform categorical variables into dummies (drop_first = True is used to get k-1 dummies out of k categorical levels by removing the first level)

X_train= pd.get_dummies(X_train, prefix = ['GENDER'], columns = ['GENDER'], drop_first = True)
X_test = pd.get_dummies(X_test, prefix = ['GENDER'], columns = ['GENDER'], drop_first = True)

X_train= pd.get_dummies(X_train, prefix = ['INSURANCE'], columns = ['INSURANCE'], drop_first = True)
X_test = pd.get_dummies(X_test, prefix = ['INSURANCE'], columns = ['INSURANCE'], drop_first = True)

X_train= pd.get_dummies(X_train, prefix = ['ADMISSION_TYPE'], columns = ['ADMISSION_TYPE'], drop_first = True)
X_test = pd.get_dummies(X_test, prefix = ['ADMISSION_TYPE'], columns = ['ADMISSION_TYPE'], drop_first = True)
                        
X_train= pd.get_dummies(X_train, prefix = ['MARITAL_STATUS'], columns = ['MARITAL_STATUS'], drop_first = True)
X_test = pd.get_dummies(X_test, prefix = ['MARITAL_STATUS'], columns = ['MARITAL_STATUS'], drop_first = True)

X_train= pd.get_dummies(X_train, prefix = ['FIRST_CAREUNIT'], columns = ['FIRST_CAREUNIT'], drop_first = True)
X_test = pd.get_dummies(X_test, prefix = ['FIRST_CAREUNIT'], columns = ['FIRST_CAREUNIT'], drop_first = True)

In [None]:
X_train.head()

# Feature standardization

In [None]:
# Scale the data 

scaler = preprocessing.StandardScaler(with_mean = True, with_std = True)
scaler.fit(X_train) # Fitted to the training set

# Apply the scaler to both training and test set

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

X_train = pd.DataFrame(X_train) 
X_test = pd.DataFrame(X_test)

# SVM model

In [None]:
#SVM model

# He canviat i he posat cv = 5, scoring 'roc_auc' i he afegit més valors al grid search

#Specify grid values for CV (Probability = False and scoring = 'accuracy' for the Grid Search because otherwise it is computationally expensive)

MySVM=SVC(kernel = 'rbf', probability = True)

grid_values = {'C':[0.1, 0.3, 0.6, 1, 1.5], 'gamma':[0.25, 0.5, 1, 1.25]}

grid_svc_acc = GridSearchCV(MySVM, param_grid = grid_values,scoring = 'roc_auc', n_jobs = -1, cv=5)

grid_svc_acc.fit(X_train, y_train)

## Optimal parameters from the grid

In [None]:
# Optimal parameters from the grid

GridSearch_table_plot(grid_svc_acc, "C", negative=False, display_all_params=False)

print('Best Cost parameter : '+ str(grid_svc_acc.best_estimator_.C))
print('Best gamma parameter : '+ str(grid_svc_acc.best_estimator_.gamma))

## Out-of-sample predictions

The predictions used the data from the <i>X_test</i> which has not been used to train the model.

In [None]:
# Out-sample

outsample_pred_prob = grid_svc_acc.predict_proba(X_test)

## In-sample predictions

The predictions used the data from the <i>X_train</i> which has been used to train the model.

In [None]:
# In-sample

insample_pred_prob = grid_svc_acc.predict_proba(X_train)

# AUC

print(roc_auc_score(y_train, insample_pred_prob[:, 1]))

### Kaggle Predictions Submissions

Once you have produced testset predictions you can submit these to <i> kaggle </i> in order to see how your model performs.

The following code provides an example of generating a <i> .csv </i> file to submit to kaggle
1) create a pandas dataframe with two columns, one with the test set "icustay_id"'s and the other with your predicted "HOSPITAL_EXPIRE_FLAG" for that observation

2) use the <i> .to_csv </i> pandas method to create a csv file. The <i> index = False </i> is important to ensure the <i> .csv </i> is in the format kaggle expects

In [None]:
# Produce .csv for kaggle testing
test_predictions_submit = pd.DataFrame({"icustay_id": data_test["icustay_id"], "HOSPITAL_EXPIRE_FLAG": outsample_pred_prob[:, 1]})
test_predictions_submit.to_csv("test_predictions_submit.csv", index = False)