<h1><center> MACHINE LEARNING - PROJECT </center></h1>
<center> "WHAT ARE THE PEOPLE MORE LIKELY TO SURVIVE TO THE BOOLEAN PANDEMIC?"</center>

Notebook structure:
* [1. Sample](#sample)
    * [1.1. Import Libraries](#import)
    * [1.2. Import Datasets](#import2)
* [2. Explore](#explore)
    * [2.1. Data Exploration](#dataexplore)
    * [2.2. Missing Values Analysis](#miss_values)
    * [2.3. Outliers Analysis](#outliers)
* [3. Modify](#modify)
    * [3.1. Transform and Create variables](#transf_create)
    * [3.2. Coherence Checking](#coherence)
    * [3.3. Correlation analysis](#corr)
    * [3.4. Train Validation Partition](#train_val)
    * [3.5. Data Standardization](#datastand)
    * [3.6. Feature Selection](#feature)
* [4. Model](#model)
    * [4.1. K Nearest Neighbors](#knn)
    * [4.2. K Nearest Centroid](#knc)
    * [4.3. Random Forest](#rf)
    * [4.4. Decision Tree](#dt)
    * [4.5. Passive Aggressive](#pa)
    * [4.6. Multi-Layer Perceptron](#mlp)

* [5. Assess](#assess)

<hr>
<a class="anchor" id="sample">
    
# 1. Sample
    
</a>

<a class="anchor" id="import">

## 1.1. Import Libraries

</a>

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier, NearestCentroid
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.impute import KNNImputer 
from sklearn.tree import DecisionTreeClassifier
from itertools import combinations_with_replacement
from sklearn.linear_model import LinearRegression, LogisticRegression, LassoCV, RidgeCV, PassiveAggressiveClassifier
from sklearn.neural_network import MLPClassifier
from imblearn.ensemble import BalancedBaggingClassifier
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier, VotingClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

<a class="anchor" id="import2">

## 1.2. Import Datasets

</a>

In [None]:
df = pd.read_csv(r'Data/train.csv')
test_df = pd.read_csv(r'Data/test.csv')

<hr>
<a class="anchor" id="explore">
    
# 2. Explore
    
</a>

<a class="anchor" id="dataexplore">

## 2.1. Data Exploration

</a>

In [None]:
df.shape

In [None]:
df.head(5)

In [None]:
df.info()

In [None]:
df[['Severity', 'Birthday_year', 'Parents or siblings infected', 'Wife/Husband or children infected', 
    'Medical_Expenses_Family', 'Deceased']].describe()

In [None]:
df['Deceased'].value_counts(normalize=True)

`NOTE:` Unbalanced learning, test over/under sampling.

<a class="anchor" id="miss_values">

## 2.2. Missing Values Analysis

</a>

In [None]:
print("# of missing values by variable:")
df.isnull().sum()

In [None]:
print("# of missing values by variable:")
test_df.isnull().sum()

##### Medical Tent

We will drop the variable "Medical Tent", once it as 702 missing values from a total of 900 (78%)

In [None]:
df = df.drop(columns='Medical_Tent')
test_df = test_df.drop(columns='Medical_Tent')

##### City

 To fill the missing values in the variable "City", we decide to use the mode, since there are only to observations missing city

In [None]:
df.City.value_counts()

In [None]:
df['City'] = df['City'].fillna(df['City'].mode()[0])

In [None]:
df.City.value_counts()

##### Birthday Year

In what concerns the remaining missing values, all from the variable "Birthday Year", we decided to apply the K-Nearest-Neighbor algorithm to fill them. This decision was based in the fact that there are 177 missing values, which we consider too much for apllying a simple input (such as mean or median input), but not that many too remove a variable that we consider that might have some importance in our model. 
Later, with more knowledge of the dataset we might consider remove it.

In [None]:
# Training set
knn_vars = df.drop(['Patient_ID', 'Name', 'City', 'Deceased'], axis = 1)
imputer = KNNImputer(n_neighbors=10)
X_filled_knn = imputer.fit_transform(knn_vars)
years = np.round(X_filled_knn[:,2])

for i in range(len(knn_vars)):
    if knn_vars.loc[i,'Birthday_year'] < 1900:
        print (years[i])
        
df['Birthday_year'] = years

# Test set
knn_vars_test = test_df.drop(['Patient_ID', 'Name', 'City'], axis = 1)
imputer = KNNImputer(n_neighbors=10)
Xtest_filled_knn = imputer.fit_transform(knn_vars_test)
years_df = np.round(Xtest_filled_knn[:,2])
test_df['Birthday_year'] = years_df

<a class="anchor" id="outliers">

## 2.3. Outliers Analysis

</a>

In [None]:
f, axes = plt.subplots(1,2, figsize=(10, 5), squeeze=False)    
sns.boxplot(df["Birthday_year"], color="skyblue", ax=axes[0, 0])
sns.boxplot(df["Medical_Expenses_Family"], color="blue", ax=axes[0, 1])

In [None]:
f, axes = plt.subplots(2, 2, figsize=(10, 10))
sns.distplot(df["Birthday_year"], color="skyblue", ax=axes[0, 0], kde=False)
sns.distplot(df["Parents or siblings infected"], color="steelblue", ax=axes[0, 1], kde=False)
sns.distplot(df["Medical_Expenses_Family"], color="blue", ax=axes[1, 0], kde=False)
sns.distplot(df["Wife/Husband or children infected"], color="c", ax=axes[1, 1], kde=False)

In [None]:
df['Outlier'] = 0
df.loc[df['Medical_Expenses_Family']>13000, 'Outlier']=1
df['Outlier'].value_counts()

In [None]:
df = df.loc[df['Outlier'] == 0]

<hr>
<a class="anchor" id="modify">

# 3. Modify
    
</a>

<hr>
<a class="anchor" id="transf_create">

## 3.1. Transform and Create variables
    
</a>

In [None]:
df.loc[df['City'] == 'Santa Fe', 'Santa Fe'] = 1
df.loc[df['City'] != 'Santa Fe', 'Santa Fe'] = 0
test_df.loc[test_df['City'] == 'Santa Fe', 'Santa Fe'] = 1
test_df.loc[test_df['City'] != 'Santa Fe', 'Santa Fe'] = 0

df.loc[df['City'] == 'Taos', 'Taos'] = 1
df.loc[df['City'] != 'Taos', 'Taos'] = 0
test_df.loc[test_df['City'] == 'Taos', 'Taos'] = 1
test_df.loc[test_df['City'] != 'Taos', 'Taos'] = 0

df['Age'] = 2020 - df['Birthday_year']
test_df['Age'] = 2020 - test_df['Birthday_year']

df['Family_cases'] = df['Parents or siblings infected'] + df['Wife/Husband or children infected']
test_df['Family_cases'] = test_df['Parents or siblings infected'] + test_df['Wife/Husband or children infected']

Family_size = pd.DataFrame(df['Patient_ID'].groupby(df['Family_Case_ID']).count())
Family_size = Family_size.rename({'Patient_ID':'Family_size'}, axis='columns') 
df = df.merge(Family_size, on = ['Family_Case_ID'])

Family_size_test = pd.DataFrame(test_df['Patient_ID'].groupby(test_df['Family_Case_ID']).count())
Family_size_test = Family_size_test.rename({'Patient_ID':'Family_size'}, axis='columns') 
test_df = test_df.merge(Family_size_test, on = ['Family_Case_ID'])

df['Medical_Expenses_Person'] = df['Medical_Expenses_Family']/df['Family_size']
test_df['Medical_Expenses_Person'] = test_df['Medical_Expenses_Family']/test_df['Family_size']

df['Parents_infected_bin'] = 0
df.loc[df['Parents or siblings infected'] != 0, 'Parents_infected_bin'] = 1
test_df['Parents_infected_bin'] = 0
test_df.loc[test_df['Parents or siblings infected'] != 0, 'Parents_infected_bin'] = 1

df['WifeOrChildren_infected_bin'] = 0
df.loc[df['Wife/Husband or children infected'] != 0, 'WifeOrChildren_infected_bin'] = 1
test_df['WifeOrChildren_infected_bin'] = 0
test_df.loc[test_df['Wife/Husband or children infected'] != 0, 'WifeOrChildren_infected_bin'] = 1

df.drop(columns = ['City','Name','Outlier','Family_Case_ID','Birthday_year'], inplace = True)
test_df.drop(columns = ['City','Name','Family_Case_ID','Birthday_year'], inplace = True)

df.set_index('Patient_ID', inplace = True)
test_df.set_index('Patient_ID', inplace = True)

In [None]:
# family_cases/family_size

<hr>
<a class="anchor" id="coherence">

## 3.2. Coherence Checking
    
</a>

In [None]:
df2 = df.copy()
df2['Incoherent'] = 0
#Acho que deviamos tirar isto, não faz sentdo considerarmos incoerencias e deixarmos ficar
df2.loc[df2['Family_cases'] > df2['Family_size'], 'Incoherent'] = 1
df2.loc[(df['Age'] > 120) | (df2['Age'] < 0), 'Incoherent'] = 1
df2['Incoherent'].value_counts()

<hr>
<a class="anchor" id="corr">

## 3.3. Correlation Analysis
    
</a>

In [None]:
# Confirmar se incluímos a dependente na matriz de correlação
plt.rcParams['figure.figsize'] = (12,12)

corr_matrix=df.drop(columns=['Wife/Husband or children infected', 'Parents or siblings infected']).corr(method = 'spearman')
mask=np.zeros_like(corr_matrix, dtype=np.bool)
mask[np.triu_indices_from(mask)]=True
sns.heatmap(data=corr_matrix, mask=mask, center=0, annot=True, linewidths=2, cmap='coolwarm')
plt.tight_layout()

From the correlation analysis and based on a threshold of 0.8 (or -0.8) we can conclude that we should choose between removing the variable "Family Cases" or both the variables "Parents or Siblings Infected" and "Wife/Husband or children infected". Besides that, there are some values that we should pay some attention, specially the ones related with the variable "Medical_Expenses_Person", which also seems to be strongly correlated with the "Medical expenses_family" (0.75) and the "Severaty" (-0.79).

In section 3.5., we will procceed to feature selection where we will take into account the values obtained with this analysis.

<hr>
<a class="anchor" id="train_val">

## 3.4. Train Validation Partition
    
</a>

In [None]:
variables = ['Severity', 'Age', 'Santa Fe', 'Taos', 'Parents_infected_bin', 'WifeOrChildren_infected_bin',
             'Family_cases', 'Family_size', 'Medical_Expenses_Person']

X = df[variables]

y = df['Deceased']

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, 
                                                  random_state=15, shuffle=True, stratify=y)

<hr>
<a class="anchor" id="datastand">

## 3.5. Standardization
    
</a>

In [None]:
scaler = StandardScaler().fit(X_train)  # z-score
#scaler = RobustScaler().fit(X_train) # robust standardization

scaler_X_train = scaler.transform(X_train)
scaler_X_train = pd.DataFrame(scaler_X_train, columns=variables)

scaler_X_val = scaler.transform(X_val)
scaler_X_val = pd.DataFrame(scaler_X_val, columns=variables)

test_df = test_df[variables]

scaler_X_test = scaler.transform(test_df)
scaler_X_test = pd.DataFrame(scaler_X_test, columns=variables)

In [None]:
# min max normalization
#minmax = MinMaxScaler(feature_range=(-1,1)).fit(X_train)
##minmax = MinMaxScaler().fit(X_train)
#scaler_X_train = minmax.transform(X_train)
#scaler_X_val = minmax.transform(X_val)
#scaler_X_test = minmax.transform(test_df)

<hr>
<a class="anchor" id="feature">

## 3.6. Feature Selection
    
</a>

<hr>
<a class="anchor" id="lasso">

### 3.6.1. Lasso Regression
    
</a>

In [None]:
def plot_importance(coef,name):
    imp_coef = coef.sort_values()
    plt.figure(figsize=(8,10))
    imp_coef.plot(kind = "barh")
    plt.title("Feature importance using " + name + " Model")
    plt.show()

In [None]:
reg = LassoCV()
reg.fit(scaler_X_train, y_train)
coef = pd.Series(reg.coef_, index=scaler_X_train.columns)
coef.sort_values()

In [None]:
plot_importance(coef, 'Lasso')

<hr>
<a class="anchor" id="ridge">

### 3.6.2. Ridge Regression
    
</a>

In [None]:
ridge = RidgeCV()
ridge.fit(X=scaler_X_train, y=y_train)
coef_ridge = pd.Series(ridge.coef_, index=scaler_X_train.columns)
print(coef_ridge.sort_values())

In [None]:
plot_importance(coef_ridge,'Ridge')

<hr>
<a class="anchor" id="rfe">

### 3.6.3. Recursive Feature Elimination (RFE)
    
</a>

In [None]:
#no of features
nof_list=np.arange(1,10)            
high_score=0
#Variable to store the optimum features
nof=0           
score_list =[]
for n in range(len(nof_list)):

    X_train_rfe, X_rfe_val, y_train_rfe, y_rfe_val = train_test_split(scaler_X_train, y_train, test_size = 0.2, 
                                                                      random_state = 100)
    
    model_rfe = LogisticRegression()
    rfe = RFE(model_rfe,nof_list[n])
    X_train_rfe = rfe.fit_transform(X_train_rfe,y_train_rfe)
    X_rfe_val = rfe.transform(X_rfe_val)
    
    model_rfe.fit(X_train_rfe,y_train_rfe)
    
    score = model_rfe.score(X_rfe_val,y_rfe_val)
    score_list.append(score)
    
    if(score>high_score):
        high_score = score
        nof = nof_list[n]
        

print("Optimum number of features: %d" %nof)
print("Score with %d features: %f" % (nof, high_score))

In [None]:
N = 6
model_rfe = LogisticRegression()
rfe = RFE(estimator = model_rfe, n_features_to_select = N)
X_rfe = rfe.fit_transform(X = scaler_X_train, y = y_train) 

selected_features_rfe = pd.Series(rfe.ranking_, index = scaler_X_train.columns)
selected_features_rfe.sort_values()

<hr>
<a class="anchor" id="model">

# 4. Model
    
</a>

<hr>
<a class="anchor" id="knn">

## 4.1. K Nearest Neighbors
    
</a>

In [None]:
k = 3
cv = StratifiedKFold(n_splits=k, shuffle=True, random_state=15)

In [None]:
knn_clf = KNeighborsClassifier()

knn_parameters = {'n_neighbors' : np.arange(13,31,1),
                  'metric' : ['euclidean', 'cosine', 'manhattan', 'minkowski'],
                  'weights' : ['uniform', 'distance'],
                  'algorithm': ['ball_tree', 'kd_tree', 'brute']}

knn_grid = GridSearchCV(estimator=knn_clf, param_grid=knn_parameters, cv=cv, scoring='accuracy', verbose=1, n_jobs=-1)

knn_grid.fit(scaler_X_train, y_train)

In [None]:
knn_grid.best_params_

In [None]:
knn_grid.best_score_

<hr>
<a class="anchor" id="knc">

## 4.2. K Nearest Centroid
    
</a>

In [None]:
knc_clf = NearestCentroid()

knc_parameters = {'metric' : ['euclidean', 'cosine', 'manhattan']}

knc_grid = GridSearchCV(estimator=knc_clf, param_grid=knc_parameters, cv=cv, scoring='accuracy', verbose=1, n_jobs=-1)

knc_grid.fit(scaler_X_train, y_train)
knc_grid.best_params_

In [None]:
knc_grid.best_score_

<hr>
<a class="anchor" id="rfc">

## 4.3. Random Forest
    
</a>

In [None]:
rf_clf = RandomForestClassifier(class_weight='balanced', random_state=15)

rf_parameters = {"n_estimators": np.arange(100, 400, 100),
                 "max_features": ['sqrt', 'log2', 'auto', None],
                 "criterion": ['gini', 'entropy'],
                 "warm_start" : [True, False]}

rf_grid = GridSearchCV(estimator=rf_clf, param_grid=rf_parameters, cv=cv, scoring='accuracy', verbose=1, n_jobs=-1)

rf_grid.fit(X_train, y_train)
rf_grid.best_params_

In [None]:
rf_grid.best_score_

<hr>
<a class="anchor" id="dt">

## 4.4. Decision Tree
    
</a>

In [None]:
dt_clf = DecisionTreeClassifier(class_weight='balanced', random_state=15)

dt_parameters = {"max_features": ['sqrt', 'log2', 'auto', None],
                 "splitter" : ['best', 'random'],
                 "criterion": ['gini', 'entropy']}

dt_grid = GridSearchCV(estimator=dt_clf, param_grid=dt_parameters, cv=cv, scoring='accuracy', verbose=1, n_jobs=-1)

dt_grid.fit(X_train, y_train)
dt_grid.best_params_

In [None]:
dt_grid.best_score_

<hr>
<a class="anchor" id="pa">

## 4.5. Passive Aggressive
    
</a>

In [None]:
pa_clf = PassiveAggressiveClassifier(class_weight='balanced', random_state=15)

pa_parameters = {"warm_start" : [True, False],
                 "early_stopping" : [True, False],
                 "max_iter" : (100, 500, 1000)}

pa_grid = GridSearchCV(estimator=pa_clf, param_grid=pa_parameters, cv=cv, scoring='accuracy', verbose=1, n_jobs=-1)

pa_grid.fit(scaler_X_train, y_train)
pa_grid.best_params_

In [None]:
pa_grid.best_score_

<hr>
<a class="anchor" id="logreg">

## 4.6. Logistic Regression
    
</a>

In [None]:
log_clf = LogisticRegression(multi_class='multinomial')

log_parameters = {'solver': ['newton-cg', 'sag', 'saga', 'lbfgs'],
                  'warm_start' : [True, False],
                  'max_iter' : (100, 200, 300)}

log_grid = GridSearchCV(estimator=log_clf, param_grid=log_parameters, cv=cv, scoring='accuracy', verbose=1, n_jobs=-1)
log_grid.fit(scaler_X_train, y_train)

In [None]:
log_grid.best_score_

<hr>
<a class="anchor" id="mlp">

## 4.7. Multi-Layer Perceptron
    
</a>

#### Grid 1
###### Note: Definir os params para um n limitado de Neurons/ Hidden_layers

In [None]:
mlp = MLPClassifier(random_state=15, max_iter=600)

mlp_parameters = {
#    'hidden_layer_sizes': [(50,),(50,50,),(50,50,50)],
    'activation': ['identity','logistic','tanh', 'relu'],
    'solver': ['lbfgs','sgd', 'adam'],
    'alpha':np.logspace(-5, 3, 5),
#     'batch_size':(),
    'learning_rate_init': list(np.linspace(0.00001,0.1,5)),
    'warm_start': [True,False],
    'learning_rate': ['constant','invscaling','adaptive'],
    'early_stopping' : [True,False]
}

mlp_grid = GridSearchCV(estimator=mlp, param_grid=mlp_parameters, cv=cv, scoring='accuracy', verbose=1, n_jobs=-1)
mlp_grid.fit(scaler_X_train, y_train)

In [None]:
mlp_grid.best_params_

In [None]:
mlp_grid.best_score_

#### Grid 2.
###### Note: Para os params acima encontrados testar todas as combinações de layers/neurons possiveis

In [None]:
def combination_layers(min_neurons,max_neurons,n_layers): 
    l = []
    for i in range(min_neurons,max_neurons):
        l.append(i)
    layersize = list(combinations_with_replacement(l,n_layers))
    return layersize

In [None]:
mlp = MLPClassifier(random_state=15, max_iter=600, activation='tanh', solver='lbfgs', alpha=10, 
                    learning_rate='constant', learning_rate_init=1e-05, warm_start=True, early_stopping=True)

mlp_parameters = {
    'hidden_layer_sizes': combination_layers(10,50,2)
}

mlp_grid = GridSearchCV(estimator=mlp, param_grid=mlp_parameters, cv=cv, scoring='accuracy', verbose=1, n_jobs=-1)
mlp_grid.fit(scaler_X_train, y_train)

In [None]:
mlp_grid.best_score_

<hr>
<a class="anchor" id="ensemble">

## 4.8. Ensemble 
    
</a>

In order to enhance the performance of the model, ensemble was implemented. The main goal is to provide the ability to combine conceptually different machine learning classifiers into one single classifier, boosting their individual performance and tackling their individual weaknesses, while reducing existing overfitting. There are two different approaches to reach this goal, this is, bagging and boosting. Bagging uses complex base models and tries to "smooth out" their predictions while Boosting expects that individual mediocre models can be combined to create a high-performance model.

<hr>
<a class="anchor" id="bbc">

### 4.8.1. Balanced Bagging Classifier
    
</a>

In [None]:
bbc_clf = BalancedBaggingClassifier(base_estimator=[mlp_grid, log_grid, rf_grid, knn_grid])

bbc_parameters = {
    #'base_estimator' : [None, [mlp_grid, log_grid, rf_grid, knn_grid]],
                  'n_estimators' : (5, 10, 15, 20),
                  'bootstrap' : [True, False],
                  'bootstrap_features' : [True, False],
                  'warm_start' : [True, False],
                  'sampling_strategy' : ['not majority', 'not minority', 'all', 'majority', 'minority'],
                  'replacement' : [True, False]}

bbc_grid = GridSearchCV(estimator=bbc_clf, param_grid=bbc_parameters, cv=cv, scoring='accuracy', verbose=1, n_jobs=-1)
bbc_grid.fit(scaler_X_train, y_train)

In [None]:
bbc_grid.best_score_

<hr>
<a class="anchor" id="gbc">

### 4.8.2. Gradient Boosting Classifier
    
</a>

In [None]:
GB_clf = GradientBoostingClassifier()

GB_parameters = {'loss' : ['deviance', 'exponential'],
                 'learning_rate' : (0.01, 0.1, 1),
                 'n_estimators' : np.arange(100, 400, 100),
                 'max_depth' : (5, 10, 15, 20, 30),
                 'max_features' : ['auto', 'log2', None],
                 'warm_start' : [True, False]}

GB_grid = GridSearchCV(estimator=GB_clf, param_grid=GB_parameters, cv=cv, scoring='accuracy', verbose=1, n_jobs=-1)
GB_grid.fit(scaler_X_train , y_train)

In [None]:
GB_grid.best_score_

<hr>
<a class="anchor" id="adaboost">

### 4.8.3. AdaBoost Classifier
    
</a>

In [None]:
AdaBoost = AdaBoostClassifier()

AdaBoost_parameters = {'base_estimator' : [None, [knn_grid, log_grid], knn_grid, GB_grid],
                       'n_estimators' : np.arange(50, 200, 50),
                       'learning_rate' : (0.01, 0.1, 1, 10)}

AdaBoost_grid = GridSearchCV(estimator=AdaBoost, param_grid=AdaBoost_parameters, cv=cv, 
                             scoring='accuracy', verbose=1, n_jobs=-1)

AdaBoost_grid.fit(scaler_X_train , y_train)

In [None]:
AdaBoost_grid.best_score_

<hr>
<a class="anchor" id="vc">

### 4.8.4. Voting Classifier
    
</a>

In [None]:
vc = VotingClassifier(estimators=[('lr', log_grid), ('knn', knn_grid), ('gnb', nb_clf), 
                                  ('gb', GB_grid), ('adaboost', AdaBoost_grid)]).fit(scaler_X_train, y_train)

<hr>
<a class="anchor" id="assess">

# 5. Assess
    
</a>

In [None]:
model = knn_grid
accuracy = accuracy_score(y_val, labels_val)
labels_train = model.predict(scaler_X_train)
labels_val = model.predict(scaler_X_val)

In [None]:
def metrics(y_train, pred_train , y_val, pred_val):
    print('___________________________________________________________________________________________________________')
    print('                                                     TRAIN                                                 ')
    print('-----------------------------------------------------------------------------------------------------------')
    print(classification_report(y_train, pred_train))
    print(confusion_matrix(y_train, pred_train))


    print('___________________________________________________________________________________________________________')
    print('                                                VALIDATION                                                 ')
    print('-----------------------------------------------------------------------------------------------------------')
    print(classification_report(y_val, pred_val))
    print(confusion_matrix(y_val, pred_val))
    
metrics(y_train, labels_train, y_val, labels_val)

In [None]:
def print_parameters (variables, k, model, accuracy):
    print("Variables:", variables )
    print("Nº of folds:", k)
    print("Parameters:", model.best_params_)
    print("Accuracy:", accuracy)

print_parameters(variables, k, model, accuracy = accuracy_score(y_val, labels_val))

In [None]:
version = 3

labels_test = model.predict(scaler_X_test)
test_output = pd.DataFrame(labels_test, columns = ["Deceased"])
test_output["Patient_ID"] = test_df.index
cols = list(test_output)
cols = cols[-1:] + cols[:-1]
test_output = test_output[cols]
test_output = test_output.sort_values(by=['Patient_ID'])

path = '.\\Data\\Group38_version' + str(version) + '.csv'
test_output.to_csv(path, index = False)