# Praca domowa nr 5

Budowanie zbioru modeli Rashomon.

Modele są zbudowany na podstawie preprocessingu przedstawionego w artykule: https://academic.oup.com/jamiaopen/article/1/1/87/5032901. 

Kod do artykułu dostępny jest pod linkiem: https://github.com/illidanlab/urgent-care-comparative

Zadanie: problem klasyfikacji, predykcja śmiertelności na podstawie przedstawienia danych w postaci *X48* (wg. artykułu powyżej).

### Biblioteki

In [1]:
import numpy as np
import pandas as pd

import pickle
import os.path

import xgboost as xgb

from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import auc as auc_score
from sklearn.utils import shuffle

### Załadowanie danych po preprocessingu

In [2]:
X = np.load("X48.npy")

In [3]:
with open('y', 'rb') as f:
    labels = pickle.load(f)
    
task = [yy[0] for yy in labels]
y = np.array(task)

### Generowanie próbek do kroswalidacji

Przy modelowaniu skorzystamy z pięciokrotnej kroswalidacji - w celu zapewnienia reprodukowalności, indeksy użytych próbek można wczytać z pliku:

In [4]:
def get_cv_samples_indexes(X, y):
    if os.path.isfile('samples.npy'):
        return np.load("samples.npy", allow_pickle = True)
    else:
        tab = []
        skf = StratifiedKFold(n_splits = 5)
        
        for train_index, test_index in skf.split(X, y):
            tab.append((train_index, test_index))
            
        with open('samples.npy', 'wb') as f:
            pickle.dump(tab, f)
            
        return tab

In [5]:
cv_tab = get_cv_samples_indexes(X, y)

### Obiekty - model, random search, siatka hiperparametrów

In [6]:
model = xgb.XGBClassifier(objective='binary:logistic', n_jobs = -1, eval_metric = 'auc', use_label_encoder = False, seed = 123)

Zakres hiperparametrów wzorowany artykułem (tabela 1): https://jmlr.org/papers/volume20/18-444/18-444.pdf

Dokumentacja parametrów: https://xgboost.readthedocs.io/en/latest/parameter.html

In [7]:
hyperparameters =  {
    'learning_rate' : 2 ** np.linspace(-10, 0, num = 100),
    'subsample' : np.linspace(0.1, 1, num = 100),
    'booster' : ['gbtree', 'dart'],
    'max_depth' : list(range(1, 20 + 1)),
    'min_child_weight' : 2 ** np.linspace(0, 7, num = 100),
    'colsample_bytree' : np.linspace(0.001, 1, num = 100),
    'colsample_bylevel' : np.linspace(0.001, 1, num = 100),
    'lambda' : 2 ** np.linspace(-10, 10, num = 100),
    'alpha' : 2 ** np.linspace(-10, 10, num = 100),
    'n_estimators' : list(range(30, 700, 10))
}

In [8]:
class cross_val_gen:
    def __init__(self, cv_tab):
        self.n_splits = 5
        self.cv_tab = cv_tab

    def split(self, X, y, groups=None):
        for train_index, test_index in cv_tab:
            yield train_index, test_index 

    def get_n_splits(self, X, y, groups=None):
        return self.n_splits

In [9]:
number_of_models = 1

In [10]:
cv_search_obj = RandomizedSearchCV(estimator = model, param_distributions = hyperparameters, n_iter = number_of_models, 
                                   scoring = 'roc_auc', cv = cross_val_gen(cv_tab), return_train_score = True)

### Modelowanie

In [11]:
search = cv_search_obj.fit(X, y)

### Ramka danych wynikowych

In [12]:
results = pd.DataFrame(search.cv_results_)

In [13]:
results.columns

Index(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time',
       'param_subsample', 'param_n_estimators', 'param_min_child_weight',
       'param_max_depth', 'param_learning_rate', 'param_lambda',
       'param_colsample_bytree', 'param_colsample_bylevel', 'param_booster',
       'param_alpha', 'params', 'split0_test_score', 'split1_test_score',
       'split2_test_score', 'split3_test_score', 'split4_test_score',
       'mean_test_score', 'std_test_score', 'rank_test_score',
       'split0_train_score', 'split1_train_score', 'split2_train_score',
       'split3_train_score', 'split4_train_score', 'mean_train_score',
       'std_train_score'],
      dtype='object')

In [14]:
results.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_subsample,param_n_estimators,param_min_child_weight,param_max_depth,param_learning_rate,param_lambda,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
0,21.128428,1.189855,0.030795,0.001162,0.527273,510,16.3396,14,0.00396133,0.018484,...,0.803544,0.007951,1,0.821616,0.816907,0.821382,0.820388,0.822033,0.820465,0.001859


In [15]:
with open('results.npy', 'wb') as f:
    pickle.dump(results, f)

In [16]:
results.to_csv("results.csv")