# Tuning
## Feature Selection
Perform feature selection using ANOVA f-score on training data. For granular - choose 5, 10, 25, 50, 100, 500, 1000, 5000 features. For grouped - choose 5, 10, 25, 50, 100, 500 features.<br/>
See script.

## Hyperparameters
Using 10-fold CV, tune model hyperparameters to have best AUC and F1 for each outcome. Implement random undersampling to deal with class imbalance.


In [1]:
import pandas as pd
import pickle

In [2]:
# Combining tuning results for each outcome and dataset

dataset = ['baseline', 'grouped']
algorithms = ['lr','rf','svm','xgb']
feature_num = {'baseline': [5,10,25,50,100,500,1000,5000,9559], 
               'grouped': [5,10,25,50,100,500,805]}

for ds in dataset:
    results_combined = pd.DataFrame()
    column_names = ['number_of_features','algorithm','Parameter_combo','Acc_val','Acc_rank','F1_val',
                    'F1_rank','AUC_val','AUC_rank','Acc_train','F1_train','AUC_train']

    for num in feature_num[ds]:
        for a in algorithms:
            temp = pd.read_csv('../../results/tuning/individual_results/%s/SSI_%s_%d.csv' %(ds, a, num))
            temp = temp.drop(columns='Unnamed: 0')
            temp['number_of_features'] = num
            temp['algorithm'] = a
            results_combined = pd.concat([results_combined, temp])

    results_combined = results_combined[column_names].reset_index(drop=True)
    results_combined['Acc_rank'] = results_combined['Acc_val'].rank(method='min', ascending=False)
    results_combined['F1_rank'] = results_combined['F1_val'].rank(method='min', ascending=False)
    results_combined['AUC_rank'] = results_combined['AUC_val'].rank(method='min', ascending=False)
            
    filename='../../results/tuning/SSI_%s_tuning_results.csv' % ds
    results_combined.to_csv(filename, index=False)

Manually choose optimal parameters for each outcome.