# Creating an ensemble from saved pickled models
This notebook shows an example of the code used to save the prediction scores from the pickled ".model" files outputted by the main program, and how these are used to score an ensemble. The ensemble can also be plotted with the shown functions. 

In [None]:
# Load packages
import pickle
import numpy as np
import pandas as pd
import ML_scripts.ensemble_scoring as ens

## Extracting the prediction scores and target values for a model

In [None]:
# Load data pickle to save target
with open('data/load_data.pickle', 'rb') as outfile: 
    scores = pickle.load(outfile)['target']

print('Loaded target:', scores.shape)
unique, counts = np.unique(scores.y, return_counts=True)
print('Class balance:', dict(zip(unique, counts)))

The below example codes shows how to extract the predictions from model output files for the in this study selected random states. The actual output files are not included in this GitHub. 

In [None]:
random_states = [654, 114, 25, 759, 281, 250, 228, 142, 754, 104, 692, 758, 913, 558, 89, 604, 432, 32, 30, 95, 223, 238, 517, 616, 27, 574, 203, 733, 665, 718, 429, 225, 459, 603, 284, 828, 890, 6, 777, 825, 163, 714, 348, 159, 220, 980, 781, 344, 94, 389]


In [None]:
# Set path of pickled model file
path = 'Insert your path here'

# Iterate over selected random states
for rs in random_states: 
    # Set file name. The file name includes the labels given to the data saved in load_data.pickle. 
    file = 'rf.diet_16s_var250.rs'+str(rs)+'.model' # This is an example of a filename generated by the main program. 
    
    # Open and extract model_data
    with open(path+file, 'rb') as outfile: 
        model_data = list(pickle.load(outfile))
        
    (X_crop, y), (train, test), (fpr,tpr), ids_test, (preds, labels), thresholds, models = model_data

    # Save filename to use as column header
    col_name = '.'.join(file.split('.')[1:3])

    # Save predictions as pandas dataframe
    predictions = pd.DataFrame(data=preds, index=ids_test, columns=[col_name])
    
    # Concatenate with scores
    scores = pd.concat([scores, predictions], axis=1, sort=True)


## Scoring an ensemble
The ensemble scoring is made using three different methods:
- Mean of prediction scores
- Majority voting
- Mean of confident prediction scores
- Majority voting on confident prediction scores

A function has been made to score predictions as an ensemble, when given a pandas dataframe containing the target and the prediction scores of selected models. This function is located in ```ML_scripts/ensemble_scoring.py``` and is called ```score_ensemble()```. The arguments are described below: 
- ```scores``` is a required argument which is a pandas dataframe containing a column with the target values and columns with prediction scores. 
- ```true_col``` is a string naming the dataframe column which contains the target values. Default is ```true_col='y'```. 
- ```drop``` names the columns to leave out if any. Default is ```drop=False```, meaning all columns are included in the ensemble scoring. 
- ```min_conf``` and ```max_conf``` set the interval of thresholds to try with step size ```step```, when scoring the ensemble with only the confidence prediction scores. Defaults are ```min_conf=0.6```, ```max_conf=0.99``` and ```step=0.05```. 
- ```threshold``` sets the threshold value for separating the classes. Default is ```threshold=0.5```. 

The output of this function is two pandas dataframes: 
- First dataframe contains the performance metrics as rows and scorings as columns. 
- Second dataframe contains the sample-wise ensemble predictions as each row with the different scoring methods as columns. 

In [None]:
# Score the ensemble using scoring function
ens_perf, ens_scores = ens.score_ensemble(scores=scores, true_col='y', min_conf=0.6, max_conf=0.9, step=0.05, verbose=False)

# Save performances
ens_perf.T.to_csv(path+'ensemble_perfs.csv', index_label='scoring')

# Save ensemble predictions
ens_scores.to_csv(path+'ensemble_scores.csv', index_label='samples')

In [None]:
# Print performances of different ensemble scoring methods
ens_perf.T # This will print with metrics as columns and scorings as rows