# Reporting Applied Prediction Probabilities - Replication Data- (for AutoMLPipe-BC)
The pipeline outputs pickled objects with all the metric results during the initial testing evaluation of trained models as well as following application of trained models to additional hold out replication data (i.e. apply data).

This notebook grabs these prediction probabilities on the replication data and reports them as .csv files for each algorithm and CV partition pair (i.e for each of the CV trained models).  Unlike the initial testing evaluation, here prediction probabilities for all instances in the replication data are reported rather than just for those in 'testing' holdout data subsets.

In these files is the instance's true outcome value, the unique instance ID, and the predicted probability of the instance being case/code 1. 

This code is set up to run on a specific pair of an original dataset and a paired replication dataset one at a time.
 

## Import Packages

In [1]:
import os
import pandas as pd
import pickle
import numpy as np
from statistics import mean
from scipy import interp,stats
import warnings
warnings.filterwarnings('ignore')

# Jupyter Notebook Hack: This code ensures that the results of multiple commands within a given cell are all displayed, rather than just the last. 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Set Run Parameters

In [2]:
#experiment_path = "C:/Users/ryanu/Documents/Analysis/AutoMLPipe_Experiments/hcc_demo"
experiment_path = "C:/Users/ryanu/Documents/Analysis/SAGIC_TimVar_HRV2"
dataname = 'TimVar_HRV_data_train_forML' #name of target dataset folder in experiment output folder from pipeline
rep_data_path ="C:/Users/ryanu/Documents/Analysis/SAGIC_TimVar_HRV2/TimVar_HRV_data_test_forML.csv"#path to replication dataset file (needed to grab instance labels and true class values)
algorithms = [] #use empty list if user wishes re-evaluate all modeling algorithms that were run in pipeline.

#available_algorithms = ['Naive Bayes','Logistic Regression','Decision Tree','Random Forest','Gradient Boosting','XGB','LGB','SVM','ANN','K Neighbors','eLCS','XCS','ExSTraCS']

## Load Other Necessary Parameters

In [3]:
metadata = pd.read_csv(experiment_path + '/' + 'metadata.csv').values
class_label = metadata[0, 1]
instance_label = metadata[1, 1]
cv_partitions = int(metadata[6,1])
do_NB = metadata[20,1]
do_LR = metadata[21,1]
do_DT = metadata[22,1]
do_RF = metadata[23,1]
do_GB = metadata[24, 1]
do_XGB = metadata[25,1]
do_LGB = metadata[26,1]
do_SVM = metadata[27,1]
do_ANN = metadata[28,1]
do_KN = metadata[29, 1]
do_eLCS = metadata[30,1]
do_XCS = metadata[31,1]
do_ExSTraCS = metadata[32,1]

possible_algos = ['Naive Bayes','Logistic Regression','Decision Tree','Random Forest','Gradient Boosting','XGB','LGB','SVM','ANN','K Neighbors','eLCS','XCS','ExSTraCS']
abbrev = {'Naive Bayes':'NB','Logistic Regression':'LR','Decision Tree':'DT','Random Forest':'RF','Gradient Boosting':'GB','XGB':'XGB','LGB':'LGB','SVM':'SVM','ANN':'ANN','K Neighbors':'KN','eLCS':'eLCS','XCS':'XCS','ExSTraCS':'ExSTraCS'}

#Create algorithms list (i.e. modeling algorithms that were run in the pipeline)
if eval(do_NB):
    algorithms.append('Naive Bayes')
if eval(do_LR):
    algorithms.append('Logistic Regression')
if eval(do_DT):
    algorithms.append('Decision Tree')
if eval(do_RF):
    algorithms.append('Random Forest')
if eval(do_GB):
    algorithms.append('Gradient Boosting')
if eval(do_XGB):
    algorithms.append('XGB')
if eval(do_LGB):
    algorithms.append('LGB')
if eval(do_SVM):
    algorithms.append('SVM')
if eval(do_ANN):
    algorithms.append('ANN')
if eval(do_KN):
    algorithms.append('K Neighbors')
if eval(do_eLCS):
    algorithms.append('eLCS')
if eval(do_XCS):
    algorithms.append('XCS')
if eval(do_ExSTraCS):
    algorithms.append('ExSTraCS')

## Extract and Report Case (i.e. class 1) Prediction Probabilities For all instances in replication dataset applied to all CV models.

In [4]:

full_path = experiment_path+'/'+dataname
apply_name = rep_data_path.split('/')[-1].split('.')[0]
new_full_path = full_path+'/applymodel/'+apply_name
        
#Make folder in experiment folder/datafolder to store all prediction probabilities per algorithm/CV combination
if not os.path.exists(new_full_path+'/model_evaluation/prediction_probas'):
    os.mkdir(new_full_path+'/model_evaluation/prediction_probas')

for algorithm in algorithms: #loop through algorithms
    print(algorithm)

    for cvCount in range(0,cv_partitions): #loop through cv's
        print(cvCount)
        #Load pickled metric file for given algorithm and cv
        result_file = new_full_path+'/model_evaluation/pickled_metrics/'+abbrev[algorithm]+"_CV_"+str(cvCount)+"_metrics"
        file = open(result_file, 'rb')
        results = pickle.load(file)
        file.close()

        #Load target replication dataset (From which we will get the instancelabel values and class outcome values.)
        rep_data = pd.read_csv(rep_data_path)
        probas_summary = rep_data[[class_label,instance_label]]

        #Separate pickled results
        probas_ = results[8]
        print(probas_[:,1])
        probas_summary['1_prob']=probas_[:,1]
        file_name = new_full_path+'/model_evaluation/prediction_probas/'+algorithm+'_CV_'+str(cvCount)+'_case_probas.csv'
        probas_summary.to_csv(file_name, index=False)

Naive Bayes
0
[1.26214625e-16 9.86698259e-01 7.94017903e-01 ... 9.97641547e-01
 9.99468627e-01 9.99604514e-01]
1
[7.47769095e-17 9.83903486e-01 7.63744162e-01 ... 9.96999758e-01
 9.99304012e-01 9.99496535e-01]
2
[1.19137736e-16 9.81424505e-01 7.36336401e-01 ... 9.96891521e-01
 9.99303924e-01 9.99499146e-01]
3
[1.98742662e-17 9.51719944e-01 5.03861063e-01 ... 9.90823820e-01
 9.96536049e-01 9.97858487e-01]
4
[1.09728489e-16 9.79329575e-01 7.11388694e-01 ... 9.96438525e-01
 9.99196701e-01 9.99403598e-01]
5
[4.97199684e-17 9.81425272e-01 7.37474779e-01 ... 9.96441511e-01
 9.99165659e-01 9.99409488e-01]
6
[3.42793238e-17 9.76937475e-01 6.86712303e-01 ... 9.95879892e-01
 9.99063445e-01 9.99317817e-01]
7
[1.47192164e-16 9.87420762e-01 8.05888407e-01 ... 9.97738439e-01
 9.99495464e-01 9.99618010e-01]
8
[1.05207666e-16 9.81109722e-01 7.33554527e-01 ... 9.96881451e-01
 9.99310165e-01 9.99484993e-01]
9
[4.33687265e-17 9.79340026e-01 7.12892729e-01 ... 9.96975924e-01
 9.99355606e-01 9.99539414e-01