# Reporting Applied Prediction Probabilities - Replication Data- (for STREAMLINE)
The pipeline outputs pickled objects with all the metric results during the initial testing evaluation of trained models as well as following application of trained models to additional hold out replication data (i.e. apply data).

This notebook grabs these prediction probabilities on the replication data and reports them as .csv files for each algorithm and CV partition pair (i.e for each of the CV trained models).  Unlike the initial testing evaluation, here prediction probabilities for all instances in the replication data are reported rather than just for those in 'testing' holdout data subsets.

In these files is the instance's true outcome value, the unique instance ID, and the predicted probability of the instance being case/code 1. 

This code is set up to run on a specific pair of an original dataset and a paired replication dataset one at a time.
 

## Import Packages

In [1]:
import os
import pandas as pd
import pickle
import numpy as np
from statistics import mean
from scipy import interp,stats
import warnings
warnings.filterwarnings('ignore')

# Jupyter Notebook Hack: This code ensures that the results of multiple commands within a given cell are all displayed, rather than just the last. 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Set Run Parameters

In [2]:
experiment_path = "C:/Users/ryanu/Documents/Analysis/STREAMLINE_Experiments/hcc_demo"
dataname = 'hcc-data_example' #name of target dataset folder in experiment output folder from pipeline
rep_data_path ="C:/Users/ryanu/OneDrive/Documents/GitHub/STREAMLINE/DemoRepData/hcc-data_example_rep.csv"#path to replication dataset file (needed to grab instance labels and true class values)
algorithms = [] #use empty list if user wishes re-evaluate all modeling algorithms that were run in pipeline.

#available_algorithms = ['Naive Bayes','Logistic Regression','Decision Tree','Random Forest','Gradient Boosting','XGB','LGB','SVM','ANN','K Neighbors','eLCS','XCS','ExSTraCS']

## Load Other Necessary Parameters

In [3]:
jupyterRun = 'True'
#Unpickle metadata from previous phase
file = open(experiment_path+'/'+"metadata.pickle", 'rb')
metadata = pickle.load(file)
file.close()
#Load variables specified earlier in the pipeline from metadata
class_label = metadata['Class Label']
instance_label = metadata['Instance Label']
cv_partitions = int(metadata['CV Partitions'])

do_NB = metadata['Naive Bayes']
do_LR = metadata['Logistic Regression']
do_DT = metadata['Decision Tree']
do_RF = metadata['Random Forest']
do_GB = metadata['Gradient Boosting']
do_XGB = metadata['Extreme Gradient Boosting']
do_LGB = metadata['Light Gradient Boosting']
do_SVM = metadata['Support Vector Machine']
do_ANN = metadata['Artificial Neural Network']
do_KNN = metadata['K-Nearest Neightbors']
do_eLCS = metadata['eLCS']
do_XCS = metadata['XCS']
do_ExSTraCS = metadata['ExSTraCS']

#Unpickle algorithm information from previous phase
file = open(experiment_path+'/'+"algInfo.pickle", 'rb')
algInfo = pickle.load(file)
file.close()
algorithms = []
abbrev = {}
colors = {}
for key in algInfo:
    if algInfo[key][0]: # If that algorithm was used
        algorithms.append(key)
        abbrev[key] = (algInfo[key][1])
        colors[key] = (algInfo[key][2])
        
print(algorithms)

['Naive Bayes', 'Logistic Regression', 'Decision Tree']


## Extract and Report Case (i.e. class 1) Prediction Probabilities For all instances in replication dataset applied to all CV models.

In [4]:

full_path = experiment_path+'/'+dataname
apply_name = rep_data_path.split('/')[-1].split('.')[0]
new_full_path = full_path+'/applymodel/'+apply_name
        
#Make folder in experiment folder/datafolder to store all prediction probabilities per algorithm/CV combination
if not os.path.exists(new_full_path+'/model_evaluation/prediction_probas'):
    os.mkdir(new_full_path+'/model_evaluation/prediction_probas')

for algorithm in algorithms: #loop through algorithms
    print(algorithm)

    for cvCount in range(0,cv_partitions): #loop through cv's
        print(cvCount)
        #Load pickled metric file for given algorithm and cv
        result_file = new_full_path+'/model_evaluation/pickled_metrics/'+abbrev[algorithm]+"_CV_"+str(cvCount)+"_metrics.pickle"
        file = open(result_file, 'rb')
        results = pickle.load(file)
        file.close()

        #Load target replication dataset (From which we will get the instancelabel values and class outcome values.)
        rep_data = pd.read_csv(rep_data_path)
        probas_summary = rep_data[[class_label,instance_label]]

        #Separate pickled results
        probas_ = results[8]
        print(probas_[:,1])
        probas_summary['1_prob']=probas_[:,1]
        file_name = new_full_path+'/model_evaluation/prediction_probas/'+algorithm+'_CV_'+str(cvCount)+'_case_probas.csv'
        probas_summary.to_csv(file_name, index=False)

Naive Bayes
0
[1.09257389e-009 1.44765505e-008 8.19797426e-004 1.38922634e-006
 1.28362058e-009 8.81921954e-008 1.31797473e-002 9.87921151e-006
 7.57140092e-002 3.42083943e-008 7.48678006e-010 2.24190572e-017
 1.77195566e-009 1.66671377e-008 1.15454744e-006 1.23136900e-006
 5.55180462e-004 3.30121051e-004 5.05694361e-007 5.63973292e-001
 1.31721054e-008 3.89146423e-010 9.64182204e-001 9.99999908e-001
 1.22252199e-002 5.55113459e-007 2.84425885e-008 8.26280853e-012
 1.25580595e-006 8.97663906e-004 2.03324052e-015 7.02479217e-006
 2.64285743e-008 2.59643575e-007 6.80581699e-011 1.54767344e-008
 2.61739986e-003 3.34779693e-006 1.55873788e-009 9.17246656e-012
 1.27951599e-009 7.02397367e-013 2.72632056e-005 2.05690818e-007
 1.60854288e-009 1.81575652e-007 2.28711867e-008 3.10400066e-002
 9.72423110e-001 9.91962484e-010 3.28664979e-002 1.00446388e-005
 7.76496235e-008 4.91032709e-009 1.60498389e-005 5.95975882e-002
 9.86960753e-001 1.00063523e-007 4.31596246e-004 2.34244769e-007
 3.14004881