# Useful Notebook: Report Replication Data Prediction Probabilities
**This notebook will generate model (class 1) prediction probabilities for instances of respective replication dataset.**

*This notebook is designed to run after having run STREAMLINE (at least phases 1-6 and phase 8 - replication) and will use the files from a specific STREAMLINE experiment folder, as well as save new output files to that same folder.*

***
## Notebook Details
STREAMLINE outputs pickled objects with all the metric results during the initial testing evaluation of trained models as well as following application of trained models to additional hold out replication data.

This notebook grabs these prediction probabilities for a specific replication dataset and reports them as .csv files for each algorithm and CV partition pair (i.e for each of the CV trained models).

When run, the last code cell will generate a new folder (`prediction_probas`) in the pipeline's output experiment folder in the `/replication/[REPDATANAME]/model_evaluation` folder of the `dataset` specified below. Here the class 1 prediction probabilities are reported as a `.csv` file for each algorithm and CV partition pair. In these files is the instance's true outcome value, the unique instance ID, and the predicted probability of the instance being class 1 (i.e. which typically encodes cases or the less frequent class). 

* *This code is set up to run on a specific pair of an original dataset and a paired replication dataset one at a time.*
 

***
## Notebook Run Parameters
* This notbook has been set up to run 'as-is' on the experiment folder generated when running the demo of STREAMLINE in any mode (if no run parameters were changed). 
* If you have run STREAMLINE on different target data or saved the experiment to some other folder outside of STREAMLINE, you need to edit `experiment_path` below to point to the respective experiment folder.

In [1]:
experiment_path = "../DemoOutput/demo_experiment" # path the target experiment folder 
dataname = 'hcc_data_custom' #name of target dataset folder in experiment output folder from pipeline
rep_dataname ="hcc_data_custom_rep"#path to replication dataset file (needed to grab instance labels and true class values)
algorithms = [] # use empty list if user wishes re-evaluate all modeling algorithms that were run in pipeline, otherwise specify a (str) list of algorithm identifiers.

***
## Housekeeping
### Import Packages

In [2]:
import os
import pandas as pd
import pickle
import numpy as np
from statistics import mean
from scipy import interp,stats
import warnings
warnings.filterwarnings('ignore')

# Jupyter Notebook Hack: This code ensures that the results of multiple commands within a given cell are all displayed, rather than just the last. 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

### Load Other Necessary Parameters

In [3]:
# Unpickle metadata from previous phase
file = open(experiment_path+'/'+"metadata.pickle", 'rb')
metadata = pickle.load(file)
file.close()
# Load variables specified earlier in the pipeline from metadata
class_label = metadata['Class Label']
instance_label = metadata['Instance Label']
cv_partitions = int(metadata['CV Partitions'])

# Unpickle algorithm information from previous phase
file = open(experiment_path+'/'+"algInfo.pickle", 'rb')
algInfo = pickle.load(file)
file.close()
algorithms = []
abbrev = {}
colors = {}
for key in algInfo:
    if algInfo[key][0]: # If that algorithm was used
        algorithms.append(key)
        abbrev[key] = (algInfo[key][1])
        colors[key] = (algInfo[key][2])
        
print("Algorithms Ran: " + str(algorithms))

Algorithms Ran: ['Decision Tree', 'Logistic Regression', 'Naive Bayes']


## Extract and Output Replication Data Prediction Probabilities 

In [4]:
full_path = experiment_path+'/'+dataname
new_full_path = full_path+'/replication/'+rep_dataname
        
#Make folder in experiment folder/datafolder to store all prediction probabilities per algorithm/CV combination
if not os.path.exists(new_full_path+'/model_evaluation/prediction_probas'):
    os.mkdir(new_full_path+'/model_evaluation/prediction_probas')

for algorithm in algorithms: #loop through algorithms
    print("Algorithm: "+str(algorithm))

    for cvCount in range(0,cv_partitions): #loop through cv's
        print("CV: "+str(cvCount))
        #Load pickled metric file for given algorithm and cv
        result_file = new_full_path+'/model_evaluation/pickled_metrics/'+abbrev[algorithm]+"_CV_"+str(cvCount)+"_metrics.pickle"
        file = open(result_file, 'rb')
        results = pickle.load(file)
        file.close()

        #Load processed replication dataset (From which we will get the instancelabel values and class outcome values.)
        rep_data = pd.read_csv(new_full_path+'/'+rep_dataname+'_Processed.csv')
        probas_summary = rep_data[[class_label,instance_label]]

        #Separate pickled results
        probas_ = results[9]
        print(probas_[:,1])
        probas_summary['1_prob'] = probas_[:,1]
        file_name = new_full_path+'/model_evaluation/prediction_probas/' + algorithm + '_CV_'+str(cvCount)+'_class1_probas.csv'
        probas_summary.to_csv(file_name, index=False)



Algorithm: Decision Tree
CV: 0
[0.63085938 0.63085938 0.63085938 0.87311178 0.08695652 0.63085938
 0.63085938 0.63085938 0.63085938 0.08695652 0.08695652 0.35051546
 0.35051546 0.08695652 0.63085938 0.08695652 0.63085938 0.08695652
 0.87311178 0.63085938 0.63085938 0.08695652 0.35051546 0.63085938
 0.35051546 0.63085938 0.87311178 0.08695652 0.87311178 0.08695652
 0.87311178 0.08695652 0.35051546 0.08695652 0.08695652 0.08695652
 0.63085938 0.87311178 0.08695652 0.63085938 0.08695652 0.08695652
 0.87311178 0.08695652 0.35051546 0.63085938 0.63085938 0.63085938
 0.87311178 0.08695652 0.87311178 0.35051546 0.08695652 0.35051546
 0.63085938 0.63085938 0.87311178 0.08695652 0.63085938 0.08695652
 0.08695652 0.35051546 0.63085938 0.63085938 0.08695652 0.35051546
 0.63085938 0.08695652 0.63085938 0.63085938 0.63085938 0.63085938
 0.35051546 0.87311178 0.08695652 0.63085938 0.63085938 0.08695652
 0.63085938 0.63085938 0.87311178 0.35051546 0.87311178 0.63085938
 0.35051546 0.87311178 0.630859