## RONIN Walkthrough 
___

This notebook contains a brief introduction to training and evaluating a random forest radar quality control algorithm using Julia. 

In [33]:
##Begin by loading dependencies 
using Pkg
Pkg.activate(".")
Pkg.instantiate() 
##Make sure Julia can see our module 
push!(LOAD_PATH, "./src/")

###Load key functionality 
###This will take a while the first time you do it 
using Ronin

[32m[1m  Activating[22m[39m project at `~/Documents/Grad_School/Research/Ronin`
[32m[1mPrecompiling[22m[39m project...
[33m  ✓ [39mRonin
  1 dependency successfully precompiled in 9 seconds. 352 already precompiled.
  [33m1[39m dependency precompiled but a different version is currently loaded. Restart julia to access the new version


### 1) Splitting data into training and testing sets
___
We'll begin by partitioning our data into scans that will be used for training the model and scans that will be used in its evaluation. It's important to keep these separate. 
<h2><span style="color:Red">WARNING: This function will begin by DELETING the training and testing directories to clean them</span></h2>
It then softlinks the divded files to their respective directories. 

In [4]:
###Make sure to use absolute paths here 
##These are EXAMPLES, make sure to edit for your own directory setup
CASE_PATHS= ["/Users/ischluesche/Documents/Grad_School/Research/Ronin/CFRADIALS/CASES/BAMEX", 
             "/Users/ischluesche/Documents/Grad_School/Research/Ronin/CFRADIALS/CASES/HAGUPIT", 
             "/Users/ischluesche/Documents/Grad_School/Research/Ronin/CFRADIALS/CASES/RITA", 
             "/Users/ischluesche/Documents/Grad_School/Research/Ronin/CFRADIALS/CASES/VORTEX"]

TRAINING_PATH = "/Users/ischluesche/Documents/Grad_School/Research/Ronin/CFRADIALS/CASES/TRAINING"
TESTING_PATH = "/Users/ischluesche/Documents/Grad_School/Research/Ronin/CFRADIALS/CASES/TESTING"

split_training_testing!(CASE_PATHS, TRAINING_PATH, TESTING_PATH)


[32mTOTAL NUMBER OF TDR SCANS ACROSS ALL CASES: 1780[39m
[39mTESTING SCANS PER CASE 89[39m
[31mNUMBER OF SCANS IN CASE: 482[39m
[31mTRAINING GROUP SIZE: 196 + REMAINDER: 1[39m
[31mTESTING GROUP SIZE: 29 + REMAINDER 2[39m

[32m INDEXES 1 TO 31 ASSIGNED TESTING[39m
[32m INDEXES 31 [39m[32m TO 228 ASSIGNED TRAINING[39m
[32m INDEXES 229[39m[32m TO 257 ASSIGNED TESTING[39m
[32m INDEXES 258[39m[32m TO 453 ASSIGNED TRAINING[39m
[32m INDEXES 454[39m[32m TO 482 ASSIGNED TESTING[39m
[31mTotal length of case files: 482[39m
[34mLength of testing files: 89 - 0.18464730290456433 percent[39m
[34mLength of testing_files: 393 - 0.8153526970954357 percent[39m
[31mNUMBER OF SCANS IN CASE: 238[39m
[31mTRAINING GROUP SIZE: 74 + REMAINDER: 1[39m
[31mTESTING GROUP SIZE: 29 + REMAINDER 2[39m

[32m INDEXES 1 TO 31 ASSIGNED TESTING[39m
[32m INDEXES 31 [39m[32m TO 106 ASSIGNED TRAINING[39m
[32m INDEXES 107[39m[32m TO 135 ASSIGNED TESTING[39m
[32m INDEXES 136[3

In [12]:
TRAINING_PATH = "/Users/ischluesche/Documents/Grad_School/Research/Ronin/NOAA/TRAINING/"
TESTING_PATH = "/Users/ischluesche/Documents/Grad_School/Research/Ronin/NOAA/TESTING/"

"/Users/ischluesche/Documents/Grad_School/Research/Ronin/NOAA/TESTING/"

# 2) Configure model
___
We'll now set up a configuartion object for use in our model. This structure contains 
key information and settings such as the number of models to use, the decision thresholds for each model, and locations to output data to. 

In [13]:
###Path to file containing variables to calculate/use as features 
config_path = "./NOAA_all_params.txt"
###Number of models to use in the composite object 
num_models =2

base_name = "raw_model"
base_name_features = "output_features"
###List of paths to output trained models to. Should be same size as num_models 
model_output_paths = [base_name * "_$(i-1).jld2" for i in 1:num_models ]
###List of paths to output calculated features to. Should be same size as num_models 
feature_output_paths = [base_name_features * "_$(i-1).h5" for i in 1:num_models]

###Decision threshold for each model in the chain. For each model, 
###if the fraction of decision trees classifying a gate as meteorological is greater than 
###the correspondant threshold in met_probs, it will be predicted to be meteorological. Otherwise, 
###it will be predicted as non-meteorological. 
met_probs = Vector{Float64}([.1,.5])

###Options are "balanced" or "". If "balanced", the decision trees will be trained 
###on a weighted version of the existing classes in order to combat class imbalance 
class_weights = "balanced"

###Path to input cfradial files 
input_path = TRAINING_PATH
###Name of variable in cfradials that has already had interactive QC applied 
QC_var = "VG"

###Name of a variable in cfradials that will be used to mask what gates are predicted upon.
###Missing values in this variable mean that gates will be removed
mask_name = "VEL_QC"

###Name of a variable in input cfradials that has not had postprocessing applied. 
###This variable is used to determine where MISSING gates exist in the scan 
remove_var = "VEL"

###Whether or not the input features for the model have already been calculated 
file_preprocessed = [false, false]


2-element Vector{Bool}:
 0
 0

In [14]:
###Create model config object
config = ModelConfig(num_models = num_models,model_output_paths =  model_output_paths,met_probs =  met_probs, 
                    feature_output_paths = feature_output_paths, input_path = input_path, input_config = config_path,
                    QC_var = QC_var, remove_var = remove_var, QC_mask = false, mask_name = mask_name, 
                    VARS_TO_QC = ["VEL"], class_weights = class_weights, HAS_INTERACTIVE_QC=true, file_preprocessed = file_preprocessed)

ModelConfig(2, ["raw_model_0.jld2", "raw_model_1.jld2"], [0.1, 0.5], ["output_features_0.h5", "output_features_1.h5"], "/Users/ischluesche/Documents/Grad_School/Research/Ronin/NOAA/TRAINING/", "./NOAA_all_params.txt", Bool[0, 0], true, true, true, true, "VG", "VEL", false, true, false, "VEL_QC", ["VEL"], "_QC", "balanced")

# 3) Train a composite model!
___
Now that we have set up our model configuration, we simply invoke the `train_multi_model` function. This will likely take a long time, especially when one is training 2 or more models in a chain (1hr+). 
<b>Data will be written to the cfradial files during this process.</b>

In [15]:
###Train composite model! 
train_multi_model(config)


[32mCALCULATING FEATURES FOR PASS: 1[39m
ERROR: POTENTIALLY INVALID FILE FORMAT FOR FILE: .tmp_hawkedit
OUTPUTTING DATA IN HDF5 FORMAT TO FILE: output_features_0.h5
Processed /Users/ischluesche/Documents/Grad_School/Research/Ronin/NOAA/TRAINING//cfrad.20220907_125500.003_to_20220907_125503.977_N42RF-TM_AIR.nc in 2.9656119346618652 seconds
Processed /Users/ischluesche/Documents/Grad_School/Research/Ronin/NOAA/TRAINING//cfrad.20220907_125500.499_to_20220907_125504.479_N42RF-TS_AIR.nc in 0.3662230968475342 seconds
Processed /Users/ischluesche/Documents/Grad_School/Research/Ronin/NOAA/TRAINING//cfrad.20220907_125512.642_to_20220907_125516.616_N42RF-TM_AIR.nc in 0.37268805503845215 seconds
Processed /Users/ischluesche/Documents/Grad_School/Research/Ronin/NOAA/TRAINING//cfrad.20220907_125513.138_to_20220907_125517.117_N42RF-TS_AIR.nc in 0.6993668079376221 seconds
Processed /Users/ischluesche/Documents/Grad_School/Research/Ronin/NOAA/TRAINING//cfrad.20220907_125525.280_to_20220907_125529.2

# 4) Verify the efficacy of the model on the testing dataset 
___
We'll begin by setting up another `ModelConfig` struct, but this time substituting the path to the testing data for `input_path` 

In [16]:
test_config = ModelConfig(num_models = num_models,model_output_paths =  model_output_paths,met_probs =  Vector{Float64}([.1,.3]), 
                    feature_output_paths = feature_output_paths, input_path = TESTING_PATH, input_config = config_path,
                    QC_var = QC_var, remove_var = remove_var, QC_mask = false, mask_name = mask_name, 
                    VARS_TO_QC = ["VEL"], class_weights = class_weights, HAS_INTERACTIVE_QC=true, file_preprocessed = file_preprocessed)

ModelConfig(2, ["raw_model_0.jld2", "raw_model_1.jld2"], [0.1, 0.3], ["output_features_0.h5", "output_features_1.h5"], "/Users/ischluesche/Documents/Grad_School/Research/Ronin/NOAA/TESTING/", "./NOAA_all_params.txt", Bool[0, 0], true, true, true, true, "VG", "VEL", false, true, false, "VEL_QC", ["VEL"], "_QC", "balanced")

## Now, call the `composite_prediction` function

In [32]:
###I recommend setting `write_out` to `true` so that predictions
###can be retained for later usage 
predictions, verification, indexers = composite_prediction(test_config, write_features_out=true, feature_outfile="NEW_MODEL_PREDICTIONS_OUT.h5")

ERROR: POTENTIALLY INVALID FILE FORMAT FOR FILE: .tmp_hawkedit
[32mLOADING MODELS....[39m
INPUT_SET NCDatasets.NCDataset{Nothing, Missing}, VAR: StringAlready exists... overwriting

Completed in 1.0272529125213623 seconds

[32mREMOVED 6845 PRESUMED NON-METEORLOGICAL DATAPOINTS[39m
FINAL COUNT OF DATAPOINTS IN VEL: 61392
INPUT_SET NCDatasets.NCDataset{Nothing, Missing}, VAR: StringAlready exists... overwriting

Completed in 0.6780979633331299 seconds

[32mREMOVED 4018 PRESUMED NON-METEORLOGICAL DATAPOINTS[39m
FINAL COUNT OF DATAPOINTS IN VEL: 57374
INPUT_SET NCDatasets.NCDataset{Nothing, Missing}, VAR: StringAlready exists... overwriting

Completed in 0.6792428493499756 seconds

[32mREMOVED 7325 PRESUMED NON-METEORLOGICAL DATAPOINTS[39m
FINAL COUNT OF DATAPOINTS IN VEL: 63586
INPUT_SET NCDatasets.NCDataset{Nothing, Missing}, VAR: StringAlready exists... overwriting

Completed in 0.6735668182373047 seconds

[32mREMOVED 3384 PRESUMED NON-METEORLOGICAL DATAPOINTS[39m
FINAL COUNT 

(Bool[0, 0, 0, 0, 0, 1, 1, 1, 1, 1  …  0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0; 0; … ; 0; 0;;], [[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0  …  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0  …  0.0, 0.0, 0.0, 

## Now, let's see how the model did using the `get_contingency` function

In [63]:
###If `normalize` is set to `true`, will return a contingency matrix where 
###Each column contains the predictions as a fraction of the total number of true values (each column will add to 1)
get_contingency(predictions, Vector{Bool}(verification[:]), normalize=true)

Row,Unnamed: 1_level_0,True Meteorological,True Non-Meteorological
Unnamed: 0_level_1,String,Float64,Float64
1,Predicted Meteorological,0.856,0.006
2,Predicted Non-Meteorological,0.144,0.994


## Looks pretty good! Lets now use it to actually apply quality control to the testing scans. 

In [62]:
QC_scan(test_config)

ERROR: POTENTIALLY INVALID FILE FORMAT FOR FILE: .tmp_hawkedit
[32mLOADING MODELS....[39m
QC-ing /Users/ischluesche/Documents/Grad_School/Research/Ronin/NOAA/TESTING//cfrad.20220907_125306.219_to_20220907_125310.193_N42RF-TM_AIR.ncINPUT_SET NCDatasets.NCDataset{Nothing, Missing}, VAR: StringAlready exists... overwriting

Completed in 1.0711100101470947 seconds

[32mREMOVED 6845 PRESUMED NON-METEORLOGICAL DATAPOINTS[39m
FINAL COUNT OF DATAPOINTS IN VEL: 61392
INPUT_SET NCDatasets.NCDataset{Nothing, Missing}, VAR: StringAlready exists... overwriting

Completed in 1.501816987991333 seconds

[32mREMOVED 4018 PRESUMED NON-METEORLOGICAL DATAPOINTS[39m
FINAL COUNT OF DATAPOINTS IN VEL: 57374
FINISHED QC-ing/Users/ischluesche/Documents/Grad_School/Research/Ronin/NOAA/TESTING//cfrad.20220907_125306.219_to_20220907_125310.193_N42RF-TM_AIR.nc in 2.7QC-ing /Users/ischluesche/Documents/Grad_School/Research/Ronin/NOAA/TESTING//cfrad.20220907_125306.750_to_20220907_125310.729_N42RF-TS_AIR.ncINP