# Project on testing ML techniques to identify YSOs in Spitzer IRAC data

## Breanna Crompvoets and Samuel Fielder

## Project Summary and Goals
Young Stellar Objects (YSOs) are newly forming stars which are yet to begin burning. They are split into different classes depending on their dust/gas envelope to protostar ratio; Class 0 having the greatest envelope and Class III no longer having an envelope. Due to the different ratios of envelope to protostar, each class appears differently in spectroscopy; thus the difference in fluxes between Spitzer IRAC bands are able to determine which class the data comes from. This project will focus on using the same data as Cornu and Montillaud (2021; CM21) to classify data points into three classes: Class I, Class II, and Contaminants. Only these three classes out of the original 9 available are chosen as Class 0 is too dusty to detect, and Class III is difficult to distinguish from regular stars. Furthermore, the contaminating classes (galaxies, shocks, stars, and PAHs) are of less concern -- we would like the algorithms to focus on distinguishing Class I and Class II from the rest. The original paper uses a multi-layer perceptron (MLP) with one hidden layer (20 neurons). Their results are presented in the below table.

|Class | Recall | Precision |
| --- | --- | --- |
|1 | 94.0% | 79.1 %|
|2 | 96.7% | 90.6% | 
|Other | 98.7%| 99.8%| 

The data for this project was pulled from https://cdsarc.cds.unistra.fr/viz-bin/cat/J/A+A/647/A116. These data include columns for four Spitzer IRAC bands (3.6 $\mu m$, 4.5 $\mu m$, 5.8 $\mu m$, and 8 $\mu m$) fluxes and errors, as well as from one Spitzer MIPS band (24 $\mu m$), along with the target values as determined via a manual classification scheme and the predicted data from CM21. We will only be using the four IRAC bands and their associated errors, as the MIPS band does not provide data for most objects. We use the same target values as they do for accurate comparison. 

This project seeks to use a multitude of algorithms learned over the semester to measure their effectiveness and compare it to the recreated MLP of CM21. These algorithms include: GridSearch with an SVC, GridSearch with a Logistic Regressor, a Stacking Ensemble with an SVC and a Logistic Regressor, a Gradient Boosting ensemble, a Random Forest ensemble, and an XGBoost ensemble. We also created our own MLP based off of their prescription. The workload was split as follows: B. Crompvoets completed the data cleaning and all other algorithms besides the MLP, and S. Fielder completed an MLP close to that of CM21, as well as creating a custom data loader/split. They communicated together on the best hyper-parameters to test.


## Import Libraries and set global variables

In [1]:
# import statements
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# classic ML libraries
from sklearn.metrics import ConfusionMatrixDisplay, classification_report, recall_score, precision_score, accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split,  GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, StackingClassifier
from sklearn.svm import SVC
import xgboost as xgb

# custom made libraries
from custom_dataloader import replicate_data

  from pandas import MultiIndex, Int64Index


In [2]:
# settings for confusion matrix plots and classification reports
cm_blues = plt.cm.Blues
custom_labs = ['Class 1', 'Class 2', 'Others']

# Classical ML Techniques

For each of the below algorithms we ran a gridsearch over a wide variety of hyperparameters. These parameter dictionaries are commented out in each cell, and we use the best parameters as a sample here to show how each algorithm performed. This section of the project was conducted by B. Crompvoets.

We conducted these fits for each of three data splits:
* "75/25" -- here the data is split into 75% training, 25% test set.
* "300s" -- here the data is split such that 5 out of the 7 subclasses, each has 300 members in the training set (2 do not have enough members). The test set size is the same as CM21. The data is again split into only 3 classes to train/test.
* "CM21" -- here the data is split with the exact same values as CM21 provide in their paper.

An example of a run with the 75/25 split is given below, with best results for each of the other runs commented out, as well as the parameter grid tested for each GridSearch. The results of all runs will be included at the end of this document.

## Loading Data Set

In [41]:
# data load
X = np.load("Input_Class_AllClasses_Sep.npy")
Y = np.load("Target_Class_AllClasses_Sep.npy") # For original targets via Gutermuth 2009 Method
# Y = np.load("Pred_Class_AllClasses_Sep.npy") # For predicted targets from CM21


# custom data loader to pull in custom sized data set
# use seed to get replicable results for now
seed_val = 1111

# the amounts below are how many of each class of object you want in the training set and validation set - leftover amounts given to testing set

# CM21 Split
# amounts_train = [331,1141,231,529,27,70,1257]
# amounts_val = [82, 531, 104, 278, 6, 17, 4359]
# amounts_train = [331,1141,231+529+27+70+1257]
# amounts_val = [82, 531, 104+278+6+17+4359]


# 300s Split
amounts_train = [300,300,300,300,27,70,300]
amounts_val = [82, 531, 104, 278, 6, 17, 4359]
# amounts_train = [300,300,300+300+27+70+300]
# amounts_val = [82, 531, 104+278+6+17+4359]


# 75/25 Split
# amounts_train = [311,1994,391,1043,25,66,21796] #75/25 train
# amounts_val = [103,665,130,348,9,22,5449] #75/25 val
# amounts_train = [311,1994,391+1043+25+66+21796] #75/25 train
# amounts_val = [103,665,130+348+9+22+5449] #75/25 val

# calling custom datagrabber here
inp_tr, tar_tr, inp_va, tar_va, inp_te, tar_te = replicate_data(X, Y, 'seven', amounts_train, amounts_val, seed_val)

# scaling data according to training inputs
scaler_S = StandardScaler().fit(inp_tr)
inp_tr = scaler_S.transform(inp_tr)
inp_va = scaler_S.transform(inp_va)

# COMMENT NEXT LINE IF RUNNING 75/25 SPLIT
inp_te = scaler_S.transform(inp_te) # Comment out for 75/25 split

# printouts for double checking all the sets and amounts
print('Sizes of Datasets : Inputs , Targets')
print('------------------------------------')
print(f'Training set: {inp_tr.shape} , {tar_tr.shape} \nValidation set: {inp_va.shape} , {tar_va.shape} \nTesting Set: {inp_te.shape}, {tar_te.shape}')
print('------------------------------------')


Sizes of Datasets : Inputs , Targets
------------------------------------
Training set: (1597, 8) , (1597,) 
Validation set: (5377, 8) , (5377,) 
Testing Set: (19929, 8), (19929,)
------------------------------------


In [42]:
inputs = np.concatenate((inp_tr,inp_va,inp_te))
targets = np.concatenate((tar_tr,tar_va,tar_te))
np.save("XGB_Val_G-targets7.npy",targets)
tar_tr = np.where(tar_tr<2,tar_tr,2)
tar_va = np.where(tar_va<2,tar_va,2)
tar_te = np.where(tar_te<2,tar_te,2)
targets = np.concatenate((tar_tr,tar_va,tar_te))
np.save("XGB_Val_G-targets2.npy",targets)

xgbcl = xgb.XGBClassifier(max_depth=7,sampling_method='uniform',subsample=0.5,use_label_encoder=False,eval_metric='mlogloss')
xgbcl.fit(inp_tr,tar_tr)
pred_va = xgbcl.predict(inputs)
np.save("XGB_300s_G-targets_VPred.npy",pred_va)

In [4]:

def bootstrap_estimate_and_ci(estimator, X_tr, y_tr, X_va, y_va, scoring_func=None, random_seed=0, 
                               alpha=0.05, n_splits=200):
                        
    scores = []

    if scoring_func == accuracy_score:
        for n in range(0,n_splits):
            estimator.fit(X_tr, y_tr.ravel())  
            scores.append(scoring_func(y_va,estimator.predict(X_va)))
            # scores = list(map(list, zip(*scores)))
        estimate = np.mean(scores)
        lower_bound = np.percentile(scores, 100*(alpha/2))
        upper_bound = np.percentile(scores, 100*(1-alpha/2))
        stderr = np.std(scores)   

    else:
        for n in range(0,n_splits):
            estimator.fit(X_tr, y_tr.ravel())  
            scores.append(scoring_func(y_va,estimator.predict(X_va),average=None))   
            scores = list(map(list, zip(*scores)))
    
        estimate = [np.mean(scores[0]),np.mean(scores[1]),np.mean(scores[2])]
        lower_bound = [np.percentile(scores[0], 100*(alpha/2)),np.percentile(scores[1], 100*(alpha/2)),np.percentile(scores[2], 100*(alpha/2))]
        upper_bound = [np.percentile(scores[0], 100*(1-alpha/2)),np.percentile(scores[1], 100*(1-alpha/2)),np.percentile(scores[2], 100*(1-alpha/2))]
        stderr = [np.std(scores[0]),np.std(scores[1]),np.std(scores[2])]
    
    return estimate, lower_bound, upper_bound, stderr

In [5]:
f = open("PRAScores_7525_C-targets_2.txt","w")
f.write("C-targets \n")
f.write("7525 split, correct \n")

# f.close()

21

## Logistic Regression


Specifying logistic regression
logreg = LogisticRegression()

hyperparameters tested over initially
param_grid = [{'penalty': ['l1'], 'max_iter': np.arange(300,1500,100),
        'solver': ['liblinear', 'saga'], 'tol': np.arange(0.0001,0.01,0.0005)},
        {'penalty': ['l2'], 'max_iter': np.arange(300,1500,100),
        'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'], 'tol': np.arange(0.0001,0.01,0.0005)},
        {'penalty': ['elasticnet'], 'max_iter': np.arange(100,2000,100), 'l1_ratio': np.arange(0.1,1.,0.1),
        'solver': ['saga'], 'tol': np.arange(0.0001,0.01,0.0005)}]

grid = GridSearchCV(logreg,param_grid=param_grid, verbose=1)

Run the data through the grid to find optimal results
grid.fit(inp_tr, tar_tr.ravel())

In [6]:
%%time
# 75/25
logreg = LogisticRegression('l1',max_iter=500,solver='saga',tol=0.0001)
# 300s
# logreg = LogisticRegression('l1',max_iter=300,solver='saga',tol=0.0016)
# CM21
# logreg = LogisticRegression('l1',max_iter=600,solver='saga',tol=0.0001)

CPU times: user 10 µs, sys: 1e+03 ns, total: 11 µs
Wall time: 12.2 µs


In [None]:
est, low, up, stderr = bootstrap_estimate_and_ci(logreg, inp_tr, tar_tr.ravel(), inp_va, tar_va.ravel(), scoring_func=recall_score, random_seed=0, 
                               alpha=0.05, n_splits=200)
                              
f.write("LR Recall\n")
f.write("{:.3f}".format(est[0])+"$\pm$"+"{:.3f}".format(stderr[0])+"_{"+"{:.3f}".format(low[0])+"}^{"+"{:.3f}".format(up[0])+"}//\n")
f.write("{:.3f}".format(est[1])+"$\pm$"+"{:.3f}".format(stderr[1])+"_{"+"{:.3f}".format(low[1])+"}^{"+"{:.3f}".format(up[1])+"}//\n")
f.write("{:.3f}".format(est[2])+"$\pm$"+"{:.3f}".format(stderr[2])+"_{"+"{:.3f}".format(low[2])+"}^{"+"{:.3f}".format(up[2])+"}//\n")
f.write("\n")

est, low, up, stderr = bootstrap_estimate_and_ci(logreg, inp_tr, tar_tr.ravel(), inp_va, tar_va.ravel(), scoring_func=precision_score, random_seed=0, 
                              alpha=0.05, n_splits=200)
                            
f.write("LR Precision\n")
f.write("{:.3f}".format(est[0])+"$\pm$"+"{:.3f}".format(stderr[0])+"_{"+"{:.3f}".format(low[0])+"}^{"+"{:.3f}".format(up[0])+"}//\n")
f.write("{:.3f}".format(est[1])+"$\pm$"+"{:.3f}".format(stderr[1])+"_{"+"{:.3f}".format(low[1])+"}^{"+"{:.3f}".format(up[1])+"}//\n")
f.write("{:.3f}".format(est[2])+"$\pm$"+"{:.3f}".format(stderr[2])+"_{"+"{:.3f}".format(low[2])+"}^{"+"{:.3f}".format(up[2])+"}//\n")
f.write("\n")

est, low, up, stderr = bootstrap_estimate_and_ci(logreg, inp_tr, tar_tr.ravel(), inp_va, tar_va.ravel(), scoring_func=accuracy_score, random_seed=0, 
                              alpha=0.05, n_splits=200)
                            
f.write("LR Accuracy\n")
f.write("{:.3f}".format(est)+"$\pm$"+"{:.3f}".format(stderr)+"_{"+"{:.3f}".format(low)+"}^{"+"{:.3f}".format(up)+"}//\n")
f.write("\n")

## SVM

Hyperparameters tested over initially
param_grid = [{'kernel':['rbf','sigmoid','linear','poly'], 'gamma':['auto','scale'], 'C':np.arange(0.1,1.,0.1)}]

In [None]:
%%time
# Final hyperparameters
# 75/25
svc = SVC(kernel='rbf',gamma='auto',C=0.9)
# 300s
# svc = SVC(kernel='linear',gamma='auto',C=0.8)
# CM21
# svc = SVC(kernel='rbf',gamma='auto',C=0.9)

CPU times: user 123 µs, sys: 151 µs, total: 274 µs
Wall time: 66.3 µs


In [None]:
est, low, up, stderr = bootstrap_estimate_and_ci(svc, inp_tr, tar_tr.ravel(),  inp_va, tar_va, scoring_func=recall_score, random_seed=0, 
                              alpha=0.05, n_splits=200)

f.write("SVC Recall\n")
f.write("{:.3f}".format(est[0])+"$\pm$"+"{:.3f}".format(stderr[0])+"_{"+"{:.3f}".format(low[0])+"}^{"+"{:.3f}".format(up[0])+"}//\n")
f.write("{:.3f}".format(est[1])+"$\pm$"+"{:.3f}".format(stderr[1])+"_{"+"{:.3f}".format(low[1])+"}^{"+"{:.3f}".format(up[1])+"}//\n")
f.write("{:.3f}".format(est[2])+"$\pm$"+"{:.3f}".format(stderr[2])+"_{"+"{:.3f}".format(low[2])+"}^{"+"{:.3f}".format(up[2])+"}//\n")
f.write("\n")

est, low, up, stderr = bootstrap_estimate_and_ci(svc, inp_tr, tar_tr.ravel(), inp_va, tar_va.ravel(), scoring_func=precision_score, random_seed=0, 
                              alpha=0.05, n_splits=200)
                            
f.write("SVC Precision\n")
f.write("{:.3f}".format(est[0])+"$\pm$"+"{:.3f}".format(stderr[0])+"_{"+"{:.3f}".format(low[0])+"}^{"+"{:.3f}".format(up[0])+"}//\n")
f.write("{:.3f}".format(est[1])+"$\pm$"+"{:.3f}".format(stderr[1])+"_{"+"{:.3f}".format(low[1])+"}^{"+"{:.3f}".format(up[1])+"}//\n")
f.write("{:.3f}".format(est[2])+"$\pm$"+"{:.3f}".format(stderr[2])+"_{"+"{:.3f}".format(low[2])+"}^{"+"{:.3f}".format(up[2])+"}//\n")
f.write("\n")

est, low, up, stderr = bootstrap_estimate_and_ci(svc, inp_tr, tar_tr.ravel(), inp_va, tar_va.ravel(), scoring_func=accuracy_score, random_seed=0, 
                               alpha=0.05, n_splits=200)
                            
f.write("SVC Accuracy\n")
f.write("{:.3f}".format(est)+"$\pm$"+"{:.3f}".format(stderr)+"_{"+"{:.3f}".format(low)+"}^{"+"{:.3f}".format(up)+"}//\n")
f.write("\n")

1

## SVM/LR Stacking Ensemble
No GridSearch, uses best pars as defined previously. Adding multiple SVC's does not improve results. 

In [None]:
%%time

# Specify Gradient Boost
# 75/25
estimators = [('svc', SVC(kernel='rbf',gamma='auto',C=0.9,random_state=42))]
# 300s
# estimators = [('svc', SVC(C=0.8, gamma='auto', kernel='linear',random_state=42))]
# CM21 
# estimators = [('svc', SVC(C=0.9, gamma='auto', kernel='rbf',random_state=42))]

# As the parameters for the Logistic Regression didn't change much, we use the best pars from the first trial.
stacl = StackingClassifier(estimators=estimators,
                           final_estimator=LogisticRegression(penalty = 'l1', max_iter = 500, solver ='saga', tol =0.0001))

CPU times: user 28 µs, sys: 33 µs, total: 61 µs
Wall time: 62 µs


In [None]:
est, low, up, stderr = bootstrap_estimate_and_ci(stacl, inp_tr, tar_tr.ravel(),  inp_va, tar_va, scoring_func=recall_score, random_seed=0, 
                              alpha=0.05, n_splits=200)

f.write("Stack Recall\n")
f.write("{:.3f}".format(est[0])+"$\pm$"+"{:.3f}".format(stderr[0])+"_{"+"{:.3f}".format(low[0])+"}^{"+"{:.3f}".format(up[0])+"}//\n")
f.write("{:.3f}".format(est[1])+"$\pm$"+"{:.3f}".format(stderr[1])+"_{"+"{:.3f}".format(low[1])+"}^{"+"{:.3f}".format(up[1])+"}//\n")
f.write("{:.3f}".format(est[2])+"$\pm$"+"{:.3f}".format(stderr[2])+"_{"+"{:.3f}".format(low[2])+"}^{"+"{:.3f}".format(up[2])+"}//\n")
f.write("\n")

est, low, up, stderr = bootstrap_estimate_and_ci(stacl, inp_tr, tar_tr.ravel(), inp_va, tar_va.ravel(), scoring_func=precision_score, random_seed=0, 
                              alpha=0.05, n_splits=200)
                            
f.write("Stack Precision\n")
f.write("{:.3f}".format(est[0])+"$\pm$"+"{:.3f}".format(stderr[0])+"_{"+"{:.3f}".format(low[0])+"}^{"+"{:.3f}".format(up[0])+"}//\n")
f.write("{:.3f}".format(est[1])+"$\pm$"+"{:.3f}".format(stderr[1])+"_{"+"{:.3f}".format(low[1])+"}^{"+"{:.3f}".format(up[1])+"}//\n")
f.write("{:.3f}".format(est[2])+"$\pm$"+"{:.3f}".format(stderr[2])+"_{"+"{:.3f}".format(low[2])+"}^{"+"{:.3f}".format(up[2])+"}//\n")
f.write("\n")

est, low, up, stderr = bootstrap_estimate_and_ci(stacl, inp_tr, tar_tr.ravel(), inp_va, tar_va.ravel(), scoring_func=accuracy_score, random_seed=0, 
                               alpha=0.05, n_splits=200)
                            
f.write("Stack Accuracy\n")
f.write("{:.3f}".format(est)+"$\pm$"+"{:.3f}".format(stderr)+"_{"+"{:.3f}".format(low)+"}^{"+"{:.3f}".format(up)+"}//\n")
f.write("\n")

1

## Gradient Boosting

Hyperparameters tested over initially
param_grid = [{'n_estimators': np.arange(50,250,50),'subsample':[0.5,1.0],
              'criterion':['friedman_mse'],'n_iter_no_change':[5],'warm_start':[True,False],
              'max_depth':np.arange(1,11,2),'max_features': ['sqrt','log2']}]

In [None]:
%%time
# Final hyperparameters
# 75/25
boostcl = GradientBoostingClassifier(criterion='friedman_mse',max_depth=9,max_features='log2',
                n_estimators=50,n_iter_no_change=5,subsample=1.0,warm_start=True)

# 300s
# boostcl = GradientBoostingClassifier(criterion='friedman_mse',max_depth=5,max_features='log2',
#                 n_estimators=150,n_iter_no_change=5,subsample=1.0,warm_start=False)

# CM21
# boostcl = GradientBoostingClassifier(criterion='friedman_mse',max_depth=7,max_features='log2',
#                 n_estimators=200,n_iter_no_change=5,subsample=1.0,warm_start=True)

CPU times: user 25 µs, sys: 52 µs, total: 77 µs
Wall time: 81.1 µs


In [None]:
est, low, up, stderr = bootstrap_estimate_and_ci(boostcl, inp_tr, tar_tr.ravel(),  inp_va, tar_va, scoring_func=recall_score, random_seed=0, 
                              alpha=0.05, n_splits=200)

f.write("GB Recall\n")
f.write("{:.3f}".format(est[0])+"$\pm$"+"{:.3f}".format(stderr[0])+"_{"+"{:.3f}".format(low[0])+"}^{"+"{:.3f}".format(up[0])+"}//\n")
f.write("{:.3f}".format(est[1])+"$\pm$"+"{:.3f}".format(stderr[1])+"_{"+"{:.3f}".format(low[1])+"}^{"+"{:.3f}".format(up[1])+"}//\n")
f.write("{:.3f}".format(est[2])+"$\pm$"+"{:.3f}".format(stderr[2])+"_{"+"{:.3f}".format(low[2])+"}^{"+"{:.3f}".format(up[2])+"}//\n")
f.write("\n")

est, low, up, stderr = bootstrap_estimate_and_ci(boostcl, inp_tr, tar_tr.ravel(), inp_va, tar_va.ravel(), scoring_func=precision_score, random_seed=0, 
                              alpha=0.05, n_splits=200)
                            
f.write("GB Precision\n")
f.write("{:.3f}".format(est[0])+"$\pm$"+"{:.3f}".format(stderr[0])+"_{"+"{:.3f}".format(low[0])+"}^{"+"{:.3f}".format(up[0])+"}//\n")
f.write("{:.3f}".format(est[1])+"$\pm$"+"{:.3f}".format(stderr[1])+"_{"+"{:.3f}".format(low[1])+"}^{"+"{:.3f}".format(up[1])+"}//\n")
f.write("{:.3f}".format(est[2])+"$\pm$"+"{:.3f}".format(stderr[2])+"_{"+"{:.3f}".format(low[2])+"}^{"+"{:.3f}".format(up[2])+"}//\n")
f.write("\n")

est, low, up, stderr = bootstrap_estimate_and_ci(boostcl, inp_tr, tar_tr.ravel(), inp_va, tar_va.ravel(), scoring_func=accuracy_score, random_seed=0, 
                               alpha=0.05, n_splits=200)
                            
f.write("GB Accuracy\n")
f.write("{:.3f}".format(est)+"$\pm$"+"{:.3f}".format(stderr)+"_{"+"{:.3f}".format(low)+"}^{"+"{:.3f}".format(up)+"}//\n")
f.write("\n")

1

### XGBoost
Hyperparameters tested over initially
param_grid = [{'subsample':[0.5,1.0],'max_depth':np.arange(1,11,2),'sampling_method':['uniform']}]


In [None]:
# 75/25
xgbcl = xgb.XGBClassifier(max_depth=9,sampling_method='uniform',subsample=0.5,use_label_encoder=False,eval_metric='mlogloss')
# 300s
# xgbcl = xgb.XGBClassifier(max_depth=7,sampling_method='uniform',subsample=0.5,use_label_encoder=False,eval_metric='mlogloss')
# CM21
# xgbcl = xgb.XGBClassifier(max_depth=9,sampling_method='uniform',subsample=1.0,use_label_encoder=False,eval_metric='mlogloss')

In [None]:
est, low, up, stderr = bootstrap_estimate_and_ci(xgbcl, inp_tr, tar_tr.ravel(),  inp_va, tar_va, scoring_func=recall_score, random_seed=0, 
                              alpha=0.05, n_splits=200)

f.write("XGB Recall\n")
f.write("{:.3f}".format(est[0])+"$\pm$"+"{:.3f}".format(stderr[0])+"_{"+"{:.3f}".format(low[0])+"}^{"+"{:.3f}".format(up[0])+"}//\n")
f.write("{:.3f}".format(est[1])+"$\pm$"+"{:.3f}".format(stderr[1])+"_{"+"{:.3f}".format(low[1])+"}^{"+"{:.3f}".format(up[1])+"}//\n")
f.write("{:.3f}".format(est[2])+"$\pm$"+"{:.3f}".format(stderr[2])+"_{"+"{:.3f}".format(low[2])+"}^{"+"{:.3f}".format(up[2])+"}//\n")
f.write("\n")

est, low, up, stderr = bootstrap_estimate_and_ci(xgbcl, inp_tr, tar_tr.ravel(), inp_va, tar_va.ravel(), scoring_func=precision_score, random_seed=0, 
                              alpha=0.05, n_splits=200)
                            
f.write("XGB Precision\n")
f.write("{:.3f}".format(est[0])+"$\pm$"+"{:.3f}".format(stderr[0])+"_{"+"{:.3f}".format(low[0])+"}^{"+"{:.3f}".format(up[0])+"}//\n")
f.write("{:.3f}".format(est[1])+"$\pm$"+"{:.3f}".format(stderr[1])+"_{"+"{:.3f}".format(low[1])+"}^{"+"{:.3f}".format(up[1])+"}//\n")
f.write("{:.3f}".format(est[2])+"$\pm$"+"{:.3f}".format(stderr[2])+"_{"+"{:.3f}".format(low[2])+"}^{"+"{:.3f}".format(up[2])+"}//\n")
f.write("\n")

est, low, up, stderr = bootstrap_estimate_and_ci(xgbcl, inp_tr, tar_tr.ravel(), inp_va, tar_va.ravel(), scoring_func=accuracy_score, random_seed=0, 
                               alpha=0.05, n_splits=200)
                            
f.write("XGB Accuracy\n")
f.write("{:.3f}".format(est)+"$\pm$"+"{:.3f}".format(stderr)+"_{"+"{:.3f}".format(low)+"}^{"+"{:.3f}".format(up)+"}//\n")
f.write("\n")

1

## Random Forest

Hyperparameters tested over initially
param_grid = [{'class_weight': ['balanced_subsample','balanced'], 'n_estimators': np.arange(50,250,50),
        'criterion': ['gini', 'entropy'], 'max_features': ['sqrt','log2'], 'oob_score':[True,False]}]

In [16]:
# Final hyperparameters
# 75/25
rfcl = RandomForestClassifier(class_weight='balanced_subsample',criterion='entropy',max_features='log2',n_estimators=150,oob_score=False)
# 300s
# rfcl = RandomForestClassifier(class_weight='balanced',criterion='entropy',max_features='log2',n_estimators=50,oob_score=False)
# CM21
# rfcl = RandomForestClassifier(class_weight='balanced',criterion='entropy',max_features='log2',n_estimators=100,oob_score=True)

In [17]:
est, low, up, stderr = bootstrap_estimate_and_ci(rfcl, inp_tr, tar_tr.ravel(),  inp_va, tar_va, scoring_func=recall_score, random_seed=0, 
                              alpha=0.05, n_splits=200)

f.write("RF Recall\n")
f.write("{:.3f}".format(est[0])+"$\pm$"+"{:.3f}".format(stderr[0])+"_{"+"{:.3f}".format(low[0])+"}^{"+"{:.3f}".format(up[0])+"}//\n")
f.write("{:.3f}".format(est[1])+"$\pm$"+"{:.3f}".format(stderr[1])+"_{"+"{:.3f}".format(low[1])+"}^{"+"{:.3f}".format(up[1])+"}//\n")
f.write("{:.3f}".format(est[2])+"$\pm$"+"{:.3f}".format(stderr[2])+"_{"+"{:.3f}".format(low[2])+"}^{"+"{:.3f}".format(up[2])+"}//\n")
f.write("\n")

est, low, up, stderr = bootstrap_estimate_and_ci(rfcl, inp_tr, tar_tr.ravel(), inp_va, tar_va.ravel(), scoring_func=precision_score, random_seed=0, 
                              alpha=0.05, n_splits=200)
                            
f.write("RF Precision\n")
f.write("{:.3f}".format(est[0])+"$\pm$"+"{:.3f}".format(stderr[0])+"_{"+"{:.3f}".format(low[0])+"}^{"+"{:.3f}".format(up[0])+"}//\n")
f.write("{:.3f}".format(est[1])+"$\pm$"+"{:.3f}".format(stderr[1])+"_{"+"{:.3f}".format(low[1])+"}^{"+"{:.3f}".format(up[1])+"}//\n")
f.write("{:.3f}".format(est[2])+"$\pm$"+"{:.3f}".format(stderr[2])+"_{"+"{:.3f}".format(low[2])+"}^{"+"{:.3f}".format(up[2])+"}//\n")
f.write("\n")

est, low, up, stderr = bootstrap_estimate_and_ci(rfcl, inp_tr, tar_tr.ravel(), inp_va, tar_va.ravel(), scoring_func=accuracy_score, random_seed=0, 
                               alpha=0.05, n_splits=200)
                            
f.write("RF Accuracy\n")
f.write("{:.3f}".format(est)+"$\pm$"+"{:.3f}".format(stderr)+"_{"+"{:.3f}".format(low)+"}^{"+"{:.3f}".format(up)+"}//\n")
f.write("\n")

1

In [18]:
f.close()

In [19]:
import os
# os.system('say "tadaaa your program has probably failed"')
os.system('say "beep"')

0

# Neural Network

In this section, we describe in which ways we replicated the CM21 paper, and the choices we made along the way to derive the best results we were able to achieve. This section was performed by S. Fielder.

All the training has been performed prior to this notebook, and the state of the network saved at the appropriate time. Below we will just import the state of the system described, and run the testing set through for evaluation, and metrics along with Confusion Matrixes will be outputted for each subsection.

Many of the functions called below are semi-custom made, and are found in the appropriate `.py` files in this directory. The majority come from `NN_Defs.py` where the model construction along with the training and validation functions are located. Additionally, our custom data split maker is found in `custom_dataloader.py`, in which we can build reproducable sets of training, validation and testing sets depending on the number of subclasses wanted in each set. In this example, we focus on the results achieved by using the CM21 data-split as performed using this loader.

Finally `network_runner.py` was the script used in order to train all of the networks, and appropriately print out the metrics, along with saving plots for both confusion matrixes and loss values. Some of these outputs are shown in the `Saved_Final_Data/` directory. The settings found therein will be imported below to load the state of the networks as mentioned above.

## Importing

In [1]:
# library imports
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data as data_utils

from sklearn.metrics import ConfusionMatrixDisplay, classification_report, recall_score, precision_score, accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split,  GridSearchCV

# custom script inputs
from NN_Defs import get_n_params, train, validate, BaseMLP

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f'Running on : {device}')

datadir = 'Saved_Final_Data/'

Running on : cpu


## Data Loading

In [18]:
import numpy as np
# custom made libraries
from custom_dataloader import replicate_data
# data load
X = np.load("Input_Class_AllClasses_Sep.npy")
Y = np.load("Target_Class_AllClasses_Sep.npy")
# Y = np.load("Pred_Class_AllClasses_Sep.npy") # For predicted targets from CM21

# custom data loader to pull in custom sized data set
# use seed to get replicable results for now
seed_val = 1111

# the amounts below are how many of each class of object you want in the training set and validation set - leftover amounts given to testing set


# CM21 Split
amounts_train = [331,1141,231,529,27,70,1257]
amounts_val = [82, 531, 104, 278, 6, 17, 4359]
# amounts_train = [331,1141,231+529+27+70+1257] # C-targets
# amounts_val = [82, 531, 104+278+6+17+4359]

# 300s Split
# amounts_train = [300,300,300,300,27,70,300]
# amounts_val = [82, 531, 104, 278, 6, 17, 4359]
# amounts_train = [300,300,300+300+27+70+300]
# amounts_val = [82, 531, 104+278+6+17+4359]

# # 75/25 Split
# amounts_train = [311,1994,391,1043,25,66,21796] #75/25 train
# amounts_val = [103,665,130,348,9,22,5449] #75/25 val
# amounts_train = [311,1994,391+1043+25+66+21796] #75/25 train
# amounts_val = [103,665,130+348+9+22+5449] #75/25 val


# calling custom datagrabber here
inp_tr, tar_tr, inp_va, tar_va, inp_te, tar_te = replicate_data(X, Y, 'seven', amounts_train, amounts_val, seed_val)

# scaling data according to training inputs
scaler_S = StandardScaler().fit(inp_tr)
inp_tr = scaler_S.transform(inp_tr)
inp_va = scaler_S.transform(inp_va)
# inp_te = scaler_S.transform(inp_te) # Comment out for 75/25 split

# printouts for double checking all the sets and amounts
print('Sizes of Datasets : Inputs , Targets')
print('------------------------------------')
print(f'Training set: {inp_tr.shape} , {tar_tr.shape} \nValidation set: {inp_va.shape} , {tar_va.shape} \nTesting Set: {inp_te.shape}, {tar_te.shape}')
print('------------------------------------')

Sizes of Datasets : Inputs , Targets
------------------------------------
Training set: (3586, 8) , (3586,) 
Validation set: (5377, 8) , (5377,) 
Testing Set: (17940, 8), (17940,)
------------------------------------


## Scaling, Conversions to Tensors, and DataLoader Creation

In [19]:
inputs = np.concatenate((inp_tr,inp_va,inp_te))
targets = np.concatenate((tar_tr,tar_va,tar_te))
np.save("MLP_Val_G-targets7.npy",targets)
tar_tr = np.where(tar_tr<2,tar_tr,2)
tar_va = np.where(tar_va<2,tar_va,2)
tar_te = np.where(tar_te<2,tar_te,2)
inputs1 = np.concatenate((inp_tr,inp_va,inp_te))
targets1 = np.concatenate((tar_tr,tar_va,tar_te))
inputs1 = torch.tensor(inputs1)
targets1 = torch.tensor(targets1)
all_data = data_utils.TensorDataset(inputs1, targets1)
all_loader = torch.utils.data.DataLoader(all_data, batch_size=25, shuffle=True)
np.save("MLP_Val_G-targets2.npy",targets1)
loadpath = datadir+'Final_CSplit_4e-2_Settings'
BaseNN = BaseMLP(8, 20, 3)
BaseNN.load_state_dict(torch.load(loadpath, map_location=device))


<All keys matched successfully>

In [20]:
val_loss, val_accuracy, val_predictions, val_truth_values = validate(BaseNN, all_loader, device)


In [21]:

np.save("MLP_CM21_G-targets_VPred.npy",val_predictions)

In [22]:
print(classification_report(targets1,val_predictions))

              precision    recall  f1-score   support

           0       0.02      0.02      0.02       414
           1       0.10      0.07      0.08      2659
           2       0.89      0.92      0.90     23830

    accuracy                           0.82     26903
   macro avg       0.33      0.33      0.33     26903
weighted avg       0.79      0.82      0.81     26903



In [3]:
# creation of tensor instances

inp_tr = torch.tensor(inp_tr)
tar_tr = torch.tensor(tar_tr)
inp_va = torch.tensor(inp_va)
tar_va = torch.tensor(tar_va)
# inp_te = torch.tensor(inp_te)
# tar_te = torch.tensor(tar_te)

# pass tensors into TensorDataset instances
train_data = data_utils.TensorDataset(inp_tr, tar_tr)
val_data = data_utils.TensorDataset(inp_va, tar_va)
# test_data = data_utils.TensorDataset(inp_te, tar_te)

# constructing data loaders
train_loader = torch.utils.data.DataLoader(train_data, batch_size=25, shuffle=True)
val_loader = torch.utils.data.DataLoader(val_data, batch_size=25, shuffle=True)
# test_loader = torch.utils.data.DataLoader(test_data, batch_size=25, shuffle=True)

In [40]:

def bootstrap_estimate_and_ci_MLP(NN, valid_loader, device,  scoring_func=None, random_seed=0, 
                               alpha=0.05, n_splits=200):
                        
    scores = []

    if scoring_func == accuracy_score:
        for n in range(0,n_splits):
            val_loss, val_accuracy, val_predictions, val_truth_values = validate(NN, valid_loader, device)
            scores.append(scoring_func(val_truth_values,val_predictions))
        estimate = np.mean(scores)
        lower_bound = np.percentile(scores, 100*(alpha/2))
        upper_bound = np.percentile(scores, 100*(1-alpha/2))
        stderr = np.std(scores)

    else:
        for n in range(0,n_splits):
            val_loss, val_accuracy, val_predictions, val_truth_values = validate(BaseNN, val_loader, device)
            scores.append(scoring_func(val_truth_values,val_predictions,average=None))   
            scores = list(map(list, zip(*scores)))
    
        estimate = [np.mean(scores[0]),np.mean(scores[1]),np.mean(scores[2])]
        lower_bound = [np.percentile(scores[0], 100*(alpha/2)),np.percentile(scores[1], 100*(alpha/2)),np.percentile(scores[2], 100*(alpha/2))]
        upper_bound = [np.percentile(scores[0], 100*(1-alpha/2)),np.percentile(scores[1], 100*(1-alpha/2)),np.percentile(scores[2], 100*(1-alpha/2))]
        stderr = [np.std(scores[0]),np.std(scores[1]),np.std(scores[2])]
    
    return estimate, lower_bound, upper_bound, stderr

## Create Network Instance and Set Hyperparameters

In [41]:
# create nn instance
BaseNN = BaseMLP(8, 20, 3)
# load in saved state of network
# loadpath = datadir+'Final_CSplit_4e-2_Settings' #CM21 Split
loadpath = datadir+'Final_CSplit_CM21Pred_Settings' #CM21 Split - C-tar train
# loadpath = datadir+'Final_300sSplit_CM21Pred_Settings' #300 Split - C-tar train
# loadpath = datadir+'Final_300s_Settings'
# loadpath = datadir+'Final_7525Split_CM21Pred_Settings' # 75/25 Split - Ctar train
# loadpath = datadir+'Final_7525Split_Settings' # 75/25 Split

BaseNN.load_state_dict(torch.load(loadpath, map_location=device))

# compute validation results
# val_loss, val_accuracy, val_predictions, val_truth_values = validate(BaseNN, val_loader, device)

f = open("PRAScores_CM21_C-targets_MLP_2.txt","w")
f.write("C-targets, CM21 data-split, correct\n")

est, low, up, stderr = bootstrap_estimate_and_ci_MLP(BaseNN, val_loader, device, scoring_func=recall_score, random_seed=0, 
                              alpha=0.05, n_splits=200)

f.write("MLP Recall\n")
f.write("{:.3f}".format(est[0])+" _{"+"{:.3f}".format(low[0])+"}^{"+"{:.3f}".format(up[0])+"} , "+"{:.3f}".format(stderr[0])+"//\n")
f.write("{:.3f}".format(est[1])+" _{"+"{:.3f}".format(low[1])+"}^{"+"{:.3f}".format(up[1])+"} , "+"{:.3f}".format(stderr[1])+"//\n")
f.write("{:.3f}".format(est[2])+" _{"+"{:.3f}".format(low[2])+"}^{"+"{:.3f}".format(up[2])+"} , "+"{:.3f}".format(stderr[2])+"//\n")
f.write("\n")

est, low, up, stderr = bootstrap_estimate_and_ci_MLP(BaseNN, val_loader, device, scoring_func=precision_score, random_seed=0, 
                              alpha=0.05, n_splits=200)
                            
f.write("MLP Precision\n")
f.write("{:.3f}".format(est[0])+" _{"+"{:.3f}".format(low[0])+"}^{"+"{:.3f}".format(up[0])+"} , "+"{:.3f}".format(stderr[0])+"//\n")
f.write("{:.3f}".format(est[1])+" _{"+"{:.3f}".format(low[1])+"}^{"+"{:.3f}".format(up[1])+"} , "+"{:.3f}".format(stderr[1])+"//\n")
f.write("{:.3f}".format(est[2])+" _{"+"{:.3f}".format(low[2])+"}^{"+"{:.3f}".format(up[2])+"} , "+"{:.3f}".format(stderr[2])+"//\n")
f.write("\n")

est, low, up, stderr = bootstrap_estimate_and_ci_MLP(BaseNN, val_loader, device, scoring_func=accuracy_score, random_seed=0, 
                               alpha=0.05, n_splits=200)
                            
f.write("MLP Accuracy\n")
f.write("{:.3f}".format(est)+" _{"+"{:.3f}".format(low)+"}^{"+"{:.3f}".format(up)+"} , "+"{:.3f}".format(stderr)+"//\n")
f.write("\n")

f.close()

Again, the loss plot below was constructed at our last epoch in this specific model loaded. This is loaded in manually here from `/Saved_Final_Data`.

<img src="Saved_Final_Data/Final_CSplit_4e-2_loss.png" alt="Loss with 4e-5 Learning Rate" width="750"/>