# 2. Model Selection
---

From this point forward, I began making heavy use of this asynchronous notebook running. As a result, I have a lot of notebooks that are essentially duplicates of each other with minor details tweaked. For each of these, I will go through the base process, and note the tweaked details along with links to the notebooks used.

In [1]:
#main imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import model_processes as mp
#sklearn imports
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, balanced_accuracy_score

%matplotlib inline
%config InlineBackend.figure_format = 'svg' 

Loading the relevant data: The mp.frame_from_drive uses the SciDrive module from the SciServer package to load the given frame. The mp.frame_from_url does exactly the same thing, but with a URL acquired manually. URLs are public, so running this cell will work.  
Both functions have, in addition to the path/url, two optional arguments. The first (grouped=bool, default True) grouped the target feature according to class, rather than subclass. The second (replace_9=bool, default=False) replaces -9.999 (the value used in the dataset to represent missing or bad data) with NaN wherever it appears in the feature columns.

In [None]:
# df = mp.frame_from_drive('SomeStars')
csv_url = 'https://www.scidrive.org/vospace-2.0/data/1bab3dd0-0961-4776-9d12-52b908ae63d1'
df = mp.frame_from_url(csv_url)
X, y = df.iloc[:,1:].sort_index(axis=1), df.iloc[:,0]

## 2.1 Initial Models
With my sampled data, I began trying out some initial models. Since I had the easy integrated access to the asynchronous notebook running, in [Notebook 1](https://github.com/edithalice/stellar_classification/blob/master/notebook_runs/2_1_MVP_Model.ipynb), I started by essentially throwing a bunch of models at my data and seeing how they did.

In [None]:
classifiers_dict = {
    'Logistic': LogisticRegression(),
    'KNN': KNeighborsClassifier(),
    'Naive Bayes': GaussianNB(),
    'Decision Tree': DecisionTreeClassifier(),
    'Linear SVM': SVC(),
    'Random Forest': RandomForestClassifier(n_estimators=1000),
    'Gradient Boost': GradientBoostingClassifier(n_estimators=1000)
}

The mp.batch_classify_models function takes a batch of classifiers along with X and y train and test sets, and returns 1. a dataframe generated from the classification report function, 2. a dictionary of confusion matrices, normalized by true class, keyed to classifier name, and 3. a dictionary of trained models, again keyed to classifier name.

In [None]:
initial_models_performances = mp.batch_classify_models(classifiers_dict, X_train, y_train, X_test, y_test)
initial_models_performances[0]

In [2]:
performances_url = 'https://www.scidrive.org/vospace-2.0/data/2026e3f7-1222-4258-8660-c8b9fccea2b8'
performance = pd.read_csv(performances_url, index_col=0)

Not going to rerun this process here, but I saved the output of this first run in a csv.

In [3]:
# scores = mp.performance_frame(initial_performances[0])
scores = mp.performance_frame(performance)

In [4]:
scores

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Logistic,KNN,Naive Bayes,Decision Tree,Linear SVM,Random Forest,Gradient Boost
Metric,Type,ClassSize,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
accuracy,overall,7866.0,0.84897,0.837656,0.705823,0.853801,0.520722,0.905034,0.903382
predict_time,overall,7866.0,0.720311,139.821869,21.121665,0.037039,175.449976,62.922563,15.311615
train_time,overall,7866.0,1460.29343,1.341346,12.396218,16.536355,853.17644,434.422161,61852.924313
f1-score,A,1241.0,0.905306,0.869882,0.827532,0.917141,0.5,0.95526,0.956487
f1-score,CV,65.0,0.342857,0.356436,0.124031,0.366667,0.0,0.616822,0.508475
f1-score,CarbonWD,584.0,0.890026,0.890277,0.744526,0.844784,0.0,0.935652,0.92281
f1-score,F,1882.0,0.812839,0.786663,0.768831,0.834259,0.676626,0.881391,0.881374
f1-score,G,497.0,0.123128,0.463675,0.061962,0.604331,0.386648,0.671875,0.662366
f1-score,K,1131.0,0.887621,0.874132,0.669145,0.865513,0.351332,0.909716,0.920469
f1-score,LT,161.0,0.394052,0.247525,0.278481,0.357143,0.0,0.510067,0.503356


This function transforms the confusion matrices into heatmaps, with the subplots sorted from top to bottom according to the unweighted average value of the correct classifications that lie along the diagonal.

In [None]:
class_names = sorted(list(y_test.unique()))
cms = mp.print_confusion_matrices(initial_models_performances[1], class_names, figsz=(6,4), fontsize=14)

## 2.2 More Models
After this notebook, I ran [Notebook 2](https://github.com/edithalice/stellar_classification/blob/master/notebook_runs/2_2_More_Models.ipynb) with more models. Based on the comparative success of the tree based models from the first notebook, I tried out some more tree based models, listed in the dict below:

In [None]:
classifiers_dict = {
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(n_estimators=1000),
    'Random Forest Shallow': RandomForestClassifier(max_depth=6, n_estimators=1000),
    'Extra Trees': ExtraTreesClassifier(n_estimators=1000),
    'Extra Trees Shallow': ExtraTreesClassifier(max_depth=6, n_estimators=1000),
    'Hist Boosted': HistGradientBoostingClassifier(max_iter=1000),
    'Hist Boosted Shallow': HistGradientBoostingClassifier(max_depth=6, max_iter=1000)
}

Output from this batch:  
<img src="pics/tree_models_cmf_mats.svg">

## 2.3 Feature Selection
While this was running, I also started testing out the various combinations of features I put together in the previous notebook. Random Forest was the best performing model of the first batch, so I used that in [Notebook 3](https://github.com/edithalice/stellar_classification/blob/master/notebook_runs/2_3_Feature_Selection.ipynb), to evaluate each set of features. This notebook contains a process almost identical to the one above. The difference is that instead of calling mp.batch_classify_models on a classifier dict, it calls mp.batch_classify_data_subsets on a dict with path name keyed to a dataframe with the corresponding subset of features.
The results: <img src='pics/feature_tuning_cmf_mats.svg'>

Based on the above results (and some subsequent similar ones once I got [xgboost](https://github.com/edithalice/stellar_classification/blob/master/notebook_runs/2_3_Feature_Selection_xgb.ipynb) and [hgboost](https://github.com/edithalice/stellar_classification/blob/master/notebook_runs/2_3_Feature_Selection_Hist.ipynb) working), I ended up deciding not to eliminate any features.

## 2.4 Getting XGBoost and HistGradientBoosting working
I got stuck on this part for a while. Basically, when I submit a notebook to run as an asynchronous job, it starts with a new default environment every - an environment with sklearn version 0.22, which doesn't support NaN handling in HGBoost, and without XGBoost entirely. In the end, I figured out that I could add the following cells to the beginning of relevant notebooks to make them work

In [None]:
pip install --upgrade scikit-learn

In [None]:
pip install xgboost

Once I got these models working, I ran the same processes as above on them. In addition, I tried out a little (super basic) parameter tuning. The exciting part (for me anyway) about this part was using the built in NaN handling for these models! Performance went up (by a lil bit) when I switched from the -9.999 values to NaNs. [Trying out hist](https://github.com/edithalice/stellar_classification/blob/master/notebook_runs/2_4_Trying_out_hist.ipynb) [Trying out xgb](https://github.com/edithalice/stellar_classification/blob/master/notebook_runs/2_4_Trying_out_xgb.ipynb)  
<img src='pics/hist_xgb.svg'>

This point was also when I started looking at balanced accuracy score as a metric of success - in the interest of having a singular success metric for the purposes of parameter tuning

Initial Balanced Accuracy Scores:  
xgb = 0.77746346  


Based on these results I was definitely liking xgboost and hgboost the best, but was having a difficult time deciding between the two models, so I ended up continuing with both for a while.