# Create a Random Forest model from BASE-9 data


This notebook performs the following tasks:
- reads in posterior data from BASE-9
- generates features from these data 
- uses these features to train and test a random forest classifier from `scipy`
- and saves the model to a file.  

Here we use data from NGC 2682 (M67) to train the model; these data were hand labelled by Justyce.  If you need access to these data, please contact Aaron Geller.

Most of the "heavy lifting" is done by the code in the `base9_ml_utils.py` file.  See the comments and markdown in that code for more details.


___
*Authors:* Justyce Watson, Aaron Geller\
*Date:* August 2025


## Import all functions from the `base9_ml_utils.py` file

In [1]:
# import functions from .py file
from base9_ml_utils import *

# The lines below are useful if you plan to make changes to the base9_ml_utils.py file.
# They will allow the notebook to refresh when you save changes to the .py file.
#
# %load_ext autoreload
# %autoreload 2


## Read in `.res` files and creates the features 

The user should specify the data directory on their own computer.  The code assumes that this directory contains one `.res` file for each star with the filename containing the star ID.  (If there is additional text in the file name, the user can specify this in the code, using the `file_prefix` and/or `file_suffix` args so that the code can identify the star ID from the filename properly.)  

We will use the `create_features` function imported from `base9_ml_utils.py`.


In [2]:
# run this cell to see information about this function
create_features?

[0;31mSignature:[0m
[0mcreate_features[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mdirectory[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcolumn[0m[0;34m=[0m[0;36m0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmax_nfiles[0m[0;34m=[0m[0minf[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfile_prefix[0m[0;34m=[0m[0;34m'NGC_2682_'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfile_suffix[0m[0;34m=[0m[0;34m''[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mess_num_samples[0m[0;34m=[0m[0;36m10000[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
function that will calculate all the features needed for the ML model 
Note that the file names for the res files must contain the ids (and can include a prefix and suffix)

inputs:
- directory : (string) path to the data directory that contains the res files from BASE-9
- column : (string) column number to use from the res file to use to calculate features
- max_nfiles : (int) 

In [6]:
# directory on your computer where the .res data files are stored
directory = "data/NGC2682/jw_output"

# create a DataFrame with features for each star using the 'create_features'
model_cluster_statistic = create_features(directory)
 
# display the resulting DataFrame in the notebook
model_cluster_statistic

Unnamed: 0,source_id,Width,Upper_bound,Lower_bound,Stdev,SnR,Dip_p,Dip_value,KS_value,KS_p,ESS
0,605002016872204416,1.361236,0.557799,0.967022,0.658567,14.214572,0.000000,0.017193,0.225703,3.966666e-279,9914.081693
1,604942879467199360,1.352529,0.557799,0.967022,0.610061,15.058561,0.000000,0.019429,0.152922,1.644700e-124,10164.999869
2,604969031523465728,0.387865,0.557799,0.967022,0.263017,36.229304,0.000000,0.012712,0.155823,1.805514e-132,10070.746059
3,604921679508385664,1.676418,0.557799,0.967022,0.744099,12.044454,0.000001,0.008560,0.141053,1.188060e-107,9933.271378
4,604968958508607360,1.262185,0.557799,0.967022,0.600071,15.298958,0.000000,0.014429,0.130290,1.232158e-89,10006.607425
...,...,...,...,...,...,...,...,...,...,...,...
1423,604994045412975744,0.770949,0.557799,0.967022,0.473367,19.784625,0.000000,0.009089,0.166776,2.267155e-149,9908.144625
1424,604921924322164608,0.167193,0.557799,0.967022,0.091203,105.613775,0.000000,0.012850,0.153061,9.241779e-126,9671.638661
1425,604970062315630336,1.360707,0.557799,0.967022,0.639646,14.251756,0.000000,0.013919,0.153942,3.812003e-125,9850.930115
1426,598962292125778560,1.532101,0.557799,0.967022,0.686447,13.055923,0.000000,0.020750,0.125103,2.478370e-83,9630.701794


## Read in data for training and testing the model

This dataset contains hand labelled sampling quality for each star that has a `.res` file in the dataset above.  The labels were created by Justyce Watson by visually inspecting the distributions in the `.res` files.

In this dataset we will use the column `Single Sampling` as our label, and only take rows where the a label exists.

In [10]:
# Read in the data
df1 = pd.read_csv('data/NGC2682/NGC2682_Age_Stats.csv',sep=',')

# Select only the rows where Single Sampling values exist
# And keep only the relevant columns
sampling_df = df1[df1['Single Sampling'].isna() == False][['source_id','Single Sampling']]

# Display this DataFrame in the notebook
sampling_df

Unnamed: 0,source_id,Single Sampling
0,597810107020313344,Bad
1,597830722862488064,Bad
2,598464900553093504,Bad
3,598525408052424960,Bad
4,598543206396991232,Bad
...,...,...
1435,605170688827236736,Bad
1436,603848521800034176,Bad
1438,603868141210083712,Bad
1439,607987427163771520,Good


# Create the model 
Here we use the `create_model` function imported from `base9_ml_utils.py`.  In this function we split the data into training and testing subsets.  The training set is further modified so that there are equal "Good" and "Bad" labelled data.  

In [11]:
# run this cell to see information about this function
create_model?

[0;31mSignature:[0m
[0mcreate_model[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mfeatures_df[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlabel_df[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlabel_column_name[0m[0;34m=[0m[0;34m'Single Sampling'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfeature_columns[0m[0;34m=[0m[0;34m[[0m[0;34m'Width'[0m[0;34m,[0m [0;34m'Upper_bound'[0m[0;34m,[0m [0;34m'Lower_bound'[0m[0;34m,[0m [0;34m'Stdev'[0m[0;34m,[0m [0;34m'SnR'[0m[0;34m,[0m [0;34m'Dip_p'[0m[0;34m,[0m [0;34m'Dip_value'[0m[0;34m,[0m [0;34m'KS_value'[0m[0;34m,[0m [0;34m'KS_p'[0m[0;34m,[0m [0;34m'ESS'[0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mrandom_seed[0m[0;34m=[0m[0;36m42[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
function that will create a random forest model using scikit-learn

inputs:
- features_df : (pandas DataFrame) contains all the features needed for the m

In [12]:
# create the model (returned as a scipy pipeline object, here we call it "pipe")
pipe, X, y, X_train, y_train, X_test, y_test = create_model(model_cluster_statistic, sampling_df)

There are 174 training elements with classification = Bad
There are 174 training elements with classification = Good


## Use the model to generate labels
Here we use the `make_preds` function imported from `base9_ml_utils.py`.  In this function we send the model from `create_model` and data to be labeled.  For this step we will send the testing data.  We will also define the labels for the test data so that we can validate the quality of the model.  (Note that you can use `make_preds` without knowing the labels, as we will do in the `apply_model.ipynb` notebook.)

In [13]:
# run this cell to see information about this function
make_preds?

[0;31mSignature:[0m
[0mmake_preds[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mpipe[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mX[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0my_test[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfeature_columns[0m[0;34m=[0m[0;34m[[0m[0;34m'Width'[0m[0;34m,[0m [0;34m'Upper_bound'[0m[0;34m,[0m [0;34m'Lower_bound'[0m[0;34m,[0m [0;34m'Stdev'[0m[0;34m,[0m [0;34m'SnR'[0m[0;34m,[0m [0;34m'Dip_p'[0m[0;34m,[0m [0;34m'Dip_value'[0m[0;34m,[0m [0;34m'KS_value'[0m[0;34m,[0m [0;34m'KS_p'[0m[0;34m,[0m [0;34m'ESS'[0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
function that uses the model (from create_model) to generate labels on new data
this function can alsob e used to test the quality of the model

inputs:
- pipe : scikit-learn pipeline object containing the random forest model and scaler objects (e.g., generated by create_mode

In [14]:
y_pred = make_preds(pipe, X_test, y_test=y_test, 
    feature_columns=[
        "Width",
        "Upper_bound",
        "Lower_bound",
        "Stdev",
        "SnR",
        "Dip_p",
        "Dip_value",
        "KS_value",
        "KS_p",
        "ESS"])


Accuracy: 0.9532710280373832
              precision    recall  f1-score   support

         Bad       0.99      0.95      0.97       359
        Good       0.80      0.96      0.87        69

    accuracy                           0.95       428
   macro avg       0.89      0.95      0.92       428
weighted avg       0.96      0.95      0.95       428

Feature Importance Ranking:
Width          0.311279
Stdev          0.254817
SnR            0.234973
KS_value       0.110907
Dip_value      0.054472
ESS            0.025940
Dip_p          0.005492
KS_p           0.002120
Upper_bound    0.000000
Lower_bound    0.000000
dtype: float64


# Save the model

You can then read in your model to apply it to other datasets.  Note that in order to use a saved model, you will need to be working with the same version of scipy (and possibly other dependencies).  

In [15]:
save_model(pipe, filename="my_model.pkl")