# Create a Random Forest model from BASE-9 data


This notebook performs the following tasks:
- reads in posterior data from BASE-9
- generates features from these data 
- uses these features to train and test a random forest classifier from `scipy`
- and saves the model to a file.  

Here we use data from NGC 2682 (M67) to train the model; these data were hand labelled by Justyce.  If you need access to these data, please contact Aaron Geller.

Most of the "heavy lifting" is done by the code in the `base9_ml_utils.py` file.  See the comments and markdown in that code for more details.


___
*Authors:* Justyce Watson, Aaron Geller\
*Date:* August 2025


## Import all functions from the `base9_ml_utils.py` file

In [1]:
# import functions from .py file
from base9_ml_utils import *

# The lines below are useful if you plan to make changes to the base9_ml_utils.py file.
# They will allow the notebook to refresh when you save changes to the .py file.
#
# %load_ext autoreload
# %autoreload 2


## Read in `.res` files and creates the features 

The user should specify the data directory on their own computer.  The code assumes that this directory contains one `.res` file for each star with the filename containing the star ID.  (If there is additional text in the file name, the user can specify this in the code, using the `file_prefix` and/or `file_suffix` args so that the code can identify the star ID from the filename properly.)  

We will use the `create_features` function imported from `base9_ml_utils.py`.


In [2]:
# run this cell to see information about this function
create_features?

[31mSignature:[39m
create_features(
    directory,
    column=[32m0[39m,
    max_nfiles=inf,
    file_prefix=[33m'NGC_2682_'[39m,
    file_suffix=[33m''[39m,
    ess_num_samples=[32m10000[39m,
    random_seed=[32m42[39m,
)
[31mDocstring:[39m
function that will calculate all the features needed for the ML model 
Note that the file names for the res files must contain the ids (and can include a prefix and suffix)

inputs:
- directory : (string) path to the data directory that contains the res files from BASE-9
- column : (string) column number to use from the res file to use to calculate features
- max_nfiles : (int) maximum number of files to use
- file_prefix : (string) prefix in the res file names before the id 
- file_suffix : (string) suffix in the res file names after the id
- ess_num_samples : (int) number of samples to use in ess normal distribution
- random_seed : (int) random seed used for calculating ess

outputs:
- pandas DataFrame with the calculated features (

In [3]:
# directory on your computer where the .res data files are stored
directory = "data/NGC2682/jw_output"

# create a DataFrame with features for each star using the 'create_features'
model_cluster_statistic = create_features(directory)
 
# display the resulting DataFrame in the notebook
model_cluster_statistic

Unnamed: 0,source_id,Width,Upper_bound,Lower_bound,Stdev,SnR,Dip_p,Dip_value,KS_value,KS_p,ESS
0,608154449852709120,0.542966,1.183637,1.103694,0.237557,38.492771,0.000000,0.076963,0.449076,0.000000e+00,9593.069949
1,608303231815505920,0.755195,1.183637,1.103694,0.464201,20.268446,0.003637,0.006149,0.149306,1.174135e-119,10069.609067
2,608141294367793024,1.705887,1.183637,1.103694,0.725193,12.193566,0.000000,0.024109,0.130040,1.668327e-91,9995.857154
3,608068623521152384,1.344996,1.183637,1.103694,0.657152,14.096636,0.000085,0.007507,0.190449,5.030127e-195,10053.388651
4,608038764908563968,1.255172,1.183637,1.103694,0.639640,14.644034,0.000000,0.014813,0.191012,8.747132e-198,9450.160853
...,...,...,...,...,...,...,...,...,...,...,...
1423,604694561637360640,1.555578,1.183637,1.103694,0.709386,12.669910,0.000000,0.021891,0.158272,2.156165e-133,9816.641309
1424,604711505283863808,2.474428,1.183637,1.103694,1.159964,7.203835,0.000000,0.118215,0.217860,1.226452e-253,9430.097529
1425,604703465105196416,1.633171,1.183637,1.103694,0.750779,11.732707,0.000000,0.052372,0.168097,5.439508e-153,10272.295674
1426,604712531781276928,1.240942,1.183637,1.103694,0.596285,15.420267,0.000000,0.012573,0.161306,1.046904e-139,10020.348694


## Read in data for training and testing the model

This dataset contains hand labelled sampling quality for each star that has a `.res` file in the dataset above.  The labels were created by Justyce Watson by visually inspecting the distributions in the `.res` files.

In this dataset we will use the column `Single Sampling` as our label, and only take rows where the a label exists.

In [4]:
# Read in the data
df1 = pd.read_csv('data/NGC2682/NGC2682_Age_Stats.csv',sep=',')

# Select only the rows where Single Sampling values exist
# And keep only the relevant columns
sampling_df = df1[df1['Single Sampling'].isna() == False][['source_id','Single Sampling']]

# Display this DataFrame in the notebook
sampling_df

Unnamed: 0,source_id,Single Sampling
0,597810107020313344,Bad
1,597830722862488064,Bad
2,598464900553093504,Bad
3,598525408052424960,Bad
4,598543206396991232,Bad
...,...,...
1435,605170688827236736,Bad
1436,603848521800034176,Bad
1438,603868141210083712,Bad
1439,607987427163771520,Good


# Create the model 
Here we use the `create_model` function imported from `base9_ml_utils.py`.  In this function we split the data into training and testing subsets.  The training set is further modified so that there are equal "Good" and "Bad" labelled data.  

In [5]:
# run this cell to see information about this function
create_model?

[31mSignature:[39m
create_model(
    features_df,
    label_df,
    label_column_name=[33m'Single Sampling'[39m,
    feature_columns=[[33m'Width'[39m, [33m'Upper_bound'[39m, [33m'Lower_bound'[39m, [33m'Stdev'[39m, [33m'SnR'[39m, [33m'Dip_p'[39m, [33m'Dip_value'[39m, [33m'KS_value'[39m, [33m'KS_p'[39m, [33m'ESS'[39m],
    random_seed=[32m42[39m,
)
[31mDocstring:[39m
function that will create a random forest model using scikit-learn

inputs:
- features_df : (pandas DataFrame) contains all the features needed for the model, including an ID column (e.g., from create_features function)
- label_df : (pandas DataFrame) contains a label for each id in the features_df to train the model
- label_column_name : (string) the name of the column in label_df that has the desired label for training
- feature_columns : (list of strings) a list of column names in features_df to use for the model 
- random_seed : (int) random seed used for test_train_split and RandomForestClass

In [6]:
# create the model (returned as a scipy pipeline object, here we call it "pipe")
pipe, X, y, X_train, y_train, X_test, y_test = create_model(model_cluster_statistic, sampling_df)

There are 161 training elements with classification = Bad
There are 161 training elements with classification = Good


## Use the model to generate labels
Here we use the `make_preds` function imported from `base9_ml_utils.py`.  In this function we send the model from `create_model` and data to be labeled.  For this step we will send the testing data.  We will also define the labels for the test data so that we can validate the quality of the model.  (Note that you can use `make_preds` without knowing the labels, as we will do in the `apply_model.ipynb` notebook.)

In [7]:
# run this cell to see information about this function
make_preds?

[31mSignature:[39m
make_preds(
    pipe,
    X,
    y_test=[38;5;28;01mNone[39;00m,
    feature_columns=[[33m'Width'[39m, [33m'Upper_bound'[39m, [33m'Lower_bound'[39m, [33m'Stdev'[39m, [33m'SnR'[39m, [33m'Dip_p'[39m, [33m'Dip_value'[39m, [33m'KS_value'[39m, [33m'KS_p'[39m, [33m'ESS'[39m],
)
[31mDocstring:[39m
function that uses the model (from create_model) to generate labels on new data
this function can alsob e used to test the quality of the model

inputs:
- pipe : scikit-learn pipeline object containing the random forest model and scaler objects (e.g., generated by create_model)
- X : (np array, or pandas DataFrame) contains features to pass to the model.  every row is a different star, every column is a feature (same order as y)
  note: if the user passes a DataFrame, it will be converted to numpy array using prepare_df_for_model
- y_test : (np array, optional) labels for X (in same order) that can be used to test the model
- feature_columns : (list of st

In [8]:
y_pred = make_preds(pipe, X_test, y_test=y_test, 
    feature_columns=[
        "Width",
        "Upper_bound",
        "Lower_bound",
        "Stdev",
        "SnR",
        "Dip_p",
        "Dip_value",
        "KS_value",
        "KS_p",
        "ESS"])


Accuracy: 0.9579439252336449
              precision    recall  f1-score   support

         Bad       1.00      0.95      0.97       346
        Good       0.82      1.00      0.90        82

    accuracy                           0.96       428
   macro avg       0.91      0.97      0.94       428
weighted avg       0.97      0.96      0.96       428

Feature Importance Ranking:
Width          0.356308
SnR            0.254617
Stdev          0.215236
KS_value       0.103967
Dip_value      0.039527
ESS            0.023257
Dip_p          0.004603
KS_p           0.002484
Upper_bound    0.000000
Lower_bound    0.000000
dtype: float64


# Save the model

You can then read in your model to apply it to other datasets.  Note that in order to use a saved model, you will need to be working with the same version of scipy (and possibly other dependencies).  

In [9]:
save_model(pipe, filename="my_model.pkl")