# Modeling
In this notebook, we'll be modeling the data we've previously prepared. Out notebook will be laid out as follows:

1. Model Selection & Generation
2. Hyperparameter Optimization
3. Fine-Tuning (if needed)
4. Reporting Best Model(s) + Settings
5. Interpretation
6. Conclusion

Our eventual goal here is two-fold:

1. Accurately [and fairly] model the diabetes dataset
2. Interpret the results to find something worth recommending to those wanting to reduce risk of diabetes. This can be via LIME/SHAP (i.e. some interpretable model that approximates the neural network) or via analyzing a more simple model's structure (i.e. regression coefficients, random forest decision boundaries)

In [1]:
# Environment Setup
from utils.model import *
from utils.dataset import *

***
## Model Selection & Generation

In [2]:
# generate lookup for models
models = {
    "tree": TreeClassifier(target="diabetes", path="../datasets/processed.parquet")
}

# train & test basic model
for mt, model in models.items():
    # attempt to load, else train and test
    if not model.load_model():
        model.train_model()
    model.test_model()

<Train-Test Split Report>
Train: 512887 obs, 170945 no diabetes [0], 171126 pre-diabetes [1], 170816 diabetes [2]
Test: 128222 obs, 42758 no diabetes [0], 42577 pre-diabetes [1], 42887 diabetes [2]

<Test Report>
Precision: [no diabetes] 0.5816647597254004, [pre-diabetes] 0.43727664253642445, [diabetes] 0.49072269024577037
Recall: [no diabetes] 0.6658169231488844, [pre-diabetes] 0.22415858327265895, [diabetes] 0.6573786928439853
F1-Score: [no diabetes] 0.6209024884953436, [pre-diabetes] 0.29638370883343945, [diabetes] 0.5619549726427411
Support: [no diabetes] 42758, [pre-diabetes] 42577, [diabetes] 42887
Accuracy: 51.6339%


***
## Hyperparameter Optimization

In [3]:
# optimize hyperparams
optimizer_results = {model_type: model.optimize_hyperparams(kfold=2) for model_type, model in models.items()}
print(optimizer_results)

Fitting 2 folds for each of 32 candidates, totalling 64 fits
[CV 1/2] END criterion=friedman_mse, learning_rate=1, loss=log_loss, max_depth=3, max_features=log2, min_samples_leaf=10, min_samples_split=2, n_estimators=100, n_iter_no_change=5, tol=0.0001;, score=0.564 total time= 2.9min
[CV 1/2] END criterion=friedman_mse, learning_rate=1, loss=log_loss, max_depth=3, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=100, n_iter_no_change=5, tol=0.0001;, score=0.565 total time= 2.9min
[CV 2/2] END criterion=friedman_mse, learning_rate=1, loss=log_loss, max_depth=3, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=100, n_iter_no_change=5, tol=0.0001;, score=0.569 total time= 2.9min
[CV 2/2] END criterion=friedman_mse, learning_rate=1, loss=log_loss, max_depth=3, max_features=log2, min_samples_leaf=10, min_samples_split=2, n_estimators=100, n_iter_no_change=5, tol=0.0001;, score=0.565 total time= 2.9min
[CV 1/2] END criterion=friedman_mse, lear

***
## Fine-Tuning + Other Adjustments

***
## Best Model Report

***
## Interpretation

***
## Conclusion