# Modeling
In this notebook, we'll be modeling the data we've previously prepared. Out notebook will be laid out as follows:

1. Model Selection & Generation
2. Hyperparameter Optimization
3. Fine-Tuning (if needed)
4. Reporting Best Model(s) + Settings
5. Interpretation
6. Conclusion

Our eventual goal here is two-fold:

1. Accurately [and fairly] model the diabetes dataset
2. Interpret the results to find something worth recommending to those wanting to reduce risk of diabetes. This can be via LIME/SHAP (i.e. some interpretable model that approximates the neural network) or via analyzing a more simple model's structure (i.e. regression coefficients, random forest decision boundaries)

In [1]:
# Environment Setup
from utils.model import *
from utils.dataset import *

***
## Model Selection & Generation

In [2]:
# generate lookup for models
models = {
    # "tree": TreeClassifier(target="diabetes", path="../datasets/pre_split_processed.parquet", upsample=True),
    # "ffnn": MLPClassifier(target="diabetes", path="../datasets/pre_split_processed.parquet", upsample=False, loss_balance=False),
    "log": LogClassifier(target="diabetes", path="../datasets/pre_split_processed.parquet", upsample=True)
}

# manual search
# models["tree"].set_hyperparams({
#     "loss": "log_loss",
#     "learning_rate": 0.01,
#     "n_estimators": 100,
#     "criterion": "friedman_mse",
#     "min_samples_split": 5,
#     "min_samples_leaf": 5,
#     "max_depth": 8,
#     "n_iter_no_change": 5,
#     "max_features": "sqrt",
#     "tol": 0.0001
# })
# models["ffnn"].set_hyperparams({
#     "learning_rate": .0005,
#     "batch_size": 256,
#     "num_hidden": 8,
#     "hidden_size": [2048, 1024, 512, 256, 128, 64, 32, 32],
#     "num_epochs": 50,
#     "dropout_rate": [0.875, 0.75, 0.75, 0.5, 0.5, 0.25, 0.25],
#     "classify_fn": "sigmoid"
# })
# models["ffnn"].set_hyperparams({
#     "learning_rate": .001,
#     "batch_size": 32,
#     "num_hidden": 4,
#     "hidden_size": [128, 64, 64, 32],
#     "num_epochs": 50,
#     "dropout_rate": [0.5, 0.4, 0.3, 0.2],
#     "classify_fn": "sigmoid"
# # })
# models["ffnn"].set_hyperparams({
#     "input_size": 21,
#     "output_size": 3,
#     "hidden_size": 1024,
#     "num_hidden": 4,
#     "num_epochs": 50,
#     "batch_size": 64,
#     "learning_rate": 5e-05,
#     "dropout_rate": 0.9,
#     "classify_fn": "sigmoid"
# })
# models["ffnn"].set_hyperparams({
#     "input_size": 21,
#     "output_size": 3,
#     "hidden_size": 512,
#     "num_hidden": 2,
#     "num_epochs": 50,
#     "batch_size": 128,
#     "dropout_rate": 0.4,
#     "learning_rate": 0.0005
# })

# train & test basic model
skip_models = ["log"]#["ffnn", "log"]
for mt, model in models.items():
    # attempt to load, else train and test
    if (mt in skip_models) or (not model.load_model()):
        model.train_model(verbose=2)
    model.test_model()

<Train-Test Split Report>
Train: 512886 obs, 170962 no diabetes [0], 170962 pre-diabetes [1], 170962 diabetes [2]
Test: 50736 obs, 42741 no diabetes [0], 926 pre-diabetes [1], 7069 diabetes [2]
Epoch 1, change: 1.00000000
Epoch 2, change: 0.25225974
Epoch 3, change: 0.13470629
Epoch 4, change: 0.09814960
Epoch 5, change: 0.05012524
Epoch 6, change: 0.02037335
Epoch 7, change: 0.01588300
Epoch 8, change: 0.00989571
Epoch 9, change: 0.00782193
Epoch 10, change: 0.00618506
Epoch 11, change: 0.00414319
Epoch 12, change: 0.00409114
Epoch 13, change: 0.00379375
Epoch 14, change: 0.00350049
Epoch 15, change: 0.00411181
Epoch 16, change: 0.00482170
Epoch 17, change: 0.00324488
Epoch 18, change: 0.00323704
Epoch 19, change: 0.00317663
Epoch 20, change: 0.00313845
Epoch 21, change: 0.00308866
Epoch 22, change: 0.00306360
Epoch 23, change: 0.00303443
Epoch 24, change: 0.00300900
Epoch 25, change: 0.00296859
Epoch 26, change: 0.00294248
Epoch 27, change: 0.00291729
Epoch 28, change: 0.00289259
Epo

***
## Hyperparameter Optimization

In [3]:
# optimize hyperparams
# optimizer_results = {model_type: model.optimize_hyperparams(kfold=2) for model_type, model in models.items()}
# print(optimizer_results)

***
## Fine-Tuning + Other Adjustments

***
## Best Model Report

In [4]:
models["log"].explain_model()

{'high_bp': 0.0,
 'high_chol': 0.04005244483501908,
 'chol_check': 0.14368574427711195,
 'bmi': -0.3341266441218481,
 'smoker': 0.0,
 'stroke': 0.0046318919658423375,
 'heart_disease': 0.0,
 'physical_activity': -0.15132506947633687,
 'fruits': 0.0,
 'veggies': 0.0,
 'heavy_drinker': 0.05994501018275361,
 'healthcare': -0.05123929528640949,
 'no_doc_bc_cost': 0.0,
 'general_health': 0.08061386743733423,
 'mental_health': -0.4073858118266818,
 'physical_health': -0.40283705380462387,
 'diff_walk': -0.0249101951252151,
 'sex': -0.639927900219654,
 'age': -0.233271780474539,
 'education': 0.022629078161513983,
 'income': 0.014953609073340274}

***
## Interpretation

***
## Conclusion