# Modeling
In this notebook, we'll be modeling the data we've previously prepared. Out notebook will be laid out as follows:

1. Model Selection & Generation
2. Hyperparameter Optimization
3. Fine-Tuning (if needed)
4. Reporting Best Model(s) + Settings
5. Interpretation
6. Conclusion

Our eventual goal here is two-fold:

1. Accurately [and fairly] model the diabetes dataset
2. Interpret the results to find something worth recommending to those wanting to reduce risk of diabetes. This can be via LIME/SHAP (i.e. some interpretable model that approximates the neural network) or via analyzing a more simple model's structure (i.e. regression coefficients, random forest decision boundaries)

In [1]:
# Environment Setup
from utils.model import *
from utils.dataset import *

***
## Model Selection & Generation

In [2]:
# generate lookup for models
models = {
    # "tree": TreeClassifier(target="diabetes", path="../datasets/pre_split_processed.parquet", upsample=True),
    # "ffnn": MLPClassifier(target="diabetes", path="../datasets/pre_split_processed.parquet", upsample=False, loss_balance=False),
    "log": LogClassifier(target="diabetes", path="../datasets/pre_split_processed.parquet", upsample=True)
}

# manual search
# models["tree"].set_hyperparams({
#     "loss": "log_loss",
#     "learning_rate": 0.01,
#     "n_estimators": 100,
#     "criterion": "friedman_mse",
#     "min_samples_split": 5,
#     "min_samples_leaf": 5,
#     "max_depth": 8,
#     "n_iter_no_change": 5,
#     "max_features": "sqrt",
#     "tol": 0.0001
# })
# models["ffnn"].set_hyperparams({
#     "learning_rate": .0005,
#     "batch_size": 256,
#     "num_hidden": 8,
#     "hidden_size": [2048, 1024, 512, 256, 128, 64, 32, 32],
#     "num_epochs": 50,
#     "dropout_rate": [0.875, 0.75, 0.75, 0.5, 0.5, 0.25, 0.25],
#     "classify_fn": "sigmoid"
# })
# models["ffnn"].set_hyperparams({
#     "learning_rate": .001,
#     "batch_size": 32,
#     "num_hidden": 4,
#     "hidden_size": [128, 64, 64, 32],
#     "num_epochs": 50,
#     "dropout_rate": [0.5, 0.4, 0.3, 0.2],
#     "classify_fn": "sigmoid"
# # })
# models["ffnn"].set_hyperparams({
#     "input_size": 21,
#     "output_size": 3,
#     "hidden_size": 1024,
#     "num_hidden": 4,
#     "num_epochs": 50,
#     "batch_size": 64,
#     "learning_rate": 5e-05,
#     "dropout_rate": 0.9,
#     "classify_fn": "sigmoid"
# })
# models["ffnn"].set_hyperparams({
#     "input_size": 21,
#     "output_size": 3,
#     "hidden_size": 512,
#     "num_hidden": 2,
#     "num_epochs": 50,
#     "batch_size": 128,
#     "dropout_rate": 0.4,
#     "learning_rate": 0.0005
# })

# train & test basic model
skip_models = ["log"]#["ffnn", "log"]
for mt, model in models.items():
    # attempt to load, else train and test
    if (mt in skip_models) or (not model.load_model()):
        model.train_model(verbose=2)
    model.test_model()

<Train-Test Split Report>
Train: 512886 obs, 170962 no diabetes [0], 170962 pre-diabetes [1], 170962 diabetes [2]
Test: 50736 obs, 42741 no diabetes [0], 926 pre-diabetes [1], 7069 diabetes [2]
Epoch 1, change: 1.00000000
Epoch 2, change: 0.25288282
Epoch 3, change: 0.13426843
Epoch 4, change: 0.09815333
Epoch 5, change: 0.05010894
Epoch 6, change: 0.02040064
Epoch 7, change: 0.01590639
Epoch 8, change: 0.00988835
Epoch 9, change: 0.00781600
Epoch 10, change: 0.00618078
Epoch 11, change: 0.00414695
Epoch 12, change: 0.00408930
Epoch 13, change: 0.00379298
Epoch 14, change: 0.00350046
Epoch 15, change: 0.00410844
Epoch 16, change: 0.00481783
Epoch 17, change: 0.00324526
Epoch 18, change: 0.00323737
Epoch 19, change: 0.00317722
Epoch 20, change: 0.00313871
Epoch 21, change: 0.00308889
Epoch 22, change: 0.00306387
Epoch 23, change: 0.00303466
Epoch 24, change: 0.00300925
Epoch 25, change: 0.00296886
Epoch 26, change: 0.00294275
Epoch 27, change: 0.00291758
Epoch 28, change: 0.00289285
Epo

***
## Hyperparameter Optimization

In [3]:
# optimize hyperparams
# optimizer_results = {model_type: model.optimize_hyperparams(kfold=2) for model_type, model in models.items()}
# print(optimizer_results)

***
## Fine-Tuning + Other Adjustments

***
## Best Model Report

In [4]:
models["log"].explain_model()

{'high_bp': 0.0,
 'high_chol': 0.014954375993543193,
 'chol_check': 0.0,
 'bmi': -0.05122772492138329,
 'smoker': -0.6397699932031862,
 'stroke': -0.4028307245204736,
 'heart_disease': -0.3341685368854179,
 'physical_activity': 0.022742793137033528,
 'fruits': -0.15132944570789003,
 'veggies': 0.04003559709174568,
 'heavy_drinker': 0.0,
 'healthcare': 0.0,
 'no_doc_bc_cost': 0.0046205032673512884,
 'general_health': 0.14369490914009092,
 'mental_health': 0.05993890450959239,
 'physical_health': 0.08058307309086526,
 'diff_walk': 0.0,
 'sex': -0.02492717915350646,
 'age': 0.0,
 'education': -0.2332485206103297,
 'income': -0.40735943112418443}

***
## Interpretation

***
## Conclusion