# Modeling
In this notebook, we'll be modeling the data we've previously prepared. Out notebook will be laid out as follows:

1. Model Selection & Generation
2. Hyperparameter Optimization
3. Fine-Tuning (if needed)
4. Reporting Best Model(s) + Settings
5. Interpretation
6. Conclusion

Our eventual goal here is two-fold:

1. Accurately [and fairly] model the diabetes dataset
2. Interpret the results to find something worth recommending to those wanting to reduce risk of diabetes. This can be via LIME/SHAP (i.e. some interpretable model that approximates the neural network) or via analyzing a more simple model's structure (i.e. regression coefficients, random forest decision boundaries)

In [1]:
# Environment Setup
from utils.model import *
from utils.dataset import *

***
## Model Selection & Generation

In [2]:
# generate lookup for models
models = {
    # "tree": TreeClassifier(target="diabetes", path="../datasets/pre_split_processed.parquet", upsample=True),
    # "ffnn": MLPClassifier(target="diabetes", path="../datasets/pre_split_processed.parquet", upsample=False, loss_balance=False),
    "log": LogClassifier(target="diabetes", path="../datasets/pre_split_processed.parquet", upsample=True)
}

# manual search
# models["tree"].set_hyperparams({
#     "loss": "log_loss",
#     "learning_rate": 0.01,
#     "n_estimators": 100,
#     "criterion": "friedman_mse",
#     "min_samples_split": 5,
#     "min_samples_leaf": 5,
#     "max_depth": 8,
#     "n_iter_no_change": 5,
#     "max_features": "sqrt",
#     "tol": 0.0001
# })
# models["ffnn"].set_hyperparams({
#     "learning_rate": .0005,
#     "batch_size": 256,
#     "num_hidden": 8,
#     "hidden_size": [2048, 1024, 512, 256, 128, 64, 32, 32],
#     "num_epochs": 50,
#     "dropout_rate": [0.875, 0.75, 0.75, 0.5, 0.5, 0.25, 0.25],
#     "classify_fn": "sigmoid"
# })
# models["ffnn"].set_hyperparams({
#     "learning_rate": .001,
#     "batch_size": 32,
#     "num_hidden": 4,
#     "hidden_size": [128, 64, 64, 32],
#     "num_epochs": 50,
#     "dropout_rate": [0.5, 0.4, 0.3, 0.2],
#     "classify_fn": "sigmoid"
# # })
# models["ffnn"].set_hyperparams({
#     "input_size": 21,
#     "output_size": 3,
#     "hidden_size": 1024,
#     "num_hidden": 4,
#     "num_epochs": 50,
#     "batch_size": 64,
#     "learning_rate": 5e-05,
#     "dropout_rate": 0.9,
#     "classify_fn": "sigmoid"
# })
# models["ffnn"].set_hyperparams({
#     "input_size": 21,
#     "output_size": 3,
#     "hidden_size": 512,
#     "num_hidden": 2,
#     "num_epochs": 50,
#     "batch_size": 128,
#     "dropout_rate": 0.4,
#     "learning_rate": 0.0005
# })

# train & test basic model
skip_models = []#["ffnn", "log"]
for mt, model in models.items():
    # attempt to load, else train and test
    if (mt in skip_models) or (not model.load_model()):
        model.train_model(verbose=2)
    model.test_model()

<Train-Test Split Report>
Train: 512886 obs, 170962 no diabetes [0], 170962 pre-diabetes [1], 170962 diabetes [2]
Test: 50736 obs, 42741 no diabetes [0], 926 pre-diabetes [1], 7069 diabetes [2]

<Test Report>
Precision: [no diabetes] 0.9045211473018959, [pre-diabetes] 0.26678220681686887, [diabetes] 0.024674861927667914
Recall: [no diabetes] 0.6529795746472942, [pre-diabetes] 0.3266374310369218, [diabetes] 0.2991360691144708
F1-Score: [no diabetes] 0.7584379585847056, [pre-diabetes] 0.29369117272958534, [diabetes] 0.0455892034233048
Support: [no diabetes] 42741, [pre-diabetes] 7069, [diabetes] 926
Accuracy: 60.1053%
Macro-F1: 0.3659


***
## Hyperparameter Optimization

In [3]:
# optimize hyperparams
# optimizer_results = {model_type: model.optimize_hyperparams(kfold=2) for model_type, model in models.items()}
# print(optimizer_results)

***
## Fine-Tuning + Other Adjustments

***
## Best Model Report

In [4]:
models["log"].explain_model()

{
    "high_bp": 0.002713658256150768,
    "high_chol": -0.13198255760527797,
    "chol_check": -0.22234891293865036,
    "bmi": 0.011753968749756911,
    "smoker": 0.08978099772083172,
    "stroke": -0.22651736522618188,
    "heart_disease": -0.007007909841112545,
    "physical_activity": -0.5065999492974543,
    "fruits": -0.0028655279496007553,
    "veggies": 0.02779148873481901,
    "heavy_drinker": 0.02192203232846939,
    "healthcare": 0.027291903085658757,
    "no_doc_bc_cost": -0.060057690646198744,
    "general_health": -0.31966943523510644,
    "mental_health": -0.05128369479380811,
    "physical_health": 0.0706631119256953,
    "diff_walk": -0.2925363752254724,
    "sex": -0.0009257155550026452,
    "age": -0.013519341477834528,
    "education": 0.10230037028270968,
    "income": -0.0066423898317053325
}


{'high_bp': 0.002713658256150768,
 'high_chol': -0.13198255760527797,
 'chol_check': -0.22234891293865036,
 'bmi': 0.011753968749756911,
 'smoker': 0.08978099772083172,
 'stroke': -0.22651736522618188,
 'heart_disease': -0.007007909841112545,
 'physical_activity': -0.5065999492974543,
 'fruits': -0.0028655279496007553,
 'veggies': 0.02779148873481901,
 'heavy_drinker': 0.02192203232846939,
 'healthcare': 0.027291903085658757,
 'no_doc_bc_cost': -0.060057690646198744,
 'general_health': -0.31966943523510644,
 'mental_health': -0.05128369479380811,
 'physical_health': 0.0706631119256953,
 'diff_walk': -0.2925363752254724,
 'sex': -0.0009257155550026452,
 'age': -0.013519341477834528,
 'education': 0.10230037028270968,
 'income': -0.0066423898317053325}

***
## Interpretation

***
## Conclusion