# Modeling
In this notebook, we'll be modeling the data we've previously prepared. Out notebook will be laid out as follows:

1. Model Selection & Generation
2. Hyperparameter Optimization
3. Fine-Tuning (if needed)
4. Reporting Best Model(s) + Settings
5. Interpretation
6. Conclusion

Our eventual goal here is two-fold:

1. Accurately [and fairly] model the diabetes dataset
2. Interpret the results to find something worth recommending to those wanting to reduce risk of diabetes. This can be via LIME/SHAP (i.e. some interpretable model that approximates the neural network) or via analyzing a more simple model's structure (i.e. regression coefficients, random forest decision boundaries)

In [2]:
# Environment Setup
from utils.model import *
from utils.dataset import *

***
## Model Selection & Generation

In [7]:
# generate lookup for models
models = {
    "tree": TreeClassifier(target="diabetes", path="../datasets/pre_split_processed.parquet"),
    "ffnn": MLPClassifier(target="diabetes", path="../datasets/pre_split_processed.parquet"),
    "log": LogClassifier(target="diabetes", path="../datasets/pre_split_processed.parquet")
}

# manual search
models["tree"].set_hyperparams({
    "loss": "log_loss",
    "learning_rate": 0.01,
    "n_estimators": 1000,
    "criterion": "friedman_mse",
    "min_samples_split": 5,
    "min_samples_leaf": 5,
    "max_depth": 8,
    "n_iter_no_change": 5,
    "max_features": 1000,
    "tol": 0.0001
})
# models["ffnn"].set_hyperparams({
#     "learning_rate": .0005,
#     "batch_size": 256,
#     "num_hidden": 8,
#     "hidden_size": [2048, 1024, 512, 256, 128, 64, 32, 32],
#     "num_epochs": 50,
#     "dropout_rate": [0.875, 0.75, 0.75, 0.5, 0.5, 0.25, 0.25],
#     "classify_fn": "sigmoid"
# })
# models["ffnn"].set_hyperparams({
#     "learning_rate": .001,
#     "batch_size": 32,
#     "num_hidden": 4,
#     "hidden_size": [128, 64, 64, 32],
#     "num_epochs": 50,
#     "dropout_rate": [0.5, 0.4, 0.3, 0.2],
#     "classify_fn": "sigmoid"
# # })
models["ffnn"].set_hyperparams({
    "learning_rate": .00005,
    "batch_size": 256,
    "num_hidden": 3,
    "hidden_size": 2048,
    "num_epochs": 50,
    "dropout_rate": 0.8,
    "classify_fn": "sigmoid"
})
# models["ffnn"].set_hyperparams({
#     "input_size": 21,
#     "output_size": 3,
#     "hidden_size": 2048,
#     "num_hidden": 3,
#     "num_epochs": 50,
#     "batch_size": 128,
#     "learning_rate": 0.00005,
#     "dropout_rate": 0.8,
#     "classify_fn": "sigmoid"
# })

# train & test basic model
for mt, model in models.items():
    # attempt to load, else train and test
    if not model.load_model():
        model.train_model(verbose=2)
    model.test_model()

<Train-Test Split Report>
Train: 512886 obs, 170962 no diabetes [0], 170962 pre-diabetes [1], 170962 diabetes [2]
Test: 50736 obs, 42741 no diabetes [0], 926 pre-diabetes [1], 7069 diabetes [2]
<Train-Test Split Report>
Train: 512886 obs, 170962 no diabetes [0], 170962 pre-diabetes [1], 170962 diabetes [2]
Test: 50736 obs, 42741 no diabetes [0], 926 pre-diabetes [1], 7069 diabetes [2]
<Train-Test Split Report>
Train: 512886 obs, 170962 no diabetes [0], 170962 pre-diabetes [1], 170962 diabetes [2]
Test: 50736 obs, 42741 no diabetes [0], 926 pre-diabetes [1], 7069 diabetes [2]





<Test Report>
Precision: [no diabetes] 0.8620768242655286, [pre-diabetes] 0.5878453038674033, [diabetes] 0.05175718849840256
Recall: [no diabetes] 0.9735148920240518, [pre-diabetes] 0.07525816947234404, [diabetes] 0.08747300215982722
F1-Score: [no diabetes] 0.914413177008362, [pre-diabetes] 0.13343365939302734, [diabetes] 0.06503412284223203
Support: [no diabetes] 42741, [pre-diabetes] 7069, [diabetes] 926
Accuracy: 83.2190%
Macro-F1: 0.3710

<Test Report>
Precision: [no diabetes] 0.8560137904117608, [pre-diabetes] 0.01050328227571116, [diabetes] 0.19072632944228274
Recall: [no diabetes] 0.6157787604407946, [pre-diabetes] 0.05183585313174946, [diabetes] 0.4160418729664733
F1-Score: [no diabetes] 0.7162899560466477, [pre-diabetes] 0.017467248908296942, [diabetes] 0.2615500911556761
Support: [no diabetes] 42741, [pre-diabetes] 926, [diabetes] 7069
Accuracy: 57.7657%
Macro-F1: 0.3318

<Test Report>
Precision: [no diabetes] 0.8764259149295246, [pre-diabetes] 0.2884435537742151, [diabetes]

***
## Hyperparameter Optimization

In [4]:
# optimize hyperparams
optimizer_results = {model_type: model.optimize_hyperparams(kfold=2) for model_type, model in models.items()}
print(optimizer_results)

Fitting 2 folds for each of 32 candidates, totalling 64 fits


KeyboardInterrupt: 

***
## Fine-Tuning + Other Adjustments

***
## Best Model Report

***
## Interpretation

***
## Conclusion