# Modeling
In this notebook, we'll be modeling the data we've previously prepared. Out notebook will be laid out as follows:

1. Model Selection & Generation
2. Hyperparameter Optimization
3. Fine-Tuning (if needed)
4. Reporting Best Model(s) + Settings
5. Interpretation
6. Conclusion

Our eventual goal here is two-fold:

1. Accurately [and fairly] model the diabetes dataset
2. Interpret the results to find something worth recommending to those wanting to reduce risk of diabetes. This can be via LIME/SHAP (i.e. some interpretable model that approximates the neural network) or via analyzing a more simple model's structure (i.e. regression coefficients, random forest decision boundaries)

In [1]:
# Environment Setup
from utils.model import *
from utils.dataset import *

***
## Model Selection & Generation

In [2]:
# generate lookup for models
models = {
    "tree": TreeClassifier(target="diabetes", path="../datasets/pre_split_processed.parquet"),
    "ffnn": MLPClassifier(target="diabetes", path="../datasets/pre_split_processed.parquet"),
    "log": LogClassifier(target="diabetes", path="../datasets/pre_split_processed.parquet")
}

# manual search
models["tree"].set_hyperparams({
    "loss": "log_loss",
    "learning_rate": 0.01,
    "n_estimators": 100,
    "criterion": "friedman_mse",
    "min_samples_split": 5,
    "min_samples_leaf": 5,
    "max_depth": 8,
    "n_iter_no_change": 5,
    "max_features": "sqrt",
    "tol": 0.0001
})
# models["ffnn"].set_hyperparams({
#     "learning_rate": .0005,
#     "batch_size": 256,
#     "num_hidden": 8,
#     "hidden_size": [2048, 1024, 512, 256, 128, 64, 32, 32],
#     "num_epochs": 50,
#     "dropout_rate": [0.875, 0.75, 0.75, 0.5, 0.5, 0.25, 0.25],
#     "classify_fn": "sigmoid"
# })
# models["ffnn"].set_hyperparams({
#     "learning_rate": .001,
#     "batch_size": 32,
#     "num_hidden": 4,
#     "hidden_size": [128, 64, 64, 32],
#     "num_epochs": 50,
#     "dropout_rate": [0.5, 0.4, 0.3, 0.2],
#     "classify_fn": "sigmoid"
# # })
models["ffnn"].set_hyperparams({
    "learning_rate": .00005,
    "batch_size": 64,
    "num_hidden": 4,
    "hidden_size": 512,
    "num_epochs": 50,
    "dropout_rate": 0.7,
    "classify_fn": "sigmoid"
})
# models["ffnn"].set_hyperparams({
#     "input_size": 21,
#     "output_size": 3,
#     "hidden_size": 512,
#     "num_hidden": 2,
#     "num_epochs": 50,
#     "batch_size": 128,
#     "dropout_rate": 0.4,
#     "learning_rate": 0.0005
# })

# train & test basic model
skip_models = []#["ffnn", "log"]
for mt, model in models.items():
    # attempt to load, else train and test
    if mt in skip_models or not model.load_model():
        model.train_model(verbose=2)
    model.test_model()

<Train-Test Split Report>
Train: 202944 obs, 170962 no diabetes [0], 3705 pre-diabetes [1], 28277 diabetes [2]
Test: 50736 obs, 42741 no diabetes [0], 926 pre-diabetes [1], 7069 diabetes [2]
<Train-Test Split Report>
Train: 202944 obs, 170962 no diabetes [0], 3705 pre-diabetes [1], 28277 diabetes [2]
Test: 50736 obs, 42741 no diabetes [0], 926 pre-diabetes [1], 7069 diabetes [2]
<Train-Test Split Report>
Train: 202944 obs, 170962 no diabetes [0], 3705 pre-diabetes [1], 28277 diabetes [2]
Test: 50736 obs, 42741 no diabetes [0], 926 pre-diabetes [1], 7069 diabetes [2]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))



<Test Report>
Precision: [no diabetes] 0.8427103669959968, [pre-diabetes] 0.6666666666666666, [diabetes] 0.0
Recall: [no diabetes] 0.9998128260920428, [pre-diabetes] 0.002546329042297355, [diabetes] 0.0
F1-Score: [no diabetes] 0.9145639379347245, [pre-diabetes] 0.005073280721533258, [diabetes] 0.0
Support: [no diabetes] 42741, [pre-diabetes] 7069, [diabetes] 926
Accuracy: 84.2617%
Macro-F1: 0.3065

<Test Report>
Precision: [no diabetes] 0.8424195837275308, [pre-diabetes] 0.0, [diabetes] 0.0
Recall: [no diabetes] 1.0, [pre-diabetes] 0.0, [diabetes] 0.0
F1-Score: [no diabetes] 0.9144709393754613, [pre-diabetes] 0.0, [diabetes] 0.0
Support: [no diabetes] 42741, [pre-diabetes] 926, [diabetes] 7069
Accuracy: 84.2420%
Macro-F1: 0.3048

<Test Report>
Precision: [no diabetes] 0.85290689351956, [pre-diabetes] 0.402724795640327, [diabetes] 0.0
Recall: [no diabetes] 0.9758311691350225, [pre-diabetes] 0.10454095345876362, [diabetes] 0.0
F1-Score: [no diabetes] 0.9102376639532093, [pre-diabetes] 0

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


***
## Hyperparameter Optimization

In [3]:
# optimize hyperparams
# optimizer_results = {model_type: model.optimize_hyperparams(kfold=2) for model_type, model in models.items()}
# print(optimizer_results)

***
## Fine-Tuning + Other Adjustments

***
## Best Model Report

***
## Interpretation

***
## Conclusion