# Modeling
In this notebook, we'll be modeling the data we've previously prepared. Out notebook will be laid out as follows:

1. Model Selection & Generation
2. Hyperparameter Optimization
3. Fine-Tuning (if needed)
4. Reporting Best Model(s) + Settings
5. Interpretation
6. Conclusion

Our eventual goal here is two-fold:

1. Accurately [and fairly] model the diabetes dataset
2. Interpret the results to find something worth recommending to those wanting to reduce risk of diabetes. This can be via LIME/SHAP (i.e. some interpretable model that approximates the neural network) or via analyzing a more simple model's structure (i.e. regression coefficients, random forest decision boundaries)

In [1]:
# Environment Setup
from utils.model import *
from utils.dataset import *

***
## Model Selection & Generation

In [2]:
# generate lookup for models
models = {
    # "tree": TreeClassifier(target="diabetes", path="../datasets/pre_split_processed.parquet"),
    "ffnn": MLPClassifier(target="diabetes", path="../datasets/pre_split_processed.parquet")
}

# manual search
models["ffnn"].set_hyperparams({
    "learning_rate": .0005,
    "batch_size": 128,
    "num_hidden": 2,
    "hidden_size": 2048,
    "num_epochs": 50,
    "classify_fn": "sigmoid"
})

# train & test basic model
for mt, model in models.items():
    # attempt to load, else train and test
    if not model.load_model():
        model.train_model(verbose=1)
    model.test_model()

<Train-Test Split Report>
Train: 512886 obs, 170962 no diabetes [0], 170962 pre-diabetes [1], 170962 diabetes [2]
Test: 50736 obs, 42741 no diabetes [0], 926 pre-diabetes [1], 7069 diabetes [2]

<Test Report>
Precision: [no diabetes] 0.8560014750169013, [pre-diabetes] 0.07317073170731707, [diabetes] 0.47493887530562345
Recall: [no diabetes] 0.9776093212606163, [pre-diabetes] 0.02267818574514039, [diabetes] 0.10991653699250248
F1-Score: [no diabetes] 0.9127727898289534, [pre-diabetes] 0.03462489694971146, [diabetes] 0.1785180930499713
Support: [no diabetes] 42741, [pre-diabetes] 926, [diabetes] 7069
Accuracy: 83.9286%


***
## Hyperparameter Optimization

In [3]:
# optimize hyperparams
optimizer_results = {model_type: model.optimize_hyperparams(kfold=2) for model_type, model in models.items()}
print(optimizer_results)

<Grid-Search>
Testing 216 combinations WITHOUT cross-validation
<testing> hidden_size=64, lr=0.001, bs=1024, num_hidden=2, num_epochs=25. . . 

100%|██████████| 501/501 [00:03<00:00, 127.71it/s]
100%|██████████| 501/501 [00:03<00:00, 143.99it/s]
100%|██████████| 501/501 [00:03<00:00, 144.11it/s]
100%|██████████| 501/501 [00:03<00:00, 148.89it/s]
100%|██████████| 501/501 [00:03<00:00, 144.93it/s]
100%|██████████| 501/501 [00:03<00:00, 140.23it/s]
100%|██████████| 501/501 [00:03<00:00, 144.09it/s]
100%|██████████| 501/501 [00:03<00:00, 158.76it/s]
100%|██████████| 501/501 [00:03<00:00, 133.43it/s]
100%|██████████| 501/501 [00:03<00:00, 139.62it/s]
100%|██████████| 501/501 [00:03<00:00, 144.18it/s]
100%|██████████| 501/501 [00:03<00:00, 149.82it/s]
100%|██████████| 501/501 [00:03<00:00, 154.71it/s]
100%|██████████| 501/501 [00:03<00:00, 131.02it/s]
100%|██████████| 501/501 [00:03<00:00, 141.31it/s]
100%|██████████| 501/501 [00:03<00:00, 148.65it/s]
100%|██████████| 501/501 [00:03<00:00, 144.35it/s]
100%|██████████| 501/501 [00:03<00:00, 134.49it/s]
100%|██████████| 501/501 [00:03<00:00, 143.58it/s]
100%|██████████| 501/501 [00:03


<Test Report>
Precision: [no diabetes] 0.8894882087305569, [pre-diabetes] 0.021924116528181, [diabetes] 0.2316288520149001
Recall: [no diabetes] 0.6636250906623616, [pre-diabetes] 0.23650107991360692, [diabetes] 0.2902815108218984
F1-Score: [no diabetes] 0.7601334601830387, [pre-diabetes] 0.04012826385707741, [diabetes] 0.257659467604219
Support: [no diabetes] 42741, [pre-diabetes] 926, [diabetes] 7069
Accuracy: 60.3812%
perf: 0.6038
<testing> hidden_size=128, lr=0.001, bs=1024, num_hidden=8, num_epochs=25. . . 

100%|██████████| 501/501 [00:03<00:00, 130.97it/s]
100%|██████████| 501/501 [00:03<00:00, 132.65it/s]
100%|██████████| 501/501 [00:03<00:00, 132.97it/s]
100%|██████████| 501/501 [00:04<00:00, 125.05it/s]
100%|██████████| 501/501 [00:04<00:00, 122.32it/s]
100%|██████████| 501/501 [00:03<00:00, 133.35it/s]
100%|██████████| 501/501 [00:03<00:00, 131.55it/s]
100%|██████████| 501/501 [00:03<00:00, 126.94it/s]
100%|██████████| 501/501 [00:03<00:00, 131.16it/s]
100%|██████████| 501/501 [00:04<00:00, 116.36it/s]
100%|██████████| 501/501 [00:04<00:00, 120.31it/s]
100%|██████████| 501/501 [00:03<00:00, 127.50it/s]
100%|██████████| 501/501 [00:03<00:00, 135.43it/s]
100%|██████████| 501/501 [00:04<00:00, 117.25it/s]
100%|██████████| 501/501 [00:04<00:00, 124.13it/s]
100%|██████████| 501/501 [00:03<00:00, 131.23it/s]
100%|██████████| 501/501 [00:03<00:00, 127.37it/s]
100%|██████████| 501/501 [00:04<00:00, 119.96it/s]
100%|██████████| 501/501 [00:03<00:00, 127.06it/s]
100%|██████████| 501/501 [00:04


<Test Report>
Precision: [no diabetes] 0.8630072840790843, [pre-diabetes] 0.017890191239975324, [diabetes] 0.20466092334879613
Recall: [no diabetes] 0.7761634028216466, [pre-diabetes] 0.06263498920086392, [diabetes] 0.26213042863205543
F1-Score: [no diabetes] 0.817284832657888, [pre-diabetes] 0.02783109404990403, [diabetes] 0.22985796687961296
Support: [no diabetes] 42741, [pre-diabetes] 926, [diabetes] 7069
Accuracy: 69.1521%
perf: 0.6915
<testing> hidden_size=256, lr=0.001, bs=1024, num_hidden=4, num_epochs=25. . . 

100%|██████████| 501/501 [00:03<00:00, 128.54it/s]
100%|██████████| 501/501 [00:03<00:00, 131.74it/s]
100%|██████████| 501/501 [00:04<00:00, 114.07it/s]
100%|██████████| 501/501 [00:04<00:00, 120.88it/s]
100%|██████████| 501/501 [00:03<00:00, 128.26it/s]
100%|██████████| 501/501 [00:04<00:00, 123.76it/s]
100%|██████████| 501/501 [00:04<00:00, 117.39it/s]
100%|██████████| 501/501 [00:04<00:00, 124.53it/s]
100%|██████████| 501/501 [00:04<00:00, 120.54it/s]
100%|██████████| 501/501 [00:04<00:00, 120.40it/s]
100%|██████████| 501/501 [00:03<00:00, 127.62it/s]
100%|██████████| 501/501 [00:03<00:00, 128.25it/s]
100%|██████████| 501/501 [00:04<00:00, 124.66it/s]
100%|██████████| 501/501 [00:04<00:00, 124.91it/s]
100%|██████████| 501/501 [00:03<00:00, 131.33it/s]
100%|██████████| 501/501 [00:04<00:00, 114.67it/s]
100%|██████████| 501/501 [00:04<00:00, 120.23it/s]
100%|██████████| 501/501 [00:04<00:00, 124.44it/s]
100%|██████████| 501/501 [00:03<00:00, 127.67it/s]
100%|██████████| 501/501 [00:04


<Test Report>
Precision: [no diabetes] 0.8681869178385995, [pre-diabetes] 0.022182254196642687, [diabetes] 0.21534307792570523
Recall: [no diabetes] 0.7785264734096067, [pre-diabetes] 0.03995680345572354, [diabetes] 0.3272032819352101
F1-Score: [no diabetes] 0.8209157744116051, [pre-diabetes] 0.028527370855821126, [diabetes] 0.2597417181358787
Support: [no diabetes] 42741, [pre-diabetes] 926, [diabetes] 7069
Accuracy: 70.2164%
perf: 0.7022
<testing> hidden_size=512, lr=0.0005, bs=1024, num_hidden=4, num_epochs=25. . . 

100%|██████████| 501/501 [00:05<00:00, 90.61it/s]
100%|██████████| 501/501 [00:05<00:00, 92.69it/s]
100%|██████████| 501/501 [00:05<00:00, 83.64it/s]
100%|██████████| 501/501 [00:05<00:00, 86.97it/s]
100%|██████████| 501/501 [00:05<00:00, 89.47it/s]
100%|██████████| 501/501 [00:05<00:00, 90.74it/s]
100%|██████████| 501/501 [00:05<00:00, 92.91it/s]
100%|██████████| 501/501 [00:05<00:00, 83.91it/s]
100%|██████████| 501/501 [00:05<00:00, 87.44it/s]
100%|██████████| 501/501 [00:05<00:00, 90.79it/s]
100%|██████████| 501/501 [00:05<00:00, 89.40it/s]
100%|██████████| 501/501 [00:05<00:00, 86.24it/s]
100%|██████████| 501/501 [00:05<00:00, 89.53it/s]
100%|██████████| 501/501 [00:05<00:00, 87.42it/s]
100%|██████████| 501/501 [00:05<00:00, 87.85it/s]
100%|██████████| 501/501 [00:05<00:00, 91.83it/s]
100%|██████████| 501/501 [00:05<00:00, 91.44it/s]
100%|██████████| 501/501 [00:05<00:00, 89.76it/s]
100%|██████████| 501/501 [00:05<00:00, 89.82it/s]
100%|██████████| 501/501 [00:05<00:00, 93.16it/s]



<Test Report>
Precision: [no diabetes] 0.8703537486800422, [pre-diabetes] 0.02447058823529412, [diabetes] 0.22402385611778958
Recall: [no diabetes] 0.7713670714302426, [pre-diabetes] 0.056155507559395246, [diabetes] 0.3400763898712689
F1-Score: [no diabetes] 0.8178762357202218, [pre-diabetes] 0.034087184529662404, [diabetes] 0.2701123595505618
Support: [no diabetes] 42741, [pre-diabetes] 926, [diabetes] 7069
Accuracy: 69.8222%
perf: 0.6982
<testing> hidden_size=1024, lr=0.0005, bs=1024, num_hidden=2, num_epochs=25. . . 

100%|██████████| 501/501 [00:07<00:00, 63.42it/s]
100%|██████████| 501/501 [00:07<00:00, 66.01it/s]
100%|██████████| 501/501 [00:07<00:00, 65.00it/s]
100%|██████████| 501/501 [00:07<00:00, 66.10it/s]
100%|██████████| 501/501 [00:07<00:00, 66.03it/s]
100%|██████████| 501/501 [00:07<00:00, 64.03it/s]
100%|██████████| 501/501 [00:07<00:00, 65.12it/s]
100%|██████████| 501/501 [00:07<00:00, 63.77it/s]
100%|██████████| 501/501 [00:07<00:00, 64.01it/s]
100%|██████████| 501/501 [00:07<00:00, 64.03it/s]
100%|██████████| 501/501 [00:07<00:00, 62.89it/s]
100%|██████████| 501/501 [00:07<00:00, 64.78it/s]
100%|██████████| 501/501 [00:07<00:00, 63.95it/s]
100%|██████████| 501/501 [00:07<00:00, 64.87it/s]
100%|██████████| 501/501 [00:07<00:00, 64.91it/s]
100%|██████████| 501/501 [00:07<00:00, 65.84it/s]
100%|██████████| 501/501 [00:07<00:00, 64.02it/s]
100%|██████████| 501/501 [00:07<00:00, 63.90it/s]
100%|██████████| 501/501 [00:07<00:00, 64.01it/s]
100%|██████████| 501/501 [00:07<00:00, 64.82it/s]



<Test Report>
Precision: [no diabetes] 0.8629153322658126, [pre-diabetes] 0.025982256020278833, [diabetes] 0.21414581066376495
Recall: [no diabetes] 0.8069301139421164, [pre-diabetes] 0.04427645788336933, [diabetes] 0.2783986419578441
F1-Score: [no diabetes] 0.8339842096990654, [pre-diabetes] 0.03274760383386582, [diabetes] 0.24208130881358017
Support: [no diabetes] 42741, [pre-diabetes] 926, [diabetes] 7069
Accuracy: 71.9371%
perf: 0.7194
<testing> hidden_size=2048, lr=0.0005, bs=1024, num_hidden=2, num_epochs=25. . . 

100%|██████████| 501/501 [00:20<00:00, 24.78it/s]
100%|██████████| 501/501 [00:20<00:00, 24.91it/s]
100%|██████████| 501/501 [00:20<00:00, 24.71it/s]
100%|██████████| 501/501 [00:20<00:00, 24.78it/s]
100%|██████████| 501/501 [00:20<00:00, 24.78it/s]
100%|██████████| 501/501 [00:20<00:00, 24.91it/s]
100%|██████████| 501/501 [00:20<00:00, 24.72it/s]
100%|██████████| 501/501 [00:20<00:00, 24.73it/s]
100%|██████████| 501/501 [00:20<00:00, 24.76it/s]
100%|██████████| 501/501 [00:20<00:00, 24.61it/s]
100%|██████████| 501/501 [00:20<00:00, 24.85it/s]
100%|██████████| 501/501 [00:20<00:00, 24.75it/s]
100%|██████████| 501/501 [00:20<00:00, 24.63it/s]
100%|██████████| 501/501 [00:20<00:00, 24.87it/s]
100%|██████████| 501/501 [00:20<00:00, 24.75it/s]
100%|██████████| 501/501 [00:20<00:00, 24.63it/s]
100%|██████████| 501/501 [00:20<00:00, 24.84it/s]
100%|██████████| 501/501 [00:20<00:00, 24.73it/s]
100%|██████████| 501/501 [00:20<00:00, 24.57it/s]
100%|██████████| 501/501 [00:20<00:00, 24.87it/s]



<Test Report>
Precision: [no diabetes] 0.8566524423677497, [pre-diabetes] 0.022132796780684104, [diabetes] 0.2040520984081042
Recall: [no diabetes] 0.8407384010668912, [pre-diabetes] 0.011879049676025918, [diabetes] 0.23935492997595134
F1-Score: [no diabetes] 0.8486208199508786, [pre-diabetes] 0.015460295151089248, [diabetes] 0.22029815767202657
Support: [no diabetes] 42741, [pre-diabetes] 926, [diabetes] 7069
Accuracy: 74.1820%
perf: 0.7418
{
    "learning_rate": 0.0005,
    "input_size": 21,
    "output_size": 3,
    "hidden_size": 2048,
    "num_hidden": 2,
    "num_epochs": 25,
    "batch_size": 1024,
    "classify_fn": "sigmoid"
}

<Test Report>
Precision: [no diabetes] 0.8566524423677497, [pre-diabetes] 0.022132796780684104, [diabetes] 0.2040520984081042
Recall: [no diabetes] 0.8407384010668912, [pre-diabetes] 0.011879049676025918, [diabetes] 0.23935492997595134
F1-Score: [no diabetes] 0.8486208199508786, [pre-diabetes] 0.015460295151089248, [diabetes] 0.22029815767202657
Suppor

***
## Fine-Tuning + Other Adjustments

***
## Best Model Report

***
## Interpretation

***
## Conclusion