# Modeling
In this notebook, we'll be modeling the data we've previously prepared. Out notebook will be laid out as follows:

1. Model Selection & Generation
2. Hyperparameter Optimization
3. Fine-Tuning (if needed)
4. Reporting Best Model(s) + Settings
5. Interpretation
6. Conclusion

Our eventual goal here is two-fold:

1. Accurately [and fairly] model the diabetes dataset
2. Interpret the results to find something worth recommending to those wanting to reduce risk of diabetes. This can be via LIME/SHAP (i.e. some interpretable model that approximates the neural network) or via analyzing a more simple model's structure (i.e. regression coefficients, random forest decision boundaries)

In [1]:
# Environment Setup
from utils.model import *
from utils.dataset import *

***
## Model Selection & Generation

In [2]:
# generate lookup for models
models = {
    # "tree": TreeClassifier(target="diabetes", path="../datasets/pre_split_processed.parquet"),
    "ffnn": MLPClassifier(target="diabetes", path="../datasets/pre_split_processed.parquet")
}

# manual search
# models["tree"].set_hyperparams({
#     "loss": "log_loss",
#     "learning_rate": 0.01,
#     "n_estimators": 100,
#     "criterion": "friedman_mse",
#     "min_samples_split": 2,
#     "min_samples_leaf": 5,
#     "max_depth": 3,
#     "n_iter_no_change": 5,
#     "max_features": "log2",
#     "tol": 1e-4
# })
# models["ffnn"].set_hyperparams({
#     "learning_rate": .0005,
#     "batch_size": 256,
#     "num_hidden": 8,
#     "hidden_size": [2048, 1024, 512, 256, 128, 64, 32, 32],
#     "num_epochs": 50,
#     "dropout_rate": [0.875, 0.75, 0.75, 0.5, 0.5, 0.25, 0.25],
#     "classify_fn": "sigmoid"
# })
models["ffnn"].set_hyperparams({
    "learning_rate": .001,
    "batch_size": 256,
    "num_hidden": 2,
    "hidden_size": 128,
    "num_epochs": 50,
    "dropout_rate": 0.25,
    "classify_fn": "softmax"
})

# train & test basic model
for mt, model in models.items():
    # attempt to load, else train and test
    # if not model.load_model():
    model.train_model(verbose=2)
    model.test_model()

<Dtype Inference>
diabetes
high_bp
high_chol
chol_check
bmi
smoker
stroke
heart_disease
physical_activity
fruits
veggies
heavy_drinker
healthcare
no_doc_bc_cost
general_health
mental_health
physical_health
diff_walk
sex
age
education
income
	3 numeric vars, 0 nominal vars, 0 ordinal vars


ValueError: SMOTE-NC is not designed to work only with numerical features. It requires some categorical features.

***
## Hyperparameter Optimization

In [None]:
# optimize hyperparams
# optimizer_results = {model_type: model.optimize_hyperparams(kfold=2) for model_type, model in models.items()}
# print(optimizer_results)

***
## Fine-Tuning + Other Adjustments

***
## Best Model Report

***
## Interpretation

***
## Conclusion