<a href="https://colab.research.google.com/github/carywoods/Aweome-Heathcare-Federated-Learning/blob/main/Copy_of_tutorial_00_classification_study.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AutoPrognosis classification

A __classification model__ is a type of machine learning model used for assigning a categorical label to a given input. It is used to predict the class of a target variable based on one or more predictor variables. The target variable is a categorical variable, with a finite number of discrete class labels, while the predictor variables can be continuous or categorical. The goal of a classification model is to learn a decision boundary that separates the classes as well as possible based on the training data. 

AutoPrognosis offers the `ClassifierStudy` for selecting an optimal classification model using AutoML.

### Setup

In [None]:
!pip install autoprognosis

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting autoprognosis
  Downloading autoprognosis-0.1.21-py2.py3-none-any.whl (284 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.5/284.5 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
Collecting hyperimpute>=0.1.16
  Downloading hyperimpute-0.1.17-py3-none-any.whl (92 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.9/92.9 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
Collecting lifelines
  Downloading lifelines-0.27.4-py3-none-any.whl (349 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m349.7/349.7 kB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting redis
  Downloading redis-4.5.4-py3-none-any.whl (238 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m238.9/238.9 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
Collecting optuna==3.1.0
  Downloading optuna-3.1.0-py3-none-any.whl (365 kB)
[2K    

In [None]:
# stdlib
import json
import warnings

# third party
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

warnings.filterwarnings("ignore")

### Import ClassifierStudy

ClassifierStudy is the engine that learns an ensemble of pipelines and their hyperparameters automatically.

In [None]:
# autoprognosis absolute
from autoprognosis.studies.classifiers import ClassifierStudy

### Load the target dataset

AutoPrognosis expects pandas.DataFrames as input.

For this example, we will use the [Breast Cancer Wisconsin Dataset](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)).

In [None]:
# stdlib
from pathlib import Path

X, Y = load_breast_cancer(return_X_y=True, as_frame=True)

df = X.copy()
df["target"] = Y

### Create the classifier

While AutoPrognosis provides default plugins, it allows the user to customize the plugins for the pipelines.

You can see the supported plugins below:

In [None]:
# List the available plugins

# autoprognosis absolute
from autoprognosis.plugins import Plugins

print(json.dumps(Plugins().list_available(), indent=2))

{
  "imputer": {
    "default": [
      "missforest",
      "sinkhorn",
      "gain",
      "softimpute",
      "hyperimpute",
      "mean",
      "ice",
      "EM",
      "median",
      "most_frequent",
      "nop",
      "mice"
    ]
  },
  "prediction": {
    "classifier": [
      "decision_trees",
      "lgbm",
      "gaussian_naive_bayes",
      "bernoulli_naive_bayes",
      "random_forest",
      "gradient_boosting",
      "lda",
      "ridge_classifier",
      "adaboost",
      "perceptron",
      "linear_svm",
      "hist_gradient_boosting",
      "multinomial_naive_bayes",
      "tabnet",
      "xgboost",
      "knn",
      "catboost",
      "logistic_regression",
      "gaussian_process",
      "qda",
      "bagging",
      "extra_tree_classifier",
      "neural_nets"
    ],
    "regression": [
      "neural_nets_regression",
      "random_forest_regressor",
      "kneighbors_regressor",
      "catboost_regressor",
      "bayesian_ridge",
      "linear_regression",
      "t

We will set a few custom plugins for the pipelines and create the classifier study.

In [None]:
workspace = Path("workspace")
workspace.mkdir(parents=True, exist_ok=True)

study_name = "classification_example"

study = ClassifierStudy(
    study_name=study_name,
    dataset=df,  # pandas DataFrame
    target="target",  # the label column in the dataset
    num_iter=2,  # DELETE THIS LINE FOR BETTER RESULTS. how many trials to do for each candidate. Default: 50
    num_study_iter=1,  # DELETE THIS LINE FOR BETTER RESULTS. how many outer iterations to do. Default: 5
    classifiers=["logistic_regression", "lda", "xgboost"], # DELETE THIS LINE FOR BETTER RESULTS. 
    workspace=workspace,
)

### Search for the best ensemble


In [None]:
# study.run saves a good architecture in "model.p" - the model is not trained at this stage.
# That model can be later used/reused for benchmarks or training on the dataset.

study.run()

In [None]:
from pprint import pprint
# autoprognosis absolute
from autoprognosis.utils.serialization import load_model_from_file
from autoprognosis.utils.tester import evaluate_estimator

output = workspace / study_name / "model.p"

model = load_model_from_file(output)

metrics = evaluate_estimator(model, X, Y)

print(f"Model {model.name()}")
pprint(metrics['str'])

Model 0.1874999998828125 * (pca->minmax_scaler->data_cleanup->lda) + 0.624999999609375 * (data_cleanup->logistic_regression) + 0.1874999998828125 * (pca->maxabs_scaler->data_cleanup->xgboost)
{'accuracy': '0.947 +/- 0.005',
 'aucprc': '0.992 +/- 0.005',
 'aucroc': '0.988 +/- 0.005',
 'f1_score_macro': '0.943 +/- 0.006',
 'f1_score_micro': '0.947 +/- 0.005',
 'f1_score_weighted': '0.947 +/- 0.005',
 'kappa': '0.887 +/- 0.011',
 'kappa_quadratic': '0.887 +/- 0.011',
 'mcc': '0.888 +/- 0.01',
 'precision_macro': '0.946 +/- 0.003',
 'precision_micro': '0.947 +/- 0.005',
 'precision_weighted': '0.948 +/- 0.004',
 'recall_macro': '0.942 +/- 0.011',
 'recall_micro': '0.947 +/- 0.005',
 'recall_weighted': '0.947 +/- 0.005'}


## Serialization

In [None]:
# Train the model

model.fit(X,Y)

In [None]:
from autoprognosis.utils.serialization import save_to_file, load_from_file

out = workspace / "tmp.bkp"

# Save
save_to_file(out, model)

# Load from file
loaded_model = load_from_file(out)

print(loaded_model.name())

assert loaded_model.name() == model.name()

out.unlink()

0.1874999998828125 * (pca->minmax_scaler->data_cleanup->lda) + 0.624999999609375 * (data_cleanup->logistic_regression) + 0.1874999998828125 * (pca->maxabs_scaler->data_cleanup->xgboost)


## Congratulations!

Congratulations on completing this notebook tutorial! If you enjoyed this and would like to join the movement towards Machine learning and AI for medicine, you can do so in the following ways!

### Star AutoPrognosis on GitHub

The easiest way to help our community is just by starring the Repos! This helps raise awareness of the tools we're building.

- [Star AutoPrognosis](https://github.com/vanderschaarlab/autoprognosis)
- [Star HyperImpute](https://github.com/vanderschaarlab/hyperimpute)


### Checkout other projects from vanderschaarlab
- [Synthcity](https://github.com/vanderschaarlab/synthcity)

