# PiML Toolbox for Model Development and Validation: Low-code Demo

PiML (Python Interpretable Machine Learning) is an integrated Python toolbox for IML model development and validation. Through low-code interface and high-code APIs, PiML supports various machine learning models in the following two categories:

- **Inherently interpretable models**:

    - 1. GLM: Linear/logistic regression with L1 and/or L2 regularization (Hastie, Tibshirani and Wainwright, 2015)

    - 2. GAM: Generalized additive models using B-splines (Servén and Brummitt, 2018)

    - 3. Tree: Decision tree for classification and regression (Pedregosa et al., 2011)

    - 4. FIGS: Fast interpretable greedy-tree sums (Tan et al., 2022)

    - 5. XGB1: Extreme gradient boosted trees of depth 1 (Chen and He, 2015)

    - 6. XGB2: Extreme gradient boosted trees of depth 2 (Chen and He, 2015; Lengerich et al., 2020)

    - 7. EBM: Explainable boosting machine (Nori, et al. 2019; Lou, et al. 2013)

    - 8. GAMI-Net: Generalized additive model with structured interactions (Yang, Zhang and Sudjianto, 2021)

    - 9. ReLU-DNN: Deep ReLU networks using Aletheia unwrapper and sparsification (Sudjianto, et al. 2020)


- **Arbitrary black-box models**，e.g.
  1. Tree-ensembles: RF, GBM, XGBoost, LightGBM, ...
  2. DNNs: MLP, ResNet, CNN, Attention, ...
  3. Kernel methods: SVM, Gaussian Process, ...


This example notebook demonstrates how to use PiML in its low-code mode to train machine learning models, then interpret/explain, diagose and compare them. The toolbox has the following built-in datasets for demo purposes. 

- **CoCircles** classification data: simulated by `sklearn.datasets.make_make_circles(n_samples=2000, noise=0.1)`; see [details](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_circles.html).   
- **Friedman** regression data: simulated by `sklearn.datasets.make_friedman1(n_samples=2000, n_features=10, and noise=0.1)`; see [details](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_friedman1.html).   
- **BikeSharing** regression data from UCI repository: consisting of 17,389 samples of hourly counts of rental bikes in Capital bikeshare system; see [details](https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset).  
- **CaliforniaHousing** regression data: consisting of 20,640 samples and 9 features, fetched by `sklearn.datasets`; see [details](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html). There are a raw version, a trim1 version (trimming only AveOccup) and a trim2 version (trimming AveRooms, AveBedrms, Population and AveOccup).   
- **TaiwanCredit** classification data fro UCI repository: consisting of 30,000 credit card clients in Taiwan from 200504 to 200509; see [details](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients). This data is subject to slight preprocessing. 

- **SimuCredit**: A credit simulation data for fairness testing.
- **SolasSimu1**: A simulated dataset, modified from the 'Friedman #1' regression problem. The covariates used for modeling are 'Segment', 'x1', 'x2', ..., 'x5', the response 'Label' is binary and it is a classification problem. The rest variables are demographic variables used for testing fairness. The data is contributed by Solas-AI (https://github.com/SolasAI/solas-ai-disparity).
- **SolasHMDA**: A preprocessed sample of the 2018 Home Mortgage Disclosure Act (HMDA) data. The HMDA dataset includes information about nearly every home mortgage application in the United States.


# Stage 0: Install PiML package on Google Colab

1. Run `!pip install piml` to install the latest version of PiML
2. In Colab, you'll need restart the runtime in order to use newly installed PiML version.

In [None]:
!pip install piml

# Stage 1: Initialize an experiment, Load and Prepare data <a name="expdata"></a>

In [None]:
from piml import Experiment
exp = Experiment()

In [None]:
exp.data_loader()

In [None]:
exp.data_summary()

In [None]:
exp.eda()

In [None]:
exp.data_prepare()

In [None]:
exp.feature_select()

In [None]:
exp.data_quality()

# Stage 2. Train intepretable models <a name="modeltrain"></a>



In [None]:
exp.model_train()

# Stage 3. Explain and Interpret <a name="modelinterpret"></a>

In [None]:
exp.model_explain()

In [None]:
exp.model_interpret()

# Stage 4. Diagnose and compare

In [None]:
exp.model_diagnose()

In [None]:
exp.model_compare()

In [None]:
exp.segmented_diagnose()

# Stage 5. Register an arbitrary model ... 

In [None]:
from xgboost import XGBRegressor, XGBMClassifier

exp.model_train(model=XGBRegressor(max_depth=7, n_estimators=500), name='XGB7')

## To register a pre-trained model, use the following: 
# pipeline = exp.make_pipeline(model=..., name='...', train_x=..., train_y=..., test_x=..., test_y=...)
# exp.register(pipeline=pipeline)

In [None]:
exp.model_explain()

In [None]:
exp.model_diagnose()

In [None]:
exp.model_compare()