[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/edikedik/eBoruta/blob/master/notebooks/demo.ipynb)

# `Boruta` usage demo

## Setup

In [None]:
# ! pip install seaborn xgboost scikit-learn

In [None]:
import logging
import typing as t

import numpy as np
import pandas as pd
import seaborn as sns
import shap
from catboost import CatBoostClassifier
from sklearn.datasets import make_classification
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.inspection import permutation_importance
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier

from eBoruta import eBoruta, TrialData, Features, Dataset, setup_logger

In [None]:
np.random.seed(666)

In [None]:
def plot_imp_history(df_history: pd.DataFrame):
    sns.lineplot(x='Step', y='Importance', hue='Feature', data=df_history)
    sns.lineplot(x='Step', y='Threshold', data=df_history, linestyle='--', linewidth=4)

## Basic usage

Single objective, `RandomForestClassifier`, default params.

In [None]:
x, y = make_classification(100, 10, n_informative=2)
boruta = eBoruta()
boruta.fit(x, y);

### Increase verbosity

Turn on logging to get a glimpse on what's going on

In [None]:
LOGGER = setup_logger(stdout_level=logging.DEBUG, logger=logging.getLogger('eBoruta'))

In [None]:
boruta = eBoruta(verbose=2)
boruta.fit(x, y);

### Access features and history

In [None]:
features = boruta.features_
features.accepted, features.rejected, features.tentative

In [None]:
df = features.history
print(df.shape)
df.head()

Note that `n_rows = n_steps * n_features`. `df.dropna()` cleans the table giving access to the last step for a feature where it was used.b

In [None]:
df.dropna().groupby('Feature').tail(1)

- Query history to inspect the selection process

In [None]:
df[df['Feature'] == '7']

- One can use history to produce plots

In [None]:
plot_imp_history(df)

### Explore params

In [None]:
?eBoruta

- Lower percentile threshold

In [None]:
boruta = eBoruta(percentile=70).fit(x, y)
plot_imp_history(boruta.features_.history)

- Lower p-value

In [None]:
boruta = eBoruta(pvalue=0.005).fit(x, y)
plot_imp_history(boruta.features_.history)

- Apply rough fix

This won't overwrite existing `boruta.features_` but will return a new `Features` instance. In the latter, the history will remain unchanged, but the `accepted`, `rejected`, and `tentative` attributes will be modified accordingly.

In [None]:
fs = boruta.rough_fix(n_last_trials=10)

In [None]:
fs.accepted, fs.rejected, fs.tentative

- Use test set

In [None]:
boruta = eBoruta(test_size=0.3, test_stratify=True).fit(x, y)

## Advanced usage

### Different models

In principle, the model can be __any__ callable defining a `fit` method -- classifier or regressor -- as long as the importance calculation is defined.
Note that one can define the latter manually (see below).

For instance, we'll use the `XGBClassifier` and `CatBoostClassifier` below.

- `XGBClassifier`

In [None]:
boruta = eBoruta().fit(x, y, model=XGBClassifier(n_estimators=20, verbosity=0))
plot_imp_history(boruta.features_.history)

- `CatBoostClassifier`

In [None]:
# shap with `approximate` is not supported for catboost currently
boruta = eBoruta().fit(x, y, model=CatBoostClassifier(iterations=20, verbose=False))
plot_imp_history(boruta.features_.history)

### Custom importance measure

Any callable accepting an estimator or estimator together with the `TrialData` object and returning a numpy array with shape `(n_test_features, )` will work.

In [None]:
def get_imp(estimator):
    # equivalent to the builtin importance getter
    return estimator.feature_importances_

boruta = eBoruta(importance_getter=get_imp)
boruta.fit(x, y)
plot_imp_history(boruta.features_.history)

In [None]:
def get_permutation_imp(estimator: t.Any, trial_data: TrialData) -> np.ndarray:
    imp = permutation_importance(
        estimator, trial_data.x_test, trial_data.y_test, 
        scoring='accuracy', n_jobs=-1
    )
    return np.array(imp['importances_mean'])

# Let's also use a different estimator, just for the sake of it
boruta = eBoruta(
    importance_getter=get_permutation_imp
).fit(
    x, y, model=ExtraTreesClassifier(n_estimators=20)
)
plot_imp_history(boruta.features_.history)

### Non-ensemble classifier with custom importance evaluation

In [None]:
boruta = eBoruta(
    importance_getter=get_permutation_imp
).fit(
    x, y, model=LogisticRegression()
)
plot_imp_history(boruta.features_.history)

### Multiple objectives

Built-in approach is basically averaging importance of each feature per objective.
One can define a different aggregation strategy via custom importance getter.

In [None]:
y2 = np.array([[y_, y_] for y_ in y])

- Using built-in shap importance evaluation

In [None]:
boruta = eBoruta().fit(x, y2)
plot_imp_history(boruta.features_.history)

- Using `feature_importances_` attribute

In [None]:
# Using shap importance
boruta = eBoruta(shap_tree=False).fit(x, y2)
plot_imp_history(boruta.features_.history)

- Using custom importance evaluation

Use-case: different aggregation strategy for multiple objectives. Below we'll use maximum of importances for a feature across objectives instead of the default mean.

In [None]:
# Using custom aggregation
def get_imp(estimator, trial_data: TrialData):
    # equivalent to the builtin importance getter
    explainer = shap.explainers.Tree(estimator)
    imp = explainer.shap_values(trial_data.x_test, approximate=False)
    imp = np.max(np.vstack([np.abs(v).mean(0) for v in imp]), axis=0)
    return imp


boruta = eBoruta(importance_getter=get_imp).fit(x, y)
plot_imp_history(boruta.features_.history)

### Using `Callback`s

It can be any callable (including classes with mutable state), accepting and returning `(Estimator, Feature, Dataset, Trial)`.
Check `callbacks` module for additional examples.

- `CatBoostClassifier` with categorical features

In [None]:
def handle_catboost_categorical(
    estimator: CatBoostClassifier, features: Features, 
    dataset: Dataset, trial_data: TrialData, **kwargs
):
    params = estimator.get_params()
    params['cat_features'] = [c for c in trial_data.x_test.columns if 'cat' in c]
    estimator = estimator.__class__(**params)
    return estimator, features, dataset, trial_data, kwargs

In [None]:
x_cat = boruta.dataset_.x.copy()
x_cat['1_cat'] = pd.Series(x_cat['1'].round(0).astype(int).astype('category'))

boruta = eBoruta().fit(
    x_cat, y, model=CatBoostClassifier(iterations=20, verbose=False),
    callbacks_trial_start=[handle_catboost_categorical],
)
plot_imp_history(boruta.features_.history)

- `CatBoostClassifier` with adjusted number of iterations

In [None]:
class AdjustIterations:
    def __init__(self, min_iterations: int = 5):
        self.min_iterations = min_iterations

    def __call__(self, estimator: CatBoostClassifier, features: Features, dataset: Dataset, trial_data: TrialData, **kwargs):
        num_features = trial_data.x_test.shape[1]
        num_iterations = max([self.min_iterations, num_features // 2])
        params = estimator.get_params()
        params['iterations'] = num_iterations
        estimator = estimator.__class__(**params)
        print(f'Set the number of iterations to {estimator.get_param("iterations")} (num_features={num_features})')
        return estimator, features, dataset, trial_data, kwargs

In [None]:
boruta = eBoruta().fit(
    x, y, model=CatBoostClassifier(iterations=20, verbose=False), 
    callbacks_trial_start=[AdjustIterations()]
)
plot_imp_history(boruta.features_.history)