# supervised clustering test
https://www.aidancooper.co.uk/supervised-clustering-shap-values/

In [1]:
import lightgbm as lgb
import shap
from sklearn.datasets import make_classification

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# simulate raw data
X, y = make_classification(
    n_samples=1000,
    n_features=50,
    n_informative=5,
    n_classes=2,
    n_clusters_per_class=3,
    shuffle=False
)

In [3]:
X.shape

(1000, 50)

In [4]:
y.shape

(1000,)

In [5]:
# fit a GBT model to the data
m = lgb.LGBMClassifier()
m.fit(X, y)

In [6]:
# compute SHAP values
explainer = shap.Explainer(m)
shap_values = explainer(X)

In [8]:
shap_values.shape

(1000, 50, 2)

Unlike in typical machine learning prediction workflows, we don't need to split the data, as our objective is merely to transform the 1,000x50 array of raw data into a new representation: a 1,000x50 array of SHAP values. This means we can simply fit the model directly to all the available data and ignore the possibility of overfitting.