# A guide to Explainable AI with SHapley Additive exPlanations 

## Introduction

This tutorial shows how to leverage SHapley Additive exPlanations (SHAP) to explain the output of machine learning models in Microsoft Fabric.

SHAP is a method used for interpreting machine learning models by attributing the contribution of each feature to the model's output for a specific data point. In this tutorial, you use Kernel SHAP to explain a tabular classification model built from the Adults Census dataset and then visualize the explanation in the ExplanationDashboard from [Responsible AI Widgets](https://github.com/microsoft/responsible-ai-widgets) in Microsoft Fabric.

This tutorial covers these topics:

1. Install `raiwidgets` library
2. Load and process the data and train a binary classification model
3. Create a TabularSHAP explainer and extract SHAP values
4. Show how to visualize the explanation using the RAI ExplanationDashboard


## Step 1: Install custom library

Prior to process the data and train a model, you need to install a custom library for which you will use the in-line installation capabilities (e.g., `pip`, `conda`, etc.) to quickly get started. Please note that this process will solely install the custom libraries within your notebook environment, and not in the workspace.

Additionally, please be aware that the PySpark kernel will automatically restart after executing the `%pip install` command. Therefore, it is crucial to install the desired library prior to running any other cells within your notebook.

You'll use `%pip install` to install the `raiwidgets` library. You can follow instructions available at [Package management - Azure Synapse Analytics | Microsoft Docs](https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-azure-portal-add-libraries) for further information about how to install ["raiwidgets"](https://pypi.org/project/raiwidgets/) and ["interpret-community"](https://pypi.org/project/interpret-community/) packages.

In [1]:
%pip install raiwidgets itsdangerous==2.0.1 interpret-community

StatementMeta(, , -1, Finished, Available)

Collecting raiwidgets
  Downloading raiwidgets-0.34.1-py3-none-any.whl (3.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.9/3.9 MB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0mm
[?25hCollecting itsdangerous==2.0.1
  Downloading itsdangerous-2.0.1-py3-none-any.whl (18 kB)
Collecting interpret-community
  Downloading interpret_community-0.31.0-py3-none-any.whl (130 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m130.7/130.7 kB[0m [31m50.7 MB/s[0m eta [36m0:00:00[0m
Collecting pandas<2.0.0,>=0.25.1 (from raiwidgets)
  Downloading pandas-1.5.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.1/12.1 MB[0m [31m138.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting rai-core-flask==0.7.3 (from raiwidgets)
  Downloading rai_core_flask-0.7.3-py3-none-any.whl (12 kB)
Collecting erroranalysis>=0.5.3 (from raiwidgets)
  Downloading errorana

You also need to import the required libraries from [PySpark](https://spark.apache.org/docs/latest/api/python/index.html) and [SynapseML](https://microsoft.github.io/SynapseML/) and define some User Defined Functions (UDFs) that you will need later.

In [2]:
from IPython.terminal.interactiveshell import TerminalInteractiveShell
from synapse.ml.explainers import *
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.sql.types import *
from pyspark.sql.functions import *
import pandas as pd

vec_access = udf(lambda v, i: float(v[i]), FloatType())
vec2array = udf(lambda vec: vec.toArray().tolist(), ArrayType(FloatType()))

StatementMeta(, eeeee01e-63fc-4b4d-915c-3d3f6b134e3e, 9, Finished, Available)

To disable Microsoft Fabric autologging in a notebook session, call `mlflow.autolog()` and set `disable=True`.

In [3]:
# Set up MLflow for experiment tracking
import mlflow

mlflow.autolog(disable=True)  # Disable MLflow autologging

StatementMeta(, eeeee01e-63fc-4b4d-915c-3d3f6b134e3e, 10, Finished, Available)

## Step 2: Load the data and train the model

For this tutorial, you will use the [Adult Census Income dataset](https://archive.ics.uci.edu/ml/datasets/Adult). The dataset contains 32,561 rows and 14 columns/features.

Download a publicly available version of the dataset from the blog storage and load the data as a spark DataFrame.

In [4]:
df = spark.read.parquet(
    "wasbs://publicwasb@mmlspark.blob.core.windows.net/AdultCensusIncome.parquet"
).cache()

labelIndexer = StringIndexer(
    inputCol="income", outputCol="label", stringOrderType="alphabetAsc"
).fit(df)
print("Label index assigment: " + str(set(zip(labelIndexer.labels, [0, 1]))))

StatementMeta(, eeeee01e-63fc-4b4d-915c-3d3f6b134e3e, 11, Finished, Available)

Label index assigment: {(' <=50K', 0), (' >50K', 1)}


Next step is to pre-process the data (indexing categorical features and one-hot encoding them) and train a Logistic Regression model to predict the `income` label (1 or 0) based on the input features.

In [5]:
training = labelIndexer.transform(df)
display(training)
categorical_features = [
    "workclass",
    "education",
    "marital-status",
    "occupation",
    "relationship",
    "race",
    "sex",
    "native-country",
]
categorical_features_idx = [col + "_idx" for col in categorical_features]
categorical_features_enc = [col + "_enc" for col in categorical_features]
numeric_features = [
    "age",
    "education-num",
    "capital-gain",
    "capital-loss",
    "hours-per-week",
]
# Convert the categorical features into numerical indices
strIndexer = StringIndexer(
    inputCols=categorical_features, outputCols=categorical_features_idx
)
# Perform one-hot encoding
onehotEnc = OneHotEncoder(
    inputCols=categorical_features_idx, outputCols=categorical_features_enc
)
# Create a VectorAssembler to assemble all the one-hot encoded categorical features and numerical features into a single feature vector
vectAssem = VectorAssembler(
    inputCols=categorical_features_enc + numeric_features, outputCol="features"
)
# Train a Logistic Regression model
lr = LogisticRegression(featuresCol="features", labelCol="label", weightCol="fnlwgt")
pipeline = Pipeline(stages=[strIndexer, onehotEnc, vectAssem, lr])
model = pipeline.fit(training)

StatementMeta(, eeeee01e-63fc-4b4d-915c-3d3f6b134e3e, 12, Finished, Available)

SynapseWidget(Synapse.DataFrame, 93a93fb2-b774-4c70-87c0-7e8720f67cb8)

After the model is trained, you randomly select some observations to be explained.

In [6]:
explain_instances = (
    model.transform(training).orderBy(rand()).limit(5).repartition(200).cache()
)
display(explain_instances)

StatementMeta(, eeeee01e-63fc-4b4d-915c-3d3f6b134e3e, 13, Finished, Available)

SynapseWidget(Synapse.DataFrame, 5cb7ea65-d860-431a-a5d3-fef94b48e135)

## Step 3: Create a TabularSHAP Explainer and extract SHAP Values

You should create a TabularSHAP explainer by configuring it with the following parameters: set the input columns to include all the features that the model uses, specify the model itself, and indicate the target output column you intend to explain.

In this particular scenario, your goal is to elucidate the `probability` output, which is represented as a vector with a length of 2. Your specific focus, however, is on class 1 probability. To simultaneously explain both class 0 and class 1 probabilities, you must define the `targetClasses` parameter as `[0, 1]`.

To serve as background data for the Kernel SHAP explanation method, it's recommended to randomly sample 100 rows from the training dataset. This sampled data will be used to integrate out the effects of individual features when calculating the SHAP values.

In [7]:
# Compute SHAP values for the trained model
shap = TabularSHAP(
    inputCols=categorical_features + numeric_features,
    outputCol="shapValues",
    numSamples=5000,
    model=model,
    targetCol="probability",
    targetClasses=[1],
    backgroundData=broadcast(training.orderBy(rand()).limit(100).cache()),
)

shap_df = shap.transform(explain_instances)

StatementMeta(, eeeee01e-63fc-4b4d-915c-3d3f6b134e3e, 14, Finished, Available)

Note that `inputCols` specifies the list of input features that you want to explain which in this case combines both the categorical and the numeric features. The `outputCol` specifies the name of the output column where SHAP values will be stored in the resulting DataFrame.

`targetCol` is used to specify the name of the target column where the model's output (probability scores) is stored and `targetClasses` indicates the class's output (e.g., 1 in this case) that is being explained (meaning you are explaining predictions for class 1).

Once you have the resulting DataFrame that contain the SHAP values, you can extract the class 1 probability of the model output, the SHAP values for the target class, the original features, and the true label. Then you convert it to a pandas DataFrame for visualization.

For each observation, the first element in the SHAP values vector is the base value (the mean output of the background dataset), and each of the following element is the SHAP values for each feature.

In [8]:
# Choose following columns from the DataFrame
# "shapValues": The modified array of SHAP values
# "probability": The extracted class 1 probability
# "label": A column assumed to contain labels or target values
shaps = (
    shap_df.withColumn("probability", vec_access(col("probability"), lit(1)))
    .withColumn("shapValues", vec2array(col("shapValues").getItem(0)))
    .select(
        ["shapValues", "probability", "label"] + categorical_features + numeric_features
    )
)

shaps_local = shaps.toPandas()
shaps_local.sort_values("probability", ascending=False, inplace=True, ignore_index=True) # Arrange with the highest probabilities at the top
pd.set_option("display.max_colwidth", None)
shaps_local

StatementMeta(, eeeee01e-63fc-4b4d-915c-3d3f6b134e3e, 15, Finished, Available)

Unnamed: 0,shapValues,probability,label,workclass,education,marital-status,occupation,relationship,race,sex,native-country,age,education-num,capital-gain,capital-loss,hours-per-week
0,"[0.22639637, -0.04680731, 0.071876764, 0.18150136, 0.047877356, -0.056228638, 0.006498608, 0.03429532, 0.01009472, 0.043627653, -0.13330539, -0.014468111, -0.003207495, 0.1284375]",0.496589,0.0,Self-emp-not-inc,HS-grad,Married-civ-spouse,Sales,Husband,White,Male,United-States,53,9,0,0,70.0
1,"[0.22639884, -0.0009190406, -0.13165753, -0.19061367, 0.060929745, -0.06848006, -0.00061092, 0.016712157, 0.00018915116, -0.021545324, 0.18501957, -0.014528859, -0.007017388, -0.004015508]",0.049859,0.0,Private,Bachelors,Never-married,Exec-managerial,Own-child,White,Male,United-States,34,13,0,0,40.0
2,"[0.22639705, -0.019952588, 0.02597009, -0.1478365, 0.04753728, -0.04471956, 0.0002848202, 0.013580291, 0.0013058935, -0.038417693, -0.031333774, -0.012864289, -0.0028575552, -0.00409028]",0.013003,0.0,State-gov,Some-college,Never-married,Exec-managerial,Own-child,White,Male,United-States,21,10,0,0,40.0
3,"[0.22639602, -0.018113226, 0.021045336, -0.1119184, 0.02237297, 0.036934316, 0.0021088163, -0.03216041, -0.028895276, -0.020643266, -0.025586909, -0.007032772, -0.00077092095, -0.060675245]",0.003062,0.0,Self-emp-not-inc,Some-college,Never-married,Prof-specialty,Not-in-family,White,Female,Mexico,26,10,0,0,4.0
4,"[0.22639537, 0.0013713314, 0.0687188, -0.0686547, -0.03228535, -0.02358418, -0.00030497543, 0.006463711, -0.036727395, -0.020135889, -0.11122784, -0.0077783973, -0.0011583443, -0.0007189848]",0.000374,0.0,Private,11th,Never-married,Other-service,Own-child,White,Male,El-Salvador,21,7,0,0,42.0


## Step 4: Visualize the explanation using the RAI ExplanationDashboard


You can visualize the explanation in [interpret-community format](https://github.com/interpretml/interpret-community) in the [ExplanationDashboard](https://github.com/microsoft/responsible-ai-widgets/).

In [9]:
import numpy as np

features = categorical_features + numeric_features
features_with_base = ["Base"] + features

rows = shaps_local.shape[0]

local_importance_values = shaps_local[["shapValues"]] # Extract the "shapValues" column from the "shaps_local" DataFrame
eval_data = shaps_local[features]
true_y = np.array(shaps_local[["label"]])

StatementMeta(, eeeee01e-63fc-4b4d-915c-3d3f6b134e3e, 16, Finished, Available)

Process the SHAP values stored to separate the bias values (likely representing the base prediction) and the actual importance values for each data point and class. 

In [10]:
list_local_importance_values = local_importance_values.values.tolist()
converted_importance_values = []
bias = []
for classarray in list_local_importance_values:
    for rowarray in classarray:
        converted_list = rowarray.tolist()
        # The bias values are stored in the bias list
        bias.append(converted_list[0])
        # Remove the bias from local importance values
        del converted_list[0]
        # Importance values are stored in the converted_importance_values list
        converted_importance_values.append(converted_list)

StatementMeta(, eeeee01e-63fc-4b4d-915c-3d3f6b134e3e, 17, Finished, Available)

Create a global explanation that is based on feature importance values (SHAP values), evaluation data, and expected values (bias terms).

In [11]:
from interpret_community.adapter import ExplanationAdapter

adapter = ExplanationAdapter(features, classification=True) # List of features used in the explanation
# eval_data is the dataset used to train or test the machine learning model
global_explanation = adapter.create_global(
    converted_importance_values, eval_data, expected_values=bias
)

StatementMeta(, eeeee01e-63fc-4b4d-915c-3d3f6b134e3e, 18, Finished, Available)

Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
2024-02-17 23:00:08.168322: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-02-17 23:00:14.159983: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


View the global importance values.


In [12]:
global_explanation.global_importance_values

StatementMeta(, eeeee01e-63fc-4b4d-915c-3d3f6b134e3e, 19, Finished, Available)

[0.01743269938742742,
 0.06385370306670665,
 0.14010492712259293,
 0.04220054037868977,
 0.04598935022950172,
 0.001961627951823175,
 0.020642377622425555,
 0.015442487041582353,
 0.02887396514415741,
 0.09729469530284404,
 0.01133448574692011,
 0.003002340684179217,
 0.03958750433521345]

View the local importance values.

In [13]:
global_explanation.local_importance_values

StatementMeta(, eeeee01e-63fc-4b4d-915c-3d3f6b134e3e, 20, Finished, Available)

[[[0.04680731147527695,
   -0.07187676429748535,
   -0.1815013587474823,
   -0.04787735641002655,
   0.0562286376953125,
   -0.006498607806861401,
   -0.03429532051086426,
   -0.01009471993893385,
   -0.043627653270959854,
   0.1333053857088089,
   0.014468111097812653,
   0.0032074949704110622,
   -0.12843750417232513],
  [0.0009190405835397542,
   0.13165752589702606,
   0.19061367213726044,
   -0.06092974543571472,
   0.06848005950450897,
   0.0006109200185164809,
   -0.016712157055735588,
   -0.0001891511637950316,
   0.021545324474573135,
   -0.1850195676088333,
   0.014528859406709671,
   0.007017388008534908,
   0.004015508107841015],
  [0.01995258778333664,
   -0.025970090180635452,
   0.14783650636672974,
   -0.04753727838397026,
   0.04471955820918083,
   -0.00028482021298259497,
   -0.013580290600657463,
   -0.0013058935292065144,
   0.038417693227529526,
   0.03133377432823181,
   0.012864288873970509,
   0.0028575551696121693,
   0.004090279806405306],
  [0.018113225698471

In [14]:
class wrapper(object):
    def __init__(self, model):
        self.model = model

    def predict(self, data):
        sparkdata = spark.createDataFrame(data)
        return (
            model.transform(sparkdata)
            .select("prediction")
            .toPandas()
            .values.flatten()
            .tolist()
        )

    def predict_proba(self, data):
        sparkdata = spark.createDataFrame(data)
        prediction = (
            model.transform(sparkdata)
            .select("probability")
            .toPandas()
            .values.flatten()
            .tolist()
        )
        proba_list = [vector.values.tolist() for vector in prediction]
        return proba_list

StatementMeta(, eeeee01e-63fc-4b4d-915c-3d3f6b134e3e, 21, Finished, Available)

The following shows how the final results using the kernel SHAP will look like. You can select the feature of your interest, choose the chart type, etc. to gain valuable insights about the impact of different features.


In [15]:
# View the explanation in the ExplanationDashboard
from raiwidgets import ExplanationDashboard

ExplanationDashboard(
    global_explanation, wrapper(model), dataset=eval_data, true_y=true_y
)

StatementMeta(, eeeee01e-63fc-4b4d-915c-3d3f6b134e3e, 22, Finished, Available)

'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to set this environment variable to '1' in both driver and executor sides if you use pyarrow>=2.0.0. pandas-on-Spark will set it for you but it does not work if there is a Spark context already launched.
toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
  Unsupported type in conversion to Arrow: VectorUDT()
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.


WebSocket transport not available. Install gevent-websocket for improved performance.
Interpret started at http://localhost:8704
Output size is too large. Please write it to a file. The code cell does not stop executing because of output size too large issue. But it may stop by other executing errors. If have, the error message can't show since the output size is too large

## Summary of the learnings

In summary, in this tutorial you have learned how to leverage kernel SHAP to provide a holistic and actionable understanding of ML models by quantifying feature importance, promoting model transparency, and facilitating model improvement and debugging. 

Kernel SHAP is a technique that helps explain the predictions of complex models by attributing the contribution of each feature to the model's output. It uses a kernel-based approach to estimate feature importance, providing insights into how different input variables influence the model's decisions. This interpretability tool aids in understanding and debugging machine learning models, making them more transparent and trustworthy.

Through the practical illustrations presented above, you've acquired the skills to effectively utilize Kernel SHAP, ensuring the reliability and alignment of machine learning models with their intended goals.