<a href="https://colab.research.google.com/github/chaitragopalappa/MIE590-690D/blob/main/6_XAI_DL_Specific.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# XAI for Deep Learning

Methods that are generally applicable to differentiable ML models including deep neural nets.

Sources: XAI methods is a field that is evolving very fast, and so is the taxonomy, and reference sources. I use CAPTUM as the main source, but use additional sources (listed below and the original manuscripts) as needed. I have also use GenAI chatbots (chatGPT or Gemini) to generate short summary in some cases. Some excerpts are directly copied from below sources or the original manuscripts, and are in italics.
* CAPTUM - this is a dedicated PyTorch library for XAI methods, is the most comprehensive library, and actively maintained.
  * [PyTorch library - CAPTUM](https://captum.ai/docs/attribution_algorithms)  
  * [Journal article - CAPTUM](https://arxiv.org/pdf/2009.07896)

* Books/Manuscripts
  * [Interpretable Machine Learning: A Guide for Making Black Box Models Explainable, by Christoph Molnar](https://christophm.github.io/interpretable-ml-book/)
    * good for more technial depth on general approaches
  * Rami Ibrahim and M. Omair Shafiq, Explainable Convolutional Neural Networks: A Taxonomy, Review, and Future Directions, ACM Computing Surveys, Vol. 55, No. 10, Article 206, 2023 https://dl.acm.org/doi/pdf/10.1145/3563691
    * specific to image data
  * [Adversarial Robustness: Theory and Practice by J. Zico Kolter · Aleksander Madry ](https://adversarial-ml-tutorial.org/)
* [Chronological collection of resources](https://github.com/pbiecek/xai_resources/blob/master/README.md#interesting-resources-related-to-xai-explainable-artificial-intelligence)
  * This is a good source for a collection of items up until 2021
* Other Python Packages  
  * [SHAP](https://github.com/shap/shap#deep-learning-example-with-gradientexplainer-tensorflowkeraspytorch-models)  

  * [TF-Explain](https://pypi.org/project/tf-explain/) [GitHUB](https://github.com/sicara/tf-explain)  
  * [iNNvestigate](https://github.com/albermax/innvestigate?tab=readme-ov-file)


**NOte**: I have used CAPTUM algorithm descriptions along with IML by Christoph Molnar to prepare this lecture to balance theory with computational model. Some excerpts are directly copied, and in italics.

---
---



**Model types in Captum**

**Attribution methods**
* **Primary Attribution (or feature attribution)**: Evaluates contribution of each input feature to the output of a model.
* **Layer Attribution**: Evaluates contribution of each neuron in a given layer to the output of the model.
* **Neuron Attribution**: Evaluates contribution of each input feature on the activation of a particular hidden neuron.

Feature attribution works at the level of raw, human-interpretable data, i.e, the features themselves, while layer and neuron attribution delve into the model's internal, often more abstract, learned representations.

<table>
  <tr>
    <th style="text-align:center;">Primary Attribution</th>
    <th style="text-align:center;">Layer Attribution</th>
    <th style="text-align:center;">Neuron Attribution</th>
  </tr>
  <tr>
    <td style="vertical-align:top;">
      <ul>
        <li>Integrated Gradients</li>
        <li>Gradient SHAP</li>
        <li>DeepLIFT</li>
        <li>DeepLIFT SHAP</li>
        <li>Saliency</li>
        <li>Input × Gradient</li>
        <li>Guided Backpropagation and Deconvolution</li>
        <li>Guided GradCAM</li>
        <li>Feature Ablation</li>
        <li>Feature Permutation</li>
        <li>Occlusion</li>
        <li>Shapley Value Sampling</li>
        <li>LIME</li>
        <li>KernelSHAP</li>
      </ul>
    </td>
    <td style="vertical-align:top;">
      <ul>
        <li>Layer Conductance</li>
        <li>Internal Influence</li>
        <li>Layer Activation</li>
        <li>Layer Gradient × Activation</li>
        <li>GradCAM</li>
        <li>Layer Integrated Gradients</li>
        <li>Layer GradientSHAP</li>
        <li>Layer DeepLIFT</li>
        <li>Layer DeepLIFT SHAP</li>
        <li>Layer Feature Ablation</li>
      </ul>
    </td>
    <td style="vertical-align:top;">
      <ul>
        <li>Neuron Conductance</li>
        <li>Neuron Gradient</li>
        <li>Neuron Integrated Gradients</li>
        <li>Neuron Guided Backpropagation and Deconvolution</li>
        <li>Neuron GradientSHAP</li>
        <li>Neuron DeepLIFT</li>
        <li>Neuron DeepLIFT SHAP</li>
        <li>Neuron Feature Ablation</li>
      </ul>
    </td>
  </tr>
</table>

---

* **Noise Tunnel**: used to smooth the results of any attribution method
* Metrics:  Metrics available in CAPTUM to estimate the trustworthiness of model explanations.
  * **Infidelity**
  * **Sensitivity**

---

Attribution methods are generally gradient-based or perturbation based methods.

**Gradient-based methods**: Use the derivative of the model's output with respect to the input to measure feature importance, requiring the model to be differentiable and offering fast computation but sometimes producing noisy results.

**Perturbation-based methods**: Measure feature importance by observing how the model's output changes after modifying (or "perturbing") parts of the input, which makes them model-agnostic (work on any black-box model) but requires more time for computation due to multiple forward passes

---

  <img src="https://github.com/meta-pytorch/captum/blob/master/docs/Captum_Attribution_Algos.png?raw=true" height="400" width ="800">

  Source: [Captum: A unified and generic model interpretability library for PyTorch](https://github.com/meta-pytorch/captum/blob/master/docs/Captum_Attribution_Algos.png)

  ---

  **Emerging interpretable methods**
  * Detecting concepts
  * Influential examples

  **Robustness method**:
  * Adversarial robustness / adversarial examples
  

  ---
  ---

## ⬛ FEATURE ATTRIBUTION METHODS:
(Based on its application to image, tabular, or text data they are also called sensitivity map, saliency map, pixel attribution map, gradient-based attribution methods, feature relevance, pixel attribution, and feature contribution methods.)

Suppose the neural network outputs as prediction a vector $F(x)$ of length $C$, i.e., $F(x)=[F_1(x),...,F_C(x)]$, for  regression $C=1$. Feature attribution methods take as input $x \in \mathbb{R^p}$ (can be image pixels, tabular data, words, …) with $p$ features, and output as explanation, a relevance score, for each of the p input features. These are **post-hoc** methods, i.e., applied on a fully trained network to interpret model predictions.

Feature attribution methods can be categorized into:

**Gradient-only method**s: tell us whether a change in a feature/pixel would change the prediction. The larger the absolute value of the gradient, the stronger the effect of a change of this pixel.

**Path-attribution methods**:  For path-attribution methods, the interpretation is always done with respect to a baseline: Compare one instance (sample) with a reference baseline sample ( for image, this could be  an artificial “zero” image such as a completely gray image; for tabular this could be zero, mean/median, or a random sample). The difference between scores is total attribution, attribution distributed among the input features based on their contribution along the chosen path.

---
---






![](https://github.com/albermax/innvestigate/raw/master/examples/images/analysis_grid.png)
Source: [iNNvestigate](https://github.com/albermax/innvestigate?tab=readme-ov-file)
[Also see this for comparing different networks](https://github.com/albermax/innvestigate/blob/master/examples/imagenet_compare_networks.ipynb)
Note that all analyses have been performed based on the logit output (layer).

### **Feature attribution methods most suitable for image data using CNN**

**1. Saliency (uses vanilla gradients) (more common for image data in CNN though can be applied to  tabular or text data)(Simonyan et al., 2013)**


It computes the gradient of the output with respect to each input feature.
$$
S_i(x) = \frac{\partial F_c(x)}{\partial x_i}
$$

**How it Explains:**  
The gradient indicates how much the model output changes with small perturbations in each feature (e.g., what pixels most affect classification scores).  
Large gradient → model output is highly sensitive to that feature.  This approach can be understood as taking a first-order Taylor expansion of the network at the input, and the gradients are simply the coefficients of each feature in the linear representation of the model. The **absolute value** of these coefficients can be taken to represent **feature importance**.


**Pros:**  
- Simple and efficient (one backward pass).  
- Works for any differentiable model.  

**Cons:**  
- Noisy and unstable.  
- Can fail when gradients vanish.

**Next Steps:**
- **Inspect sensitivity patterns:** Verify that the model is focusing on meaningful features or regions of an image (e.g., object areas, biological features).
- **Quantitative sanity check:** Apply input perturbations — if you remove high-saliency regions, prediction confidence should drop.
- **Action:** If heatmaps look random or irrelevant, consider:
  - Re-training with regularization or better preprocessing.
  - Using **smoother or more robust methods** like Integrated Gradients or SmoothGrad.

![](https://github.com/albermax/innvestigate/raw/master/examples/images/readme_example_input.png)
![](https://github.com/albermax/innvestigate/raw/master/examples/images/readme_example_analysis.png)

Source:[iNNvestigate](https://github.com/albermax/innvestigate?tab=readme-ov-file)

---

---

<b>2. DeconvNet (image data, CNN) (Zeiler & Fergus, 2014)</b>
 It is similar to saliency in that it computes the gradient of the target output with respect to the input through the backward pass (backpropogation), except that it applies the ReLU function to the gradients themselves, allowing only positive gradients to pass through (regardless of whether the corresponding activation in the forward pass was positive or negative).

 It can be applied to any neuron. It generates visualizations that show the features that positively contribute to a neuron's activation.  In CNN, early layers typically show basic features like edges and textures, while deeper layers visualize more complex, semantic patterns like object parts or specific shapes. By applying DeconvNet to neurons in differnet layers helps vizualize  semantic hierarchies to understand which patterns in the input activate specific filters (layer-level receptive fields).

[See Figures 1 and 2 from original paper](https://arxiv.org/pdf/1311.2901)

---

**3. Guided Backpropagation (image data, CNN) (Springenberg et al., 2014)**

It is similar to DeconvNet in that it computes the gradient of the target output with respect to the input through the backward pass  (backprogpogation), applying ReLU function to the gradients themselves to allow only positive gradients to pass through. Additionaly, it also blocks gradients if the corresponding activation in the forward pass of ReLU was negative. This provides crisper details.

As per [(Springenberg et al., 2014)](https://arxiv.org/pdf/1412.6806)
"*The deconvolutional network (’deconvnet’) approach to visualizing concepts learned by neurons in higher layers of a CNN can be summarized as follows. Given a high-level feature map, the ’deconvnet’ inverts the data flow of a CNN, going from neuron activations in the given layer down to an image. Typically, a single neuron is left non-zero in the high level feature map. Then the resulting reconstructed image shows the part of the input image that is most strongly activating this neuron (and hence the part that is most discriminative to it)*". On the other hand, GuidedBP does "*not condition on an input image. This way we get insight into what lower layers of the network learn.*"
See Figure 2: directly plots the filters


**How it Explains:**  
Highlights **fine-grained edges and textures** that trigger specific activations.  
*Used to inspect early-layer filters and texture sensitivity.*

**Pros:** High-resolution visualizations.  
**Cons:** Not class-discriminative; fails for non-ReLU networks.

**Next Steps:**
- **Visual verification:** Are highlighted regions consistent with domain knowledge (edges, features, textures)?
- **Action:** If the visualization shows low-level textures unrelated to the class:
  - Add **data augmentation** to reduce texture bias.
  - Consider using **Grad-CAM** or **Guided Grad-CAM** for class-level localization.

---

---

<b>4. Guided Grad-CAM (image data, CNN) (Selvaraju et al., 2017)</b>
It combines Grad-CAM’s class localization ( Grad-CAM is a layer attribution method typically applied to the last layer of convolution which has much coarser resolution compared to the input image - see further details in layer attribution section) with Guided BP’s edge detail (Guided BP is a feature attribution method that propogates gradients all the way to the input signal and thus can show much more details).  Specifically, we compute for an image both the Grad-CAM explanation and the guided backpropogation maps. GradCAM attributions are upsampled to match the input size, and then both maps are multiplied element-wise. Grad-CAM works like a lens that focuses on specific parts of the feature attribution map.
Guided GradCAM computes the element-wise product of guided backpropagation attributions with upsampled (layer) GradCAM attributions.

Creates high-res, class-specific maps.

**Explanation:**  
Helps explain **which pixels in an image** contributed most strongly to a particular class prediction.

**Pros:** Detailed and discriminative.  
**Cons:** Complex; not generalizable beyond CNNs.

**Next Steps:**
- **Visual localization:** Compare highlighted regions with ground-truth objects or segmentation masks.
- **Quantitative validation:** Compute localization accuracy metrics (IoU - (Intersection over Union)).
- **Action:** If attention is misplaced  (e.g., analysis in one study revealed that the model had learned to look at the person’s face /hairstyle in task of image classification into nurses and doctors, indictaing likely gender bias), consider fine tuning.

![](https://camo.githubusercontent.com/ed92061e0f9cfb2841758c5ae05c7a4fc979b1dcbd01ea66eaff8516fbf6c846/687474703a2f2f692e696d6775722e636f6d2f4a614762645a352e706e67)
Guided Grad-CAM overview  
Source: [Original manuscript](https://arxiv.org/pdf/1610.02391);  [GITHUB](https://github.com/ramprs/grad-cam/?tab=readme-ov-file)

---
---

<b>5. Occlusion / Feature Ablation ((mainly images) Zeiler & Fergus, 2014)</b>

**Idea:** Mask parts of input and observe output change.

Occlusion: "*Occlusion is a perturbation based approach to compute attribution, involving replacing each contiguous rectangular region with a given baseline / reference, and computing the difference in output. For features located in multiple regions (hyperrectangles), the corresponding output differences are averaged to compute the attribution for that feature. Occlusion is most useful in cases such as images, where pixels in a contiguous rectangular region are likely to be highly correlated.*"  Source: [CAPTUM](https://captum.ai/docs/attribution_algorithms#feature-ablation)

*Feature ablation is a perturbation based approach to compute attribution, involving replacing each input feature with a given baseline / reference value (e.g. 0), and computing the difference in output. Input features can also be grouped and ablated together rather than individually. This can be used in a variety of applications. For example, for images, one can group an entire segment or region and ablate it together, measuring the importance of the segment (feature group)* Source: [CAPTUM](https://captum.ai/docs/attribution_algorithms#feature-ablation);

[Code](https://captum.ai/tutorials/Resnet_TorchVision_Ablation)

**Explanation:**  
Directly measures **what parts of input are critical** for predictions by removing them.  
*Used for intuitive visual explanations.*

**Pros:** Simple, model-agnostic.  
**Cons:** Computationally heavy.

---
---



### **Feature attribution methods suitable for any type of data and deep learning (DL) model**
<b>6. Input × Gradient (any data type, any DL model)(Shrikumar et al., 2016)</b>

Combines input value with gradient to account for input magnitude.  

$$
\text{IxG}_i = x_i \cdot \frac{\partial F(x)}{\partial x_i}
$$

**How it Explains:**  
Reveals **how much each input feature contributes** to the prediction by weighting sensitivity with input strength.  
Useful for tabular data where features have scale.  

**Pros:** Simple and scale-aware.  
**Cons:** Fails in flat gradient regions.



---
<b>7. Integrated Gradients (any data type, any DL model) (Sundararajan et al., 2017)</b>

It represents the integral of gradients with respect to inputs along the path from a given baseline $x'$ to input $x$.The cornerstones of this approach is that it satisfies two fundamental axioms, namely sensitivity and implementation invariance. Baseline can be zeros or mean/median, or a random sample for tabular data, and zero image or grayscale for image data.

*An attribution method satisfies Sensitivity if for every input and baseline that differ in one feature but have different predictions then the differing feature should be given a non-zero attribution.* *Two networks are functionally equivalent if their outputs are equal for all inputs, despite having very different implementations. Attribution methods should satisfy Implementation Invariance, i.e., the attributions are always identical for two functionally equivalent networks.*
Source:Shrikumar et al., 2016 https://arxiv.org/pdf/1703.01365

$$
IG_i(x) = (x_i - x_i') \int_0^1 \frac{\partial F(x' + \alpha(x - x'))}{\partial x_i} d\alpha
$$

**Explanation:**  
Captures the *average influence* of a feature as the model transitions from an uninformative baseline to the actual input.  

**Pros:** Theoretically principled; robust to scaling.  
**Cons:** Requires multiple gradient computations; baseline choice matters.

**Next Steps:**
- **Path sanity check:** Verify baseline appropriateness (e.g., zero image, neutral text embedding).
- **Compare across data points:** Ensure consistent attribution patterns for similar inputs.
- **Action:** If attribution is unstable, average over multiple baselines or apply **Noise Tunnel** for smoothing.

---
<b>8. DeepLIFT (Deep Learning Important FeaTures)(any data type, any DL model) (Shrikumar et al., 2017)</b>

Instead of gradients, DeepLIFT tracks **how differences in activations from a baseline propagate** to the output.  
It helps explain how inputs *deviate from normal* to cause a prediction.  

Let
* $t$ represent some target output neuron of interest
* $x_1, x_2, ..., x_n $ represent some neurons in some intermediate layer or set of layers that are necessary and sufficient to compute t.
* $t^0$ represent the reference activation of $t$.
* $\Delta t$ is the difference-from-refernce, i.e, $\Delta t = t- t^0$

DeepLIFT assigns contribution scores $C_{\Delta {x_i}\Delta t} \text{ to } \Delta {x_i} $
  * $\text{ s.t } \sum_{i=1}^n C_{\Delta {x_i}\Delta t} =\Delta t$, and
  * uses the concept of multipliers to "blame" specific neurons for the difference in output.

$$
 \quad m_{\Delta x \Delta t} = \frac{C_{\Delta x \Delta t}}{\Delta x}
$$

The multiplier is similar in spirit to a partial derivative. The fundamental advantage over gradient-based methods is that $C_{\Delta {x_i}\Delta t}$  can be non-zero even when $\frac {\partial t} {\partial x_i} = 0$.  An other advantage over gradient-based methods is that the discontinuous nature of gradients causes sudden jumps in the importance score over infinitesimal changes in the input. By contrast, the difference-from-reference is continuous, allowing DeepLIFT to avoid discontinuities caused by bias terms.


**Pros:** Avoids vanishing gradients.  
**Cons:** Sensitive to reference choice.

---
<b> DeepLIFT SHAP (Deep Learning Important FeaTures)(any data type, any DL model) </b>
DeepLIFT SHAP is a method extending DeepLIFT to approximate SHAP values

---
---
<b>9. LIME (any data type, local, model-agnostic) (Ribeiro et al., 2016)</b>

**(See Lecture 4- below is brief overview)**
$$
\min_{g\in G} L(f,g,\pi_x) + \Omega(g)
$$

**Explanation:**  
Trains a simple surrogate (e.g., linear model) near an instance to approximate local model behavior.  
Shows **which features locally drive** the model’s decision.  

**Pros:** Model-agnostic; intuitive.  
**Cons:** Unstable to sampling; local scope only.


---
---

<b>10. Shapley Value Sampling(any data type, model-agnostic)  (Lundberg & Lee, 2017)</b>

**(See Lecture 4- below is brief overview)**
**Idea:** Fairly distributes model output among input features using cooperative game theory.  

**Explanation:**  
Shows how each feature’s presence or absence changes prediction.  
**Helps Explain:** Fair, theoretically grounded attributions.

**Pros:** Satisfies fairness axioms.  
**Cons:** Exponential complexity; approximations needed.

---

<b>11. KernelSHAP (any data type, model-agnostic) (Lundberg & Lee, 2017)</b>

**(See Lecture 4- below is brief overview)**
Combines Shapley theory and LIME for efficient approximations.



**Explanation:**  
Helps explain both local and global feature importance through a weighted regression framework.

**Pros:** Balanced between theory and practicality.  
**Cons:** Slow for many features.


---
<b>12. GradientSHAP </b>

Gradient SHAP is a gradient method to compute SHAP values (it uses ideas from integrated gradient in SHAP). Gradient SHAP adds Gaussian noise to each input sample multiple times, selects a random point along the path between baseline and input, and computes the gradient of outputs with respect to those selected random points. The final SHAP values represent the expected value of gradients * (inputs - baselines).

The computed attributions approximate SHAP values under the assumptions that the input features are independent and that the explanation model is linear between the inputs and given baselines.

As it takes the expected value, it allows an entire dataset to be used as the background distribution (as opposed to a single reference value) and allows local smoothing.

Also see original [GradientExplainer in SHAP Implementation](https://github.com/shap/shap?tab=readme-ov-file#deep-learning-example-with-gradientexplainer-tensorflowkeraspytorch-models)

---
---

<b>13. Feature Permutation (Fisher et al., 2019)</b>

**Idea:** Randomly permute one feature and measure change in model output or performance.  

**Explanation:**  
Provides **global feature importance** — features whose shuffling causes large performance drop are most critical. A feature is “important” if shuffling its values increases the model error.

**Pros:** Global measure; model-agnostic.  
**Cons:** Correlation can mislead importance.



---
---

**Attribution methods**
* **Primary Attribution (or feature attribution)**: Evaluates contribution of each input feature to the output of a model.
* **Layer Attribution**: Evaluates contribution of each neuron in a given layer to the output of the model.
* **Neuron Attribution**: Evaluates contribution of each input feature on the activation of a particular hidden neuron.

Feature attribution works at the level of raw, human-interpretable data, i.e, the features themselves, while layer and neuron attribution delve into the model's internal, often more abstract, learned representations.

---
---

##  ⬛  LAYER ATTRIBUTION METHODS

<b>1. Layer Integrated Gradients</b>

Layer integrated gradients represents the integral of gradients with respect to the layer inputs / outputs along the straight-line path from the layer activations at the given baseline to the layer activation at the input.


**Explanation:**  
Quantifies contribution of a layer’s internal activations along the path from baseline to input.

**Helps Explain:**  
Reveals how **information builds up** through internal layers.



---
---
<b>2. Grad-CAM (Gradient-weighted Class Activation Map) (it a layer attribution mehod)(image data, CNN)(layer-attribution method) (Selvaraju et al., 2017)</b>

Adds *class-awareness* to visual attributions in CNNs.  
$$
L^c \in \mathbb{R}^{U\times V} = \text{ReLU}\left(\sum_k \alpha_k^c A^k\right)$$
$$\alpha_k^c = \frac{1}{Z}\sum_{u,v}\frac{\partial y^c}{\partial A_{uv}^k}, \quad
$$
Here, $U$ is the width, $V$ the height of the explanation, $c$ the class of interest, $A_k$ is the $k^{th}$ feature map output from the convolution layer (typically the last convolution layer). GRAD-CAM finds the localization map ($L^c$) for selcted class $c$ by weighting the feature map with $\alpha_k^c$, which is the weighted average of the gradent of the class $y^c$ w.r.t to each feature map pixel $A_{uv}^k$ for feature map $k$. It is passed through ReLU to retain only positive values.  

1. Forward-propagate the input image through the convolutional neural network.
2. Obtain the raw score for the class of interest, meaning the activation of the neuron before the softmax layer.
3. Set all other class activations to zero.
4. Back-propagate the gradient of the class of interest to the last convolutional layer before the fully connected layers:$\frac{\partial y^c}{\partial A^k},$
5. Calculate feature map weights $\alpha_k^c$
6. Calculate an average of the feature maps $\left(\sum_k \alpha_k^c A^k\right)$.
7. Apply ReLU to the averaged feature map.
8. For visualization: Scale values to the interval $[0,1]$. Upscale the image (to match input size) and overlay it over the original image.
9. Additional step for Guided Grad-CAM: Multiply heatmap with guided backpropagation.

Shows **which spatial regions** of a CNN feature map most influence a given class output.  
**Pros:** Class-specific; intuitive for CNNs.  
**Cons:** Limited resolution; CNN-only.
![](https://camo.githubusercontent.com/ed92061e0f9cfb2841758c5ae05c7a4fc979b1dcbd01ea66eaff8516fbf6c846/687474703a2f2f692e696d6775722e636f6d2f4a614762645a352e706e67)
Guided Grad-CAM overview  
Source: [Original manuscript](https://arxiv.org/pdf/1610.02391);  [GITHUB](https://github.com/ramprs/grad-cam/?tab=readme-ov-file)

---

<b>3. Layer Conductance (Dhamdhere et al., 2018)</b>
Conductance combines the neuron activation with the partial derivatives of both the neuron with respect to the input and the output with respect to the neuron to build a more complete picture of neuron importance.

Conductance builds on Integrated Gradients (IG) by looking at the flow of IG attribution which occurs through the hidden neuron. The formal definition of total conductance of a hidden neuron y (from the original paper) is as follows:
$$
\text{Cond}^y(x)::=\sum_i (x_i-x'_i)\int_0^1 \frac{\partial F(x'+\alpha(x-x'))}{\partial y}  \frac{\partial y}{\partial x_i}d\alpha
$$

**Explanation:**  
Captures how a layer’s neurons conduct relevance to the output.  
Helps quantify **importance of entire feature maps or layers**.



---

<b>4. Layer DeepLIFT</b>

Applies DeepLIFT propagation rules to intermediate activations.  
Helps understand **activation differences** at specific depths.  



---

<b>5. Layer Gradient SHAP</b></summary>

Combines **expected gradients** and **SHAP sampling** at layer level.  
Explains internal contribution distributions.

---

<b>6. Layer-Wise Relevance Propagation (LRP)</b></summary>: It uses a backward propagation mechanism applied sequentially to all layers of the model, to see which neurons contributed to the output.

---

**General Steps for Layer Attribution:**
- Analyze contributions per layer or feature map.  
- Identify which layer accumulates most relevance.
- Assess early vs. late hidden layers for meaningful transformations.  
- Detect bottleneck layers responsible for key decisions.
- Identify layers where class-specific patterns emerge (e.g., edges → textures → objects).  (images)
- Visualize feature maps to confirm hierarchical abstraction. (for images)

**Action:**  
Use to optimize architecture or guide pruning.


---
---


**Attribution methods**
* **Primary Attribution (or feature attribution)**: Evaluates contribution of each input feature to the output of a model.
* **Layer Attribution**: Evaluates contribution of each neuron in a given layer to the output of the model.
* **Neuron Attribution**: Evaluates contribution of each input feature on the activation of a particular hidden neuron.

Feature attribution works at the level of raw, human-interpretable data, i.e, the features themselves, while layer and neuron attribution delve into the model's internal, often more abstract, learned representations.

---
---

##   ⬛  NEURON ATTRIBUTION METHODS

<b>1. Neuron Gradient</b>

Measures gradient of neuron activation w.r.t. input.  
Reveals which inputs most strongly activate a given neuron.

---

<b>2. Neuron Input × Gradient</b>

Weights input sensitivity by its magnitude, restricted to a specific neuron.  
Explains how each feature contributes to activating that neuron.



---

<b>3. Neuron Integrated Gradients</b>

*Neuron Integrated Gradients approximates the integral of input gradients with respect to a particular neuron along the path from a baseline input to the given input. This method is equivalent to applying integrated gradients considering the output to be simply the output of the identified neuron.*


---

<b>4. Neuron Conductance</b>

Breaks layer conductance into individual neuron contributions.  
Shows **which neurons are responsible** for transmitting relevant signals.

$$
\text{cond}_i^y(x)::= (x_i-x'_i)\int_0^1 \frac{\partial F(x'+\alpha(x-x'))}{\partial y}  \frac{\partial y}{\partial x_i}d\alpha
$$

---

<b>5. Neuron DeepLIFT</b>

Applies DeepLIFT rules at neuron granularity.  
Useful for identifying **neurons encoding semantic concepts**.



---

---


#  ⬛  <b>Noise Tunnel (Smilkov et al., 2017)</b>

Typically applied to feature attribution methods to smooth out noisy gradients. Adds Gaussian noise to inputs to compute attributions, repeating multiple times, and combines the calculated attributions based on the chosen type. CAPTUM supports 3 types for noise tunnel :

* Smoothgrad: The mean of the sampled attributions is returned. This approximates smoothing the given attribution method with a Gaussian Kernel.
* Smoothgrad Squared: The mean of the squared sample attributions is returned.
* Vargrad: The variance of the sample attributions is returned.


**How it Helps:**  
Smooths out noisy gradients, making attributions more robust and interpretable.

Example from IML by Molnar:
Pretrained VGG-16 (Simonyan and Zisserman 2015), which was trained on ImageNet, was used for classifying three images: first image classified as “Greyhound” with a probability of 35% (missed the text book by the author); second (middle) image correctly classified as “Soup Bowl” with a probability of 50%;  third image (which is an octopus on the ocean floor) incorrectly classified as “Eel” with a high probability of 70%.
![](https://christophm.github.io/interpretable-ml-book/images/original-images-classification.jpg)

![](https://christophm.github.io/interpretable-ml-book/images/smoothgrad.jpg)

Source: IML by Molnar *Figure 28.3: Pixel attributions or saliency maps for the Vanilla Gradient method, SmoothGrad and Grad-CAM.*  

Vanilla saliency (Vanilla Gradient) and SmoothGrad both highlight the dog, which makes sense. But they also highlight some areas around the book. Grad-CAM highlights only highlights the book area which makes no sense (note it was classified as Greyhound). The

---
---

#  ⬛   ATTRIBUTION METRICS

<b>1. Infidelity (Yeh et al., 2019)</b>

Infidelity measures the mean squared error between model explanations in the magnitudes of input perturbations and predictor function's changes to those input perturbtaions.

**Explanation:**  
Measures how accurately an attribution explains output changes under perturbations.
- Low infidelity → faithful explanation.



---

<b>2. Sensitivity (Ancona et al., 2018)</b>

Sensitivity measures the degree of explanation changes to subtle input perturbations using Monte Carlo sampling-based approximation

**Explanation:**  
Assesses how stable the attributions are to small input noise
- low variance implies more stable and more reliable explanation.
- High variance implies overfitting or gradient instability.

---
---




## **Limitations of attribution methods and other options**

The importance of a single feature (pixel in an image) usually does not convey much meaningful interpretation. The expressiveness of a feature-based explanation is constrained by the number of features.

Concept-based approach address above limitations

# ⬛ Detecting concepts: Testing with Concept Activation Vectors (TCAV)
A concept is a human-understandable, high-level abstractions, such as a color, an object (e.g., "stripes" in an image of zebra), or even an idea. *Given any user-defined concept, although a neural network might not be explicitly trained with the given concept, the concept-based approach detects that concept embedded within the latent space learned by the network. In other words, the concept-based approach can generate explanations that are not limited by the feature space of a neural network.*

**Testing with Concept Activation Vectors (TCAV)** was developed to generate global explanations for neural networks (but could work for any model where taking a directional derivative is possible). For any given concept, TCAV measures the extent of that concept’s influence on the model’s prediction for a certain class, e.g., how the concept of “striped” influences a model classifying an image as a “zebra.” Concepts are incorporated into the importance score computations using Concept Activation Vectors (CAVs).

**Concept Activation Vector (CAV)**: A **CAV** represents a concept $C$ in the activation space of a neural network layer $l$ ($l$ is also called a bottleneck of the model).  
It shows how strongly a concept influences a model’s prediction.

**To compute a CAV**:
1. Prepare two datasets:
   - **Concept dataset:** examples representing concept $C$
   - **Random dataset:** arbitrary examples (no concept)
    
    Example: to define the concept of “striped,” we can collect images of striped objects as the concept dataset, while the random dataset is a group of random images without stripes.
2. For the bottleneck layer $l$ of the network, train a binary classifier (e.g., SVM or logistic regression) to separate activations of concept vs. random data at layer $l$.
3. The binary classifier’s coefficient vector becomes the **CAV**, denoted $\mathbf{v}_l^C$.

Given an image input $\mathbf{x}$, its **conceptual sensitivity** $S_{C,k,l}(\mathbf{x})$ is measured by calculating the directional derivative of the prediction in the direction of the unit CAV ($\mathbf{v}_l^C$)
$$S_{C,k,l}(\mathbf{x}) = \nabla h_{l,k}(\hat{f}_l(\mathbf{x})) \cdot \mathbf{v}_l^C$$


- $\mathbf{x}$: input image  
- $\hat{f}_l(\mathbf{x})$: maps the input  $\mathbf{x}$ to the activation vector of the layer $l$  
- $h_{l,k}$: mapping from activation vector to the logit output of class $k$  
- $\mathbf{v}_l^C$: CAV for concept $C$ at layer $l$  
- $\nabla h_{l,k}$: gradient of class $k$ logit w.r.t. activations  

**Interpretation:**  
Since the gradient ($\nabla h_{l,k}$) points to the direction that maximizes the output the most rapidly, conceptual sensitivity $S_{C,k,l}(\mathbf{x})$, intuitively, indicates whether $\mathbf{v}_l^C$ points to the similar direction that maximizes $h_{l,k}$. Thus:
- $S_{C,k,l}(\mathbf{x}) > 0$: concept $C$ encourages classification as class $k$.  
- $S_{C,k,l}(\mathbf{x}) < 0$: concept $C$ discourages it.

*The vector that is orthogonal to learnt decision boundary and is pointing towards the direction of a concept is the CAV of that concept.*

---

**Testing with CAVs (TCAV)**
To measure a concept’s overall influence on a class:

$TCAV_{Q,C,k,l} = \frac{|\{\mathbf{x} \in X_k : S_{C,k,l}(\mathbf{x}) > 0\}|}{|X_k|}$

**Notations:**
- $X_k$: set of samples predicted as class $k$  
- Numerator: number of samples with **positive conceptual sensitivity**  
- Denominator: total samples in class $k$

**Example:**  
If $TCAV_{striped,zebra} = 0.8$, then **80% of zebra predictions** are positively influenced by the concept *“striped”*.
**Statistical significance test** for TCAV: instead of training only one CAV, train multiple CAVs using different random datasets while keeping the concept dataset the same. A meaningful concept should generate CAVs with consistent TCAV scores.
1. *Collect random datasets, where it is recommended that is at least 10.*
2. *Fix the concept dataset and calculate TCAV score using each of random datasets.*
3. *Apply a two-sided t-test to TCAV scores against other TCAV scores generated by a random CAV. A random CAV can be obtained by choosing a random dataset as the concept dataset.*

![](https://christophm.github.io/interpretable-ml-book/images/tcav.jpg)
*Figure 29.1: Measuring TCAV scores of three concepts for the model predicting “zebra.” The targeted bottleneck is a layer called “mixed4c.” A star sign above “dotted” indicates that “dotted” has not passed the statistical significance test, i.e., having the p-value larger than 0.05. Both “striped” and “zigzagged” have passed the test, and both concepts are useful for the model to identify “zebra” images according to TCAV. Figure originally from the TCAV GitHub. Figure description from [IML by Molnar](https://christophm.github.io/interpretable-ml-book/detecting-concepts.html)*

[Tensorflow TCAV library](https://pypi.org/project/tcav/)  
TCAV example: [GITHUB](https://github.com/tensorflow/tcav/blob/master/Run_TCAV.ipynb)

----

[Network Dissection](https://netdissect.csail.mit.edu/): While TCAV allows users to test hypotheses about user-defined concepts, network dissection automatically identifies and labels the semantic meaning of individual network units in a deep CNN. That is, it links highly activated areas of CNN channels with human concepts (objects, parts, textures, colors, …).

Network dissection works by measuring the alignment between unit response and a set of concepts drawn from a broad and dense **segmentation data set called Broden**.*Provides the distribution of concepts represented in each unit of network (limited to concepts in the predefined dictionary)*
Algorithm: (source: [IML by Molnar](https://christophm.github.io/interpretable-ml-book/cnn-features.html)
1. Get images with human-labeled visual concepts: Network Dissection requires pixel-wise labeled images with concepts of different abstraction levels (from colors to street scenes). Bau & Zhou et al. combined a couple of datasets with pixel-wise concepts. They called this new dataset **‘Broden’, which stands for broadly and densely labeled data**.
2. Measure the CNN channel activations for these images.
3. Quantify the alignment of activations and labeled concepts.


  ![!](https://netdissect.csail.mit.edu/image/alexnet-places365-conv5-unique.svg)
  *"By measuring the concept that best matches each unit, Net Dissection can break down the types of concepts represented in a layer: here the 256 units of AlexNet conv5 trained on Places represent many objects and textures, as well as some scenes, parts, materials, and a color."* Source: Embedded from https://netdissect.csail.mit.edu/

---
---


# ⬛ INFLUENTIAL EXAMPLES

*TracInCP calculates the influence score of a given training example on a given test example, which roughly speaking, represents how much higher the loss for the given test example would be if the given training example were removed from the training dataset, and the model re-trained. This functionality can be leveraged towards the following two use cases*:

*1. For a single test example, identifying its influential examples. These are the training examples with the most positive influence scores (proponents) and the training examples with the most negative influence scores (opponents).
2. Identifying mislabelled data, i.e. examples in the training data whose "ground-truth" label is actually incorrect. The influence score of a mislabelled example on itself (i.e. self-influence score) will tend to be high. Thus to find mislabelled examples, one can examine examples in order of decreasing self-influence score.*

*TracInCP can be used for any trained Pytorch model for which several model checkpoints are available.*

Source: CAPTUM tutorials

Journal article: [Farima et. al., Estimating Training Data Influence by Tracing Gradient Descent, NeurIPS 2020](https://arxiv.org/pdf/2002.08484)


Code example: [GITHUB](https://github.com/meta-pytorch/captum/blob/master/tutorials/TracInCP_Tutorial.ipynb)


# ⬛ ADVERSARIAL ROBUSTNESS / ADVERSARIAL EXAMPLES
Source:  [Adversarial Robustness: Theory and Practice by J. Zico Kolter · Aleksander Madry ](https://adversarial-ml-tutorial.org/)

This is more of a robustness method than an interpretable method. An adversarial example is an instance with small, intentional feature perturbations that cause a machine learning model to make a false prediction. The goal is to find vulnerabilities in deep learning models for testing and improving the robustness of deep learning models against "adversarial attacks". Adversarial attacks involve making small, imperceptible changes to input data (like an image) that cause a model to make incorrect predictions.

FGSM (Fast Gradient Sign Method) and PGD (Projected Gradient Descent) are methods for adversarial attacks: PGD is used for creating stronger attacks for evaluation, while FGSM is a simpler and faster attack.