Awesome Explainable AI

If you find some overlooked papers, please open issues or pull requests, and provide the paper(s) in this format:

- **[]** Paper Name [[pdf]]() [[code]]()

Papers

Visualizing and Understanding Convolutional Networks [pdf]
Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps [pdf] [saliency code]
Striving for Simplicity: The All Convolutional Net [pdf]
Understanding Neural Networks Through Deep Visualization [pdf]
Synthesizing the preferred inputs for neurons in neural networks via deep generator networks [pdf]
Multifaceted Feature Visualization: Uncovering the Different Types of Features Learned By Each Neuron in Deep Neural Networks [pdf]
Understanding Deep Image Representations by Inverting Them [pdf]
Visualizing deep convolutional neural networks using natural pre-images [pdf]
Understanding Neural Networks via Feature Visualization: A survey [pdf]
Conditional iterative generation of images in latent space [pdf]
Interpretable Explanations of Black Boxes by Meaningful Perturbation [pdf] [code] [code] [code]
Gradient-Based Attribution Methods [pdf]
Top-down Neural Attention by Excitation Backprop [pdf] [code]
Salient Deconvolutional Networks [pdf]
Explaining and Interpreting LSTMs [pdf]
Explaining and Harnessing Adversarial Examples [pdf] [code] [code] [code]
Adversarial Training for Free! [pdf] [code] [video]
Fast Adversarial Training with Smooth Convergence [pdf] [code]
Intriguing properties of neural networks [pdf]
High Confidence Predictions for Unrecognizable Images [pdf]
Contrastive Explanations in Neural Networks [pdf] [code] [slides]
Towards better understanding of gradient-based attribution methods for Deep Neural Networks [pdf]
On the (In)fidelity and Sensitivity of Explanations [pdf] [code]
Unsupervised learning of object semantic parts from internal states of CNNs by population encoding [pdf]
Diverse feature visualizations reveal invariances in early layers of deep neural networks [pdf]
Interpretation of Neural Networks is Fragile [pdf]
Towards Better Analysis of Deep Convolutional Neural Networks [pdf]
Do semantic parts emerge in Convolutional Neural Networks? [pdf]
Do Convolutional Neural Networks Learn Class Hierarchy? [pdf]
A Benchmark for Interpretability Methods in Deep Neural Networks [pdf]
On the Robustness of Interpretability Methods [pdf]
Sanity Checks for Saliency Maps [pdf]
Sanity Checks for Saliency Metrics [pdf]
Relative Attributing Propagation: Interpreting the Comparative Contributions of Individual Units in Deep Neural Networks [pdf]
Transformer Interpretability Beyond Attention Visualization [pdf] [code] [video]
Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers [pdf] [code]
Optimizing Relevance Maps of Vision Transformers Improves Robustness [pdf] [code]
Investigating the influence of noise and distractors on the interpretation of neural networks [pdf]
Do Explanations Explain? Model Knows Best [pdf] [code]
Visualizing Deep Neural Network Decisions: Prediction Difference Analysis [pdf] [code]
Visualizing and Understanding Generative Adversarial Networks [pdf] [code] [website]
ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness [pdf] [code]
Deep Image Prior [pdf] [code] [code] [code] [website]
How Do Vision Transformers Work? [pdf]
Breaking Batch Normalization for better explainability of Deep Neural Networks through Layer-wise Relevance Propagation [pdf]
Layer-wise Relevance Propagation for Neural Networks with Local Renormalization Layers [pdf]
Revisiting The Evaluation of Class Activation Mapping for Explainability: A Novel Metric and Experimental Analysis [pdf]
Explaining image classifiers by removing input features using generative models [pdf] [code]
Do Vision Transformers See Like Convolutional Neural Networks? [pdf]
Explaining Classifiers using Adversarial Perturbations on the Perceptual Ball [pdf]
Explaining Knowledge Distillation by Quantifying the Knowledge [pdf]
Interpreting Super-Resolution Networks with Local Attribution Maps [pdf]
Is the deconvolution layer the same as a convolutional layer? [pdf]
Towards Human-Understandable Visual Explanations: Imperceptible High-frequency Cues Can Better Be Removed [pdf]
Gradient Inversion with Generative Image Prior [pdf] [code]
Explaining Local, Global, And Higher-Order Interactions In Deep Learning [pdf]
Pitfalls of Explainable ML: An Industry Perspective [pdf]
Do Feature Attribution Methods Correctly Attribute Features? [pdf] [code]
Look at the Variance! Efficient Black-box Explanations with Sobol-based Sensitivity Analysis [pdf] [code]
What do neural networks learn in image classification? A frequency shortcut perspective [pdf]
The effectiveness of feature attribution methods and its correlation with automatic evaluation scores [pdf]
Interpreting Deep Neural Networks with Relative Sectional Propagation by Analyzing Comparative Gradients and Hostile Activations [pdf]
The (Un)reliability of saliency methods [pdf]
Explaining Convolutional Neural Networks through Attribution-Based Input Sampling and Block-Wise Feature Aggregation [pdf]
Explainable Models with Consistent Interpretations [pdf] [code]
Interpreting Multivariate Shapley Interactions in DNNs [pdf]
Finding and Fixing Spurious Patterns with Explanations [pdf]
Monitoring Shortcut Learning using Mutual Information [pdf]
Dissecting Deep Learning Networks - Visualizing Mutual Information [pdf]
Revisiting Backpropagation Saliency Methods [pdf]
Towards Visually Explaining Variational Autoencoders [pdf] [code] [code] [video] [video]
Understanding Integrated Gradients with SmoothTaylor for Deep Neural Network Attribution [pdf]
Understanding Deep Networks via Extremal Perturbations and Smooth Masks [pdf] [code]
Interpretable and Fine-Grained Visual Explanations for Convolutional Neural Networks [pdf]
Towards Robust Interpretability with Self-Explaining Neural Networks [pdf]
Influence-Directed Explanations for Deep Convolutional Networks [pdf]
Interpretable Basis Decomposition for Visual Explanation [pdf] [code]
Real Time Image Saliency for Black Box Classifiers [pdf]
Bias Also Matters: Bias Attribution for Deep Neural Network Explanation [pdf]
Understanding Impacts of High-Order Loss Approximations and Features in Deep Learning Interpretation [pdf]
Distilling Critical Paths in Convolutional Neural Networks [pdf]
Understanding intermediate layers using linear classifier probes [pdf]
Neural Response Interpretation through the Lens of Critical Pathways [pdf] [code] [code]
Interpret Neural Networks by Identifying Critical Data Routing Paths [pdf]
Reconstructing Training Data from Trained Neural Networks [pdf] [website]
Visualizing Deep Similarity Networks [pdf] [code]
Improving Deep Learning Interpretability by Saliency Guided Training [pdf] [code]
Understanding Prediction Discrepancies in Machine Learning Classifiers [pdf]
Intriguing Properties of Vision Transformers [pdf] [code]
From Clustering to Cluster Explanations via Neural Networks [pdf]
Compositional Explanations of Neurons [pdf]
What Does CNN Shift Invariance Look Like? A Visualization Study [pdf] [code] [project]
Explainability Methods for Graph Convolutional Neural Networks [pdf] [code]
What do Vision Transformers Learn? A Visual Exploration [pdf]
Learning Accurate and Interpretable Decision Rule Sets from Neural Networks [pdf]
Visual Explanation for Deep Metric Learning [pdf] [code]
Right for the Right Reasons: Training Differentiable Models by Constraining their Explanations [pdf]
Understanding Black-box Predictions via Influence Functions [pdf] [code]
Unmasking Clever Hans predictors and assessing what machines really learn [pdf]
Explaining Deep Neural Networks with a Polynomial Time Algorithm for Shapley Values Approximation [pdf]
Quantitative Evaluations on Saliency Methods: An Experimental Study [pdf]
Metrics for saliency map evaluation of deep learning explanation methods [pdf]
Neural Networks are Decision Trees [pdf]
Towards Generating Human-Centered Saliency Maps without Sacrificing Accuracy [blog]
Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data [pdf] [code] [code] [pip] [powerlaw]
Exploring Explainability for Vision Transformers [blog] [code]
Disentangled Explanations of Neural Network Predictions by Finding Relevant Subspaces [pdf]
Are Transformers More Robust Than CNNs? [pdf] [code]
Exploring Corruption Robustness: Inductive Biases in Vision Transformers and MLP-Mixers [pdf] [code]
Explanatory Interactive Machine Learning [pdf]
Toward Faithful Explanatory Active Learning with Self-explainable Neural Nets [pdf]
Studying How to Efficiently and Effectively Guide Models with Explanations [pdf] [supp]
Learning What Makes a Difference from Counterfactual Examples and Gradient Supervision [pdf]
Fixing Localization Errors to Improve Image Classification [pdf]
Don't Judge an Object by Its Context: Learning to Overcome Contextual Bias [pdf]
On Guiding Visual Attention with Language Specification [pdf]
Improving Interpretability via Regularization of Neural Activation Sensitivity [pdf]
L1-Norm Gradient Penalty for Noise Reduction of Attribution Maps [pdf]
Identifying Spurious Correlations and Correcting them with an Explanation-based Learning [pdf]
Visual Attention Consistency under Image Transforms for Multi-Label Image Classification [pdf]
Improving performance of deep learning models with axiomatic attribution priors and expected gradients [pdf]
Fast Axiomatic Attribution for Neural Networks [pdf] [code]
Understanding Impacts of High-Order Loss Approximations and Features in Deep Learning Interpretation [pdf]
Detecting Statistical Interactions from Neural Network Weights [pdf]
What I Cannot Predict, I Do Not Understand: A Human-Centered Evaluation Framework for Explainability Methods [pdf] [code] [blog]
The Hidden Language of Diffusion Models [pdf] [code] [website]
Investigating Vision Transformer representations [blog]
Mean Attention Distance in Vision Transformers [pdf] [code]
Interpreting Vision and Language Generative Models with Semantic Visual Priors [pdf]
Learning Concise and Descriptive Attributes for Visual Recognition [pdf]
Visual Classification via Description from Large Language Models [pdf] [code] [website]
Representation Engineering: A Top-Down Approach to AI Transparency [pdf] [code] [website]
Multimodal Neurons in Pretrained Text-Only Transformers [pdf] [website]
Are Vision Language Models Texture or Shape Biased and Can We Steer Them? [pdf]
Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation [pdf] [website]
[OpenXAI] Towards a Transparent Evaluation of Model Explanations [pdf] [code] [website]
[TracIn] Estimating Training Data Influence by Tracing Gradient Descent [pdf] [code] [code]
[VoG] Estimating Example Difficulty using Variance of Gradients [pdf] [code] [project]
[D-RISE] Black-box Explanation of Object Detectors via Saliency Maps [pdf]
[SmoothGrad] Removing noise by adding noise [pdf]
[Integrated Gradients] Axiomatic Attribution for Deep Networks [pdf] [code] [code]
[BlurIG] Attribution in Scale and Space [pdf] [code]
[IDGI] A Framework to Eliminate Explanation Noise from Integrated Gradients [pdf] [code]
[GIG] Guided Integrated Gradients: an Adaptive Path Method for Removing Noise [pdf] [code]
[SPI] Beyond Single Path Integrated Gradients for Reliable Input Attribution via Randomized Path Sampling [pdf] [supp]
[IIA] Visual Explanations via Iterated Integrated Attributions [pdf] [supp] [code]
[Integrated Hessians] Explaining Explanations: Axiomatic Feature Interactions for Deep Networks [pdf] [code]
[Archipelago] How does this interaction affect me? Interpretable attribution for feature interactions [pdf] [code]
[I-GOS] Visualizing Deep Networks by Optimizing with Integrated Gradients [pdf]
[MoreauGrad] Sparse and Robust Interpretation of Neural Networks via Moreau Envelope [pdf] [code]
[SAGs] One Explanation is Not Enough: Structured Attention Graphs for Image Classification [pdf]
[LRP] On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation [pdf] [pdf] [pdf] [tutorial] [code] [code] [blog]
[DeepDream] Inceptionism: Going Deeper into Neural Networks [blog] [code] [code] [code]
[Archipelago] How does this interaction affect me? Interpretable attribution for feature interactions [pdf]
[RISE] Randomized Input Sampling for Explanation of Black-box Models [pdf] [code] [website]
[DeepLIFT] Learning Important Features Through Propagating Activation Differences [pdf] [video] [code]
[ROAD] A Consistent and Efficient Evaluation Strategy for Attribution Methods [pdf] [code]
[Layer Masking] Towards Improved Input Masking for Convolutional Neural Networks [pdf] [code]
[Summit] Scaling Deep Learning Interpretability by Visualizing Activation and Attribution Summarizations [pdf]
[SHAP] A Unified Approach to Interpreting Model Predictions [pdf] [code]
[MM-SHAP] A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks [pdf] [code] [video]
[Anchors] High-Precision Model-Agnostic Explanations [pdf] [code]
[Layer Conductance] How Important Is a Neuron? [pdf] [pdf]
[BiLRP] Building and Interpreting Deep Similarity Models [pdf] [code]
[CGC] Consistent Explanations by Contrastive Learning [pdf] [code]
[DeepInversion] Dreaming to Distill: Data-free Knowledge Transfer via DeepInversion [pdf] [code]
[GradInversion] See through Gradients: Image Batch Recovery via GradInversion [pdf]
[GradViT] Gradient Inversion of Vision Transformers [pdf]
[Plug-In Inversion] Model-Agnostic Inversion for Vision with Data Augmentations [pdf]
[GIFD] A Generative Gradient Inversion Method with Feature Domain Optimization [pdf]
[X-OIA] Explainable Object-induced Action Decision for Autonomous Vehicles [pdf] [code] [website]
[CAT-XPLAIN] Causality for Inherently Explainable Transformers [pdf] [code]
[CLRP] Understanding Individual Decisions of CNNs via Contrastive Backpropagation [pdf] [code]
[HINT] Leveraging Explanations to Make Vision and Language Models More Grounded [pdf]
[BagNet] Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet [pdf] [code] [blog]
[SMERF] Sanity Simulations for Saliency Methods [pdf]
[ELUDE] Generating interpretable explanations via a decomposition into labelled and unlabelled features [pdf]
[C3LT] Cycle-Consistent Counterfactuals by Latent Transformations [pdf]
[B-cos] Alignment is All We Need for Interpretability [pdf] [code]
[ShapNets] Shapley Explanation Networks [pdf] [code]
[CALM] Keep CALM and Improve Visual Feature Attribution [pdf] [code]
[SGLRP] Explaining Convolutional Neural Networks using Softmax Gradient Layer-wise Relevance Propagation [pdf]
[DTD] Explaining NonLinear Classification Decisions with Deep Taylor Decomposition [pdf] [code]
[GradCAT] Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding [pdf]
[FastSHAP] Real-Time Shapley Value Estimation [pdf] [code]
[VisualBackProp] Efficient visualization of CNNs [pdf]
[NBDT] Neural-Backed Decision Trees [pdf] [code]
[XRAI] Better Attributions Through Regions [pdf]
[MeGe, ReCo] How Good is your Explanation? Algorithmic Stability Measures to Assess the Quality of Explanations for Deep Neural Networks [pdf]
[FCDD] Explainable Deep One-Class Classification [pdf] [code]
[DiCE] Explaining Machine Learning Classifiers through Diverse Counterfactual Explanations [pdf] [code] [blog]
[ARM] Blending Anti-Aliasing into Vision Transformer [pdf] [code]
[RelEx] Building Reliable Explanations of Unreliable Neural Networks: Locally Smoothing Perspective of Model Interpretation [pdf] [code]
[X-Pruner] eXplainable Pruning for Vision Transformers [pdf] [code]
[ShearletX] Explaining Image Classifiers with Multiscale Directional Image Representation [pdf]
[MACO] Unlocking Feature Visualization for Deeper Networks with MAgnitude Constrained Optimization [pdf] [website]
[Guided Zoom] Questioning Network Evidence for Fine-Grained Classification [pdf] [pdf] [code]
[DAAM] Interpreting Stable Diffusion Using Cross Attention [pdf] [code] [demo]
[Diffusion Explainer] Visual Explanation for Text-to-image Stable Diffusion [pdf] [website] [video]
[ECLIP] Exploring Visual Explanations for Contrastive Language-Image Pre-training [pdf]
[CNC] Correct-N-Contrast: A Contrastive Approach for Improving Robustness to Spurious Correlations [pdf]
[AMC] Improving Visual Grounding by Encouraging Consistent Gradient-based Explanations [pdf]
[ClickMe] Learning what and where to attend [pdf]
[MaskTune] Mitigating Spurious Correlations by Forcing to Explore [pdf]
[CoDA-Nets] Convolutional Dynamic Alignment Networks for Interpretable Classifications [pdf]
[ABN] Attention Branch Network: Learning of Attention Mechanism for Visual Explanation [pdf] [pdf]
[RES] A Robust Framework for Guiding Visual Explanation [pdf]
[IAA] Aligning Eyes between Humans and Deep Neural Network through Interactive Attention Alignment [pdf]
[DiFull] Towards Better Understanding Attribution Methods [pdf] [code]
[AttentionViz] A Global View of Transformer Attention [pdf]
[Rosetta Neurons] Mining the Common Units in a Model Zoo [pdf] [code] [website]
[SAFARI] Versatile and Efficient Evaluations for Robustness of Interpretability [pdf]
[LANCE] Stress-testing Visual Models by Generating Language-guided Counterfactual Images [pdf] [code] [website]
[FunnyBirds] A Synthetic Vision Dataset for a Part-Based Analysis of Explainable AI Methods [pdf] [code]
[MAGI] Multi-Annotated Explanation-Guided Learning [pdf]
[CCE] Towards Visual Contrastive Explanations for Neural Networks [pdf]
[CNN Filter DB] An Empirical Investigation of Trained Convolutional Filters [pdf] [code]
[VLSlice] Interactive Vision-and-Language Slice Discovery [pdf] [code] [website] [demo] [video] [video]
[Feature Sieve] Overcoming Simplicity Bias in Deep Networks using a Feature Sieve [pdf] [blog]

CAM Papers

[CAM] Learning Deep Features for Discriminative Localization [pdf]
[Grad-CAM] Visual Explanations from Deep Networks via Gradient-based Localization [pdf] [code] [code] [website]
[Grad-CAM++] Improved Visual Explanations for Deep Convolutional Networks [pdf] [code]
[Score-CAM] Score-Weighted Visual Explanations for Convolutional Neural Networks [pdf] [code] [code]
[LayerCAM] Exploring Hierarchical Class Activation Maps for Localization [pdf] [code]
[Eigen-CAM] Class Activation Map using Principal Components [pdf]
[XGrad-CAM] Axiom-based Grad-CAM: Towards Accurate Visualization and Explanation of CNNs [pdf] [code]
[Ablation-CAM] Visual Explanations for Deep Convolutional Network via Gradient-free Localization [pdf]
[Group-CAM] Group Score-Weighted Visual Explanations for Deep Convolutional Networks [pdf] [code]
[FullGrad] Full-Gradient Representation for Neural Network Visualization [pdf]
[Relevance-CAM] Your Model Already Knows Where to Look [pdf] [code]
[Poly-CAM] High resolution class activation map for convolutional neural networks [pdf] [code]
[Smooth Grad-CAM++] An Enhanced Inference Level Visualization Technique for Deep Convolutional Neural Network Models [pdf] [code]
[Zoom-CAM] Generating Fine-grained Pixel Annotations from Image Labels [pdf]
[FD-CAM] Improving Faithfulness and Discriminability of Visual Explanation for CNNs [pdf] [code]
[LIFT-CAM] Towards Better Explanations of Class Activation Mapping [pdf]
[Shap-CAM] Visual Explanations for Convolutional Neural Networks based on Shapley Value [pdf]
[HiResCAM] Use HiResCAM instead of Grad-CAM for faithful explanations of convolutional neural networks [pdf]
[FAM] Visual Explanations for the Feature Representations from Deep Convolutional Networks [pdf]
[MinMaxCAM] Improving object coverage for CAM-based Weakly Supervised Object Localization [pdf]

LIME-based

[LIME] "Why Should I Trust You?": Explaining the Predictions of Any Classifier [pdf] [code]
[InteractionLIME] Model-Agnostic Visual Explanations via Approximate Bilinear Models [pdf]
[NormLime] A New Feature Importance Metric for Explaining Deep Neural Networks [pdf]
[GALE] Global Aggregations of Local Explanations for Black Box models [pdf]
[D-LIME] A Deterministic Local Interpretable Model-Agnostic Explanations Approach for Computer-Aided Diagnosis Systems [pdf]

Concept Bottleneck Models

[CBM] Concept Bottleneck Models [pdf] [code]
[Label-free CBM] Label-Free Concept Bottleneck Models [pdf] [code]
[PCBMs] Post-hoc Concept Bottleneck Models [pdf] [code]
[CDM] Sparse Linear Concept Discovery Models [pdf] [code]
[BotCL] Learning Bottleneck Concepts in Image Classification [pdf] [code]
[LaBo] Language Model Guided Concept Bottlenecks for Interpretable Image Classification [pdf] [code]
[CompMap] Do Vision-Language Pretrained Models Learn Composable Primitive Concepts? [pdf] [code] [website]
[FVLC] Faithful Vision-Language Interpretation via Concept Bottleneck Models [pdf]
[DN-CBM] Discover-then-Name: Task-Agnostic Concept Bottlenecks via Automated Concept Discovery [pdf] [code]
[ECBMs] Energy-Based Concept Bottleneck Models: Unifying Prediction, Concept Intervention, and Probabilistic Interpretations [pdf] [code]
Promises and Pitfalls of Black-Box Concept Learning Models [pdf]
Do Concept Bottleneck Models Learn as Intended? [pdf]

Neuron Annotation

[Network Dissection]: Quantifying Interpretability of Deep Visual Representations [pdf] [code] [website]
[CLIP-Dissect] Automatic Description of Neuron Representations in Deep Vision Networks [pdf] [code]
[Net2Vec] Quantifying and Explaining how Concepts are Encoded by Filters in Deep Neural Networks [pdf]
[MILAN] Natural Language Descriptions of Deep Visual Features [pdf] [code] [website]
[INViTE] INterpret and Control Vision-Language Models with Text Explanations [pdf] [code]
[CLIP-Decomposition] Interpreting CLIP's Image Representation via Text-Based Decomposition [pdf] [code] [website]
[Second-Order CLIP-Decomposition] Interpreting the Second-Order Effects of Neurons in CLIP [pdf] [code] [website]
[ZS-A2T] Zero-shot Translation of Attention Patterns in VQA Models to Natural Language [pdf] [code]
[FALCON] Identifying Interpretable Subspaces in Image Representations [pdf] [code]
[STAIR] Learning Sparse Text and Image Representation in Grounded Tokens [pdf]
[DISCOVER] Making Vision Networks Interpretable via Competition and Dissection [pdf]
[DeViL] Decoding Vision features into Language [pdf] [code]
[LaViSE] Explaining Deep Convolutional Neural Networks via Latent Visual-Semantic Filter Attention [pdf] [code]

Prototype/Concept-Based

[ProtoTrees] Neural Prototype Trees for Interpretable Fine-grained Image Recognition [pdf] [code]
[ProtoPNet] This Looks Like That: Deep Learning for Interpretable Image Recognition [pdf] [code]
[ST-ProtoPNet] Learning Support and Trivial Prototypes for Interpretable Image Classification [pdf]
[Deformable ProtoPNet] An Interpretable Image Classifier Using Deformable Prototypes [pdf]
[SPARROW] Semantically Coherent Prototypes for Image Classification [pdf]
[Proto2Proto] Can you recognize the car, the way I do? [pdf] [code]
[PDiscoNet] Semantically consistent part discovery for fine-grained recognition [pdf] [code]
[ProtoPool] Interpretable Image Classification with Differentiable Prototypes Assignment [pdf] [code]
[ProtoPShare] Prototype Sharing for Interpretable Image Classification and Similarity Discovery [pdf]
[PW-Net] Towards Interpretable Deep Reinforcement Learning with Human-Friendly Prototypes [pdf] [code]
[ProtoPDebug] Concept-level Debugging of Part-Prototype Networks [pdf]
[DSX] Describe, Spot and Explain: Interpretable Representation Learning for Discriminative Visual Reasoning [pdf]
[HINT] Hierarchical Neuron Concept Explainer [pdf] [code]
[ConceptSHAP] On Completeness-aware Concept-Based Explanations in Deep Neural Networks [pdf] [code]
[CW] Concept Whitening for Interpretable Image Recognition [pdf]
[VRX] Interpreting with Structural Visual Concepts [pdf]
[MOCE] Extracting Model-Oriented Concepts for Explaining Deep Neural Networks [pdf] [code]
[ConceptExplainer] Interactive Explanation for Deep Neural Networks from a Concept Perspective [pdf]
[ProtoSim] Prototype-based Dataset Comparison [pdf] [code] [website]
[TCAV] Quantitative Testing with Concept Activation Vectors [pdf] [code] [book chapter]
[SACV] Hidden Layer Interpretation with Spatial Activation Concept Vector [pdf] [code]
[ACE] Towards Automatic Concept-based Explanations [pdf] [code]
[DFF] Deep Feature Factorization For Concept Discovery [pdf] [code] [code] [blog and code]
[CRP] From “Where” to “What”: Towards Human-Understandable Explanations through Concept Relevance Propagation [pdf] [code]
[FeatUp] A Model-Agnostic Framework for Features at Any Resolution [pdf] [code] [colab] [website] [demo]
[LENS] A Holistic Approach to Unifying Automatic Concept Extraction and Concept Importance Estimation [pdf] [website]
[CRAFT] Concept Recursive Activation FacTorization for Explainability [pdf] [code] [website]
Deep ViT Features as Dense Visual Descriptors [pdf] [supp] [code] [website]
Evaluation and Improvement of Interpretability for Self-Explainable Part-Prototype Networks [pdf] [code]

Distill Papers

Distill
Multimodal Neurons in Artificial Neural Networks [paper] [blog] [code]
The Building Blocks of Interpretability [paper]
Visualizing the Impact of Feature Attribution Baselines [paper]
An Overview of Early Vision in InceptionV1 [paper]
Feature Visualization [paper]
Differentiable Image Parameterizations [paper]
Deconvolution and Checkerboard Artifacts [paper]
Visualizing memorization in RNNs [paper]
Exploring Neural Networks with Activation Atlases [paper]

XAI/Analysis of Self-Supervised Models and Transfer Learning

High Fidelity Visualization of What Your Self-Supervised Representation Knows About [pdf]
How Well Do Self-Supervised Models Transfer? [pdf] [code]
A critical analysis of self-supervision, or what we can learn from a single image [pdf] [code] [video]
How transferable are features in deep neural networks? [pdf]
Understanding the Role of Self-Supervised Learning in Out-of-Distribution Detection Task [pdf]
Beyond Supervised vs. Unsupervised: Representative Benchmarking and Analysis of Image Representation Learning [pdf] [code] [website]
Revealing the Dark Secrets of Masked Image Modeling [pdf]
Visualization of Supervised and Self-Supervised Neural Networks via Attribution Guided Factorization [pdf] [code]
Understanding Failure Modes of Self-Supervised Learning [pdf]
Explaining Self-Supervised Image Representations with Visual Probing [pdf] [pdf] [code]
Objectives Matter: Understanding the Impact of Self-Supervised Objectives on Vision Transformer Representations [pdf]
What Happens to the Source Domain in Transfer Learning? [pdf]
Overwriting Pretrained Bias with Finetuning Data [pdf]
Exploring Model Transferability through the Lens of Potential Energy [pdf] [code]
How Far Pre-trained Models Are from Neural Collapse on the Target Dataset Informs their Transferability [pdf] [supp]
What Contrastive Learning Learns Beyond Class-wise Features? [pdf]
Are Large-scale Datasets Necessary for Self-Supervised Pre-training? [pdf]
What makes instance discrimination good for transfer learning? [pdf] [website]
Revisiting the Transferability of Supervised Pretraining: an MLP Perspective [pdf]
Intriguing Properties of Contrastive Losses [pdf] [code]
When Does Contrastive Visual Representation Learning Work? [pdf]
What Makes for Good Views for Contrastive Learning? [pdf] [code]
What Should Not Be Contrastive in Contrastive Learning [pdf]
Demystifying Contrastive Self-Supervised Learning: Invariances, Augmentations and Dataset Biases [pdf]
Are all negatives created equal in contrastive instance discrimination? [pdf]
Improving Pixel-based MIM by Reducing Wasted Modeling Capability [pdf] [code]
On Pretraining Data Diversity for Self-Supervised Learning [pdf] [code]

Circuits/Mechanistic Interpretability

Circuits [series]
Transformer Circuits [series]
Progress measures for grokking via mechanistic interpretability [pdf]
Circuit Component Reuse Across Tasks in Transformer Language Models [pdf]
Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks [pdf]
TransformerLens

Natural Language Explanations (Supervised)

[GVE] Generating visual explanations [pdf]
[PJ-X] Multimodal Explanations: Justifying Decisions and Pointing to the Evidence [pdf] [code]
[FME] Faithful Multimodal Explanation for Visual Question Answering [pdf]
[RVT] Natural Language Rationales with Full-Stack Visual Reasoning [pdf] [code]
[e-UG] e-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks [pdf] [code]
[NLX-GPT] A Model for Natural Language Explanations in Vision and Vision-Language Tasks [pdf] [code]
[Uni-NLX] Unifying Textual Explanations for Vision and Vision-Language Tasks [pdf] [code]
[Explain Yourself] Leveraging Language Models for Commonsense Reasoning [pdf]
[e-SNLI] Natural Language Inference with Natural Language Explanations [pdf]
[CLEVR-X] A Visual Reasoning Dataset for Natural Language Explanations [pdf] [code] [website]
[VQA-E] Explaining, Elaborating, and Enhancing Your Answers for Visual Questions [pdf]
[PtE] Are Training Resources Insufficient? Predict First Then Explain! [pdf]
[WT5] Training Text-to-Text Models to Explain their Predictions [pdf]
[RExC] Knowledge-Grounded Self-Rationalization via Extractive and Natural Language Explanations [pdf] [code]
[ELV] Towards Interpretable Natural Language Understanding with Explanations as Latent Variables [pdf] [code]
[FEB] Few-Shot Self-Rationalization with Natural Language Prompts [pdf]
[CALeC] Chunk-aware Alignment and Lexical Constraint for Visual Entailment with Natural Language Explanations [pdf]
[OFA-X] Harnessing the Power of Multi-Task Pretraining for Ground-Truth Level Natural Language Explanations [pdf] [code]
[S3C] Semi-Supervised VQA Natural Language Explanation via Self-Critical Learning [pdf]
[ReVisE] A Recursive Approach Towards Vision-Language Explanation [pdf] [code]
[Multimodal-CoT] Multimodal Chain-of-Thought Reasoning in Language Models [pdf] [code]
[CCoT] Compositional Chain-of-Thought Prompting for Large Multimodal Models [pdf]
Grounding Visual Explanations [pdf]
Textual Explanations for Self-Driving Vehicles [pdf] [code]
Measuring Association Between Labels and Free-Text Rationales [pdf] [code]
Reframing Human-AI Collaboration for Generating Free-Text Explanations [pdf]
Few-Shot Out-of-Domain Transfer Learning of Natural Language Explanations [pdf]

XAI for NLP

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned [pdf] [code]
Quantifying Attention Flow in Transformers [pdf]
Locating and Editing Factual Associations in GPT [pdf] [code] [colab] [colab] [video] [website]
Visualizing and Understanding Neural Machine Translation [pdf]
Transformer Feed-Forward Layers Are Key-Value Memories [pdf]
A Diagnostic Study of Explainability Techniques for Text Classification [pdf] [code]
A Survey of the State of Explainable AI for Natural Language Processing [pdf]
How do Decisions Emerge across Layers in Neural Models? Interpretation with Differentiable Masking [pdf] [code]
Why use attention as explanation when we have saliency methods? [pdf]
Attention is Not Only a Weight: Analyzing Transformers with Vector Norms [pdf]
Attention is not Explanation [pdf]
Attention is not not Explanation [pdf]
Analyzing Individual Neurons in Pre-trained Language Models [pdf]
Identifying and Controlling Important Neurons in Neural Machine Translation [pdf]
“Will You Find These Shortcuts?” A Protocol for Evaluating the Faithfulness of Input Salience Methods for Text Classification [pdf] [blog]
Interpreting Language Models with Contrastive Explanations [pdf] [code]
Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers [pdf] [pdf] [code]
Discretized Integrated Gradients for Explaining Language Models [pdf] [code]
Did the Model Understand the Question? [pdf]
Explaining Compositional Semantics for Neural Sequence Models [pdf] [code]
Fooling Explanations in Text Classifiers [pdf]
Interpreting GPT: The Logit Lens [blog]
A Circuit for Indirect Object Identification in GPT-2 small [pdf]
Inside BERT from BERT-related-papers Github [link]
Massive Activations in Large Language Models [pdf] [code] [website]
Language Models Represent Space and Time [pdf] [code]
Are self-explanations from Large Language Models faithful? [pdf]
Awesome LLM Interpretability

Review Papers

Explaining Deep Neural Networks and Beyond: A Review of Methods and Applications [pdf]
Benchmarking and Survey of Explanation Methods for Black Box Models [pdf]
An Empirical Study of Deep Neural Network Explanation Methods [pdf] [code]
Methods for Interpreting and Understanding Deep Neural Networks [pdf]
From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic Review on Evaluating Explainable AI [pdf]
Leveraging Explanations in Interactive Machine Learning: An Overview [pdf]

Object-Centric Learning

[SLOT-Attention] Object-Centric Learning with Slot Attention [pdf] [code] [code]
[SCOUTER] Slot Attention-based Classifier for Explainable Image Recognition [pdf] [code]
[SPOT] Self-Training with Patch-Order Permutation for Object-Centric Learning with Autoregressive Transformers [pdf] [code]

XAI Libraries for Vision

Captum
PyTorch Grad-CAM [github] [docs]
Lucid [tensorflow] [pytorch]
Zennit [github] [docs] [paper]
TorchCAM [github] [docs] [demo]
pytorch-cnn-visualizations
VL-InterpreT [pdf] [github] [demo] [video]
DeepExplain
TorchRay [github] [docs]
grad-cam-pytorch
ViT-Prisma
CLIP Explainability

XAI Libraries for NLP

Other Awesomes

Other Resources

Explainable AI: Interpreting, Explaining and Visualizing Deep Learning
Interpretable Machine Learning
Transformer Circuits
OpenAI Microscope
Summary - Captum
Alibi Docs
jacobgil blogs
Stanford CS231n slides
TU Berlin Notes
Tutorial Notebooks
NPTEL-NOC IITM Videos [Early Methods] [Visualization Methods] [CAM Methods] [Recent Methods] [Beyond Explaining]
AI Explained Video Series by Fiddler AI
XAI Explained Video Series by DeepFindr
Visualizing and Understanding Stanford Video
CVPR 2021 Tutorial
CVPR 2023 Tutorial
CS231n Assignments Solutions
Filter and Feature Maps Visualization [blog] [blog] [blog] [pytorch discuss]
Hooks in PyTorch [tutorial] [tutorial] [tutorial] [tutorial]
Feature Extraction using Torch FX [tutorial]
Feature extraction for model inspection [tutorial]

Name		Name	Last commit message	Last commit date
Latest commit History 306 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Explainable AI

Papers

CAM Papers

LIME-based

Concept Bottleneck Models

Neuron Annotation

Prototype/Concept-Based

Distill Papers

XAI/Analysis of Self-Supervised Models and Transfer Learning

Circuits/Mechanistic Interpretability

Natural Language Explanations (Supervised)

XAI for NLP

Review Papers

Object-Centric Learning

XAI Libraries for Vision

XAI Libraries for NLP

Other Awesomes

Other Resources

About

Releases

Packages

fawazsammani/awesome-xai

Folders and files

Latest commit

History

Repository files navigation

Awesome Explainable AI

Papers

CAM Papers

LIME-based

Concept Bottleneck Models

Neuron Annotation

Prototype/Concept-Based

Distill Papers

XAI/Analysis of Self-Supervised Models and Transfer Learning

Circuits/Mechanistic Interpretability

Natural Language Explanations (Supervised)

XAI for NLP

Review Papers

Object-Centric Learning

XAI Libraries for Vision

XAI Libraries for NLP

Other Awesomes

Other Resources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages