The following peer review was solicited as part of the Distill review process.
The reviewer chose to keep keep anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service them offer to the community.
Distill is grateful to the reviewer for taking the time to review this article.
Conflicts of Interest: Reviewer disclosed a minor potential conflict of interest that the editors did not view as a substantive concern.
Section Summaries and Comments
This submission uses the core interpretability technique proposed in Feature Distillation (Olah et al) to address core concerns in interpretability (visualization, attribution, summarizing to human absorbable amounts of information).
The article first motivates the reason to understand the network through canonical examples (in this case visualizations of concepts maximally activating neurons).
One comment on this: while using visual examples definitely gives a finer degree of granularity than lumping several abstractions together under one description -- e.g. “floppy ear detectors” -- it seems likely that any visual concept that the network reliably learns that is unfamiliar to humans is likely to still be overlooked, even using canonical examples. (This would be an example of the human scale issue). One way to study this further (probably in future work) might be to extract semantic concepts that are repeatedly learned, and thus identify these important concepts that aren’t immediately human recognizable.
The next section introduces a way to perform feature visualization at every spatial location across all channels. The main contribution of this section is definitely the layer by layer visualization of different concept detectors (and with the nice follow on image of concepts scaled by activation magnitudes). It might help slightly to have a sentence introducing this in the start of the section. The visualizations are fantastic, and really give a sense of how the model comes to its decision. I’m very excited to see such visualizations applied to other image datasets (particularly those where we might use ML to make scientific discoveries), as well as other domains (e.g. language).
Perhaps this is a personal preference, but when the authors describe the optimization procedure, I would have really liked to see a simple equation or two (maybe even as an aside), so that the mathematically minded readers can have a (simplified) mathematical description. Several parts were unclear to me: e.g. do you initialize with the image and then optimize to maximize the sum of the activations over all channels? Or something else?
Some mathematical description is also something I would have liked to see in the Human Scale section when describing the matrix factorization problem. (There are links to colab notebooks which is an excellent resource, but it would be nice to have an equation or two instead of having to look through the code to work this out.)
The next section presents what is arguably the most important visualization of the article, of seeing how different concepts at different layers influence and are influenced by concepts in other layers. This is the first visualization of its kind that I’ve seen, and seems like it could be extremely valuable in helping determine failure cases of neural networks, or assess how robust they are. It would be fascinating to see this diagram for adversarial images for example. One follow up question to this visualization might be: there has been work that suggests that saliency maps are not entirely reliable (and this is indeed discussed later in the article). Is this kind of influence visualization robust to other attribution methods? (Perhaps other attribution methods could even be built into the model, e.g. attention.)
The authors then overview an attempt to categorize and summarize the interpretability conclusions being presented using matrix factorization. As mentioned before, it would have been nice to see some mathematical descriptions of the setup. It is good to see the human scale problem being addressed but the results here appear to be preliminary, and there is likely much more work to be done in the future. One related direction is the following: in this article, it appears that the “semantic dictionary” is built up from the output classes and humans looking at some of the feature detectors and labelling them, e.g. “floppy ears”. Have the authors considered methods to automatically cluster together similar concepts? (E.g. first collecting a set of canonical examples corresponding to a certain class, and then applying clustering to see what natural groups they fall into?)
This is an excellent submission studying how to develop tools to better interpret networks. To me, the two most exciting contributions were the layer by layer canonical example visualizations and the influence diagrams showing how different features influenced each other across layers. I’m very excited to see how these can be applied across different domains to interpret and learn from the neural networks we train.
The text was updated successfully, but these errors were encountered:
Thank you for the in depth and thoughtful review. We're glad that you found our work exciting! We've responded to points inline below.
We agree! The factorization approach we used is only one from a larger family of techniques, and we’re excited about future research that explores the impact of using these other techniques. We’ve made this clearer with 790f4c9.
We want to keep the focus of the article on how the various interpretability building blocks come together to form rich user interfaces. The challenge is that digging into the details of each of the individual building blocks involves many moving parts (e.g., with feature visualization, we would need to unpack initialization, parameterization, optimization objective, procedure, etc.; with matrix factorization, it’s unclear to us how more formal mathematic description would not essentially reduce to redefining NMF).
With that said, we’ve improved our prose description of our approaches in 3c4312b, and added a link to our previous feature visualization article (which discusses our specific approach at length) in 90979c4.
Our interfaces reify attribution methods, so we believe that they will be as robust as the attribution methods they rely on. We can use the same interface but swap out the underlying attribution method used, and we think doing so (and evaluating the resultant interfaces) is a promising direction for future work.
We agree, and have clarified this as of 790f4c9.
Yes! This is one of the things we are actively researching, but unfortunately don’t have conclusive results yet to include in the article :)
We’ve improved our annotations and controls for all diagrams (e.g., with 5f9cebb, we’ve added new zooming controls for this particular interface).