Review #2 #7
The following peer review was solicited as part of the Distill review process.
The reviewer chose to waive anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service they offer to the community.
Distill is grateful to Jonathan Uesato for taking the time to review this article.
This article describes research into interpretability techniques applied to vision systems of RL agents. Specifically, the article uses attribution and dimensionality reduction techniques to analyze a CoinRun agent. The authors demonstrate that these tools can produce many useful insights into the agent, allowing agent designers to understand causes of failures (such as obscured or hallucinated features) and to hand-edit model weights to create precise, predictable changes to agent behavior.
This article presents one of the most in-depth investigations of interpretability to RL agents to-date. The results are compelling, and support the generality of interpretability techniques such as attribution and dimensionality reduction, by producing useful insights in the RL domain. The experiments are very well-executed.
The model editing results are particularly striking, and it would be surprising to many deep RL researchers that it is possible to modify individual model weights with precise and predictable results.
Overall, this article presents important contributions, and communicates them clearly and correctly.
More detailed comments:
The diversity hypothesis is an interesting insight - in particular, the only-if direction helps contextualize many difficulties in previous work applying interpretability techniques to non-procedural games, such as Atari.
While the diversity hypothesis is clearly labeled as a ""hypothesis,"" the ""if"" direction still seems rather speculative to me. In particular, while the hypothesis offers that ""diverse distributions are sufficient to produce interpretable models,"" the experiments show that a diverse distribution produces interpretable models in a single case. I believe the article would generally be better served by weakening the claim, since it is very difficult to support the claim that diverse distributions are always sufficient.
The discussion can be strengthened by discussing the relationship between the diversity hypothesis and other previous work. For instance, how do previous results on what works and doesn't in vision and language relate to the diversity hypothesis? Given that the diversity hypothesis offers such a broad scope (""diverse distributions <-> interpretable models""), only discussing the hypothesis within the narrow scope of a single agent misses the opportunity to relate to most existing evidence at hand.
Is it possible to provide quantitative support for this? Even if this involves some subjectivity in categorizing ""why the failure"" occurs, this would be a valuable summary. For example, can you provide a categorization into common failure reasons? The ""Failure rate by cause"" figure is a good step in this direction, but it could be useful to consider different categorizations, such as ""hallucination"" vs ""obscured object,"" or other high-level categories.
The article overall does a good job flagging caveats, such as when interpretability techniques are effective or not, and noting when examples are cherry-picked vs. summary statistics. It would be good to more explicitly note the current focus on RL vision systems (I think the only mentions of ""vision"" are the title and once in the intro) - perhaps the future work section would be a good place for this.
Regarding reproducibility, the authors provide an expanded version of the interface. The rl-clarity library and Colab notebook links are to be updated, so I cannot comment on the code. Sharing the model weights would also be very valuable to allow these investigations to be reproduced externally (I assume these will be also provided in the library/notebook).
Distill employs a reviewer worksheet as a help for reviewers.
The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.
Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest
The text was updated successfully, but these errors were encountered:
We are extremely grateful to Jonathan for these thoughtful comments. We have made a number of changes thanks to his feedback.
We agree that the "if" direction of the diversity hypothesis is speculative. We have edited our explanation of the hypothesis to make it clearer that we do not expect the "if" direction to hold in a strict sense, and have edited our discussion to make it clearer that we consider the hypothesis to be highly unproven. Nevertheless, we did not change the wording of the hypothesis itself, since we wanted to keep the statement concise, and we hope that it is now sufficiently caveated.
We estimate that around 80-90% of failures have a relatively clear explanation (similar in quality to the three in the article), and that 80-90% of those are due to a combination of the model's lack of memory and bad luck. However, carefully analyzing failures is unfortunately a time-consuming process, and as a result we only have only done it a handful of times (probably less than 20). Consequently, we do not consider these figures robust enough to include in the article. We have instead modified the text to state that we only explored a few failures.
We have updated the future work section to suggest future research on non-visual systems.
We can confirm that we plan to share the model weights if the paper is accepted for publication.