Review #3 #8
The following peer review was solicited as part of the Distill review process.
The reviewer chose to waive anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service they offer to the community.
Distill is grateful to Joel Lehman for taking the time to review this article.
"Here are detailed notes I took while writing the review:
""However, only the layout of each level is randomized, and correspondingly, we were only able to find interpretable features [at] the level of abstraction of objects"" [typo]
""With a diverse environment"" in header -- with diverse environments? It seems like a diverse environment is ambiguous.
""Dissecting failures"" vs ""Hallucinations"" -- Are hallucinations just a subclass of failures you dissect? Or do hallucinations not cause failures?
""Here are some examples of the objects used, along with walls and floors, to generate CoinRun levels."" -- Does the agent sprite change procedurally as well? Slightly confusing why the agent looks different when in mid air vs jump. Is there a wide range of agent / enemy sprites, or just the two version shown?
""The velocity info painted into the top left of each observation"" -- to make the task more Markovian? Should mention why CoinRun paints velocity, a reader could miss that it actually is important (I imagine this is a standard feature of CoinRun?).
""This incorporates attribution from a hidden layer, which serves to highlight objects that positively or negatively influence a particular network output"" -> At this point in the article, attribution hasn't been defined. If this serves as a definition, the reader can mostly muddle through, but it would be nice if the connection between hidden layer / attribution / objects could be clarified a little -- e.g. ""This incorporates attribution, from a hidden layer that recognizes objects, which serves to highlight objects ...""
""By applying dimensionality reduction, attribution is sliced according to the type of object detected"" -> Confusing, implies that dimensionality reduction identifies objects, which had me scratching my head -- reads too ambiguously.
(On the first interface) Note that the interface didn't fit entirely on my laptop screen (Google pixelbook), and was initially confusing -- at first the I thought the timeline was separate from the rest of the interface (probably because the whole thing didn't fit on my screen). After some playing around I could understand it, but it was inconvenient to have to scroll up and down. If I zoom out to 80% it fits, but that is not something I often think about doing (much better experience when the interface completely fits). Keyboard shortcuts for forward/back frame would be nice -- mainly because it is hard to coordinate between playing and rewinding and focusing on particular features. It is very nice / intriguing how effective dimensionality reduction is here.
""Stepping down to avoid jumping"" -> Confusing -- it appears as if the agent starts in mid-air, how could it jump from mid-air in the first place? (is double jumping allowed by the dynamics of CoinRun? -- if so, a reader would need to know that in order to appreciate what is going on). Hard to see what is going on, perhaps original resolution would be helpful here. I couldn't appreciate the story for this example, which was a shame. In general, original resolution would be helpful just to understand what actually is going on, although of course the observation is useful to understand the point of view of the model. In general, there is a lot going on (UI a bit cluttered w/ options to attribute to actions instead of value, etc.); some choices not clear to me or explained (I could infer what attribution channel totals were, but perhaps should be described -- is the idea to introduce UI complexity gradually (attribution channel totals not included in first interface, but then are included in later ones?)
Not sure how I feel about forward-referencing the technique used for ""model editing"" -- makes it hard to peer review. Not sure what Distill's review policy is on that, but I have to take the authors' word that what they are doing behind the scenes is sane, which I do trust, but not necessarily good in general for science. ""Our results show that our edits were successful and targeted, with no statistically measurable effects on the agent's other abilities."" -> One confusing thing is the scale of the plot -- is it true that even with buzzsaw blindness, the agent still beats ~80% of levels -- is that reasonable (or are buzzsaws sort of avoidable by chance 80% by an otherwise skilled policy?)?
""An interface for all convolutional layers can be viewed here"" -> nice!
""Our results illustrate how diversity may lead to interpretable features via generalization, lending support to the diversity hypothesis."" -> very interesting, good empirical evidence.
""If our analysis here is valid, it provides further evidence for the diversity hypothesis, as an example of features that are hard to interpret in the absence of diversity. However, it seems to be a lack of diversity at a low level of abstraction that harms our ability to interpret features at all levels of abstraction, which could be an artefact of the fact that gradient-based feature visualization needs to back-propagate through earlier layers."" -> But you tried adding low-level visual diversity, right? Isn't that conflicting evidence?
""Spatially-aware feature visualization"" -- clever/interesting, but is the coin-detecting channel the only thing that does something interesting? Is there anything else this helped you to find?
Caption to the figure on integrated gradients: ""Viewing F as the height of a surface, the integrated gradient of F measures the elevation gained while traveling in each direction, and sums to the total elevation gain"" -- I found neither this caption nor the figure itself intuitive -- some ambiguity about what ""each direction"" is? If it is a plot of X vs elevation, isn't the only ""direction"" positive-negative X? Not sure what I am supposed to take away from this plot. Also -- another forward reference to unpublished research there -- (there are several in this paper), almost subconsciously to me it seems to suggest the appearance of Distill as more an arm of the OpenAI research group than as a research journal; I don't doubt the integrity of Distill, but it is sort of an implicit thought that comes to mind for me, and may for others as well. Ahh, I understand now the figure -- the gradient on the background is supposed to be ""F"" -- needs explanation in the caption.
""Both methods tend to produce NMF directions that are close to one-hot, and so can be thought of as picking out the most relevant channels. However, when reducing to a small number of dimensions, using attributions usually picks out more salient features, because attribution takes into account not just what neurons respond to but also whether their response matters."" -> I found this confusing to read. NMF directions are close to one-hot, so then dimensionality reduction is just removing a lot of dead channels (or at least dead with respect to affecting the value of a state)? I would suggest clarifying this paragraph -- maybe adding a figure, or anything that could help give a reader better intuition of what you're suggesting, which sounds interesting and important, but was hard for me to discern.
""Do the ""non-diverse features"" that appear in the absence of diversity remain when diversity is present?"" -- what do you mean by ""non-diverse features"" -- non-interpretable, or is it more tautological -- that ""non-diverse features"" are features resulting when not training with diversity that tend not to lead to generalizing policies? Clarify this list item (""Pervasiveness of non-diverse features"")
""such as by using a misaligned approval-based objective"" -> I think I get what you're saying, but the point is not clear -- don't think you can expect average reader to infer what you mean by this."
Distill employs a reviewer worksheet as a help for reviewers.
The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.
Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest
The text was updated successfully, but these errors were encountered:
We are extremely grateful to Joel for these thoughtful comments. We have made a number changes thanks to his suggestions, and only discuss the more substantive points here.
The agent cannot jump when it is in mid-air, but jumping is only triggered when up is released, and it is allowed to start pressing up before landing. We have modified the explanation of Timestep 1 to include this information.
We agree that a forward reference is not proper here. It was not our intention to make it seem as though the article relied on unpublished work, and a full explanation of the model editing method was included, but it was buried in a footnote. We have therefore converted this footnote into an appendix, and linked to it more prominently. We have kept the citation to the Circuits thread (which has now been partially published) in order to give proper credit.
This is an excellent point. In fact, buzzsaws are avoidable by chance for around 68% of levels by an otherwise skilled policy, which we measured this by modifying the game to make the buzzsaws invisible. This implies that the blindness we induce by model editing is significant but not complete. This information was also buried in a footnote, but we now better appreciate its significance and so have promoted it to the main text.
This is another excellent point, and we have edited this paragraph to highlight the conflicting evidence.