Skip to content

Review #3 #9

@distillpub-reviewers

Description

@distillpub-reviewers

The following peer review was solicited as part of the Distill review process.

The reviewer chose to waive anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service them offer to the community.

Distill is grateful to the reviewer, Ruth Fong, for taking the time to write such a thorough review.


Most of my review will focus on the ways the article can be improved. So, before I jump to those potential improvements, I wanted to emphasize that I thoroughly enjoyed the article and the authors’ visualization paradigm (autocomplete + connectivity visualization) and contribution to the space of explanations, and I wanted to emphasize their key contribution (their visualization) is solid and is a great fit for Distill. The authors should expect that the parts of the article I don’t comment on I thought were really great as is. In particular, I wanted to highlight that the Connectivity Visualization, Autocomplete (my favorite), and Connectivity diagrams were fantastic. Lastly, apologies for the length and my own disorganization.

The primary area that this article can be improved on is its writing. I’d recommend this article be revised and then reviewed as a second revision (if that’s a possibility for Distill).

Writing:

The reason for the low writing style rating is primarily due to the article’s grammatical error, lack of brief but clear explanations in a few places . The word choice in the article could also be improved.

Despite the short form of the Distill article, more context (which can be brief) about the problem the article is trying to explain with their visualization would greatly improve the article.
What’s the original and modified task the models are trained on? (next character prediction vs. auto-completion problem) In particular, the authors should more clearly emphasize that, for visualization purposes, they are changing the underlying task (this is key to the article but not very clear / can be missed in its current state).
Why are metrics bad explanations? (Intro, p3 -- see below for more details)
What does the author mean by “memorization” (and what are other kinds of “memorization” issues with RNN)? (Intro, p1 -- see below for more details)

1st Paragraph: The first paragraph could be greatly improved by clearly describing the desired model behavior (prediction based on long-term context) and problem with current set-up before jumping into (or without explaining) the specifics, i.e. what’s memorization in the context of this article?, what’s the vanishing gradient problem?

This is a small comment, but I’m not sure “memorization” is the right word to describe the article’s focus and personally prefer “context”, as “long-term context” makes more sense to me than “long-term memorization”. Personally, “memorization” seems to refer more to “short-term memorization”.

Here are more details on a few explanations that I thought can be improved in the article:
Intro, p2: For readers unfamiliar with the specifics of NLP, it would help to briefly explain the training task (i.e., next character prediction as opposed to word prediction).
Intro, paragraph 1: It’d help to clarify what the authors mean by “ the practical problem of memorization” (this is implied later on to mean how much models use earlier context for predicting more downstream words -- moving this up to the intro using simplified language would help the readers precisely understand the kind of “memorization” the authors are interrogating with their visualization).
Intro, p3: It’d help the readers understand as well as improve the strength of the article’s argument to expand a bit more and with more clarity why quantitative metrics only provide “partial insight”. One way to do this would be to explain that boiling a model’s behavior down to one or a few numbers allows many different behaviors to map onto the same number, i.e. there’s many ways to achieve high accuracy; for instance the model can “cheat” and only use short-term context as opposed to long-term context. There’s other reasons why performance on metrics serve as poor “explanations” that the authors might want to consider briefly mentioning and then highlighting that their work focusing on elucidating the context problem. Furthermore, clearly explaining how a short-term context, “cheating” model can perform well on metrics but not actually learn context (what we’re interested in) would be useful to the unfamiliar reader’s understanding. The explanation seems to appear most clearly the first paragraph of the conclusion (“It is only for the first couple of characters, that long-term memorization”) but it’d be great to hear this explanation expanded a bit more and earlier.

Here’s a few grammar issues I caught in my read through and suggestions for changes:

  • Intro, paragraph 2: “relies” => “rely”
  • Intro, paragraph 3: “are useful. They” => “are useful, they”
  • Recurrent Units, p1: “considered, uses” => “considered use”
  • Recurrent Units, p1: “vanishing gradient problem, that” -- either remove comma or replace “that” with “which”
  • Recurrent Units, p3: “Theoretical” => “Theoretically,”
  • Recurrent Units, p3: “problem but” => “problem, but”
  • Autocomplete diagram caption: “humanly interpretable” => “human interpretable” or “human-interpretable”
  • Comparing Recurrent Units, p2: “easy reason about” => “easy to reason about”
  • Connectivity in the Autocomplete Problem, “Here are two interesting observations”: Reword “The X observation is when…” => “The X observation is how the model predictsion the word ‘Y’” (for first observation, => “when only given data up to and including the first character.”)
  • Connectivity in the Autocomplete Problem, last p: “long-term memorization That” => “long-term memorization; these observations”
  • Conclusion, p1: “characters, that” => “characters for which”
  • Conclusion, p1: “really matters” => “really matter”

Regarding word choice, the authors used the pattern “very X” and the word “good” a few times; I’d encourage the authors to use stronger words and have provided suggestions.
Intro, p3: “good accuracy … very good at predictions” => “achieve high accuracy and low cross entropy loss by only leveraging short-term memorization to make highly accurate predictions” (this suggestion introduces some redundancy in “high accuracy” / “highly accurate”)
Recurrent Units, p4: “very difficult” => “a difficult and opaque”

Diagrams

For the reader interested in experimental details, the Autocomplete and Connectivity diagrams (and more generally the main text) could be improved by creating links / pop ups to the appendix with more details explaining each diagram (i.e., link to the Autocomplete Problem from caption and main text and/or a hover pop-up with minimal model and training description).

While it serves its purpose in demonstrating the vanishing gradient problem, “The Vanishing Gradient” diagram could be greatly improved by visualizing the gradient when more complex RNN units (GRU, LSTM) are used. Without giving this too much thought, this could be done by setting up simple, randomly initialized models.

Recurrent Unit, Nested LSTM: For completeness, it is a nice-to-have to also add the vanilla recurrent unit.

Connectivity Diagram: It would be nice to have a few more examples besides the one current example (as well as a link to code / notebook for others to play around with). It would also be nice to have the percentages for each predicted word.

Important Observations: Bold or underline “Clicking on the links above will change what is viewed in connectivity figure.”

Future work:

The authors can consider -- either mentioning or doing for this article or for their future work more generally -- the following: A quantitative explanatory metric can be constructed as follows: How quickly is the current word predicted correctly? How many characters does it take on average? How does this metric vary based on the position of the word in the larger text (i.e., first word vs middle word vs last word).

Scientific Correctness & Integrity

Claims: The main claim the authors make on the current body of work is that their visualization suggests that the Nested LSTM paper’s claim that “more internal memory leads to more long-term memorization” may not be true. More examples (and possibly more analysis -- see future work suggestion) would be needed to substantiate that claim.

Limitations

The author should highlight the limitations of visualizing the gradient (as well as justify why they think it’s a decent explanation). I’m more familiar with these limitations as they relate to Computer Vision (i.e., gradients can be noisy and not actually a salient “explanation” the more distance between the input and output / when propagated back through many layers -- see Simonyan et al., 2013 [subsequent works that attempt to deal with the noise]) and and am assuming that they transfer somewhat to NLP (there might even be unique limitations in NLP vs. CV). If this is not true, the authors should briefly mention that.
The authors should also highlight the limitations of their autocomplete problem (this can be done in the appendix / as a hover). One limitation is the power of the visualization explanation is dependent on the model performance on the autocomplete problem (it can be seen as an upper bound on the quality of the visualization). For instance, if the model performs poorly on autocomplete, the visualization might not be as human-interpretable and/or we shouldn’t “trust” the autocomplete answers in the visualization as much.
A current limitation is that only one example text is given -- if the authors don’t include more examples, this should also be noted in the caveat on generalizing to other datasets / hyper-parameters.

Replicability

It’d be quite helpful if the authors released a Python notebook for readers to play around more with.

Citations

The authors should cite more for the unfamiliar reader’s benefit, particularly on related work to autocomplete and connectivity (datasets should also be properly cited). I’m not familiar with NLP literature so I’m not fluent on works related to autocomplete (if there are none, the authors should highlight that this is novel), but on connectivity, there’s quite a bit of explanatory / visualization research, particularly in Computer Vision research (primary one being Simonyan et al., 2014), on visualizing gradients.

This is a small point, but I’d prefer prefer if the authors made the following change “another paper” => “Karparthy et al.”, for readers who print out Distill articles (as I did).


Distill employs a reviewer worksheet as a help for reviewers.

The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.

Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest
What type of contributions does this article make?: Explanations of recurrent vanishing gradients, a new visualization technique to aid in qualitative comparisons of different approaches.

Advancing the Dialogue Score
How significant are these contributions? 3/5
Outstanding Communication Score
Article Structure 2/5
Writing Style 2/5
Diagram & Interface Style 4/5
Impact of diagrams / interfaces / tools for thought? 3/5
Readability 3/5
Scientific Correctness & Integrity Score
Are claims in the article well supported? 3/5
Does the article critically evaluate its limitations? How easily would a lay person understand them? 3/5
How easy would it be to replicate (or falsify) the results? 2/5
Does the article cite relevant work? 2/5
Does the article exhibit strong intellectual honesty and scientific hygiene? 2/5

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions