Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review #3 #9

Closed
distillpub-reviewers opened this Issue Oct 19, 2018 · 2 comments

Comments

Projects
None yet
2 participants
@distillpub-reviewers
Copy link
Collaborator

distillpub-reviewers commented Oct 19, 2018

The following peer review was solicited as part of the Distill review process.

The reviewer chose to waive anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service them offer to the community.

Distill is grateful to the reviewer, Ruth Fong, for taking the time to write such a thorough review.


Most of my review will focus on the ways the article can be improved. So, before I jump to those potential improvements, I wanted to emphasize that I thoroughly enjoyed the article and the authors’ visualization paradigm (autocomplete + connectivity visualization) and contribution to the space of explanations, and I wanted to emphasize their key contribution (their visualization) is solid and is a great fit for Distill. The authors should expect that the parts of the article I don’t comment on I thought were really great as is. In particular, I wanted to highlight that the Connectivity Visualization, Autocomplete (my favorite), and Connectivity diagrams were fantastic. Lastly, apologies for the length and my own disorganization.

The primary area that this article can be improved on is its writing. I’d recommend this article be revised and then reviewed as a second revision (if that’s a possibility for Distill).

Writing:

The reason for the low writing style rating is primarily due to the article’s grammatical error, lack of brief but clear explanations in a few places . The word choice in the article could also be improved.

Despite the short form of the Distill article, more context (which can be brief) about the problem the article is trying to explain with their visualization would greatly improve the article.
What’s the original and modified task the models are trained on? (next character prediction vs. auto-completion problem) In particular, the authors should more clearly emphasize that, for visualization purposes, they are changing the underlying task (this is key to the article but not very clear / can be missed in its current state).
Why are metrics bad explanations? (Intro, p3 -- see below for more details)
What does the author mean by “memorization” (and what are other kinds of “memorization” issues with RNN)? (Intro, p1 -- see below for more details)

1st Paragraph: The first paragraph could be greatly improved by clearly describing the desired model behavior (prediction based on long-term context) and problem with current set-up before jumping into (or without explaining) the specifics, i.e. what’s memorization in the context of this article?, what’s the vanishing gradient problem?

This is a small comment, but I’m not sure “memorization” is the right word to describe the article’s focus and personally prefer “context”, as “long-term context” makes more sense to me than “long-term memorization”. Personally, “memorization” seems to refer more to “short-term memorization”.

Here are more details on a few explanations that I thought can be improved in the article:
Intro, p2: For readers unfamiliar with the specifics of NLP, it would help to briefly explain the training task (i.e., next character prediction as opposed to word prediction).
Intro, paragraph 1: It’d help to clarify what the authors mean by “ the practical problem of memorization” (this is implied later on to mean how much models use earlier context for predicting more downstream words -- moving this up to the intro using simplified language would help the readers precisely understand the kind of “memorization” the authors are interrogating with their visualization).
Intro, p3: It’d help the readers understand as well as improve the strength of the article’s argument to expand a bit more and with more clarity why quantitative metrics only provide “partial insight”. One way to do this would be to explain that boiling a model’s behavior down to one or a few numbers allows many different behaviors to map onto the same number, i.e. there’s many ways to achieve high accuracy; for instance the model can “cheat” and only use short-term context as opposed to long-term context. There’s other reasons why performance on metrics serve as poor “explanations” that the authors might want to consider briefly mentioning and then highlighting that their work focusing on elucidating the context problem. Furthermore, clearly explaining how a short-term context, “cheating” model can perform well on metrics but not actually learn context (what we’re interested in) would be useful to the unfamiliar reader’s understanding. The explanation seems to appear most clearly the first paragraph of the conclusion (“It is only for the first couple of characters, that long-term memorization”) but it’d be great to hear this explanation expanded a bit more and earlier.

Here’s a few grammar issues I caught in my read through and suggestions for changes:

  • Intro, paragraph 2: “relies” => “rely”
  • Intro, paragraph 3: “are useful. They” => “are useful, they”
  • Recurrent Units, p1: “considered, uses” => “considered use”
  • Recurrent Units, p1: “vanishing gradient problem, that” -- either remove comma or replace “that” with “which”
  • Recurrent Units, p3: “Theoretical” => “Theoretically,”
  • Recurrent Units, p3: “problem but” => “problem, but”
  • Autocomplete diagram caption: “humanly interpretable” => “human interpretable” or “human-interpretable”
  • Comparing Recurrent Units, p2: “easy reason about” => “easy to reason about”
  • Connectivity in the Autocomplete Problem, “Here are two interesting observations”: Reword “The X observation is when…” => “The X observation is how the model predictsion the word ‘Y’” (for first observation, => “when only given data up to and including the first character.”)
  • Connectivity in the Autocomplete Problem, last p: “long-term memorization That” => “long-term memorization; these observations”
  • Conclusion, p1: “characters, that” => “characters for which”
  • Conclusion, p1: “really matters” => “really matter”

Regarding word choice, the authors used the pattern “very X” and the word “good” a few times; I’d encourage the authors to use stronger words and have provided suggestions.
Intro, p3: “good accuracy … very good at predictions” => “achieve high accuracy and low cross entropy loss by only leveraging short-term memorization to make highly accurate predictions” (this suggestion introduces some redundancy in “high accuracy” / “highly accurate”)
Recurrent Units, p4: “very difficult” => “a difficult and opaque”

Diagrams

For the reader interested in experimental details, the Autocomplete and Connectivity diagrams (and more generally the main text) could be improved by creating links / pop ups to the appendix with more details explaining each diagram (i.e., link to the Autocomplete Problem from caption and main text and/or a hover pop-up with minimal model and training description).

While it serves its purpose in demonstrating the vanishing gradient problem, “The Vanishing Gradient” diagram could be greatly improved by visualizing the gradient when more complex RNN units (GRU, LSTM) are used. Without giving this too much thought, this could be done by setting up simple, randomly initialized models.

Recurrent Unit, Nested LSTM: For completeness, it is a nice-to-have to also add the vanilla recurrent unit.

Connectivity Diagram: It would be nice to have a few more examples besides the one current example (as well as a link to code / notebook for others to play around with). It would also be nice to have the percentages for each predicted word.

Important Observations: Bold or underline “Clicking on the links above will change what is viewed in connectivity figure.”

Future work:

The authors can consider -- either mentioning or doing for this article or for their future work more generally -- the following: A quantitative explanatory metric can be constructed as follows: How quickly is the current word predicted correctly? How many characters does it take on average? How does this metric vary based on the position of the word in the larger text (i.e., first word vs middle word vs last word).

Scientific Correctness & Integrity

Claims: The main claim the authors make on the current body of work is that their visualization suggests that the Nested LSTM paper’s claim that “more internal memory leads to more long-term memorization” may not be true. More examples (and possibly more analysis -- see future work suggestion) would be needed to substantiate that claim.

Limitations

The author should highlight the limitations of visualizing the gradient (as well as justify why they think it’s a decent explanation). I’m more familiar with these limitations as they relate to Computer Vision (i.e., gradients can be noisy and not actually a salient “explanation” the more distance between the input and output / when propagated back through many layers -- see Simonyan et al., 2013 [subsequent works that attempt to deal with the noise]) and and am assuming that they transfer somewhat to NLP (there might even be unique limitations in NLP vs. CV). If this is not true, the authors should briefly mention that.
The authors should also highlight the limitations of their autocomplete problem (this can be done in the appendix / as a hover). One limitation is the power of the visualization explanation is dependent on the model performance on the autocomplete problem (it can be seen as an upper bound on the quality of the visualization). For instance, if the model performs poorly on autocomplete, the visualization might not be as human-interpretable and/or we shouldn’t “trust” the autocomplete answers in the visualization as much.
A current limitation is that only one example text is given -- if the authors don’t include more examples, this should also be noted in the caveat on generalizing to other datasets / hyper-parameters.

Replicability

It’d be quite helpful if the authors released a Python notebook for readers to play around more with.

Citations

The authors should cite more for the unfamiliar reader’s benefit, particularly on related work to autocomplete and connectivity (datasets should also be properly cited). I’m not familiar with NLP literature so I’m not fluent on works related to autocomplete (if there are none, the authors should highlight that this is novel), but on connectivity, there’s quite a bit of explanatory / visualization research, particularly in Computer Vision research (primary one being Simonyan et al., 2014), on visualizing gradients.

This is a small point, but I’d prefer prefer if the authors made the following change “another paper” => “Karparthy et al.”, for readers who print out Distill articles (as I did).


Distill employs a reviewer worksheet as a help for reviewers.

The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.

Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest
What type of contributions does this article make?: Explanations of recurrent vanishing gradients, a new visualization technique to aid in qualitative comparisons of different approaches.

Advancing the Dialogue Score
How significant are these contributions? 3/5
Outstanding Communication Score
Article Structure 2/5
Writing Style 2/5
Diagram & Interface Style 4/5
Impact of diagrams / interfaces / tools for thought? 3/5
Readability 3/5
Scientific Correctness & Integrity Score
Are claims in the article well supported? 3/5
Does the article critically evaluate its limitations? How easily would a lay person understand them? 3/5
How easy would it be to replicate (or falsify) the results? 2/5
Does the article cite relevant work? 2/5
Does the article exhibit strong intellectual honesty and scientific hygiene? 2/5
@AndreasMadsen

This comment has been minimized.

Copy link
Member

AndreasMadsen commented Nov 9, 2018

Writing ...

Thanks, it should all be fixed.

For the reader interested in experimental details, the Autocomplete and Connectivity diagrams (and more generally the main text) could be improved by creating links / pop ups to the appendix with more details explaining each diagram (i.e., link to the Autocomplete Problem from caption and main text and/or a hover pop-up with minimal model and training description).

I added such link. As well as details on which model is being used.

While it serves its purpose in demonstrating the vanishing gradient problem, “The Vanishing Gradient” diagram could be greatly improved by visualizing the gradient when more complex RNN units (GRU, LSTM) are used. Without giving this too much thought, this could be done by setting up simple, randomly initialized models.

The primary concern here is that doing that adds either a lot of heavy computation to the article, or makes it much larger to download.

Connectivity Diagram: It would be nice to have a few more examples besides the one current example.

Each example takes up about 8 MB. So I'm not sure that would be the best.

(as well as a link to code / notebook for others to play around with).

Link added.

would also be nice to have the percentages for each predicted word.
Important Observations: Bold or underline “Clicking on the links above will change what is viewed in connectivity figure.”

I made it italic and bold. As only bold or only underline have a different meaning.

The authors can consider -- either mentioning or doing for this article or for their future work more generally -- the following: A quantitative explanatory metric can be constructed as follows: How quickly is the current word predicted correctly? How many characters does it take on average? How does this metric vary based on the position of the word in the larger text (i.e., first word vs middle word vs last word).

Mentioned for now. I will properly end up doing this. Thanks.

Claims: The main claim the authors make on the current body of work is that their visualization suggests that the Nested LSTM paper’s claim that “more internal memory leads to more long-term memorization” may not be true. More examples (and possibly more analysis -- see future work suggestion) would be needed to substantiate that claim.

It is really just a suggestion, the intension here isn't to claim that Nested LSTM is worse. It is simply to provide a better visualization for showing long-term memorization/contextual understanding than what they did.

The author should highlight the limitations of visualizing the gradient (as well as justify why they think it’s a decent explanation). I’m more familiar with these limitations as they relate to Computer Vision (i.e., gradients can be noisy and not actually a salient “explanation” the more distance between the input and output / when propagated back through many layers -- see Simonyan et al., 2013 [subsequent works that attempt to deal with the noise]) and and am assuming that they transfer somewhat to NLP (there might even be unique limitations in NLP vs. CV). If this is not true, the authors should briefly mention that.

Thanks. I read the article now. It is rather challenging to draw parallels, and I actually don't think the article discusses noise itself that much. My intuition is that it is less a concern, as the noise in texts has many different properties compared to images. Images have a high degree of white noise, while texts have none. There are some misspellings, maybe redundant could be considered a source of noise. But generally it is fairly minor, thus the gradient is less noisy. I think that is also apparent from the visualizations. Hence I don't feel a need to mention this further in the article.

The authors should also highlight the limitations of their autocomplete problem (this can be done in the appendix / as a hover). One limitation is the power of the visualization explanation is dependent on the model performance on the autocomplete problem (it can be seen as an upper bound on the quality of the visualization). For instance, if the model performs poorly on autocomplete, the visualization might not be as human-interpretable and/or we shouldn’t “trust” the autocomplete answers in the visualization as much.


That is not really a limitation. If the model is that poorly trained, then yes the connectivity doesn't make sense, but that will actually show that the model is poorly trained. I would consider that the desired behavior.

A current limitation is that only one example text is given -- if the authors don’t include more examples, this should also be noted in the caveat on generalizing to other datasets / hyper-parameters.

Most likely I will do the metrics as you suggested.

It’d be quite helpful if the authors released a Python notebook for readers to play around more with.

The code will be released.

The authors should cite more for the unfamiliar reader’s benefit, particularly on related work to autocomplete and connectivity (datasets should also be properly cited). I’m not familiar with NLP literature so I’m not fluent on works related to autocomplete (if there are none, the authors should highlight that this is novel), but on connectivity, there’s quite a bit of explanatory / visualization research, particularly in Computer Vision research (primary one being Simonyan et al., 2014), on visualizing gradients.

Hmm. It is definitely not novel, it exists on most phones. But I also can't find any relevant literature on it.

This is a small point, but I’d prefer prefer if the authors made the following change “another paper” => “Karparthy et al.”, for readers who print out Distill articles (as I did).

Thanks. I changed that.


Thanks a lot for the feedback. I will look into adding the metrics, which could improve the "Scientific Correctness & Integrity" of the article. Although, I want to stretch (this is done in the article as well) that the purpose of the article is not to discredit Nested LSTM.

@AndreasMadsen

This comment has been minimized.

Copy link
Member

AndreasMadsen commented Jan 7, 2019

Thanks a lot for the feedback. I will look into adding the metrics, which could improve the "Scientific Correctness & Integrity" of the article. Although, I want to stretch (this is done in the article as well) that the purpose of the article is not to discredit Nested LSTM.

I now added the extra metric section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.