The following peer review was solicited as part of the Distill review process.
The reviewer chose to waive anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service them offer to the community.
Distill is grateful to the reviewer, Ruth Fong, for taking the time to write such a thorough review.
Most of my review will focus on the ways the article can be improved. So, before I jump to those potential improvements, I wanted to emphasize that I thoroughly enjoyed the article and the authors’ visualization paradigm (autocomplete + connectivity visualization) and contribution to the space of explanations, and I wanted to emphasize their key contribution (their visualization) is solid and is a great fit for Distill. The authors should expect that the parts of the article I don’t comment on I thought were really great as is. In particular, I wanted to highlight that the Connectivity Visualization, Autocomplete (my favorite), and Connectivity diagrams were fantastic. Lastly, apologies for the length and my own disorganization.
The primary area that this article can be improved on is its writing. I’d recommend this article be revised and then reviewed as a second revision (if that’s a possibility for Distill).
The reason for the low writing style rating is primarily due to the article’s grammatical error, lack of brief but clear explanations in a few places . The word choice in the article could also be improved.
Despite the short form of the Distill article, more context (which can be brief) about the problem the article is trying to explain with their visualization would greatly improve the article.
1st Paragraph: The first paragraph could be greatly improved by clearly describing the desired model behavior (prediction based on long-term context) and problem with current set-up before jumping into (or without explaining) the specifics, i.e. what’s memorization in the context of this article?, what’s the vanishing gradient problem?
This is a small comment, but I’m not sure “memorization” is the right word to describe the article’s focus and personally prefer “context”, as “long-term context” makes more sense to me than “long-term memorization”. Personally, “memorization” seems to refer more to “short-term memorization”.
Here are more details on a few explanations that I thought can be improved in the article:
Here’s a few grammar issues I caught in my read through and suggestions for changes:
Regarding word choice, the authors used the pattern “very X” and the word “good” a few times; I’d encourage the authors to use stronger words and have provided suggestions.
For the reader interested in experimental details, the Autocomplete and Connectivity diagrams (and more generally the main text) could be improved by creating links / pop ups to the appendix with more details explaining each diagram (i.e., link to the Autocomplete Problem from caption and main text and/or a hover pop-up with minimal model and training description).
While it serves its purpose in demonstrating the vanishing gradient problem, “The Vanishing Gradient” diagram could be greatly improved by visualizing the gradient when more complex RNN units (GRU, LSTM) are used. Without giving this too much thought, this could be done by setting up simple, randomly initialized models.
Recurrent Unit, Nested LSTM: For completeness, it is a nice-to-have to also add the vanilla recurrent unit.
Connectivity Diagram: It would be nice to have a few more examples besides the one current example (as well as a link to code / notebook for others to play around with). It would also be nice to have the percentages for each predicted word.
Important Observations: Bold or underline “Clicking on the links above will change what is viewed in connectivity figure.”
The authors can consider -- either mentioning or doing for this article or for their future work more generally -- the following: A quantitative explanatory metric can be constructed as follows: How quickly is the current word predicted correctly? How many characters does it take on average? How does this metric vary based on the position of the word in the larger text (i.e., first word vs middle word vs last word).
Scientific Correctness & Integrity
Claims: The main claim the authors make on the current body of work is that their visualization suggests that the Nested LSTM paper’s claim that “more internal memory leads to more long-term memorization” may not be true. More examples (and possibly more analysis -- see future work suggestion) would be needed to substantiate that claim.
The author should highlight the limitations of visualizing the gradient (as well as justify why they think it’s a decent explanation). I’m more familiar with these limitations as they relate to Computer Vision (i.e., gradients can be noisy and not actually a salient “explanation” the more distance between the input and output / when propagated back through many layers -- see Simonyan et al., 2013 [subsequent works that attempt to deal with the noise]) and and am assuming that they transfer somewhat to NLP (there might even be unique limitations in NLP vs. CV). If this is not true, the authors should briefly mention that.
It’d be quite helpful if the authors released a Python notebook for readers to play around more with.
The authors should cite more for the unfamiliar reader’s benefit, particularly on related work to autocomplete and connectivity (datasets should also be properly cited). I’m not familiar with NLP literature so I’m not fluent on works related to autocomplete (if there are none, the authors should highlight that this is novel), but on connectivity, there’s quite a bit of explanatory / visualization research, particularly in Computer Vision research (primary one being Simonyan et al., 2014), on visualizing gradients.
This is a small point, but I’d prefer prefer if the authors made the following change “another paper” => “Karparthy et al.”, for readers who print out Distill articles (as I did).
Distill employs a reviewer worksheet as a help for reviewers.
The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.
Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest
The text was updated successfully, but these errors were encountered:
Thanks, it should all be fixed.
I added such link. As well as details on which model is being used.
The primary concern here is that doing that adds either a lot of heavy computation to the article, or makes it much larger to download.
Each example takes up about 8 MB. So I'm not sure that would be the best.
I made it italic and bold. As only bold or only underline have a different meaning.
Mentioned for now. I will properly end up doing this. Thanks.
It is really just a suggestion, the intension here isn't to claim that Nested LSTM is worse. It is simply to provide a better visualization for showing long-term memorization/contextual understanding than what they did.
Thanks. I read the article now. It is rather challenging to draw parallels, and I actually don't think the article discusses noise itself that much. My intuition is that it is less a concern, as the noise in texts has many different properties compared to images. Images have a high degree of white noise, while texts have none. There are some misspellings, maybe redundant could be considered a source of noise. But generally it is fairly minor, thus the gradient is less noisy. I think that is also apparent from the visualizations. Hence I don't feel a need to mention this further in the article.
That is not really a limitation. If the model is that poorly trained, then yes the connectivity doesn't make sense, but that will actually show that the model is poorly trained. I would consider that the desired behavior.
Most likely I will do the metrics as you suggested.
The code will be released.
Hmm. It is definitely not novel, it exists on most phones. But I also can't find any relevant literature on it.
Thanks. I changed that.
Thanks a lot for the feedback. I will look into adding the metrics, which could improve the "Scientific Correctness & Integrity" of the article. Although, I want to stretch (this is done in the article as well) that the purpose of the article is not to discredit Nested LSTM.
I now added the extra metric section.