The following peer review was solicited as part of the Distill review process.
The reviewer chose to waive anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service them offer to the community.
Distill is grateful to the reviewer, Dylan Cashman, for taking the time to write such a thorough review.
In this work, the authors explore three different kinds of recurrent models for autocompletion primarily using two interactive visualizations. It isn't clear what the contribution of the work is, however, because the claims made in the title and introduction are not adequately matched by evidence found in the visualizations. The introduction suggests that the visualizations provided show how gradient magnitudes can be helpful in understanding the differences between short-term and long-term memories, but the two visualizations prominently display the inferences of the models. The second visualization does show gradient magnitude, or connectivity, but the text provides little guidance on how to interpret it, and the two outlined examples only depend on the inference, or expected predictions, of the models. Inspecting the inferences of models to compare them is not a novel idea, although the visualizations are very clear and seem like very useful tools to inspect inference. In summary, the visualizations seemed very useful, but it was unclear what the goal of the work was.
This submission is promising in two distinct ways. To begin with, it features some very compelling visualizations that are well-designed, responsive, and inviting to the reader. It also attempts to explain the difference between memories of various recurrent models, which is not well understood empirically. However, its findings are not convincing, both because the examples given only analyze the inference probabilities of the models, which is not a novel technique, and also because the models being inspected do not replicate the performance found in past studies. I recommend that the authors attempt to replicate the findings of the NLSTM on the auto-correct data, and then use the visualizations to interpret those models. Otherwise, it is impossible to attribute any artifact found in the visualization as being the result of a poorly trained model, or an issue with the visualization. If it is the case that the NLSTM results cannot be replicated on that dataset, the authors need to offer some suggestions of why that's the case. In fact, this might offer a compelling story about their visualization—the qualitative insights they gain from their visualization lead them to conduct more quantitative experiments to understand why their NLSTM model is not learning long-term dependencies.
Some minor comments:
The figures should have more descriptive captions. For example, in the first figure, the reader doesn't know what the green highlights mean. It isn't necessary to describe in detail what connectivity is at this point. It is only necessary to provide a high-level description of how the reader should interpret what's going on in the figure.
In general, the figures are excellent, and the use of hypertext to set the state of the figures is very convenient for the reader.
The intro is fairly generic. It isn't compelling to the reader to describe memorization as ""a challenge"": Why is it challenging? Can you provide examples of why this is a critical issue? Towards the end of the introduction, there should be a high-level description of what the visualization does, and why it would offer any more insight than cross-entropy or accuracy. Many users of recurrent algorithms might also need to be convinced why an interactive visualization would even be useful; why can't this problem be solved by just printing out a metric? The answer has to do with grounding the user in their data, allowing them to rapidly test hypotheses on how the various models perform at sections of their training set, using a pleasing and inviting visualization.
It would be good to link to some literature on the vanishing gradient: Pascanu, Razvan, Tomas Mikolov, and Yoshua Bengio. ""On the difficulty of training recurrent neural networks."" International Conference on Machine Learning. 2013.
When the authors mention connectivity, it is unclear if this is a term that they have defined, or if it has previous usage in the field. This should be better explained. The authors should also explain, at a high level, why the gradient might be an interesting quantity to show the reader, and explain how to interpret it. In particular, the reader should understand what it means for a model to have some gradient propagating back to much earlier words; that the current prediction is a function of the prediction made at that time step.
In both the figure describing the recurrent units and the appendix, the simpler architectures should go first, with the NLSTM last, to allow the reader to understand the context in which the NLSTM was invented. The description of the LSTM and NLSTM given in the appendix seem mostly superfluous because this information is readily available in the listed references.
Grammar/spelling and word usage:
In the introduction, there is a incomplete sentence ""While quantitative comparisons are useful.""
Some word misspellings: ""gramma"", ""attemps"", ""enogth""
When describing how the models were trained, the authors note they trained for 7139 epochs for one complete run of the data. This actually means they trained for 7139 batches and a single epoch. This may actually be the source of some of the issues they had - it is likely that their models should train for many more epochs.
Distill employs a reviewer worksheet as a help for reviewers.
The first three parts of this worksheet ask reviewers to rate a submission along certain dimensions on a scale from 1 to 5. While the scale meaning is consistently "higher is better", please read the explanations for our expectations for each score—we do not expect even exceptionally good papers to receive a perfect score in every category, and expect most papers to be around a 3 in most categories.
Any concerns or conflicts of interest that you are aware of?: No known conflicts of interest
The text was updated successfully, but these errors were encountered:
I don't know what you mean by the introduction and title not matching. This is too unspecific to be actionable. Others have suggested something more specific here, I hope that correlates with your expectations.
As you said, the second visualization shows the "gradient magnitude, or connectivity", which is not inference. And there are two descriptions of how to interpret that.
I have not made any claims that anything here is novel. I think the idea is very simple and I would be surprised if I'm the first to visualize gradient magnitude, the novelty is in how it has been visualized.
The Nested LSTM paper found the cross entropy to be close but slightly better compared to LSTM and GRU. I also found it to be close, although slightly worse. As we don't know the exact setup to Nested LSTM paper, I think getting close is resonable. If the purpose of this article was to discredit Nested LSTM this might be worth the effort, but that is not the purpose. I think the variations are too likely to explained by a random-seed or some hyperparameter.
Thanks. A few words about this have been added.
Hmm, I think this is a symptom of only articles with a produce results being published. That makes it hard to cite concrete issues.
Some text have been added. Others have suggested something similar.
Thanks for the suggestion.
There are some new drawing. Besides that it only serves to make the notation explicit.
Thanks. The word "gramma" is from the dataset. I don't intend to fix that.
I change it to the word mini-batch. I think that is the most explicit.