The following peer review was solicited as part of the Distill review process. Some points in this review were clarified by an editor after consulting the reviewer.
The reviewer chose to keep keep anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service them offer to the community.
Distill is grateful to the reviewer for taking the time to review this article.
Conflicts of Interest: Reviewer disclosed no conflicts of interest.
[The reviewer did not include this in the written review, but verbally gave positive overall feedback and encouraged Distill to publish the article]
“This lets us use”
Does beam search necessarily give a close to optimal result? Often it does, but this seems like too strong a statement. Maybe “approximate solution that performs quite well in practice and isn’t too expensive to find”?
It’s a problem for training as well, isn’t it, just more subtly? If you’re summing over alignments that form “cat” during training, you don’t want to sum over ones that are ambiguous or that really mean “caat”.
Worth noting that the reason this isn’t a huge problem is that we can split X into pieces that are as small as we want, so that even when the relative speed of X and Y varies a lot, we can choose a chunk size for X small enough that this constraint is rarely violated.
Worth saying that just as we have a formula for the sum, we can also have a formula for its derivative in terms of the derivatives of p_t(z|X), and we can use this analytic formula directly in the gradient computation.
maybe a little more explanation, e.g. “doesn’t depend on the size of the output because we never have to sum over all possible letters”.
Worth saying that the language model requires beam search, and maybe saying a bit more about how it’s incorporated into beam search.
A bit mysterious and not totally clear.
Also note that if you have an auxiliary language model to go with the CTC model, then the system as a whole isn’t making the conditional independence assumption, so long as the language model isn’t character-wise conditionally independent (which e.g. a simple n-gram language model isn’t).
I know what this section is saying but it sounds a bit confusing to the uninitiated and I’m not sure this detail really needs to be included.
The text was updated successfully, but these errors were encountered:
Thanks so much for taking the time to review the article and for the feedback. I've addressed all the points below and the corresponding changes to the article can be found in PR #16 (Report 2).
The beam search is not guaranteed to be close to optimal. I've clarified / made the statement more accurate in the text.
Yes, good point, if some of the alignments overlap between "cat" and "caat" then we could end up optimizing alignments that are totally ambiguous. Prior to this review, I had also changed this paragraph such that it frames the problem as not being able to produce outputs with multiple characters in a row:
There is no longer a mention of the problem being specific to inference only and in general I hope the example points out how the naive collapsing results in a poorly defined model for common use cases.
This is a great point for problems like ASR or OCR, but, I think may not hold easily in general. For example, if we wanted to use CTC to transcribe a sequence of letters to another sequence of letters there may not be a finer grained encoding of the input. I added a note that this isn't usually an issue for ASR and OCR in the "Properties of CTC" section.
It's possible I'm misunderstanding the point being made here. My interpretation is that it's worth noting we can analytically compute the gradient of the CTC loss function with respect to p_t(z | x) and perform back propagation as usual. I changed the paragraph on computing the gradient to make this clear.
I ended up removing this paragraph as it was not critical and somewhat out of place.
I mentioned that the LM should be incorporated into the beam search. I did not go into much detail on exactly how though as the implementation can be fairly involved and I think beyond of the scope of this tutorial.
I've added a (hopefully) clarifying example:
Yes, I've added a short sentence clarifying that an external language model can be used with CTC to model the dependencies between the outputs.
I've removed this section. I agree it's not critical and I don't think it's worth spending more time elaborating on input vs output synchronous decoding.