New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peer Review Report 2 -- Anonymous #14

Closed
colah opened this Issue Nov 13, 2017 · 1 comment

Comments

Projects
None yet
3 participants
@colah
Member

colah commented Nov 13, 2017

The following peer review was solicited as part of the Distill review process. Some points in this review were clarified by an editor after consulting the reviewer.

The reviewer chose to keep keep anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service them offer to the community.

Distill is grateful to the reviewer for taking the time to review this article.

Conflicts of Interest: Reviewer disclosed no conflicts of interest.


[The reviewer did not include this in the written review, but verbally gave positive overall feedback and encouraged Distill to publish the article]

This let’s us use any gradient-based optimizer.

“This lets us use”

Ideally Y∗ can be found efficiently. With CTC we’ll settle for a close to optimal solution that’s not too expensive to find.

Does beam search necessarily give a close to optimal result? Often it does, but this seems like too strong a statement. Maybe “approximate solution that performs quite well in practice and isn’t too expensive to find”?

Distinguishing between these cases makes it easier to use the model for inference.

It’s a problem for training as well, isn’t it, just more subtly? If you’re summing over alignments that form “cat” during training, you don’t want to sum over ones that are ambiguous or that really mean “caat”.

This implies a third property: the length of Y cannot be greater than the length of X

Worth noting that the reason this isn’t a huge problem is that we can split X into pieces that are as small as we want, so that even when the relative speed of X and Y varies a lot, we can choose a chunk size for X small enough that this constraint is rarely violated.

As long as p_t(z \mid X)p​t​​(z∣X) is differentiable, the entire loss function will be differentiable.

Worth saying that just as we have a formula for the sum, we can also have a formula for its derivative in terms of the derivatives of p_t(z|X), and we can use this analytic formula directly in the gradient computation.

Notice also that this part of the computation doesn’t depend on the size of the output alphabet.

maybe a little more explanation, e.g. “doesn’t depend on the size of the output because we never have to sum over all possible letters”.

In some problems, such as speech recognition, incorporating a language model over the outputs significantly improves accuracy.

Worth saying that the language model requires beam search, and maybe saying a bit more about how it’s incorporated into beam search.

The function L(⋅) computes the length of Y in terms of the language model tokens and serves as a word insertion bonus

A bit mysterious and not totally clear.

In fact speech recognizers using CTC are not able to learn a language model over the output nearly as well as models which do not make this assumption.

Also note that if you have an auxiliary language model to go with the CTC model, then the system as a whole isn’t making the conditional independence assumption, so long as the language model isn’t character-wise conditionally independent (which e.g. a simple n-gram language model isn’t).

Section on “Input Synchronous Inference”

I know what this section is saying but it sounds a bit confusing to the uninitiated and I’m not sure this detail really needs to be included.

@awni awni referenced this issue Nov 15, 2017

Merged

Report 2 #16

@awni

This comment has been minimized.

Collaborator

awni commented Nov 15, 2017

Thanks so much for taking the time to review the article and for the feedback. I've addressed all the points below and the corresponding changes to the article can be found in PR #16 (Report 2).

Does beam search necessarily give a close to optimal result? Often it does, but this seems like too strong a statement. Maybe "approximate solution that performs quite well in practice and isn’t too expensive to find"?

The beam search is not guaranteed to be close to optimal. I've clarified / made the statement more accurate in the text.

It’s a problem for training as well, isn't it, just more subtly? If you’re summing over alignments that form "cat" during training, you don’t want to sum over ones that are ambiguous or that really mean "caat".

Yes, good point, if some of the alignments overlap between "cat" and "caat" then we could end up optimizing alignments that are totally ambiguous. Prior to this review, I had also changed this paragraph such that it frames the problem as not being able to produce outputs with multiple characters in a row:

We have no way to produce outputs with multiple characters in a row. Consider the alignment [h, h, e, l, l, l, o]. The naive collapsing will produce "helo" instead of "hello".

There is no longer a mention of the problem being specific to inference only and in general I hope the example points out how the naive collapsing results in a poorly defined model for common use cases.

Worth noting that the reason this isn't a huge problem is that we can split X into pieces that are as small as we want, so that even when the relative speed of X and Y varies a lot, we can choose a chunk size for X small enough that this constraint is rarely violated.

This is a great point for problems like ASR or OCR, but, I think may not hold easily in general. For example, if we wanted to use CTC to transcribe a sequence of letters to another sequence of letters there may not be a finer grained encoding of the input. I added a note that this isn't usually an issue for ASR and OCR in the "Properties of CTC" section.

Worth saying that just as we have a formula for the sum, we can also have a formula for its derivative in terms of the derivatives of p_t(z|X), and we can use this analytic formula directly in the gradient computation.

It's possible I'm misunderstanding the point being made here. My interpretation is that it's worth noting we can analytically compute the gradient of the CTC loss function with respect to p_t(z | x) and perform back propagation as usual. I changed the paragraph on computing the gradient to make this clear.

maybe a little more explanation, e.g. "doesn’t depend on the size of the output because we never have to sum over all possible letters".

I ended up removing this paragraph as it was not critical and somewhat out of place.

Worth saying that the language model requires beam search, and maybe saying a bit more about how it's incorporated into beam search.

I mentioned that the LM should be incorporated into the beam search. I did not go into much detail on exactly how though as the implementation can be fairly involved and I think beyond of the scope of this tutorial.

A bit mysterious and not totally clear.

I've added a (hopefully) clarifying example:

With a word-based language model L(⋅) counts the number of words in Y. If we use a character-based language model then L(⋅) counts the number of characters in Y .

Also note that if you have an auxiliary language model to go with the CTC model, then the system as a whole isn't making the conditional independence assumption, so long as the language model isn't character-wise conditionally independent (which e.g. a simple n-gram language model isn't).

Yes, I've added a short sentence clarifying that an external language model can be used with CTC to model the dependencies between the outputs.

I know what this section is saying but it sounds a bit confusing to the uninitiated and I’m not sure this detail really needs to be included.

I've removed this section. I agree it's not critical and I don't think it's worth spending more time elaborating on input vs output synchronous decoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment