Skip to content

Peer Review Report 2 -- Anonymous #14

@colah

Description

@colah

The following peer review was solicited as part of the Distill review process. Some points in this review were clarified by an editor after consulting the reviewer.

The reviewer chose to keep keep anonymity. Distill offers reviewers a choice between anonymous review and offering reviews under their name. Non-anonymous review allows reviewers to get credit for the service them offer to the community.

Distill is grateful to the reviewer for taking the time to review this article.

Conflicts of Interest: Reviewer disclosed no conflicts of interest.


[The reviewer did not include this in the written review, but verbally gave positive overall feedback and encouraged Distill to publish the article]

This let’s us use any gradient-based optimizer.

“This lets us use”

Ideally Y∗ can be found efficiently. With CTC we’ll settle for a close to optimal solution that’s not too expensive to find.

Does beam search necessarily give a close to optimal result? Often it does, but this seems like too strong a statement. Maybe “approximate solution that performs quite well in practice and isn’t too expensive to find”?

Distinguishing between these cases makes it easier to use the model for inference.

It’s a problem for training as well, isn’t it, just more subtly? If you’re summing over alignments that form “cat” during training, you don’t want to sum over ones that are ambiguous or that really mean “caat”.

This implies a third property: the length of Y cannot be greater than the length of X

Worth noting that the reason this isn’t a huge problem is that we can split X into pieces that are as small as we want, so that even when the relative speed of X and Y varies a lot, we can choose a chunk size for X small enough that this constraint is rarely violated.

As long as p_t(z \mid X)p​t​​(z∣X) is differentiable, the entire loss function will be differentiable.

Worth saying that just as we have a formula for the sum, we can also have a formula for its derivative in terms of the derivatives of p_t(z|X), and we can use this analytic formula directly in the gradient computation.

Notice also that this part of the computation doesn’t depend on the size of the output alphabet.

maybe a little more explanation, e.g. “doesn’t depend on the size of the output because we never have to sum over all possible letters”.

In some problems, such as speech recognition, incorporating a language model over the outputs significantly improves accuracy.

Worth saying that the language model requires beam search, and maybe saying a bit more about how it’s incorporated into beam search.

The function L(⋅) computes the length of Y in terms of the language model tokens and serves as a word insertion bonus

A bit mysterious and not totally clear.

In fact speech recognizers using CTC are not able to learn a language model over the output nearly as well as models which do not make this assumption.

Also note that if you have an auxiliary language model to go with the CTC model, then the system as a whole isn’t making the conditional independence assumption, so long as the language model isn’t character-wise conditionally independent (which e.g. a simple n-gram language model isn’t).

Section on “Input Synchronous Inference”

I know what this section is saying but it sounds a bit confusing to the uninitiated and I’m not sure this detail really needs to be included.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions