Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store subword indices for quality scores in Response #357

Closed
wants to merge 4 commits into from

Conversation

jelmervdl
Copy link
Member

Fixes #355

This change stores the quality score per subword range instead of byte range in Response. When inserting HTML into the translation output, the byte offsets change but the subword offsets don't.

I added a type for SubwordRange (previously just an alias for ByteRange) to avoid any confusion between the two. Now you can't compare or interchange them, even though they're both just ranges.

The WASM bindings still produce a SentenceQualityScore that works exactly the same as now, i.e. with ByteRanges. In #355 I mused that Javascript could access the byteranges per word through response.target.wordAsByteRange() but neither response.target nor the type AnnotatedText are currently exposed through the bindings. So to not introduce much more new API at this point, I made getQualityScores() return the same thing it always did.

Downside of this approach is that you'll need to delete() that returned object in Javascript land. I've changed worker.js to do this.

The bindings still expose the byteranges as they previously did, because otherwise I would also have to expose AnnotatedText etc. which is apparently not something we're doing right now. Because the bindings create a new score vector on the fly, you'll have to delete it when you no longer use it, unfortunately.
Copy link
Contributor

@jerinphilip jerinphilip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving a few queries below. Mozilla and QE have negotiated some properties (meaningful units of QE = word, etc.). There was some threshold based operation at extension discussion as well. This was before HTML, the assumptions were plaintext QE - so we may revisit.

Requesting @mfomicheva, @abarbosa94 and @felipesantosk to take a look, since we're modifying the parts.

for (const auto &wordByteRange : sentenceQualityEstimate.wordByteRanges) {
for (const auto &wordRange : sentenceQualityEstimate.wordRanges) {
const ByteRange wordByteRange{response.target.wordAsByteRange(sentenceIdx, wordRange.begin).begin,
response.target.wordAsByteRange(sentenceIdx, wordRange.end).begin};
Copy link
Contributor

@jerinphilip jerinphilip Feb 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a possibility wordByteRange ends up including an HTML opening/closing tag if two subwords end up with an HTML something in between?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. They are completely unsynchronized and they have to be for better or worse. I imagine HTML rendering will use something like #358 by copying the QE score tags.

}

return words;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's happening here? There was a "words as meaningful QE units" notion before, are we losing this with the HTML integrated QE?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The QualityEstimator class still works with this assumption, subwords are grouped into words in mapWords:

std::vector<SubwordRange> mapWords(const std::vector<float>& logProbs, const AnnotatedText& target,
const size_t sentenceIdx) {
// Ignore empty target
if ((logProbs.size() < 2) || (target.numWords(sentenceIdx) == 0)) {
return {};
}
// It is expected that translated words will have at least one word
std::vector<SubwordRange> wordIndices(/*numWords=*/1);
/// The LogisticRegressorQualityEstimator model ignores the presence of the EOS token, and hence we only need to
/// iterate n-1 positions.
for (size_t subwordIdx = 0; subwordIdx < (logProbs.size() - 1); ++subwordIdx) {
ByteRange subword = target.wordAsByteRange(sentenceIdx, subwordIdx);
const char firstLetter = target.text.at(subword.begin);
// if the first character is whitespace, it's a beginning of a new word
if (isspace(firstLetter)) {
wordIndices.back().end = subwordIdx;
wordIndices.emplace_back();
wordIndices.back().begin = subwordIdx;
}
}
wordIndices.back().end = logProbs.size() - 1;
return wordIndices;
}

The bit of code I removed here turned those groupings into a single byterange. That has been moved to the WASM bindings bit since that's the last possible moment to do it.

I think most clients will find it useful to do this themselves by accessing the AnnotatedText's data. But since we don't expose those to Javascript, and I didn't want to start a half-assed attempt at reworking the Javascript API, I moved the functionality above to here:

std::vector<SentenceQualityScore> getQualityScores(const Response& response) {
std::vector<SentenceQualityScore> scores;
scores.reserve(response.qualityScores.size());
for (size_t sentenceIdx = 0; sentenceIdx < response.qualityScores.size(); ++sentenceIdx) {
std::vector<ByteRange> wordByteRanges;
wordByteRanges.reserve(response.qualityScores[sentenceIdx].wordRanges.size());
for (auto&& word : response.qualityScores[sentenceIdx].wordRanges) {
wordByteRanges.emplace_back();
wordByteRanges.back().begin = response.target.wordAsByteRange(sentenceIdx, word.begin).begin;
wordByteRanges.back().end = response.target.wordAsByteRange(sentenceIdx, word.end).begin;
}
scores.emplace_back(SentenceQualityScore{response.qualityScores[sentenceIdx].wordScores, std::move(wordByteRanges),
response.qualityScores[sentenceIdx].sentenceScore});
}
return scores;
}

@jelmervdl jelmervdl self-assigned this Feb 18, 2022
Note that this does not work for HTML as the closing tags of the previous word will be placed before the spaces, so the spaces won't be skipped unfortunately.
@jelmervdl
Copy link
Member Author

Closing this in favour of #358 (which also includes part of this pull request)

@jelmervdl jelmervdl closed this Feb 25, 2022
@jelmervdl jelmervdl deleted the qe-for-html branch March 2, 2022 12:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

QE for HTML input doesn't work
3 participants