Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QE for HTML input doesn't work #355

Closed
abhi-agg opened this issue Feb 17, 2022 · 28 comments · Fixed by #358
Closed

QE for HTML input doesn't work #355

abhi-agg opened this issue Feb 17, 2022 · 28 comments · Fixed by #358
Assignees
Labels
enhancement New feature or request

Comments

@abhi-agg
Copy link
Contributor

abhi-agg commented Feb 17, 2022

A quick experiment with wasm test page for html text translation shows weird byte ranges for words.

An example:

Sentence: "How are  <b>you</b>  doing?"
Translated sentence: "Wie geht <b>es dir</b>?"

"words":["Wie", "geht", "<b", "es d"]
"wordByteRanges":[{"begin":0,"end":3}, {"begin":4,"end":8}, {"begin":9,"end":11}, {"begin":12,"end":16}]

Looks like, QE ignores tags completely (treating them as if they are non-existent) in the translated sentences and compute the byte ranges of the words in the translated sentences. Is it something that needs to be fixed or am I doing something wrong at my end?

Attaching the image for detailed results.

Screenshot 2022-02-17 at 14 32 35

cc @kpu @jerinphilip @abarbosa94 @felipesantosk @mfomicheva

@abhi-agg
Copy link
Contributor Author

abhi-agg commented Feb 17, 2022

Looks like, QE ignores tags completely (treating them as if they are non-existent) in the translated sentences and compute the byte ranges of the words in the translated sentences.

Because I translated How are you doing? after removing all the tags from html sentence and I get the same byte ranges for both html and non-html versions.

Attached image for reference.
Screenshot 2022-02-17 at 14 49 19

@jerinphilip jerinphilip added the bug Something isn't working label Feb 17, 2022
@jelmervdl
Copy link
Member

Ah, I totally missed that. I was under the impression that QualityEstimation scores uses those same offsets that are stored in the AnnotatedText class that HTML updates. It does… but it aggregates them into words and then stores its own byte offsets all before the HTML is inserted into the translation.

I'll see whether I can make the QualityEstimator class work on token offsets instead. Then it automatically stays up to date with the byte offset updates that HTML makes. (And it would also benefit from changes like #298.)

@jelmervdl jelmervdl self-assigned this Feb 17, 2022
@jerinphilip
Copy link
Contributor

@jelmervdl

I'll see whether I can make the QualityEstimator class work on token offsets instead.

QE uses words (not continuous subwords), I don't expect the above approach to fare well.

I have always wanted QE to use Annotation after relaxing the continuity constraints. Then we should be able to uniformly update QE "annotation" (based on words, no continuity) and plaintext vocab "annotation" (based on subwords, sentence piece guaranteed continuity).

This will break the negotiated API between QE and Mozilla, but I strongly believe is the proper way to go. Especially since #298 also points towards this. In effect, this would mean pushing down some of the "insert" tag (opening or closing) before token t as a method on to Annotation, which automatically fixes the rest of the Annotation. While I do not know the full detail, I understand that in the current state of HTML you are editing Annotation outside the class to accomplish what is currently happening in HTML - some of this would be moving in and contained within Annotation. This means you can fix up the QE "Annotation" the same way you fix up the plain-text annotations.

I do not expect the above to be trivial.

@jerinphilip
Copy link
Contributor

QE is slated to be implemented in W5 (March 7-11). HTML is expected to be implemented in W3 (February 21-25), which is next week. I guess it's reasonable to keep this soft at W4 here and prioritize based on how HTML completes or not by W3.

@jelmervdl
Copy link
Member

QE uses words (not continuous subwords), I don't expect the above approach to fare well.

My idea was to remove the subwordToWord call here:

return {wordScores, subwordToWords(wordIndices, target, sentenceIdx), sentenceScore};

That call just turns the token ranges that QualityEstimator already uses internally into byte ranges:

std::vector<ByteRange> subwordToWords(const std::vector<SubwordRange>& wordIndices, const AnnotatedText& target,
const size_t sentenceIdx) {
std::vector<ByteRange> words;
for (const SubwordRange& wordIndice : wordIndices) {
size_t wordBegin = target.wordAsByteRange(sentenceIdx, wordIndice.begin).begin;
size_t wordEnd = target.wordAsByteRange(sentenceIdx, wordIndice.end).begin;
if (isspace(target.text.at(wordBegin))) {
++wordBegin;
}
words.emplace_back(ByteRange{wordBegin, wordEnd});
}
return words;

That step is something a client can easily do itself, it has access to the AnnotatedText.

Hell, I could even integrate it in a way similar to 83e869c at that point, and have quality estimation scores be part of the output HTML. Then the client wouldn't need to do any work at all. But that all depends on the scenario in which we need both HTML and quality scores at the same time.

@abhi-agg
Copy link
Contributor Author

QE is slated to be implemented in W5 (March 7-11). HTML is expected to be implemented in W3 (February 21-25), which is next week. I guess it's reasonable to keep this soft at W4 here and prioritize based on how HTML completes or not by W3.

The whole idea of filing this issue was based on @kpu's suggestion (btw @kpu thanks a lot for this suggestion 👍🏾) in last plenary to start integrating and find bugs quickly so that technical support from involving partners could be solicited asap.

QE feature is definitely going to be a part of the final delivery and so is the HTML. In my opinion, this fix doesn't have to wait till W4. We need to test QE again after this bug is fixed and that might resurface some new issues which further might take some time to fix.

All of this goes in general for anything that is becoming a part of the final delivery.
cc @andrenatal @lonnen

@kpu
Copy link
Member

kpu commented Feb 18, 2022

@jelmervdl is working on it.

@abhi-agg
Copy link
Contributor Author

abhi-agg commented Feb 18, 2022

My idea was to remove the subwordToWord call here:

return {wordScores, subwordToWords(wordIndices, target, sentenceIdx), sentenceScore};

That call just turns the token ranges that QualityEstimator already uses internally into byte ranges:

That step is something a client can easily do itself, it has access to the AnnotatedText.

@jelmervdl Thanks for working on it. I would add one thing. If I remember correctly, the translation models (and perhaps supervised QE models as well) provide scores of the "subwords" and then some post-processing is done to convert the "subword" scores to "word" scores in the engine. We discussed in the beginning with the QE team and @jerinphilip that it is better for the client to receive the quality scores at "word" level so that client doesn't need to do this post-processing. It is more efficient to do it in the engine as it has greater control and access to more information, plus avoiding replicating the same logic for every client. The client was kind of made oblivious to the concept of "subword" when the html translation came inside the engine.

Please correct me if I am wrong somewhere. Open for discussion here if keeping it the same way for html text poses a technical challenge 👍🏾

@kpu
Copy link
Member

kpu commented Feb 18, 2022

@abhi-agg @andrenatal Please take a look at #357 and #358 to express an opinion about API.
#357: Byte spans expanded to cover HTML removal/insertion. (I note @abhi-agg is a reviewer!)
#358: explicit HTML tags inserted with QE scores as an attribute.

Of course QE words and HTML tags are unsynchronized i.e. a QE word can span over an HTML open or closing tag. So #358 repeats the QE score tag as necessary to cover the QE word.

@jelmervdl
Copy link
Member

We discussed in the beginning with the QE team and @jerinphilip that it is better for the client to receive the quality scores at "word" level so that client doesn't need to do this post-processing.

I agree with the reasoning here. In both #357 I effectively moved the subwordToWord call to the WASM bindings part. In Javascript you'll still receive scores per "word" (where the byterange for that word can include some HTML tags as well, but never more text)

In #358 you get this behaviour for free since I assign the same <font x-bergamot-word-score=""> tag to all subwords in a word, and the HTML insertion part is smart enough to then group all subwords into the same font tag.

@jelmervdl jelmervdl linked a pull request Feb 18, 2022 that will close this issue
@jerinphilip
Copy link
Contributor

Open for discussion here if keeping it the same way for html text poses a technical challenge

#358 looks promising. To carry forward, we need to reach synchronization on the following:

  1. Is QE intended by the extension devs to be operational for Outbound translation? This would mean we will have to look into Store subword indices for quality scores in Response #357 with plaintext, no HTML.
  2. Is there a mockup for the UI / Interactions with QE from the extension dev side? If Embed quality-scores as HTML tag attributes #358 is working with HTML render, it will need to be coordinated with how the JS or CSS at extension interacts for this purpose. There's no ask for anything concrete here, but a loose sketch of what is planned to fit the active pull request to the needs. @jelmervdl appears to have shown a demonstration of how 358 is intended to be used in the description.

@kpu
Copy link
Member

kpu commented Feb 18, 2022

  1. QE has only delivered three models and they're all out of English, so there is no bidirectional QE. Which is sad because QE might have been useful in that context. However, text/plain translation remains a requirement.

  2. There is a mockup. See https://github.com/browsermt/coordination/blob/master/docs/D1.3-Bergamot_User_interface_with_quality_estimation.pdf in the private repo.

@jerinphilip
Copy link
Contributor

jerinphilip commented Feb 18, 2022

There is a mockup. See https://github.com/browsermt/coordination/blob/master/docs/D1.3-Bergamot_User_interface_with_quality_estimation.pdf in the private repo.

I found the following underline based proposal, and a few variants in Section 5 relating to QE.

image

From meetings, I think it's important to discuss the following. I think @jelmervdl's render might have to do the thresholding (in C++) that was earlier meant for @abhi-agg to do to get the probability values to classes (Section 5: Major/Critical, Major/Minor/Critical, Poor/Ok). Right now we're exporting real values as data. I'm not sure if the decision rule can be applied in CSS (I've lost track of how capable it has become), @jelmervdl's currently using continuous colouring. There is the alternate option of picking the data from attributes and doing the attaching the class to the element in JS.

@jelmervdl
Copy link
Member

All mock-ups seem to assume plain text (e.g. a form input), not HTML. In that case, what is currently in main will be sufficient. #357 won't hurt, but the offsets weren't broken to begin with in this scenario.

To clarify a bit further: #357 just fixes the byte offsets in case the input is marked as HTML. If the input was text, there was no issue. So for form translation (which I assume are always plain text unless you're going to attempt auto translation in contenteditable elements…) it is not necessary.

#358 adds the quality scores as HTML to HTML output, but this only works if the input was HTML to begin with. Of course, it could also be used with plain text if you'd encode all entities in the text before sending it to the translator (and mark it as HTML). The output can easily be rendered like the mock-ups. Thresholding can easily be done in the extension using Javascript or CSS. The added tags that don't meet the threshold won't affect rendering. You might need to be a bit creative with horizontal padding to fill in the gaps between consecutive words that do meet the threshold though. The way I insert the tags, it will leave the spaces that are not part of the word outside any of the score tags. But that's not insurmountable.

@kpu
Copy link
Member

kpu commented Feb 18, 2022

I think we should get #357 in to close this issue as that addresses the original post and meets our original agreement. @abhi-agg can mull the HTML offer.

@kpu
Copy link
Member

kpu commented Feb 21, 2022

Written to @abhi-agg and @andrenatal seeking comment on #357 and #358 and am coming to the notion #358 is a better option. Blocked awaiting their feedback.

@abhi-agg
Copy link
Contributor Author

abhi-agg commented Feb 21, 2022

I have few queries here:

  1. Both Store subword indices for quality scores in Response #357 and Embed quality-scores as HTML tag attributes #358 solve the same issue. i.e. returning QE scores for HTML text translation and not the plain text translation.
  2. For plain text translation, QE scores will continue to be returned as before (i.e. byte ranges representing the "words" and their corresponding scores). This means JS side will format the text for QE by processing these byte ranges and scores.

Am I right?

@kpu
Copy link
Member

kpu commented Feb 21, 2022

@abhi-agg Plain text remains as it was already working. However, what I would recommend as the easiest thing for you is to follow the suggestion in: #355 (comment) specifically:

HTML: pick #358 and you get the tags back.
text/plain: you encode it then submit as HTML. Render the HTML (Note it will be a snippet of HTML, not a full page.)

Then you don't have to handle byte ranges in either case.

@abhi-agg
Copy link
Contributor Author

abhi-agg commented Feb 21, 2022

HTML: pick #358 and you get the tags back.
text/plain: you encode it then submit as HTML. Render the HTML (Note it will be a snippet of HTML, not a full page.)

Using this approach means always setting ResponseOptions::HTML flag to true. Right? And in that case wouldn't the engine crash for sentences like Hello &lt; world (e.g this issue).

@kpu
Copy link
Member

kpu commented Feb 21, 2022

HTML: pick #358 and you get the tags back.
text/plain: you encode it then submit as HTML. Render the HTML (Note it will be a snippet of HTML, not a full page.)

Using this approach means always setting HTML flag to true. Right? And in that case wouldn't the engine crash for sentences like Hello &lt; world?

HTML: pick #358 and you get the tags back.
text/plain: you encode it then submit as HTML. Render the HTML (Note it will be a snippet of HTML, not a full page.)

Using this approach means always setting ResponseOptions::HTML flag to true. Right? And in that case wouldn't the engine crash for sentences like Hello &lt; world (e.g this issue).

Hello &lt; world is a valid HTML snippet. Hello < world is not. That's what encode refers to.

I suppose one could always have HTML mode on then implement text/plain translation by encode -> pass as HTML -> decode. That would be less efficient though.

For QE, to render the colors, you are presumably doing HTML rendering in all cases anyway, hence the suggestion of having us generate the HTML output even if the input is (escaped) text/plain.

@jelmervdl
Copy link
Member

Little Javascript example of how to encode plain text to HTML for when you want to submit plain text to the engine but want HTML as output:

function encodePlainTextAsHTML(text) {
  const div = document.createElement('div');
  div.appendChild(document.createTextNode(text));
  return div.innerHTML;
}

And then also set responseOptions.html = true.

@jerinphilip
Copy link
Contributor

All mock-ups seem to assume plain text (e.g. a form input), not HTML.

If plaintext (textarea, document) is being translated, I don't think there's a way to annotate it with reds or colors (https://stackoverflow.com/a/12831155/4565794) without converting to HTML first.

Suggest taking a look at Grammarly for an example. It's essentially the same problem, highlight poor confidence tokens and leave the others be. The textarea or input field is kept hidden, a contentEditable div takes over. The rich text annotations go on the div which support HTML. The textarea or form field can be populated with just the text content.

@jelmervdl
Copy link
Member

@jerinphilip I assume that people type in the translation input field, and the quality is shown in a separate translation output field (which isn't editable when the UI is in this state…). Only once the translation is accepted you'd need to convert the translation output field to plain text to be inserted in the actual form field.

@abhi-agg
Copy link
Contributor Author

Hello &lt; world is a valid HTML snippet. Hello < world is not. That's what encode refers to.

Oh. Thanks for correcting 👍🏾

Little Javascript example of how to encode plain text to HTML for when you want to submit plain text to the engine but want HTML as output:

function encodePlainTextAsHTML(text) {
  const div = document.createElement('div');
  div.appendChild(document.createTextNode(text));
  return div.innerHTML;
}

And then also set responseOptions.html = true.

@jelmervdl With responseOptions.html = true, the sentence Hello < world if passed as it is will break the engine but the encoded version of it will not. Right?

@jelmervdl
Copy link
Member

@abhi-agg Yep! With responseOptions.html = true:
Okay: Hello &lt; world
Not okay: Hello < world

@abhi-agg
Copy link
Contributor Author

@jelmervdl Thanks for clarifying. Please correct me if I am wrong but with the approach #358 (comment), extension will still have to parse the html translation result to show color-coded words based on thresholds.

@kpu
Copy link
Member

kpu commented Feb 21, 2022

@abhi-agg How exactly are you showing the color-coded words? Wouldn't the natural way to do that be rendering the HTML with some CSS?

@jelmervdl
Copy link
Member

I've added an example of how the output of #358 can be used to the demo page in that branch. That example treats the input field as plain text, escapes it, and renders the translation output as HTML. Then with a little bit of Javascript for thresholding and CSS I render sentence and word level quality indicators.

image

(Based on thresholds that have no meaning, I just wanted some thresholds that are met often enough for screenshot purposes.)

@jerinphilip jerinphilip added enhancement New feature or request and removed bug Something isn't working labels Feb 24, 2022
jerinphilip pushed a commit that referenced this issue Feb 25, 2022
Quality scores for HTML translation exposed as <font
x-bergamot-sentence-score=""> and <font x-bergamot-word-score=""> tags
in the HTML output. While this increases the size of the HTML returned,
the resulting rendered HTML can easily be styled to show the scores.
With Javascript or CSS, developers can easily have some interface based
on these extra attributes.

Also includes updates to the test page to show a proof-of-concept 
demonstration.

Fixes: #355
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
4 participants