Implementation of Inside-outside algorithm for tag translation #199

qianqianzhu · 2021-06-24T13:51:45Z

This PR implements an inside-outside algorithm for tag translation.
The basic idea of this feature is to reserve HTML tag positions in the translated sentence using the alignment and token information.

The current implementation following the suggestions by @kpu:

TagProcessor is built as a layer on top translate using the alignment and token information.
The browser provides tag information as ByteRange in the source sentence and we return tag information as ByteRange in the target sentence in the same order.

There are still some things to be done:
(1) optimise the no-solution case. (some randomisations in the greedy algorithm)
(2) some corner cases are not optimal.
(3) empty tags need to be optimised.

src/translator/tag_processor.h

motin · 2021-06-24T20:04:45Z

If work on this PR or follow-up PRs eventually leads to a capability for bergamot-translator to translate tags and in turn allow for the extension not having to maintain and finish the detagAndProject methods, then maybe these test cases are relevant: https://github.com/mozilla-extensions/firefox-translations/blob/7cb7faffd9847ff87f587092078234d71ac2b820/src/core/ts/content-scripts/dom-translation-content-script.js/dom-translators/detagAndProject.spec.ts#L36-L188
Note that these test cases passed in v0.2.0 of bergamot-translator, but are failing since v0.3.0 and @abhi-agg is maintaining the detagAndProject code nowadays.

qianqianzhu · 2021-06-25T08:56:22Z

If work on this PR or follow-up PRs eventually leads to a capability for bergamot-translator to translate tags and in turn allow for the extension not having to maintain and finish the detagAndProject methods, then maybe these test cases are relevant: https://github.com/mozilla-extensions/firefox-translations/blob/7cb7faffd9847ff87f587092078234d71ac2b820/src/core/ts/content-scripts/dom-translation-content-script.js/dom-translators/detagAndProject.spec.ts#L36-L188
Note that these test cases passed in v0.2.0 of bergamot-translator, but are failing since v0.3.0 and @abhi-agg is maintaining the detagAndProject code nowadays.

@motin many thanks for pointing out the relevant tests. That's exactly what I am looking for. I will try to integrate those test cases into this PR.

src/tests/units/tag_tests.cpp

qianqianzhu · 2021-10-21T12:27:29Z

@abhi-agg let me explain the unclear points.

As per my understanding, the browser will always send the byte ranges for the tags in the input text at character level and receive byte ranges for the tags in the translated text at character level.

Exactly. The integration part with browser is always at char-level as the sentence splitter or tokeniser is inside Bergamot/Marian.

For the entire text that comes your way (irrespective of whether it contains a phrase, single sentence or multiple sentences), you always expect just one std::vector<ByteRange>

Yes. Like I said about the char-level thing, the browser doesn’t have the sentence splitter, so the cross-sentence tag case is an internal matter for Bergamot.

To be more specific, from the browser side, we would expect the entire text where all the tags are removed and one std::vector<ByteRange> where contains tag positions at chat-level. Then from us, you would expect the translated text as whole (without tags) and one std::vector<ByteRange> where contains tag positions at char-level in the translated text (in the same order as in the original text so you render them back).

Edit: just a reminder about the order of tag positions as we agreed before:

The byte ranges should appear in the same order as the open tags in HTML.

abhi-agg · 2021-10-21T14:29:15Z

4. migrate exiting mozilla test cases (https://github.com/mozilla-extensions/firefox-translations/blob/7cb7faffd9847ff87f587092078234d71ac2b820/src/core/ts/content-scripts/dom-translation-content-script.js/dom-translators/detagAndProject.spec.ts#L36-L188) to this repo.

@qianqianzhu I am not sure how much these test cases are valid/helpful as we resorted to passing tag positions as byte offsets to the bergamot-translator and let it tell us the positions of tags in translated text. However, if you find them helpful then please feel free to incorporate them.

I have tried to list down some test cases as per my understanding in mozilla#61 (comment). Please let me know if the expectations of the API are different to what I have documented there. This will help us in starting the integration in extension side and find the corner cases sooner 👍
cc @andrenatal

…lation

jerinphilip

Hi,

I'm leaving this review in a bit of haste given mention of a Friday deadline, excuse any misunderstandings.

If I understood correctly, I recommend replacing your usage of ByteRange in TagTree to represent index of tokens at begin and end with a strong typedef of an equivalent as soon as possible (TokenIndexRange?) to avoid confusion. A ByteRange represents a range in a sequence of characters. begin and end are indices onto the string, not a sequence of token units.
There are pass by values which can create high memory usage, would a const & work here?
How about having a vector<IndexRange> vertices, vector<vector<size_t>> adjacencyList[vertex] and TagTree owning it, recursive member functions traversing the tree by using adj? Count of source vertices are known at construction, same for target vertices?

src/translator/tag_processor.h

jerinphilip · 2021-10-27T12:45:06Z

src/tests/apps.cpp

+    TagProcessor tp = TagProcessor(alignments, ttOriginalTokenLevel, response.source.numWords(sentenceId),
+                                   response.target.numWords(sentenceId));
+    int exitStatus = tp.traverseAndQuery();
+    TagTree ttTranslatedTokenLevel = tp.getTargetRoot();


Can't 159 - 162 be the following?

TagTree targetTree = tagProcess(alignments, ttOriginalTokenLevel, response.source.numWords(sentenceId), response.target.numWords(sentenceId))

Edit: This is a Maybe = Just / Nothing, which in modern C++ is a recommended use of std::optional

If exitStatus is internal to the problem of TagProcessor as mentioned in other GitHub PR comment, please ensure it doesn't leak (by leak, I mean is referred and used outside) into ResponseBuilder and here, at apps.cpp.

jerinphilip · 2021-10-27T12:47:27Z

src/tests/apps.cpp

+    std::vector<ByteRange> brvecTranslatedCharLevel;
+    for (ByteRange br : brvecTranslatedTokenLevel) {
+      size_t charBegin = response.target.wordAsByteRange(sentenceId, br.begin).begin;
+      size_t charEnd = response.target.wordAsByteRange(sentenceId, br.end).begin;


I'm not sure what is happening here.

response.target.wordAsByteRange(sentenceId, br.end // <- This is supposed to be an index. // Not a ByteRange constituent. )

Are you using the ByteRange to represent something that is not a ByteRange?

So TokenIndexRange is introduced for less confusion?

Sort of yes. The problem now is TokenIndexRange is an alias for ByteRange. So if you pass a vector<ByteRange> instead of vector<TokenIndexRange> the compiler and type-system will let it slide. In a strong typedef one is not interconvertible to another.

A lot of guards come in place from compiler and type-system if you bake the following into design:

ByteRange != TokenIndexRange (different struct definition, strong typedef).

Nothing other than TagProcessor knows about this TokenIndexRange. (defined private within TagProcessor or something).

jerinphilip · 2021-10-27T13:06:43Z

src/tests/apps.cpp

+    for (ByteRange br : brvecTranslatedCharLevel) {
+      std::cout << br.begin << " " << br.end;
+      for (size_t pos = br.begin; pos < br.end; pos++) {
+        std::cout << response.target.text[pos];


Can you share the output of this somewhere to understand what's going on here? I'm trying to understand the need for a character-level something. Is this some mechanism to allow the client to break tokens in the middle of a Token and put ByteRanges there?

jerinphilip · 2021-10-27T14:18:36Z

src/translator/tag_processor.h

+      currentFlat.insert(currentFlat.end(), childFlat.begin(), childFlat.end());
+    }
+    return currentFlat;
+  }


Can "child", "subTree", "i" converge into something consistent and easily comprehensible?

src/translator/tag_processor.h

src/tests/apps.cpp

jerinphilip · 2021-10-27T14:59:01Z

src/tests/apps.cpp

+  brvecCharLevel.push_back(ByteRange{2, 12});
+
+  const Response response = translateForResponse(service, model, std::move(source), responseOptions);
+  ABORT_IF(response.source.numSentences() != 1, "Cross sentence byteranges are tricky at the moment.");


Are they still? Shouldn't be for your algorithm to work properly (In my case, I intended to give you examples for unit-tests, so they had to start from zero).

cross-sentence tag alignment is still under development.

Please keep in mind this PR cannot merge without working for a whole Response, which can contain multiple sentences and potential cross sentence tags.

What's the issue with cross sentence? We just need a for loop to cut it at sentence boundaries and call the tag alignment thing for each of them, then do some offsets?

@kpu For cross-sentence tags, tag pairs are incomplete so the tag nesting constrains cannot hold. One way to hold the nesting constraints is to manually complete/close the tag pairs at the end of the sentence and remove those additional tags once the tag alignment for the whole text is completed. Otherwise it will generate broken HTML code.

Yes though phantom tag removal / insertion could be inlined into the loop over sentences rather than done once.

@jerinphilip Fair enough. I am not expecting this PR is going to be merged very soon. I was trying to make a preliminary version of the tag alignment and API integration to Service (which will not be changed in the future). So Mozilla side can start JS bindings and test the performance. That is the main purpose of the deadline of this Friday (from my perspective). I will keep improving and polishing the internal implementation.

src/tests/units/tag_tests.cpp

jerinphilip

Please be careful of creating copies by pass-by-value or copy-assignment where you don't need it. I have some comments on the placement of tagPosition.

I have also left a few queries on inside-outside and the construction of the data-structure used.

jerinphilip · 2021-10-28T21:11:16Z

src/translator/response_builder.h

-        qualityEstimator_(qualityEstimator) {}
+        qualityEstimator_(qualityEstimator) {
+    // Holds tag positions info for later alignment
+    source_.tagPositionSource_ = tagPositionSource;


(General guideline) Use member list to initialize, consistent with the rest. I think this vector is a heavy object. use std::move(...) to slowly transfer ownership to ResponseBuilder. The existing state invokes copy-assignment, which can potentially be expensive. (There are other places, please try to avoid unnecessary copies ahead).

(This is the action item here, I think) If you intend to place this with AnnotatedText as an additional annotation, do it at some-place earlier (makeRequest or next to where source Annotation is populated or something).

jerinphilip · 2021-10-28T21:17:06Z

src/translator/annotation.h

@@ -122,6 +122,9 @@ struct AnnotatedText {
  std::string text;       ///< Blob of string elements in annotation refers to.
  Annotation annotation;  ///< sentence and (sub-) word annotations.

+  /// Tag positions at char-level in the source text
+  std::vector<ByteRange> tagPositionSource_;


AnnotatedText can be source or target, so simply call this tagPosition. std::move(...) browser provided one into source's AnnotatedText.

Construct target's tagPosition at ResponseBuilder.

jerinphilip · 2021-10-28T21:18:03Z

src/translator/response.h

+  std::vector<marian::data::SoftAlignment> alignments;
+
+  /// Tag positions at char-level in the translated text (in the same order as in the source text)
+  std::vector<ByteRange> tagPositionTarget;


See comment on having

AnnotatedText source, target; source.tagPosition target.tagPosition

jerinphilip · 2021-10-28T21:20:19Z

src/translator/tag_processor.h

+    }
+  }
+
+  std::vector<TokenIndexRange> flatten() {


Returning vector on each node is inefficient? You can pass around an output vector here and accumulate over recursive traversal I think.

jerinphilip · 2021-10-28T21:24:10Z

src/translator/tag_processor.h

+/// Building TagTree from a ByteRange vector (passing from the browser).
+class TagTreeBuilder {
+ public:
+  TagTreeBuilder(std::vector<ByteRange> brv) {


pass by value is copy, which might be expensive. I think you may be able to std::move(...) this in here.

Edit: I think const &, use the traversal to build adjacency list will be more suitable.

jerinphilip · 2021-10-28T21:25:18Z

src/translator/tag_processor.h

+  }
+
+  // bottom-up construction
+  void addSubtree(const TagTree &st) { subtree_.emplace_back(st); }


Looks like an expensive copy.

jerinphilip · 2021-10-28T21:28:12Z

src/translator/tag_processor.h

+      }
+    }
+    return tt;
+  }


I'll need to look more into inside-outside, but this construct (growTagTree) is confusing. If you process the traversal once and construct the vertices, adjacency list at construction of TagTree (which I'd expect you'll need anyway) shouldn't you be done with all the compute and storage required for this?

Do you take leaf nodes and iterate up the parent for inside-outside? Otherwise this is equivalent to an adjacency list, correct?

jerinphilip · 2021-10-28T21:29:11Z

src/translator/tag_processor.h

+  std::vector<bool> coverageMatrix_;
+  std::vector<size_t> parentVector_;
+  bool treeValid_;
+  std::vector<TokenIndexRange> tokenBoundVector_;


Please have some information (comments) on what these do. Not obvious to me, perhaps some missing ideas in inside-outside.

kpu · 2021-10-28T22:16:46Z

In today's Bergamot meeting, we discussed with Mozilla that it's critical to have a working API version even if it means returning stupid results. Even just copying the tag offsets within sentence but clipping their ends to sentence length would work. This is a blocker for getting the next extension. Can you get something in quickly?

qianqianzhu · 2021-10-28T22:38:55Z

it's critical to have a working API version even if it means returning stupid results.

The current version is a working API version. Only the multiple sentences are not working yet (which will simply abort). The API will not change. If no one mind the broken HTML tags, I can simply write a for loop and treat every unclosed tag pairs as an empty tag (This is doable by tomorrow).

kpu · 2021-10-28T22:49:50Z

Only the multiple sentences are not working yet (which will simply abort). The API will not change. If no one mind the broken HTML tags, I can simply write a for loop and treat every unclosed tag pairs as an empty tag (This is doable by tomorrow).

I need something Mozilla can integrate on the full-scale problem. If it provides mediocre output that's fine. But it needs to not crash. Then you can fix the output quality in a later change.

An example would be simply linearly scaling the tag offsets by the length ratio of input and output text (even ignoring sentences).

jerinphilip

I'm trying to sanity check this PR with some black-box testing given the shortage of time to completely inspect/review code.

jerinphilip#25 attempts to create a DOM equivalent by recursively walking through [l, r), starting from [0, size) choosing a random split point, set to terminate after placing some n HTML nodes.

I'm trying to follow:

The byte ranges should appear in the same order as the open tags in HTML.

Could this be static vs implicitly dynamic declarations?
Source tag - traversal { [21, 56), [21, 55), [46, 55), [55, 56)}
Segmentation fault (core dumped)

On a 56 character input, the above appears to me to be valid ByteRange as input.

GDB shows this to be within inside-outside:

Source tag - traversal { [21, 56), [21, 55), [46, 55), [55, 56)}

Thread 1 "bergamot-test" received signal SIGSEGV, Segmentation fault.
0x0000555555658033 in marian::bergamot::TagProcessor::maxProduct (this=0x7fffffff90f0, query=..., outer=..., inner=...) at /home/jphilip/code/bergamot-translator/src/translator/tag_processor.h:222
222               logProductDynamic += std::log(inside_[flattenOffset(query.begin, query.end - 1, s)]);

I expect something like this running on WNGT20 1M sentences, with a few random-walk HTML generations, where each line is configured to be a sentence, without crashing for this PR to go in. This, in the real-world, would mean that the inside-outside tag-tree functionality does not crash for a wide range of inputs that's possible to be thrown at it.

jerinphilip · 2021-10-29T07:51:58Z

src/translator/response_builder.cpp

-
-    response.alignments.push_back(std::move(unified_alignment));
+    response.alignments.push_back(std::move(softAlignment));
+    buildTagAlignment(response);


If there is no source-tagtree, this need to be coded such that the process is a no-op, ie, tag-alignment is never called. This means other test-app or functionality path which does not require TagTree will work for certain.

qianqianzhu · 2021-10-29T10:14:20Z

@jerinphilip thanks for your sanity check. I am trying to debug here while trying to make cross-sentence tags work. Please bear with me. Or if you target anything abnormal, feel free to tell me.
BTW, GitHub CI checks keeps complaining about the builds on the minimal systems. It happens since I merged the main branch (which I haven’t started integration yet).
I am not sure what’s going on there. Can you give a look? (It’s not merely for my PR but for other ongoing works too).

qianqianzhu · 2021-10-29T13:32:47Z

Marked as draft since so many complaints about the current implementation (sorry my bad). Another clean PR will be made with a FakeTagProcessor (using linear scaling as requested) for not being a blocker by tonight (23:59 uk time).

kpu · 2022-01-26T21:20:51Z

Abandoned in favor of @jelmervdl implementation.

qianqianzhu added 10 commits June 2, 2021 10:59

initial experiment

a6cd1ea

add bound in tagNode

50aee1c

refactor of tag processor and add test case

ccfe824

add real test cases for debugging

6e2cdcb

more test cases for tag processor

b9c4e86

Merge branch 'main' into tag-translation

5083f3f

add tag_tests in unit tests

49e6eff

add more test cases

2e62341

handle empty tags and no solution case

1fc8d0d

Merge remote-tracking branch 'origin/main' into tag-translation

83ae1a4

jerinphilip reviewed Jun 24, 2021

View reviewed changes

src/translator/tag_processor.h Outdated Show resolved Hide resolved

jerinphilip reviewed Jun 24, 2021

View reviewed changes

src/translator/tag_processor.h Outdated Show resolved Hide resolved

qianqianzhu added 2 commits June 24, 2021 17:52

switch prod to sum(log/log1p())

6fbb193

fix code format

fe9a97c

qianqianzhu added 2 commits June 25, 2021 00:51

improve left-right algorithm for empty tags

b63c245

some code refactorings and comments for better readability

f78d86b

jerinphilip reviewed Jun 25, 2021

View reviewed changes

src/tests/units/tag_tests.cpp Outdated Show resolved Hide resolved

qianqianzhu added 9 commits June 30, 2021 22:47

refactor TagTree for better representation

51300ae

flatten inside 3d to 1d array

38cf4ef

remove debugging code in response_builder

bc0fba3

reduce complexity of inside-outside algorithm from O(n^3) to O(n^2)

92e54ec

fix code styles

6f0e0bc

Merge branch 'main' into tag-translation

502b727

recontruct TagTree to build from vector<ByteRange>

26d4605

more about building TagTree from vector<ByteRange>

5198351

preliminary version of building TagTree from vector<ByteRange>

2c12a06

jerinphilip mentioned this pull request Oct 1, 2021

Alignment information XapaJIaMnu/translateLocally#52

Merged

unit tests for tagtree builder from byterange vectors

3613e6a

qianqianzhu added 2 commits October 25, 2021 11:48

Merge remote-tracking branch 'jerinphilip/tagtree-gen' into tag-trans…

95085c9

…lation

Merge branch 'main' into tag-translation

6a3a123

This was referenced Oct 25, 2021

Python bindings and a module #234

Closed

First class pivot translation capability #236

Merged

qianqianzhu added 3 commits October 26, 2021 23:55

test app for whole tag alignment workflow

2c817b8

fix code style

c1ac6af

some clean-ups with CharTokenTable

27eedbc

qianqianzhu mentioned this pull request Oct 27, 2021

API design for html text translation #238

Closed

jerinphilip reviewed Oct 27, 2021

View reviewed changes

qianqianzhu added 3 commits October 28, 2021 12:20

Merge branch 'main' into tag-translation

820ed34

first attempt to integration TagProcessor to BlockingService

9e00edf

remove debugging cout to fix CI failure

e5338bd

jerinphilip reviewed Oct 28, 2021

View reviewed changes

jerinphilip reviewed Oct 29, 2021

View reviewed changes

qianqianzhu marked this pull request as draft October 29, 2021 13:25

Merge branch 'main' into tag-translation

b6feae6

andrenatal mentioned this pull request Oct 30, 2021

what is happening here? mozilla/firefox-translations#1

Closed

qianqianzhu mentioned this pull request Nov 1, 2021

Place-holder tag processor for integration #241

Closed

jerinphilip mentioned this pull request Nov 1, 2021

Deprecate hardAlignment in favour of softAlignment #250

Merged

jerinphilip mentioned this pull request Nov 26, 2021

HTML translation bindings provision for python jerinphilip/lemonade#13

Merged

kpu closed this Jan 26, 2022

jerinphilip deleted the tag-translation branch February 10, 2022 23:11

Implementation of Inside-outside algorithm for tag translation #199

Implementation of Inside-outside algorithm for tag translation #199

Conversation

qianqianzhu commented Jun 24, 2021 • edited Loading

motin commented Jun 24, 2021

qianqianzhu commented Jun 25, 2021

qianqianzhu commented Oct 21, 2021 • edited Loading

abhi-agg commented Oct 21, 2021 • edited Loading

jerinphilip left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerinphilip Oct 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerinphilip Oct 28, 2021 • edited Loading

Choose a reason for hiding this comment

jerinphilip Oct 27, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerinphilip left a comment

Choose a reason for hiding this comment

jerinphilip Oct 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerinphilip Oct 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerinphilip Oct 28, 2021 • edited Loading

Choose a reason for hiding this comment

jerinphilip Oct 28, 2021 • edited Loading

Choose a reason for hiding this comment

kpu commented Oct 28, 2021 • edited Loading

qianqianzhu commented Oct 28, 2021

kpu commented Oct 28, 2021 • edited Loading

jerinphilip left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qianqianzhu commented Oct 29, 2021 • edited Loading

qianqianzhu commented Oct 29, 2021 • edited Loading

kpu commented Jan 26, 2022

qianqianzhu commented Jun 24, 2021 •

edited

Loading

qianqianzhu commented Oct 21, 2021 •

edited

Loading

abhi-agg commented Oct 21, 2021 •

edited

Loading

jerinphilip Oct 28, 2021 •

edited

Loading

jerinphilip Oct 28, 2021 •

edited

Loading

jerinphilip Oct 27, 2021 •

edited

Loading

jerinphilip Oct 28, 2021 •

edited

Loading

jerinphilip Oct 28, 2021 •

edited

Loading

jerinphilip Oct 28, 2021 •

edited

Loading

jerinphilip Oct 28, 2021 •

edited

Loading

kpu commented Oct 28, 2021 •

edited

Loading

kpu commented Oct 28, 2021 •

edited

Loading

qianqianzhu commented Oct 29, 2021 •

edited

Loading

qianqianzhu commented Oct 29, 2021 •

edited

Loading