Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of Inside-outside algorithm for tag translation #199

Closed
wants to merge 38 commits into from

Conversation

qianqianzhu
Copy link
Contributor

@qianqianzhu qianqianzhu commented Jun 24, 2021

This PR implements an inside-outside algorithm for tag translation.
The basic idea of this feature is to reserve HTML tag positions in the translated sentence using the alignment and token information.

The current implementation following the suggestions by @kpu:

  • TagProcessor is built as a layer on top translate using the alignment and token information.
  • The browser provides tag information as ByteRange in the source sentence and we return tag information as ByteRange in the target sentence in the same order.

There are still some things to be done:
(1) optimise the no-solution case. (some randomisations in the greedy algorithm)
(2) some corner cases are not optimal.
(3) empty tags need to be optimised.

@motin
Copy link
Contributor

motin commented Jun 24, 2021

If work on this PR or follow-up PRs eventually leads to a capability for bergamot-translator to translate tags and in turn allow for the extension not having to maintain and finish the detagAndProject methods, then maybe these test cases are relevant: https://github.com/mozilla-extensions/firefox-translations/blob/7cb7faffd9847ff87f587092078234d71ac2b820/src/core/ts/content-scripts/dom-translation-content-script.js/dom-translators/detagAndProject.spec.ts#L36-L188
Note that these test cases passed in v0.2.0 of bergamot-translator, but are failing since v0.3.0 and @abhi-agg is maintaining the detagAndProject code nowadays.

@qianqianzhu
Copy link
Contributor Author

If work on this PR or follow-up PRs eventually leads to a capability for bergamot-translator to translate tags and in turn allow for the extension not having to maintain and finish the detagAndProject methods, then maybe these test cases are relevant: https://github.com/mozilla-extensions/firefox-translations/blob/7cb7faffd9847ff87f587092078234d71ac2b820/src/core/ts/content-scripts/dom-translation-content-script.js/dom-translators/detagAndProject.spec.ts#L36-L188
Note that these test cases passed in v0.2.0 of bergamot-translator, but are failing since v0.3.0 and @abhi-agg is maintaining the detagAndProject code nowadays.

@motin many thanks for pointing out the relevant tests. That's exactly what I am looking for. I will try to integrate those test cases into this PR.

@qianqianzhu
Copy link
Contributor Author

qianqianzhu commented Oct 21, 2021

@abhi-agg let me explain the unclear points.

As per my understanding, the browser will always send the byte ranges for the tags in the input text at character level and receive byte ranges for the tags in the translated text at character level.

Exactly. The integration part with browser is always at char-level as the sentence splitter or tokeniser is inside Bergamot/Marian.

For the entire text that comes your way (irrespective of whether it contains a phrase, single sentence or multiple sentences), you always expect just one std::vector<ByteRange>

Yes. Like I said about the char-level thing, the browser doesn’t have the sentence splitter, so the cross-sentence tag case is an internal matter for Bergamot.

To be more specific, from the browser side, we would expect the entire text where all the tags are removed and one std::vector<ByteRange> where contains tag positions at chat-level. Then from us, you would expect the translated text as whole (without tags) and one std::vector<ByteRange> where contains tag positions at char-level in the translated text (in the same order as in the original text so you render them back).

Edit: just a reminder about the order of tag positions as we agreed before:

The byte ranges should appear in the same order as the open tags in HTML.

@abhi-agg
Copy link
Contributor

abhi-agg commented Oct 21, 2021

4. migrate exiting mozilla test cases (https://github.com/mozilla-extensions/firefox-translations/blob/7cb7faffd9847ff87f587092078234d71ac2b820/src/core/ts/content-scripts/dom-translation-content-script.js/dom-translators/detagAndProject.spec.ts#L36-L188) to this repo.

@qianqianzhu I am not sure how much these test cases are valid/helpful as we resorted to passing tag positions as byte offsets to the bergamot-translator and let it tell us the positions of tags in translated text. However, if you find them helpful then please feel free to incorporate them.

I have tried to list down some test cases as per my understanding in mozilla#61 (comment). Please let me know if the expectations of the API are different to what I have documented there. This will help us in starting the integration in extension side and find the corner cases sooner 👍
cc @andrenatal

Copy link
Contributor

@jerinphilip jerinphilip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

I'm leaving this review in a bit of haste given mention of a Friday deadline, excuse any misunderstandings.

  1. If I understood correctly, I recommend replacing your usage of ByteRange in TagTree to represent index of tokens at begin and end with a strong typedef of an equivalent as soon as possible (TokenIndexRange?) to avoid confusion. A ByteRange represents a range in a sequence of characters. begin and end are indices onto the string, not a sequence of token units.
  2. There are pass by values which can create high memory usage, would a const & work here?
  3. How about having a vector<IndexRange> vertices, vector<vector<size_t>> adjacencyList[vertex] and TagTree owning it, recursive member functions traversing the tree by using adj? Count of source vertices are known at construction, same for target vertices?

src/translator/tag_processor.h Outdated Show resolved Hide resolved
src/translator/tag_processor.h Outdated Show resolved Hide resolved
TagProcessor tp = TagProcessor(alignments, ttOriginalTokenLevel, response.source.numWords(sentenceId),
response.target.numWords(sentenceId));
int exitStatus = tp.traverseAndQuery();
TagTree ttTranslatedTokenLevel = tp.getTargetRoot();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't 159 - 162 be the following?

TagTree targetTree = tagProcess(alignments, 
				ttOriginalTokenLevel, 
				response.source.numWords(sentenceId),
                                response.target.numWords(sentenceId))

Edit: This is a Maybe = Just / Nothing, which in modern C++ is a recommended use of std::optional

Copy link
Contributor

@jerinphilip jerinphilip Oct 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If exitStatus is internal to the problem of TagProcessor as mentioned in other GitHub PR comment, please ensure it doesn't leak (by leak, I mean is referred and used outside) into ResponseBuilder and here, at apps.cpp.

std::vector<ByteRange> brvecTranslatedCharLevel;
for (ByteRange br : brvecTranslatedTokenLevel) {
size_t charBegin = response.target.wordAsByteRange(sentenceId, br.begin).begin;
size_t charEnd = response.target.wordAsByteRange(sentenceId, br.end).begin;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what is happening here.

response.target.wordAsByteRange(sentenceId, 
				br.end // <- This is supposed to be an index. 
				       //    Not a ByteRange constituent.
)

Are you using the ByteRange to represent something that is not a ByteRange?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So TokenIndexRange is introduced for less confusion?

Copy link
Contributor

@jerinphilip jerinphilip Oct 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of yes. The problem now is TokenIndexRange is an alias for ByteRange. So if you pass a vector<ByteRange> instead of vector<TokenIndexRange> the compiler and type-system will let it slide. In a strong typedef one is not interconvertible to another.

A lot of guards come in place from compiler and type-system if you bake the following into design:

  1. ByteRange != TokenIndexRange (different struct definition, strong typedef).
  2. Nothing other than TagProcessor knows about this TokenIndexRange. (defined private within TagProcessor or something).

for (ByteRange br : brvecTranslatedCharLevel) {
std::cout << br.begin << " " << br.end;
for (size_t pos = br.begin; pos < br.end; pos++) {
std::cout << response.target.text[pos];
Copy link
Contributor

@jerinphilip jerinphilip Oct 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you share the output of this somewhere to understand what's going on here? I'm trying to understand the need for a character-level something. Is this some mechanism to allow the client to break tokens in the middle of a Token and put ByteRanges there?

currentFlat.insert(currentFlat.end(), childFlat.begin(), childFlat.end());
}
return currentFlat;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can "child", "subTree", "i" converge into something consistent and easily comprehensible?

src/translator/tag_processor.h Show resolved Hide resolved
src/tests/apps.cpp Outdated Show resolved Hide resolved
brvecCharLevel.push_back(ByteRange{2, 12});

const Response response = translateForResponse(service, model, std::move(source), responseOptions);
ABORT_IF(response.source.numSentences() != 1, "Cross sentence byteranges are tricky at the moment.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are they still? Shouldn't be for your algorithm to work properly (In my case, I intended to give you examples for unit-tests, so they had to start from zero).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cross-sentence tag alignment is still under development.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep in mind this PR cannot merge without working for a whole Response, which can contain multiple sentences and potential cross sentence tags.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the issue with cross sentence? We just need a for loop to cut it at sentence boundaries and call the tag alignment thing for each of them, then do some offsets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kpu For cross-sentence tags, tag pairs are incomplete so the tag nesting constrains cannot hold. One way to hold the nesting constraints is to manually complete/close the tag pairs at the end of the sentence and remove those additional tags once the tag alignment for the whole text is completed. Otherwise it will generate broken HTML code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes though phantom tag removal / insertion could be inlined into the loop over sentences rather than done once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jerinphilip Fair enough. I am not expecting this PR is going to be merged very soon. I was trying to make a preliminary version of the tag alignment and API integration to Service (which will not be changed in the future). So Mozilla side can start JS bindings and test the performance. That is the main purpose of the deadline of this Friday (from my perspective). I will keep improving and polishing the internal implementation.

src/tests/units/tag_tests.cpp Show resolved Hide resolved
Copy link
Contributor

@jerinphilip jerinphilip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please be careful of creating copies by pass-by-value or copy-assignment where you don't need it. I have some comments on the placement of tagPosition.

I have also left a few queries on inside-outside and the construction of the data-structure used.

qualityEstimator_(qualityEstimator) {}
qualityEstimator_(qualityEstimator) {
// Holds tag positions info for later alignment
source_.tagPositionSource_ = tagPositionSource;
Copy link
Contributor

@jerinphilip jerinphilip Oct 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(General guideline) Use member list to initialize, consistent with the rest. I think this vector is a heavy object. use std::move(...) to slowly transfer ownership to ResponseBuilder. The existing state invokes copy-assignment, which can potentially be expensive. (There are other places, please try to avoid unnecessary copies ahead).

(This is the action item here, I think) If you intend to place this with AnnotatedText as an additional annotation, do it at some-place earlier (makeRequest or next to where source Annotation is populated or something).

@@ -122,6 +122,9 @@ struct AnnotatedText {
std::string text; ///< Blob of string elements in annotation refers to.
Annotation annotation; ///< sentence and (sub-) word annotations.

/// Tag positions at char-level in the source text
std::vector<ByteRange> tagPositionSource_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AnnotatedText can be source or target, so simply call this tagPosition. std::move(...) browser provided one into source's AnnotatedText.

Construct target's tagPosition at ResponseBuilder.

std::vector<marian::data::SoftAlignment> alignments;

/// Tag positions at char-level in the translated text (in the same order as in the source text)
std::vector<ByteRange> tagPositionTarget;
Copy link
Contributor

@jerinphilip jerinphilip Oct 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment on having

AnnotatedText source, target;
source.tagPosition
target.tagPosition

}
}

std::vector<TokenIndexRange> flatten() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning vector on each node is inefficient? You can pass around an output vector here and accumulate over recursive traversal I think.

/// Building TagTree from a ByteRange vector (passing from the browser).
class TagTreeBuilder {
public:
TagTreeBuilder(std::vector<ByteRange> brv) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pass by value is copy, which might be expensive. I think you may be able to std::move(...) this in here.

Edit: I think const &, use the traversal to build adjacency list will be more suitable.

}

// bottom-up construction
void addSubtree(const TagTree &st) { subtree_.emplace_back(st); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like an expensive copy.

}
}
return tt;
}
Copy link
Contributor

@jerinphilip jerinphilip Oct 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll need to look more into inside-outside, but this construct (growTagTree) is confusing. If you process the traversal once and construct the vertices, adjacency list at construction of TagTree (which I'd expect you'll need anyway) shouldn't you be done with all the compute and storage required for this?

Do you take leaf nodes and iterate up the parent for inside-outside? Otherwise this is equivalent to an adjacency list, correct?

std::vector<bool> coverageMatrix_;
std::vector<size_t> parentVector_;
bool treeValid_;
std::vector<TokenIndexRange> tokenBoundVector_;
Copy link
Contributor

@jerinphilip jerinphilip Oct 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please have some information (comments) on what these do. Not obvious to me, perhaps some missing ideas in inside-outside.

@kpu
Copy link
Member

kpu commented Oct 28, 2021

In today's Bergamot meeting, we discussed with Mozilla that it's critical to have a working API version even if it means returning stupid results. Even just copying the tag offsets within sentence but clipping their ends to sentence length would work. This is a blocker for getting the next extension. Can you get something in quickly?

@qianqianzhu
Copy link
Contributor Author

it's critical to have a working API version even if it means returning stupid results.

The current version is a working API version. Only the multiple sentences are not working yet (which will simply abort). The API will not change. If no one mind the broken HTML tags, I can simply write a for loop and treat every unclosed tag pairs as an empty tag (This is doable by tomorrow).

@kpu
Copy link
Member

kpu commented Oct 28, 2021

Only the multiple sentences are not working yet (which will simply abort). The API will not change. If no one mind the broken HTML tags, I can simply write a for loop and treat every unclosed tag pairs as an empty tag (This is doable by tomorrow).

I need something Mozilla can integrate on the full-scale problem. If it provides mediocre output that's fine. But it needs to not crash. Then you can fix the output quality in a later change.

An example would be simply linearly scaling the tag offsets by the length ratio of input and output text (even ignoring sentences).

Copy link
Contributor

@jerinphilip jerinphilip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to sanity check this PR with some black-box testing given the shortage of time to completely inspect/review code.

jerinphilip#25 attempts to create a DOM equivalent by recursively walking through [l, r), starting from [0, size) choosing a random split point, set to terminate after placing some n HTML nodes.

I'm trying to follow:

The byte ranges should appear in the same order as the open tags in HTML.

Could this be static vs implicitly dynamic declarations?
Source tag - traversal { [21, 56), [21, 55), [46, 55), [55, 56)}
Segmentation fault (core dumped)

On a 56 character input, the above appears to me to be valid ByteRange as input.

GDB shows this to be within inside-outside:

Source tag - traversal { [21, 56), [21, 55), [46, 55), [55, 56)}

Thread 1 "bergamot-test" received signal SIGSEGV, Segmentation fault.
0x0000555555658033 in marian::bergamot::TagProcessor::maxProduct (this=0x7fffffff90f0, query=..., outer=..., inner=...) at /home/jphilip/code/bergamot-translator/src/translator/tag_processor.h:222
222               logProductDynamic += std::log(inside_[flattenOffset(query.begin, query.end - 1, s)]);

I expect something like this running on WNGT20 1M sentences, with a few random-walk HTML generations, where each line is configured to be a sentence, without crashing for this PR to go in. This, in the real-world, would mean that the inside-outside tag-tree functionality does not crash for a wide range of inputs that's possible to be thrown at it.


response.alignments.push_back(std::move(unified_alignment));
response.alignments.push_back(std::move(softAlignment));
buildTagAlignment(response);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no source-tagtree, this need to be coded such that the process is a no-op, ie, tag-alignment is never called. This means other test-app or functionality path which does not require TagTree will work for certain.

@qianqianzhu
Copy link
Contributor Author

qianqianzhu commented Oct 29, 2021

@jerinphilip thanks for your sanity check. I am trying to debug here while trying to make cross-sentence tags work. Please bear with me. Or if you target anything abnormal, feel free to tell me.
BTW, GitHub CI checks keeps complaining about the builds on the minimal systems. It happens since I merged the main branch (which I haven’t started integration yet).
I am not sure what’s going on there. Can you give a look? (It’s not merely for my PR but for other ongoing works too).

@qianqianzhu qianqianzhu marked this pull request as draft October 29, 2021 13:25
@qianqianzhu
Copy link
Contributor Author

qianqianzhu commented Oct 29, 2021

Marked as draft since so many complaints about the current implementation (sorry my bad). Another clean PR will be made with a FakeTagProcessor (using linear scaling as requested) for not being a blocker by tonight (23:59 uk time).

@kpu
Copy link
Member

kpu commented Jan 26, 2022

Abandoned in favor of @jelmervdl implementation.

@kpu kpu closed this Jan 26, 2022
@jerinphilip jerinphilip deleted the tag-translation branch February 10, 2022 23:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants