-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implementation of Inside-outside algorithm for tag translation #199
Conversation
If work on this PR or follow-up PRs eventually leads to a capability for bergamot-translator to translate tags and in turn allow for the extension not having to maintain and finish the detagAndProject methods, then maybe these test cases are relevant: https://github.com/mozilla-extensions/firefox-translations/blob/7cb7faffd9847ff87f587092078234d71ac2b820/src/core/ts/content-scripts/dom-translation-content-script.js/dom-translators/detagAndProject.spec.ts#L36-L188 |
@motin many thanks for pointing out the relevant tests. That's exactly what I am looking for. I will try to integrate those test cases into this PR. |
@abhi-agg let me explain the unclear points.
Exactly. The integration part with browser is always at char-level as the sentence splitter or tokeniser is inside Bergamot/Marian.
Yes. Like I said about the char-level thing, the browser doesn’t have the sentence splitter, so the cross-sentence tag case is an internal matter for Bergamot. To be more specific, from the browser side, we would expect the entire text where all the tags are removed and one Edit: just a reminder about the order of tag positions as we agreed before:
|
@qianqianzhu I am not sure how much these test cases are valid/helpful as we resorted to passing tag positions as byte offsets to the bergamot-translator and let it tell us the positions of tags in translated text. However, if you find them helpful then please feel free to incorporate them. I have tried to list down some test cases as per my understanding in mozilla#61 (comment). Please let me know if the expectations of the API are different to what I have documented there. This will help us in starting the integration in extension side and find the corner cases sooner 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
I'm leaving this review in a bit of haste given mention of a Friday deadline, excuse any misunderstandings.
- If I understood correctly, I recommend replacing your usage of ByteRange in TagTree to represent index of tokens at begin and end with a strong typedef of an equivalent as soon as possible (
TokenIndexRange
?) to avoid confusion. A ByteRange represents a range in a sequence of characters.begin
andend
are indices onto the string, not a sequence of token units. - There are pass by values which can create high memory usage, would a
const &
work here? - How about having a
vector<IndexRange> vertices
,vector<vector<size_t>> adjacencyList[vertex]
and TagTree owning it, recursive member functions traversing the tree by using adj? Count of source vertices are known at construction, same for target vertices?
src/tests/apps.cpp
Outdated
TagProcessor tp = TagProcessor(alignments, ttOriginalTokenLevel, response.source.numWords(sentenceId), | ||
response.target.numWords(sentenceId)); | ||
int exitStatus = tp.traverseAndQuery(); | ||
TagTree ttTranslatedTokenLevel = tp.getTargetRoot(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't 159 - 162 be the following?
TagTree targetTree = tagProcess(alignments,
ttOriginalTokenLevel,
response.source.numWords(sentenceId),
response.target.numWords(sentenceId))
Edit: This is a Maybe = Just / Nothing, which in modern C++ is a recommended use of std::optional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If exitStatus
is internal to the problem of TagProcessor
as mentioned in other GitHub PR comment, please ensure it doesn't leak (by leak, I mean is referred and used outside) into ResponseBuilder
and here, at apps.cpp
.
src/tests/apps.cpp
Outdated
std::vector<ByteRange> brvecTranslatedCharLevel; | ||
for (ByteRange br : brvecTranslatedTokenLevel) { | ||
size_t charBegin = response.target.wordAsByteRange(sentenceId, br.begin).begin; | ||
size_t charEnd = response.target.wordAsByteRange(sentenceId, br.end).begin; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what is happening here.
response.target.wordAsByteRange(sentenceId,
br.end // <- This is supposed to be an index.
// Not a ByteRange constituent.
)
Are you using the ByteRange
to represent something that is not a ByteRange
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So TokenIndexRange
is introduced for less confusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sort of yes. The problem now is TokenIndexRange
is an alias for ByteRange
. So if you pass a vector<ByteRange>
instead of vector<TokenIndexRange>
the compiler and type-system will let it slide. In a strong typedef one is not interconvertible to another.
A lot of guards come in place from compiler and type-system if you bake the following into design:
ByteRange
!=TokenIndexRange
(different struct definition, strong typedef).- Nothing other than
TagProcessor
knows about thisTokenIndexRange
. (defined private withinTagProcessor
or something).
src/tests/apps.cpp
Outdated
for (ByteRange br : brvecTranslatedCharLevel) { | ||
std::cout << br.begin << " " << br.end; | ||
for (size_t pos = br.begin; pos < br.end; pos++) { | ||
std::cout << response.target.text[pos]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you share the output of this somewhere to understand what's going on here? I'm trying to understand the need for a character-level something. Is this some mechanism to allow the client to break tokens in the middle of a Token and put ByteRanges there?
currentFlat.insert(currentFlat.end(), childFlat.begin(), childFlat.end()); | ||
} | ||
return currentFlat; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can "child", "subTree", "i" converge into something consistent and easily comprehensible?
src/tests/apps.cpp
Outdated
brvecCharLevel.push_back(ByteRange{2, 12}); | ||
|
||
const Response response = translateForResponse(service, model, std::move(source), responseOptions); | ||
ABORT_IF(response.source.numSentences() != 1, "Cross sentence byteranges are tricky at the moment."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are they still? Shouldn't be for your algorithm to work properly (In my case, I intended to give you examples for unit-tests, so they had to start from zero).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cross-sentence tag alignment is still under development.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep in mind this PR cannot merge without working for a whole Response
, which can contain multiple sentences and potential cross sentence tags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the issue with cross sentence? We just need a for loop to cut it at sentence boundaries and call the tag alignment thing for each of them, then do some offsets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kpu For cross-sentence tags, tag pairs are incomplete so the tag nesting constrains cannot hold. One way to hold the nesting constraints is to manually complete/close the tag pairs at the end of the sentence and remove those additional tags once the tag alignment for the whole text is completed. Otherwise it will generate broken HTML code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes though phantom tag removal / insertion could be inlined into the loop over sentences rather than done once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jerinphilip Fair enough. I am not expecting this PR is going to be merged very soon. I was trying to make a preliminary version of the tag alignment and API integration to Service (which will not be changed in the future). So Mozilla side can start JS bindings and test the performance. That is the main purpose of the deadline of this Friday (from my perspective). I will keep improving and polishing the internal implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please be careful of creating copies by pass-by-value or copy-assignment where you don't need it. I have some comments on the placement of tagPosition
.
I have also left a few queries on inside-outside and the construction of the data-structure used.
qualityEstimator_(qualityEstimator) {} | ||
qualityEstimator_(qualityEstimator) { | ||
// Holds tag positions info for later alignment | ||
source_.tagPositionSource_ = tagPositionSource; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(General guideline) Use member list to initialize, consistent with the rest. I think this vector is a heavy object. use std::move(...)
to slowly transfer ownership to ResponseBuilder
. The existing state invokes copy-assignment, which can potentially be expensive. (There are other places, please try to avoid unnecessary copies ahead).
(This is the action item here, I think) If you intend to place this with AnnotatedText
as an additional annotation, do it at some-place earlier (makeRequest
or next to where source Annotation is populated or something).
@@ -122,6 +122,9 @@ struct AnnotatedText { | |||
std::string text; ///< Blob of string elements in annotation refers to. | |||
Annotation annotation; ///< sentence and (sub-) word annotations. | |||
|
|||
/// Tag positions at char-level in the source text | |||
std::vector<ByteRange> tagPositionSource_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AnnotatedText
can be source or target, so simply call this tagPosition
. std::move(...)
browser provided one into source's AnnotatedText
.
Construct target's tagPosition
at ResponseBuilder
.
std::vector<marian::data::SoftAlignment> alignments; | ||
|
||
/// Tag positions at char-level in the translated text (in the same order as in the source text) | ||
std::vector<ByteRange> tagPositionTarget; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment on having
AnnotatedText source, target;
source.tagPosition
target.tagPosition
} | ||
} | ||
|
||
std::vector<TokenIndexRange> flatten() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning vector on each node is inefficient? You can pass around an output vector here and accumulate over recursive traversal I think.
/// Building TagTree from a ByteRange vector (passing from the browser). | ||
class TagTreeBuilder { | ||
public: | ||
TagTreeBuilder(std::vector<ByteRange> brv) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pass by value is copy, which might be expensive. I think you may be able to std::move(...)
this in here.
Edit: I think const &
, use the traversal to build adjacency list will be more suitable.
} | ||
|
||
// bottom-up construction | ||
void addSubtree(const TagTree &st) { subtree_.emplace_back(st); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like an expensive copy.
} | ||
} | ||
return tt; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll need to look more into inside-outside, but this construct (growTagTree
) is confusing. If you process the traversal once and construct the vertices, adjacency list at construction of TagTree
(which I'd expect you'll need anyway) shouldn't you be done with all the compute and storage required for this?
Do you take leaf nodes and iterate up the parent for inside-outside? Otherwise this is equivalent to an adjacency list, correct?
std::vector<bool> coverageMatrix_; | ||
std::vector<size_t> parentVector_; | ||
bool treeValid_; | ||
std::vector<TokenIndexRange> tokenBoundVector_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please have some information (comments) on what these do. Not obvious to me, perhaps some missing ideas in inside-outside.
In today's Bergamot meeting, we discussed with Mozilla that it's critical to have a working API version even if it means returning stupid results. Even just copying the tag offsets within sentence but clipping their ends to sentence length would work. This is a blocker for getting the next extension. Can you get something in quickly? |
The current version is a working API version. Only the multiple sentences are not working yet (which will simply abort). The API will not change. If no one mind the broken HTML tags, I can simply write a for loop and treat every unclosed tag pairs as an empty tag (This is doable by tomorrow). |
I need something Mozilla can integrate on the full-scale problem. If it provides mediocre output that's fine. But it needs to not crash. Then you can fix the output quality in a later change. An example would be simply linearly scaling the tag offsets by the length ratio of input and output text (even ignoring sentences). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to sanity check this PR with some black-box testing given the shortage of time to completely inspect/review code.
jerinphilip#25 attempts to create a DOM equivalent by recursively walking through [l, r)
, starting from [0, size)
choosing a random split point, set to terminate after placing some n
HTML nodes.
I'm trying to follow:
The byte ranges should appear in the same order as the open tags in HTML.
Could this be static vs implicitly dynamic declarations?
Source tag - traversal { [21, 56), [21, 55), [46, 55), [55, 56)}
Segmentation fault (core dumped)
On a 56 character input, the above appears to me to be valid ByteRange as input.
GDB shows this to be within inside-outside:
Source tag - traversal { [21, 56), [21, 55), [46, 55), [55, 56)}
Thread 1 "bergamot-test" received signal SIGSEGV, Segmentation fault.
0x0000555555658033 in marian::bergamot::TagProcessor::maxProduct (this=0x7fffffff90f0, query=..., outer=..., inner=...) at /home/jphilip/code/bergamot-translator/src/translator/tag_processor.h:222
222 logProductDynamic += std::log(inside_[flattenOffset(query.begin, query.end - 1, s)]);
I expect something like this running on WNGT20 1M sentences, with a few random-walk HTML generations, where each line is configured to be a sentence, without crashing for this PR to go in. This, in the real-world, would mean that the inside-outside tag-tree functionality does not crash for a wide range of inputs that's possible to be thrown at it.
|
||
response.alignments.push_back(std::move(unified_alignment)); | ||
response.alignments.push_back(std::move(softAlignment)); | ||
buildTagAlignment(response); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is no source-tagtree, this need to be coded such that the process is a no-op, ie, tag-alignment is never called. This means other test-app or functionality path which does not require TagTree will work for certain.
@jerinphilip thanks for your sanity check. I am trying to debug here while trying to make cross-sentence tags work. Please bear with me. Or if you target anything abnormal, feel free to tell me. |
Marked as draft since so many complaints about the current implementation (sorry my bad). Another clean PR will be made with a |
Abandoned in favor of @jelmervdl implementation. |
This PR implements an inside-outside algorithm for tag translation.
The basic idea of this feature is to reserve HTML tag positions in the translated sentence using the alignment and token information.
The current implementation following the suggestions by @kpu:
TagProcessor
is built as a layer on top translate using the alignment and token information.ByteRange
in the source sentence and we return tag information asByteRange
in the target sentence in the same order.There are still some things to be done:
(1) optimise the no-solution case. (some randomisations in the greedy algorithm)
(2) some corner cases are not optimal.
(3) empty tags need to be optimised.