Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removes vocabs and propogates fixes for breaks #79

Merged
merged 22 commits into from
Apr 7, 2021

Conversation

jerinphilip
Copy link
Contributor

@jerinphilip jerinphilip commented Mar 31, 2021

This PR removes sourceVocab() and targetVocab() exposed by Service as requested by Mozilla for interface towards a cleaner API.

sourceVocab() is unused and harmless to remove. Removing targetVocab() breaks existing marian-decoder-new and consequently regression-tests. Further changes in the PR propogates the vocab change to provide a Response based output printing in the regression-test-app. We have lost capabilities to completely benchmark with marian-decoder for time being as the new OutputCollector equivalent is a dumbed down version which only couts lines.

Keeping the changes minimal for reducing review-load (edit: oops, sorry, accidentally removed a bit of marian::Histories in the process). Marking as ready for review.

Regression tests (2/4) are failing for bytearray loads for bergamot-translator-app and service, which is already known and will fix with submodule update.

Regression-tests for service-cli which now tests the preliminary alignment and annotations are succeeding if PR merges. bergamot-translator-app stays intact, so no harm done.

Checklist

  • Prettify diff to not run over unnecessary lines.
  • Regression-tests - basic: [1/4]
  • Regression-tests - speed: [1/1]

Jerin Philip added 5 commits March 31, 2021 14:45
Conflicts:
   src/translator/response.cpp -> Bringing to sync with ByteArray updates
   src/translator/response.h -> Bringing to sync with ByteArray updates
We however have Histories in constructor, which we will remove out of the way
soon.
@jerinphilip
Copy link
Contributor Author

There is a potential bug in main. If a sentence is attempted to be inserted corresponding to an empty std::vector<string_view>, we have an invalid sentence in the flat container for string_views, where I additionally mark sentence boundaries. This breaks on WNGT20 sources.shuf in line 1695, where an invalid character sentence is provided and the source vocab eats it. The resulting target is empty with no string_view and causes issues with word-counts and also sentence boundary.

For now, I have pushed the eos corresponding string_view in to get WNGT20 to complete, which seems the simplest. Looks harmless for the time-being. I have strengthened the unit-tests in the debugging process. (These are the few additional commits).

This is handling invalid user-input - not something strictly wrong with the container (Annotation).

@jerinphilip jerinphilip added the cleanup Something that can be refactored or better organized label Apr 1, 2021
@jerinphilip jerinphilip marked this pull request as draft April 1, 2021 20:31
@jerinphilip jerinphilip marked this pull request as ready for review April 2, 2021 20:24
@jerinphilip jerinphilip changed the base branch from main to jp/strengthen-annotation April 3, 2021 20:29
@jerinphilip jerinphilip changed the base branch from jp/strengthen-annotation to main April 3, 2021 20:31
@jerinphilip jerinphilip linked an issue Apr 4, 2021 that may be closed by this pull request
@jerinphilip jerinphilip added the mod: marian Changes affecting marian-dev component label Apr 4, 2021
@kpu
Copy link
Member

kpu commented Apr 6, 2021

Focused PRs please. This appears to be both fixing annotations and removing vocabs. Separate them.

@jerinphilip
Copy link
Contributor Author

jerinphilip commented Apr 6, 2021

@kpu #85. I can only test the changes here, which is why they're in source. can you review that PR first?

(I discovered bug here, cherry-picked commits separating them to a different PR at #85; When they're in source, this diff will simplify).

@kpu
Copy link
Member

kpu commented Apr 6, 2021

By ordering these you've make it much harder to get in. You can simplify the diff here now.

Jerin Philip added 2 commits April 6, 2021 19:27
@jerinphilip
Copy link
Contributor Author

By ordering these you've make it much harder to get in. You can simplify the diff here now.

I have decoupled changes to simplify diff, this source breaks marian-decoder-new and the speed tests bound to it.

@abhi-agg
Copy link
Contributor

abhi-agg commented Apr 7, 2021

@jerinphilip Does it make sense to merge main to this PR? It seems to be out of date with main. Although, I don't know whether it will impact the review or not 👍

@kpu
Copy link
Member

kpu commented Apr 7, 2021

@abhi-agg I hit the magic merge button, please review

@jerinphilip
Copy link
Contributor Author

Does it make sense to merge main to this PR?

Well, there are no more bugs in main than what exists already with this in, so... Just breaks marian-decoder-new on invalid inputs thanks to sentencepiece that's all. Fix is in #85 which after hours of struggle I believe cannot improve much beyond.

@jerinphilip jerinphilip merged commit b71b3a1 into main Apr 7, 2021
@jerinphilip
Copy link
Contributor Author

I'll take advantage of this situation and work the other way. I pull this source into annotation-bugfix branch through main.

@jerinphilip jerinphilip deleted the jp/marian-decoder-new-output-collector branch April 7, 2021 23:11
@jerinphilip jerinphilip mentioned this pull request Apr 29, 2021
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cleanup Something that can be refactored or better organized mod: marian Changes affecting marian-dev component
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make marian-decoder-new consume Response instead of Histories
3 participants