-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integration: marian new on-the-fly-decoder for bergamot #8
Conversation
This first commit imports files from mts which was repurposed for bergamot translator from https://github.com/browsermt/mts/tree/nuke.
Modifications to SentencePiece are necessary to provide token level string_views. This commit changes marian to an alternate branch which has the feature incorporated.
CMakeLists have been modified with the necessary includes to add browsermt/mts@nuke files to the bergamot-translator library. In addition, adds the ssplit dependency, corresponding includes. Intel MKL fails on compilation, unable to find libraries. To solve this 3rd_party/CMakeLists.txt is modified with @UG's fixes to propogate variables (EXT_LIBS, etc) at a library level.
A faster linesplitter added for benchmarks is removed in favour of @UG's ssplit-cpp. NOTE: ssplit-cpp's regex based implementation is slow for one-line parses, which ideally needs to be improved in upstream ssplit-cpp to trivially reduce to a faster newline character based split.
Commit modifies the example test-code main-mts into the app folder, updating CMakeLists accordingly.
Removed Alignments, too many questions and no concrete answers. Better off removing unused code. History is kept for now, for internal use.
Vocabs was earlier loaded in each thread and copied several times. Modified this to be loaded only once in Service and reference used consistently later on. This change makes Tokenizer as a class rather moot, as there's only one private member and a function. Moved this into TextProcessor. SentenceSplitter, however remains a separate class. utils.{h,cpp} had only a single loadVocabularies function, which is at the moment required only in Service. Making loadVocabularies a function inside Service and getting rid of utils.*.
- Truncating long sentences into those of a specified length for faster processing is now a separate function, for improved readability. - Changes doing push_back -> emplace_back at places to avoid copy. - query_to_segments is renamed as process. - Comments are added in an attempt to bring some sanity.
@jerinphilip This is great work. Congratts. 👍 The PR mostly looks good to me. I added some comments that are mostly related to cmake files and 1 general comment regarding the namespace hierarchy. Would be great if you can fix them :) PS: I am not reviewing the implementation files as I trust Kenneth and Uli on that part of the code. |
Only the bergamot-translator library should be linked to main target Any other library (marian ${MARIAN_CUDA_LIB} ${EXT_LIBS} ssplit pcrecpp.a pcre.a) should be linked to bergamot-translator target inside src/translator folder.
Added instructions to the I'm not sure if I'm adding more dependencies to the merge, but we should eliminate both
|
@jerinphilip Both of your issues should be resolved by last 2 commits. Please check at your end 👍 |
@abhi-agg: I will confim browsermt/marian-dev@wasm's issues looks to be fixed with the recent commits. Until the time browsermt/marian-dev@wasm gets reviewed and merged to master (ensuring the disabled switches only disable them for wasm) by @kpu or someone else who's qualified (I'm probably not, at least for now) I will configure my jenkins builds with the following conf: Click to expand
. /etc/environment
rm -rf build
mkdir -p build
cd build
#CMAKE=/usr/bin/cmake
CMAKE=/opt/modules/cmake-3.20-rc1/bin/cmake
CC=/usr/bin/gcc-8 CXX=/usr/bin/g++-8 \
$CMAKE \
-DCOMPILE_CUDA=off -DCOMPILE_TESTS=OFF -DCOMPILE_EXAMPLES=OFF \
-DUSE_SENTENCEPIECE=ON -DCOMPILE_SERVER=OFF -DUSE_WASM_COMPATIBLE_MARIAN=off \
..
# Build time patch: browsermt/marian-dev@wasm -> browsermt/marian-dev@master
git -C $WORKSPACE/3rd_party/marian-dev fetch -a
git -C $WORKSPACE/3rd_party/marian-dev checkout origin/master
git -C $WORKSPACE/3rd_party/marian-dev rev-parse --short HEAD
git -C $WORKSPACE add 3rd_party/marian-dev
( git config --global user.name 'Marian Minion';
git config --global user.email 'jenkins@vali.inf.ed.ac.uk';
git -C $WORKSPACE commit \
-m "Jenkins: Local submodule update"
)
# Display CMake cached variables for debugging
$CMAKE -L -N
make -j8
# No tests for bergamot-translator
# CUDA_VISIBLE_DEVICES=0 env CTEST_OUTPUT_ON_FAILURE=1 make test
cd ..
# Create the artifact with files needed for regression tests
tar zcvf bergamot-translator.tgz \
build/CMakeCache.txt \
build/libmarian.a build/libssplit.a build/src/translator/libbergamot-translator.a \
build/marian* build/spm_* \
build/app/* This should let us know if something in wasm branch breaks stuff here again. GitHub CI workflows file (the original ubuntu, macos) for vanilla configuration should only require the addition of There also seems to be a segfault at full WNGT20 decode (bergamot-translator-regression-tests) are failing. Hold off on merge to main until I isolate the cause. |
@jerinphilip You mean, this config change will happen only for
And NOT for these 2 builds:
Right? |
@abhi-agg GitHub CI will have the 3, no modifications. It's not a change, the 3 builds will work with the source here. I'll have a version (separately) compiling with master branch from marian-dev, that's what the comment means. If something in browsermt/marian-dev@wasm breaks (like the quantizer), we'll know. |
- Custom marian means only those marian features that are required for wasm - Added workflow for native builds - Added workflow for wasm builds
Last 2 are done. Once you push the change for the 1st one, all 3 should be good to go. |
- Removed 'wasm-integration' branch from wasm build instructions - Improved native build instructions
* Updating vanilla workflows with -DUSE_WASM_COMPATIBLE_MARIAN=off * Boost: Replacing with OS Boost for Ubuntu Builds
CI Fixes: vanilla bergamot-translator
Through inheritance, a non-threaded and multithreaded Service are created, both derived of the same ServiceBase class which holds the common elements. In preparation to solve SIGSEGV in #41. First inspections gave aborts in thread part, and repeated SIGSEGV's in lock-policy's of shared_pointers even in non-threaded paths. Solving this first, to avoid ifdef or tricky paths. The non-threaded implementation is not included in WASM builds at all, by separating out the single-threaded logic. DRY is achieved through inheritance and operator overloading.
To fix WASM Mac builds on CI. loadVocabularies function is now inlines and available through service_base.h, from where it seems to propogate to all places of use.
- Renamed USE_WASM_COMPATIBLE_MARIAN to USE_WASM_COMPATIBLE_SOURCES - Removed COMPILE_THREAD_VARIANT cmake option and removed corresponding compile definition - Updated workflows and READMEs accordingly
- main-mts and marian-decoder-new can't be used because it uses multi-threaded variant of Service class
Improve threaded/non-threaded paths
bergamot-translator is building:
No issues at bergamot-translator-tests: Automation is setup for both to open issues at respective repositories from jenkins@vali builds. |
Looks covered. Probably a GitHub UI bug. Points me to resolved conversation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bergamot-translator is building:
There also seems to be a segfault at full WNGT20 decode (bergamot-translator-regression-tests) are failing. Hold off on merge to main until I isolate the cause.
No issues at bergamot-translator-tests:
Automation is setup for both to open issues at respective repositories from jenkins@vali builds.
Lgtm. As we agreed, we will add tests for the artifacts built with wasm compatible sources in bergamot-translator-tests repo.
This has become one gigantic PR. Mozilla has branched this (integration) into wasm-integration and tinkered with threads, meanwhile I have made more refactors and adjustments in an attempt to fit to WASM development in a cleaner way. Everyone seems to be interested in having a 'main' as well. Here's a summary of changes since @kpu's last review, for doc:
Note for @kpu: Concurrent queuing (#41) is not included in this PR. I'm not requesting a further review (which is cumbersome given how big this has become) in this PR. I request to carry forward any concerns in respective issues. With the current state of source, this PR will close #41, #10, #27, #11. @abhi-agg please merge into a 'main'. I understand risks and take responsibility for edits/ potential issues from the unreviewed code in my edits. |
Cool. Merging it. |
Integration: marian new on-the-fly-decoder for bergamot
WIP repurpose of MTS to translate on the fly and provide a consumable API for bergamot-translator.
from browsermt/mts.