Skip to content

Commit

Permalink
Alignments + weak quality scores capability in Service (#46)
Browse files Browse the repository at this point in the history
* Draft adjustments to API

* Adjustments to docs

* Let's call the word + sentence ranges annotations

* Editing confusing comment on size()

* Fixing compilation for template adjustments for SentenceRanges

* string_view template hacks

This commit shifts AnnotatedBlob into a templated type and gets the
troubled part to compile. All to manage absl::string_view and
std::string_view.

Objective: marian::bergamot stays C++ 11 to pluck and put in marian
code, bergamot-translator somehow flexes C++17. Simplify development in
one place.

* Fixing the wiring: Gets source to build

Runtime errors exist, but AnnotatedBlobs are consistent.

* Bugfix: Matching old-state after factoring AnnotatedBlob in

* Removing vocabs_ from Response.

(For the umpteenth time).

* Alignment API ready in marian::bergamot::Response

* Wiring alignments upto TranslationResult

* Adjustment to get alignments; bergamot-translator-app has alignments available

* Accessing words instead of Ids

This code sets up access of word string_views from annotations instead
of printing Ids. However, we have segfault. This is likely due to
targetRanges not being set, pending from
#25.

Could also be a rogue EOS token which we're filtering for in string_view
annotations, but not so in alignments.

* Switching to browsermt/marian-dev@jp/decode-string-view for targetTokenRanges

* Target word byte range annotations available

Issues corresponding to #25 should be resolved. There is still a
segfault. Could be due to EOS. Pending investigation.

* Bugfix: Tokens for alignments are now through.

Was not EOS.

* browsermt/marian-dev@master

ByteRange changes work downstream and has been merged to master.
Updating submodule to point to master.

* Style and documentation enhancements: response.cpp

* Style and documentation enhancements: TranslationResult.h

* Descriptions for SentenceRanges templating

* Switching to marian-dev@wasm-sync

* AnnotatedBlob can be copy-ctord/copy-assigned

* TranslationResult: Empty ctor + WASM Bindings

Allows empty construction of TranslationResult. Using this empty
constructor, WASM bindings are adjusted. Unsure of the results, maybe
@abhi-agg can test.

* Cosmetic: SentenceRangesT -> Annotation

- SentenceRangesT is renamed to AnnotationT;
- Further comments to explain heavily templated files.

* Response: Cleaning up unused members and adding docs

* Adding quality scores - attempt

* Stub QualityScores

This adjustment adds capability to get "scores", which should
potentially indicate how confident (at least relative in a
target-sentence) should be. This enables writing the code forward for
TranslationResult, and an example quality-score people can be pointed
at.

- These are not between [0,1] yet.
- In addition, guards to check out-of-bounds access have been placed so
  illegal accesses are caught early on during development.

* Removing token debug statements

* Reworking Annotation without templates

mozilla#8 provides
ByteRanges.

- This ByteRange data-type is used in Annotation and converted
  to marian::string_view(=absl::string-view) on demand.
- Since Annotation[using ByteRange] is not bound to anything else, it
  can be unit tested. A unit test is added (originally to test
  independently for integration after).
- Annotation with ByteRange is now propogated across marian::bergamot
  and functionality matched to how it was previously working.

This eliminates the string-view conversion and template code.

* Nit: Removing std::endl flushes

* Bring TranslationResult and Response closer

Helps #53.

In preparation , the data-export types for Quality and Alignment are
pushed down to Response from TranslationResult and computed during
construction. This brings TranslationResult closer to Response, paving
way to avoid having two TranslationResults.

histories_ only remain for marian-decoder replacement usage, which can
be removed in a separate PR.

* Clean up hacks originally added for a unit-test to compile

* Moving Annotation functions to cpp and documenting header file

* Shifting alignments, qualityScore testing capability into main-mts

* Restore Unified API files to previous state

* Adaptations to fix Response with Quality, Alignments to connect to old Unified API

* Missing reset on TranslationResultBindings

* Cleaning up Response documentation to reflect newer code

* Minor adjustments to get build back after main sync

* Marian seems to make available Catch somehow

* Disable COMPILE_BERGAMOT_TESTS for WASM

* Add COMPILE_BERGAMOT_TESTS as a CMakeDependent option

* Use the COMPILE_TESTS flag instead to skip macos.yml

* Trigger unit-tests on GitHub runners for Annotation

* Reordering enable_testing() to before inclusion of test directory

* doc constructs required to operate with alignments

Documents with doxygen compatible documentation for Response,
AnnotatedBlob, Annotation, ByteRange.

Incorporates doxygen compatible documentation for

* Updates ByteRange consistent with general C++

Also little documentation enhancements in the process.

* Updating marian-dev@9337105

* Copy-paste documentation because lazy

* Turn off autoformat and manually edit to fix style changes

* AnnotatedBlob -> AnnotatedText; blob -> text

* text.text in test app renamed

* text of text -> blob of text in places of documentation
  • Loading branch information
Jerin Philip committed Mar 31, 2021
1 parent e0dca1b commit bfb5e78
Show file tree
Hide file tree
Showing 20 changed files with 509 additions and 222 deletions.
7 changes: 3 additions & 4 deletions .github/workflows/native-full_marian-mac.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,10 +45,9 @@ jobs:
working-directory: build
run: make -j2

# Removing unit-tests, taken care of in browsermt/marian-dev
# - name: Run unit tests
# - working-directory: build
# - run: make test
- name: Run unit tests
working-directory: build
run: make test

- name: Print versions
working-directory: build
Expand Down
12 changes: 5 additions & 7 deletions .github/workflows/native-full_marian-ubuntu.yml
Original file line number Diff line number Diff line change
Expand Up @@ -103,13 +103,11 @@ jobs:
working-directory: build
run: make -j2

# Removing unit-tests, taken care of in browsermt/marian-dev
# TODO: add a flag to CMake to compile unit tests only on CPU
# - name: Run unit tests
# working-directory: build
# run: make test
# # GitHub-hosted VMs do not have GPUs, so can not be run in CUDA builds
# if: matrix.gpu == false
- name: Run unit tests
working-directory: build
run: make test
# GitHub-hosted VMs do not have GPUs, so can not be run in CUDA builds
if: matrix.gpu == false

- name: Print versions
working-directory: build
Expand Down
8 changes: 8 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@ include(CMakeDependentOption)
# Project specific cmake options
option(COMPILE_WASM "Compile for WASM" OFF)
option(USE_WASM_COMPATIBLE_SOURCES "Use wasm compatible sources" ON)
option(COMPILE_TESTS "Compile bergamot-tests" OFF)

SET(PACKAGE_DIR "" CACHE STRING "Directory including all the files to be packaged (pre-loaded) in wasm builds")

# Set 3rd party submodule specific cmake options for this project
Expand Down Expand Up @@ -56,6 +58,11 @@ if(COMPILE_WASM)
list(APPEND WASM_COMPILE_FLAGS -Wno-error=pthreads-mem-growth)
endif(COMPILE_WASM)

# Needs to be enabled before including the folder containing tests (src/tests)
if(COMPILE_TESTS)
enable_testing()
endif(COMPILE_TESTS)

add_subdirectory(3rd_party)
add_subdirectory(src)

Expand All @@ -64,3 +71,4 @@ if(COMPILE_WASM)
else()
add_subdirectory(app)
endif(COMPILE_WASM)

2 changes: 1 addition & 1 deletion app/service-cli-bytearray.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ int main(int argc, char *argv[]) {
std::future<Response> responseFuture = service.translate(std::move(input));
responseFuture.wait();
Response response = responseFuture.get();
std::cout << response.translation() << std::endl;
std::cout << response.target.text << std::endl;

// Clear the memory used for the byte array
free(model_bytes); // Ideally, this should be done after the translation model has been gracefully shut down.
Expand Down
51 changes: 50 additions & 1 deletion app/service-cli.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,56 @@ int main(int argc, char *argv[]) {
std::future<Response> responseFuture = service.translate(std::move(input));
responseFuture.wait();
Response response = responseFuture.get();
std::cout << response.translation() << std::endl;

std::cout << "[original]: " << response.source.text << '\n';
std::cout << "[translated]: " << response.target.text << '\n';
for (int sentenceIdx = 0; sentenceIdx < response.size(); sentenceIdx++) {
std::cout << " [src Sentence]: " << response.source.sentence(sentenceIdx)
<< '\n';
std::cout << " [tgt Sentence]: " << response.target.sentence(sentenceIdx)
<< '\n';
std::cout << "Alignments" << '\n';
typedef std::pair<size_t, float> Point;

// Initialize a point vector.
std::vector<std::vector<Point>> aggregate(
response.source.numWords(sentenceIdx));

// Handle alignments
auto &alignments = response.alignments[sentenceIdx];
for (auto &p : alignments) {
aggregate[p.src].emplace_back(p.tgt, p.prob);
}

for (size_t src = 0; src < aggregate.size(); src++) {
std::cout << response.source.word(sentenceIdx, src) << ": ";
for (auto &p : aggregate[src]) {
std::cout << response.target.word(sentenceIdx, p.first) << "("
<< p.second << ") ";
}
std::cout << '\n';
}

// Handle quality.
auto &quality = response.qualityScores[sentenceIdx];
std::cout << "Quality: whole(" << quality.sequence
<< "), tokens below:" << '\n';
size_t wordIdx = 0;
bool first = true;
for (auto &p : quality.word) {
if (first) {
first = false;
} else {
std::cout << " ";
}
std::cout << response.target.word(sentenceIdx, wordIdx) << "(" << p
<< ")";
wordIdx++;
}
std::cout << '\n';
}
std::cout << "--------------------------\n";
std::cout << '\n';

return 0;
}
8 changes: 7 additions & 1 deletion src/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1 +1,7 @@
add_subdirectory(translator)
add_subdirectory(translator)

if(COMPILE_TESTS)
# Catch currently comes from marian sources.
add_subdirectory(tests)
endif(COMPILE_TESTS)

22 changes: 22 additions & 0 deletions src/tests/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Unit tests
set(UNIT_TESTS
annotation_tests
)

foreach(test ${UNIT_TESTS})
add_executable("run_${test}" run_tests.cpp "${test}.cpp")
target_include_directories("run_${test}" PRIVATE ${CATCH_INCLUDE_DIR} "${CMAKE_SOURCE_DIR}/src")

if(CUDA_FOUND)
target_link_libraries("run_${test}" ${EXT_LIBS} marian ${EXT_LIBS} marian_cuda ${EXT_LIBS} Catch bergamot-translator)
else(CUDA_FOUND)
target_link_libraries("run_${test}" marian ${EXT_LIBS} Catch bergamot-translator)
endif(CUDA_FOUND)

if(msvc)
# disable c4305: truncation from 'double' to '_ty'
target_compile_options("run_${test}" public /wd4305)
endif(msvc)

add_test(NAME ${test} COMMAND "run_${test}")
endforeach(test)
73 changes: 73 additions & 0 deletions src/tests/annotation_tests.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
#include "catch.hpp"
#include "translator/sentence_ranges.h"
#include <random>
#include <vector>

using namespace marian::bergamot;

TEST_CASE("Test Annotation API with random sentences") {
/// Objective here is to test insertion for sentences, and that whatever comes
/// out adheres to the way it was inserted. Towards this, we keep externally
/// which sentence went in where and try to use accessor methods on
/// AnnotatedText to check if what we have as ground-truth by construction is
/// consistent with what is returned.
size_t sentences = 20;
size_t maxWords = 40;

std::mt19937 randomIntGen_;
randomIntGen_.seed(42);

AnnotatedText testAnnotation;
std::vector<std::vector<ByteRange>> sentenceWords;
std::vector<ByteRange> Words;

for (size_t idx = 0; idx < sentences; idx++) {
if (idx != 0)
testAnnotation.text += "\n";

Words.clear();
size_t words = randomIntGen_() % maxWords + 1;
Words.reserve(words);
for (size_t idw = 0; idw < words; idw++) {
size_t before = testAnnotation.text.size();
std::string word = std::to_string(idx) + "-" + std::to_string(idw);
testAnnotation.text += word;
if (idw != 0)
testAnnotation.text += " ";
Words.push_back((ByteRange){before, before + word.size() - 1});
}
// std::cout << std::endl;

sentenceWords.push_back(Words);
}

// std::cout << "Inserting words:" << std::endl;
std::vector<std::vector<marian::string_view>> byteRanges;
for (auto &sentence : sentenceWords) {
std::vector<marian::string_view> wordByteRanges;
for (auto &word : sentence) {
marian::string_view wordView(&testAnnotation.text[word.begin],
word.end - word.begin);
wordByteRanges.push_back(wordView);
// std::cout << std::string(wordView) << " ";
}
testAnnotation.addSentence(wordByteRanges);
byteRanges.push_back(wordByteRanges);
// std::cout << std::endl;
}

// std::cout << "From container: " << std::endl;
for (int idx = 0; idx < sentenceWords.size(); idx++) {
for (int idw = 0; idw < sentenceWords[idx].size(); idw++) {
ByteRange expected = sentenceWords[idx][idw];
ByteRange obtained = testAnnotation.wordAsByteRange(idx, idw);
// std::cout << std::string(testAnnotation.word(idx, idw)) << " ";
CHECK(expected.begin == obtained.begin);
CHECK(expected.end == obtained.end);

std::string expected_string = std::string(byteRanges[idx][idw]);
CHECK(expected_string == std::string(testAnnotation.word(idx, idw)));
}
// std::cout << std::endl;
}
}
2 changes: 2 additions & 0 deletions src/tests/run_tests.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
#define CATCH_CONFIG_MAIN
#include "catch.hpp"
21 changes: 8 additions & 13 deletions src/translator/TranslationModel.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -32,24 +32,19 @@ TranslationModel::translate(std::vector<std::string> &&texts,
intermediate.wait();
auto marianResponse(std::move(intermediate.get()));

// This mess because marian::string_view != std::string_view
std::string source, translation;
marian::bergamot::Response::SentenceMappings mSentenceMappings;
marianResponse.move(source, translation, mSentenceMappings);

// Convert to UnifiedAPI::TranslationResult
TranslationResult::SentenceMappings sentenceMappings;
for (auto &p : mSentenceMappings) {
std::string_view src(p.first.data(), p.first.size()),
tgt(p.second.data(), p.second.size());
sentenceMappings.emplace_back(src, tgt);
for (size_t idx = 0; idx < marianResponse.size(); idx++) {
marian::string_view src = marianResponse.source.sentence(idx);
marian::string_view tgt = marianResponse.target.sentence(idx);
sentenceMappings.emplace_back(std::string_view(src.data(), src.size()),
std::string_view(tgt.data(), tgt.size()));
}

// In place construction.
translationResults.emplace_back(
std::move(source), // &&marianResponse.source_
std::move(translation), // &&marianResponse.translation_
std::move(sentenceMappings) // &&sentenceMappings
std::move(marianResponse.source.text), // &&marianResponse.source_
std::move(marianResponse.target.text), // &&marianResponse.translation_
std::move(sentenceMappings) // &&sentenceMappings
);
}

Expand Down
9 changes: 3 additions & 6 deletions src/translator/request.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,10 @@ namespace bergamot {

// -----------------------------------------------------------------
Request::Request(size_t Id, size_t lineNumberBegin,
std::vector<Ptr<Vocab const>> &vocabs, std::string &&source,
Segments &&segments, SentenceRanges &&sourceRanges,
std::promise<Response> responsePromise)
std::vector<Ptr<Vocab const>> &vocabs, AnnotatedText &&source,
Segments &&segments, std::promise<Response> responsePromise)
: Id_(Id), lineNumberBegin_(lineNumberBegin), vocabs_(&vocabs),
source_(std::move(source)), segments_(std::move(segments)),
sourceRanges_(std::move(sourceRanges)),
response_(std::move(responsePromise)) {

counter_ = segments_.size();
Expand Down Expand Up @@ -48,8 +46,7 @@ void Request::processHistory(size_t index, Ptr<History> history) {
void Request::completeRequest() {
// Request no longer needs to hold the content, can transfer it to
// Response.
Response response(std::move(source_), std::move(sourceRanges_),
std::move(histories_), *vocabs_);
Response response(std::move(source_), std::move(histories_), *vocabs_);
response_.set_value(std::move(response));
}

Expand Down
12 changes: 5 additions & 7 deletions src/translator/request.h
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
//
// Defines:
//
// Request: holds the input blob of a text, Segments (vector<Words>) which are
// Request: holds the input text of a text, Segments (vector<Words>) which are
// to go to the batching mechanism and alignments between the processed
// segments and the input blob (sourceTokenRanges). In addition, Request takes
// segments and the input text (sourceTokenRanges). In addition, Request takes
// care of the barrier which fires when all the Segments in a request are done
// translating by the workers (BatchTranslator).
// TODO(jerinphilip): Extend Request with notions of Priority (sequence,
Expand Down Expand Up @@ -36,9 +36,8 @@ namespace bergamot {
class Request {
public:
Request(size_t Id, size_t lineNumberBegin,
std::vector<Ptr<Vocab const>> &vocabs_, std::string &&source,
Segments &&segments, SentenceRanges &&sourceTokenRanges,
std::promise<Response> responsePromise);
std::vector<Ptr<Vocab const>> &vocabs_, AnnotatedText &&source,
Segments &&segments, std::promise<Response> responsePromise);

// Obtain the count of tokens in the segment correponding to index. Used to
// insert sentence from multiple requests into the corresponding size bucket.
Expand Down Expand Up @@ -77,9 +76,8 @@ class Request {
// string_views of the text corresponding to these words, pointing to
// sequences in source_. histories_ is a buffer which eventually stores the
// translations of each segment in the corresponding index.
std::string source_;
AnnotatedText source_;
Segments segments_;
SentenceRanges sourceRanges_;
std::vector<Ptr<History>> histories_;

// Members above are moved into newly constructed Response on completion
Expand Down
Loading

0 comments on commit bfb5e78

Please sign in to comment.