Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QualityEstimation Implementation [RFC] #56

Closed
jerinphilip opened this issue Mar 18, 2021 · 11 comments
Closed

QualityEstimation Implementation [RFC] #56

jerinphilip opened this issue Mar 18, 2021 · 11 comments
Assignees

Comments

@jerinphilip
Copy link
Contributor

jerinphilip commented Mar 18, 2021

Input: translated text, Source text, model scores for tokens, tokenization information to make sense of model scores.

Output is expected to be containing for each sentence the following:

  • sentence level quality score: float
  • Word level quality score vector<float> corresponding to each of the Word

where Word is space separated words of a sentence (mozilla prefers word, not subword level scores). Continuous values preferred for more experimentation capabilities.

Let output be a struct called QualityEstimate. Implementation which can start from the below skeleton is tentatively going to be used by Service to make QualityEstimate a member in Response. (The layer above in UnifiedAPI doesn't have access to logprobs, so).

class QualityEstimator {
public:
  QualityEstimator(Args…) {
       // Use constructor to load an initialize any trained models
       // This is where I expect you to load any neural nets into a graph 
       // or prepare the model parameters (logistic regression) or something if you're using simpler.
  }
  QualityEstimate quality(Histories &histories, Response &response) {
      // AnnotatedText has the blob of text and sentence/word-token information which should be extracted.
      // modelScores are logprobs, they're accessed and ready.
      … your calculation code here
  }
};

This is to be built native first, and when readied exported to WASM.

@ugermann @abhi-agg @fredblain @mfomicheva /cc @kpu

@jerinphilip
Copy link
Contributor Author

jerinphilip commented Mar 18, 2021

Now adding the implementation details, to conveniently prototype (as @ugermann mentioned in the meeting) the class called QualityEstimator, you can already see that Response already contains most of the information you suggested as inputs to quality estimation. Catch is you will have to work on top a pending alignments PR (this is the only place with the logprobs and target sentencepiece annotations available which is required per input specifications) After much struggle alignment PR is in.

There is a command line test-app which already prints qualityscores and accesses response. There are some qualityscores modelscores already being printed through this test-app at

// Handle quality.
auto &quality = response.qualityScores[sentenceIdx];
std::cout << "Quality: whole(" << quality.sequence
<< "), tokens below:" << '\n';
size_t wordIdx = 0;
bool first = true;
for (auto &p : quality.word) {
if (first) {
first = false;
} else {
std::cout << " ";
}
std::cout << response.target.word(sentenceIdx, wordIdx) << "(" << p
<< ")";
wordIdx++;
}
std::cout << '\n';
}
std::cout << "--------------------------\n";

The output looks like this, for a sample text:

Quality: whole(547.219), tokens below:
Das(20.1062)  Bergamo(20.6754) t(26.2467) -(18.5193) Projekt(29.8694)  wird(18.8979)  die(18.8123)  maschine(16.202) lle(23.2749)  Übersetzung(20.4364)  von(15.9851)  Client(16.8951) -(15.9545) S(17.2791) ide(23.0952) -(14.7887) Maschine(20.0458) n(20.454)  in(16.5911)  einem(18.3833)  Web(20.3965) browser(25.2303)  hinzufügen(17.8319)  und(24.5277)  verbessern(22.4654) .(23.9528)

These values are believed to be -logprobs scale (which are the requested input values to QualityEstimator). You can pick up from here and implement the QualityEstimator class.

In main-mts.cppservice-cli.cpp, You can pick up Response, and below it create an instance of QualityEstimator and prototype and test the functionality. of quality-estimation (with or without Translator). I'd expect you to supply a quality_estimator.h and quality_estimator.cpp implementing QualityEstimator.

When ready and complete, (hopefully) it is not going to be much work to induct QualityEstimator into Service (which is what interacts with browser extension in some capacity), replacing the logprobs now returned as qualityscores with your version of quality scores. There is currently no imposition of what QualityEstimate needs to look like.

Instructions to build the command line app is here.

@mfomicheva
Copy link

@jerinphilip has this alignment PR been merged? If that's the case, I don't see bergamot-translator/app/main-mts.cpp in main. Where can find it?

@jerinphilip
Copy link
Contributor Author

jerinphilip commented Apr 2, 2021

Alignment PR is merged. The file is renamed to service-cli.cpp in a parallel sanitization attempt of filenames(7fd5d0f). Docs to build it are at doc/marian-integration.md.

There will be some more shuffling and moving around to cleanup and well structure the source. However, if you stick to providing the QualityEstimator class with a defined input (Source and Target text with subword information, modelScores) and output it should be okay. I'm working towards providing you access to Histories later on which is a richer marian object (more than logprobs - this will allow you to experiment one level deeper).

@abarbosa94
Copy link
Contributor

Hi there, I managed to successfully build the project locally through the provided docs

However, when I tried to run: ./app/service-cli "${ARGS[@]}" < path-to-input-file, where path-to-input-file == input_file.en with the following content:

A Republican strategy to counter the re-election of Obama

I am getting the following error: Error: Incorrect magic in binary shortlist

Here is the full stack trace of the error:

./app/service-cli "${ARGS[@]}" < input_file.en
[2021-05-16 14:55:43] [marian] Marian v1.9.56 03db505f 2021-05-09 18:41:30 +0100
[2021-05-16 14:55:43] [marian] Running on brsppn-hwxvfv2.ldap.quintoandar.com.br as process 27636 with command line:
[2021-05-16 14:55:43] [marian] ./app/service-cli -m /home/andrebarbosa/Downloads/enes.student.tiny11/model.intgemm.alphas.bin --beam-size 1 --skip-cost --shortlist /home/andrebarbosa/Downloads/enes.student.tiny11/lex.s2t.gz false --vocabs /home/andrebarbosa/Downloads/enes.student.tiny11/vocab.esen.spm /home/andrebarbosa/Downloads/enes.student.tiny11/vocab.esen.spm --cpu-threads 4 --max-length-break 1024 --mini-batch-words 1024 --ssplit-mode paragraph
[2021-05-16 14:55:43] [config] alignment: ""
[2021-05-16 14:55:43] [config] allow-unk: false
[2021-05-16 14:55:43] [config] authors: false
[2021-05-16 14:55:43] [config] beam-size: 1
[2021-05-16 14:55:43] [config] bert-class-symbol: "[CLS]"
[2021-05-16 14:55:43] [config] bert-mask-symbol: "[MASK]"
[2021-05-16 14:55:43] [config] bert-masking-fraction: 0.15
[2021-05-16 14:55:43] [config] bert-sep-symbol: "[SEP]"
[2021-05-16 14:55:43] [config] bert-train-type-embeddings: true
[2021-05-16 14:55:43] [config] bert-type-vocab-size: 2
[2021-05-16 14:55:43] [config] best-deep: false
[2021-05-16 14:55:43] [config] build-info: ""
[2021-05-16 14:55:43] [config] check-bytearray: true
[2021-05-16 14:55:43] [config] cite: false
[2021-05-16 14:55:43] [config] clip-gemm: 0
[2021-05-16 14:55:43] [config] cpu-threads: 4
[2021-05-16 14:55:43] [config] dec-cell: ssru
[2021-05-16 14:55:43] [config] dec-cell-base-depth: 2
[2021-05-16 14:55:43] [config] dec-cell-high-depth: 1
[2021-05-16 14:55:43] [config] dec-depth: 2
[2021-05-16 14:55:43] [config] devices:
[2021-05-16 14:55:43] [config]   - 0
[2021-05-16 14:55:43] [config] dim-emb: 256
[2021-05-16 14:55:43] [config] dim-rnn: 1024
[2021-05-16 14:55:43] [config] dim-vocabs:
[2021-05-16 14:55:43] [config]   - 32000
[2021-05-16 14:55:43] [config]   - 32000
[2021-05-16 14:55:43] [config] dump-config: ""
[2021-05-16 14:55:43] [config] dump-quantmult: false
[2021-05-16 14:55:43] [config] enc-cell: gru
[2021-05-16 14:55:43] [config] enc-cell-depth: 1
[2021-05-16 14:55:43] [config] enc-depth: 6
[2021-05-16 14:55:43] [config] enc-type: bidirectional
[2021-05-16 14:55:43] [config] gemm-precision: float32
[2021-05-16 14:55:43] [config] ignore-model-config: false
[2021-05-16 14:55:43] [config] input:
[2021-05-16 14:55:43] [config]   - stdin
[2021-05-16 14:55:43] [config] input-types:
[2021-05-16 14:55:43] [config]   []
[2021-05-16 14:55:43] [config] interpolate-env-vars: false
[2021-05-16 14:55:43] [config] layer-normalization: false
[2021-05-16 14:55:43] [config] lemma-dim-emb: 0
[2021-05-16 14:55:43] [config] log: ""
[2021-05-16 14:55:43] [config] log-level: info
[2021-05-16 14:55:43] [config] log-time-zone: ""
[2021-05-16 14:55:43] [config] max-length: 1000
[2021-05-16 14:55:43] [config] max-length-break: 1024
[2021-05-16 14:55:43] [config] max-length-crop: false
[2021-05-16 14:55:43] [config] max-length-factor: 3
[2021-05-16 14:55:43] [config] maxi-batch: 1
[2021-05-16 14:55:43] [config] maxi-batch-sort: none
[2021-05-16 14:55:43] [config] mini-batch: 1
[2021-05-16 14:55:43] [config] mini-batch-words: 1024
[2021-05-16 14:55:43] [config] models:
[2021-05-16 14:55:43] [config]   - /home/andrebarbosa/Downloads/enes.student.tiny11/model.intgemm.alphas.bin
[2021-05-16 14:55:43] [config] n-best: false
[2021-05-16 14:55:43] [config] no-spm-decode: false
[2021-05-16 14:55:43] [config] normalize: 0
[2021-05-16 14:55:43] [config] num-devices: 0
[2021-05-16 14:55:43] [config] output: stdout
[2021-05-16 14:55:43] [config] output-approx-knn:
[2021-05-16 14:55:43] [config]   []
[2021-05-16 14:55:43] [config] output-omit-bias: false
[2021-05-16 14:55:43] [config] output-sampling: false
[2021-05-16 14:55:43] [config] precision:
[2021-05-16 14:55:43] [config]   - float32
[2021-05-16 14:55:43] [config] quiet: false
[2021-05-16 14:55:43] [config] quiet-translation: false
[2021-05-16 14:55:43] [config] relative-paths: false
[2021-05-16 14:55:43] [config] right-left: false
[2021-05-16 14:55:43] [config] seed: 0
[2021-05-16 14:55:43] [config] shortlist:
[2021-05-16 14:55:43] [config]   - /home/andrebarbosa/Downloads/enes.student.tiny11/lex.s2t.gz
[2021-05-16 14:55:43] [config]   - false
[2021-05-16 14:55:43] [config] skip: false
[2021-05-16 14:55:43] [config] skip-cost: true
[2021-05-16 14:55:43] [config] ssplit-mode: paragraph
[2021-05-16 14:55:43] [config] ssplit-prefix-file: ""
[2021-05-16 14:55:43] [config] tied-embeddings: false
[2021-05-16 14:55:43] [config] tied-embeddings-all: true
[2021-05-16 14:55:43] [config] tied-embeddings-src: false
[2021-05-16 14:55:43] [config] transformer-aan-activation: swish
[2021-05-16 14:55:43] [config] transformer-aan-depth: 2
[2021-05-16 14:55:43] [config] transformer-aan-nogate: false
[2021-05-16 14:55:43] [config] transformer-decoder-autoreg: rnn
[2021-05-16 14:55:43] [config] transformer-depth-scaling: false
[2021-05-16 14:55:43] [config] transformer-dim-aan: 1536
[2021-05-16 14:55:43] [config] transformer-dim-ffn: 1536
[2021-05-16 14:55:43] [config] transformer-ffn-activation: relu
[2021-05-16 14:55:43] [config] transformer-ffn-depth: 2
[2021-05-16 14:55:43] [config] transformer-guided-alignment-layer: last
[2021-05-16 14:55:43] [config] transformer-heads: 8
[2021-05-16 14:55:43] [config] transformer-no-projection: false
[2021-05-16 14:55:43] [config] transformer-pool: false
[2021-05-16 14:55:43] [config] transformer-postprocess: dan
[2021-05-16 14:55:43] [config] transformer-postprocess-emb: d
[2021-05-16 14:55:43] [config] transformer-postprocess-top: ""
[2021-05-16 14:55:43] [config] transformer-preprocess: ""
[2021-05-16 14:55:43] [config] transformer-tied-layers:
[2021-05-16 14:55:43] [config]   []
[2021-05-16 14:55:43] [config] transformer-train-position-embeddings: false
[2021-05-16 14:55:43] [config] tsv: false
[2021-05-16 14:55:43] [config] tsv-fields: 0
[2021-05-16 14:55:43] [config] type: transformer
[2021-05-16 14:55:43] [config] ulr: false
[2021-05-16 14:55:43] [config] ulr-dim-emb: 0
[2021-05-16 14:55:43] [config] ulr-trainable-transformation: false
[2021-05-16 14:55:43] [config] use-legacy-batching: false
[2021-05-16 14:55:43] [config] version: v1.8.40 b3a2310 2020-01-17 21:52:33 +0000
[2021-05-16 14:55:43] [config] vocabs:
[2021-05-16 14:55:43] [config]   - /home/andrebarbosa/Downloads/enes.student.tiny11/vocab.esen.spm
[2021-05-16 14:55:43] [config]   - /home/andrebarbosa/Downloads/enes.student.tiny11/vocab.esen.spm
[2021-05-16 14:55:43] [config] weights:
[2021-05-16 14:55:43] [config]   []
[2021-05-16 14:55:43] [config] word-penalty: 0
[2021-05-16 14:55:43] [config] word-scores: false
[2021-05-16 14:55:43] [config] workspace: 512
[2021-05-16 14:55:43] [config] Loaded model has been created with Marian v1.8.40 b3a2310 2020-01-17 21:52:33 +0000
[2021-05-16 14:55:43] [data] Loading SentencePiece vocabulary from file /home/andrebarbosa/Downloads/enes.student.tiny11/vocab.esen.spm
[2021-05-16 14:55:43] Missing list of protected prefixes for sentence splitting. Set with --ssplit-prefix-file.
[2021-05-16 14:55:43] [data] Loading binary shortlist from buffer with check=true
[2021-05-16 14:55:43] [data] Loading binary shortlist from buffer with check=true
[2021-05-16 14:55:43] Error: Incorrect magic in binary shortlist
[2021-05-16 14:55:43] Error: Aborted from void marian::data::BinaryShortlistGenerator::load(const void*, size_t, bool) in /home/andrebarbosa/bergamot-translator/3rd_party/marian-dev/src/data/shortlist.cpp:175
[2021-05-16 14:55:43] Error: Incorrect magic in binary shortlist
Aborted from void marian::data::BinaryShortlistGenerator::load(const void*, size_t, bool) in /home/andrebarbosa/bergamot-translator/3rd_party/marian-dev/src/data/shortlist.cpp:175
[data] Loading binary shortlist from buffer with check=true
[data] Loading binary shortlist from buffer with check=true
[2021-05-16 14:55:43] Error: Incorrect magic in binary shortlist
[2021-05-16 14:55:43] Error: Incorrect magic in binary shortlist
[2021-05-16 14:55:43] Error: Aborted from void marian::data::BinaryShortlistGenerator::load(const void*, size_t, bool) in /home/andrebarbosa/bergamot-translator/3rd_party/marian-dev/src/data/shortlist.cpp:175
[2021-05-16 14:55:43] Error: Aborted from void marian::data::BinaryShortlistGenerator::load(const void*, size_t, bool) in /home/andrebarbosa/bergamot-translator/3rd_party/marian-dev/src/data/shortlist.cpp:175

[CALL STACK]
[0x558788e01fa2]                                                       + 0x215fa2
[0x558788e02e5d]                                                       + 0x216e5d
[0x558788cfa371]                                                       + 0x10e371
[0x558788ceb6de]                                                       + 0xff6de
[0x7f076edc8d84]                                                       + 0xd6d84
[0x7f076eb6e609]                                                       + 0x9609
[0x7f076ea95293]    clone                                              + 0x43

Aborted (core dumped)

I have tried to investigate a little bit and I found that the shortlist feature was merged from here but I don't know if they are related.

@jerinphilip do you have a clue about what I'm might be doing wrong?

Thanks!

@abarbosa94
Copy link
Contributor

I have tried both

[2021-05-16 15:14:48] [config] shortlist:
[2021-05-16 15:14:48] [config]   - /home/andrebarbosa/Downloads/enes.student.tiny11/lex.s2t.gz
[2021-05-16 15:14:48] [config]   - false

And

[2021-05-16 15:15:40] [config] shortlist:
[2021-05-16 15:15:40] [config]   - /home/andrebarbosa/Downloads/enes.student.tiny11/lex.s2t.gz
[2021-05-16 15:15:40] [config]   - 50
[2021-05-16 15:15:40] [config]   - 50

@jerinphilip
Copy link
Contributor Author

@abarbosa94 We know for sure the tests pass (because CI) from clean install, so can you try the script over there? They should be similar enough to know to update the variables.

https://github.com/browsermt/bergamot-translator-tests/blob/main/tests/basic/test_service-cli_intgemm_8bit.cpu-threads.4.sh

I will update the documentation shortly. This is my bad, sorry.

@kpu
Copy link
Member

kpu commented May 16, 2021

There's a few issues here for @qianqianzhu to fix.

  1. Apparently not all of the config files were changed to use .bin: Config files not using binary lexical shortlist students#33 .
  2. We shouldn't be shipping lex.s2t.gz to save on download bandwidth.
  3. The binary shortlist shouldn't break loading from text. To be clear, if this breaks:
[2021-05-16 15:15:40] [config] shortlist:
[2021-05-16 15:15:40] [config]   - /home/andrebarbosa/Downloads/enes.student.tiny11/lex.s2t.gz
[2021-05-16 15:15:40] [config]   - 50
[2021-05-16 15:15:40] [config]   - 50

then file an issue against https://github.com/browsermt/marian-dev.

It should work if you patch the config file:
shortlist with /home/andrebarbosa/Downloads/enes.student.tiny11/lex.s2t.bin and false.

@abarbosa94
Copy link
Contributor

abarbosa94 commented May 16, 2021

Hi guys, Thanks for the quick feedback. Indeed the instructions provided by @kpu worked :)

As I'm using a CPU-only machine, there was a little tweak that I was required to do to make this "hello world" rungs smoothly. For future reference:

  • Install MKL:
wget -qO- 'https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS-2019.PUB' | sudo apt-key add -
sudo sh -c 'echo deb https://apt.repos.intel.com/mkl all main > /etc/apt/sources.list.d/intel-mkl.list'
sudo apt-get update
sudo apt-get install intel-mkl-64bit-2020.0-088
cd build
cmake .. -DUSE_WASM_COMPATIBLE_SOURCE=off -DCMAKE_BUILD_TYPE=Release -DCOMPILE_CPU=on
make -j4

ARGS=(
    -m $MODEL_DIR/model.intgemm.alphas.bin 
    --vocabs 
        $MODEL_DIR/vocab.esen.spm #change for a different file
        $MODEL_DIR/vocab.esen.spm
    --ssplit-mode paragraph
    --beam-size 1
    --skip-cost
    --shortlist $MODEL_DIR/lex.s2t.bin false
    --int8shiftAlphaAll
    --cpu-threads 4
    --max-length-break 1024
    --mini-batch-words 1024
)

Sample output:

[original]: A Republican strategy to counter the re-election of Obama
[translated]: Una estrategia republicana para contrarrestar la reelección de Obama
 [src Sentence]: A Republican strategy to counter the re-election of Obama
 [tgt Sentence]: Una estrategia republicana para contrarrestar la reelección de Obama
Alignments
A: 
 Republic: 
an: 
 strategy: 
 to: 
 counter: 
 the: 
 re: 
-: 
election: 
 of: 
 Obama: 
Quality: whole(331.195), tokens below:
Una(24.1778)  estrategia(26.3681)  republican(25.3441) a(32.2952)  para(21.0713)  contrarrestar(22.4898)  la(20.8506)  re(25.0341) ele(32.8157) cción(35.9558)  de(21.5138)  Obama(21.0547)

Again, I much appreciated the quick assistance :)

@XapaJIaMnu
Copy link
Collaborator

I think Mozilla were keeping their own fork of some models which need to be updated /synced with master. @abhi-agg?

@kpu
Copy link
Member

kpu commented May 16, 2021

The instructions point to http://data.statmt.org/bergamot/models/deen/ende.student.tiny11.tar.gz ; don't poke Mozilla to fix things until we've fixed upstream.

@abarbosa94
Copy link
Contributor

Hey guys, there is a first attempt to provide this solution in #173

Feel free to analyze and criticize. I basically decided to perform this implementation with ONNX because I think it would treat models agnostically and it would also have a lot of good features already implemented aiming inference performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants