Skip to content

Latest commit



89 lines (75 loc) · 3.42 KB

File metadata and controls

89 lines (75 loc) · 3.42 KB


This repository allow build models for machine translation (MT) quality estimation (QE). It is clearly Quest++ rip off that I made in order to experiment with 'before BERT' QE.

The data :

  • English-German WMT18 sentences on the IT domain translated by in-house encoder-decoder attention-based NMT system (13,442 training and 1,000 development sentences)
  • After running ./scripts/ data will be downloaded to data/sentence-level/features/en_de.
  • The usual 17 features used in WMT12-17 is considered for the baseline system
  • WMT18 QE baseline model was SVM regression with an RBF kernel, with grid search algorithm for the optimisation of relevant parameters. I tried to reproduce this in config/svc.cfg

Train model

The program takes as an input; method, config file and additional parameters.

For example, to train model:

./ --train --config config/svc.yaml

Preparing training corpora

To extract features from tsv file (needed columnt: src and trg):

./ --extract_features \
                       --src_lm_path data/lm.tok.en \
                       --trg_lm_path data/ \
                       --trg_ncount_path data/ \
                       -i input.tsv -i output.tsv

also remember to provide SRILM path either with export SRILM_PATH or by --srilm_path.

Available learning methods

All of available methods are taken from sklearn, so it is fairly easey to add other as well, but currently these are "supported":

Feature selection

To set up a feature selection algorithm add the "feature_selection" section to the configuration file. This section is independent of the "learning" section:

    method: LinearSVC
        cv: 10


Currently, the following feature selection algorithms are available:

  • Linear Support Vector Classification. The exposed parameters are:
    • penalty (default=’l2’)
    • loss (default=’squared_hinge’)
    • dual (default=True)
    • tol (default=1e-4)
    • C (default=1.0)
    • fit_intercept (default=True)
    • intercept_scaling (default=1)
    • max_iterint (default=1000)

These parameters and the method are documented at:


To inference model on given input:

./ --inference --config config/svc.yaml --input test.tsv