# Task 1: Sentence-Level Direct Assessment - WMT20

I am using this OpenKiwi tutorial to start the Shared Task 1: Sentence-Level Direct Assessment - WMT20

I have downloaded the training and development data consisting of the following Wikipedia datasets, all with 7K sentences for training, 1K sentences for development, including info from the NMT model used to generate the translations: model score for the sentence and log probabilities for words, as well as the title of the Wikipedia article where the source sentence came from.

We have already a predictor-estimator approach implemented in OpenKiwi, where the predictor model will be trained on the parallel data used to train the NMT model.

Since we already have a trained model, we will run the train_estimator.

## Setup 

Before being able to run OpenKiwi, there is a small setup required. If you have completed this step, keep reading onwards. Otherwise please refer to the [setup instructions](https://github.com/Unbabel/KiwiCutter/blob/master/setup.md)

First, we will begin by loading all necessary libraries to run this notebook. Note that most of these are for demonstration purposes and to facilitate working in a notebook. They will not be necessary for using openkiwi in a normal setting (kiwi itselft should be enough).

In [1]:
import utils
import yaml
from ipywidgets import interact, fixed, Textarea
from functools import partial
%load_ext yamlmagic

[nltk_data] Downloading package punkt to /home/daniel/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Install Kiwi

Installing Kiwi to use it as a package is a fairly simple procedure. The only thing you need to do is `pip install openkiwi`! In this case it should already be installed in your machine, so all that's left is to import it.

In [2]:
#!pip install openkiwi
import kiwi

### Downloading pre-trained models
First, we will begin by using a pre-trained OpenKiwi model to evaluate the quality of an existing translation. The pre-trained models made available with OpenKiwi focus mainly on En-De which is has been the primary language pair for the WMT20 shared task on quality estimation.

The baseline system is a neural predictor-estimator approach implemented in OpenKiwi (Kepler at al., 2019), where the predictor model will be trained on the parallel data used to train the NMT model (see data below). To foster improvements over this baseline, we are providing the trained predictor models for all language pairs (they can be used for both Task 1 and Task 2)

The following cell runs a method which will (conditionally, if you haven't done it yet) download and extract the zip which contains OpenKiwi's pre-trained models.

In [6]:
# Download and extract pre-trained kiwi models

OK_url = 'https://www.quest.dcs.shef.ac.uk/wmt20_files_qe/en-de.openkiwi-predictor.tar.gz'

utils.download_kiwi(OK_url)

#The .tar file won't extract, so you will need to go to the folder and extract it

Getting filename
Checking if file already downloaded
Extracting trained_models/en-de.openkiwi-predictor.tar.gz
File type not supported
Done extracting


PosixPath('trained_models/en-de.openkiwi-predictor.tar')

# Loading and Predicting

The model we are going to use is a Predictor-Estimator with an RNN-based architecture. You can find more details about it [here](https://www.aclweb.org/anthology/W17-4763). This model was trained on the [WMT20 Quality Estimation data](http://www.statmt.org/wmt20/quality-estimation-task.html). 


Using OpenKiwi's API is fairly straightforward. We start by loading the model:

In [29]:
model = kiwi.load_model('trained_models/estimator_en_de.torch/estimator_en_de.torch')

We then create the sample that we would like to test and make it into a dictionary of lists. In other words, we are creating a batch of examples that Kiwi should use for inference.

In [30]:
source = ['the part of the regular expression within the forward slashes defines the pattern .']
target = ['der Teil des regulären Ausdrucks innerhalb der umgekehrten Schrägstrich definiert das Muste .']
examples = {'source': source,'target': target}

Then you can simply call `model.predict`!

In [31]:
predictions = model.predict(examples)

Or you could just as easily, use our cli by passing a config with the location of the models and data:

In [32]:
#!kiwi predict --config {path_to_config}

The CLI approach is normally used when you want to produce a file of predictions. On the other hand, the Kiwi as a library approach is used when using Kiwi in the context of another application.

In this case, Kiwi will return the scores attributed to each token in the default output format which is a python dictionary.

In [33]:
predictions

{'tags': [[0.03265794739127159,
   0.012598986737430096,
   0.012346097268164158,
   0.014910775236785412,
   0.02257019653916359,
   0.11328258365392685,
   0.8787198066711426,
   0.7055758237838745,
   0.9504916667938232,
   0.2618913948535919,
   0.6289752721786499,
   0.8422331213951111,
   0.017515210434794426]],
 'gap_tags': [[0.0053154416382312775,
   0.001961634261533618,
   0.0005985352909192443,
   0.0004387694352772087,
   0.0005252885748632252,
   0.02461223490536213,
   0.019498424604535103,
   0.006651211529970169,
   0.019716748967766762,
   0.015120620839297771,
   0.36977633833885193,
   0.29399019479751587,
   0.1221683993935585,
   0.011256835423409939]],
 'sentence_scores': [0.1559453010559082]}

You'll notice that this Kiwi model returns three different types of predictions, tags, gap tags and sentence scores:

    - Tags: Tags are the scores attributed to each word token. This means that if you have a sentence of length `x` Kiwi will return a list with `x` scores.
    
    - Gap Tags: These represent the scores of the gaps between words. A gap tag should be marked as bad if there is a word missing in between two other words. This also includes the beggining and end of sentence. As such, on our sequence of length `x`, there will be `x + 1` gap tags. 
    
    - Sentence Score: Finally, the sentence score is a prediction of the sentence's HTER (Human-targeted Translation Error Rate). Or in other words, what is the percentage of the sentence that you would need to change to create a correct translation.
    
    
On the other hand, looking at a bunch of scores with no context is not terribly informative. So, below you can find a small utility to visualize the Kiwi scores in the context of a translation. 

You'll notice that you can move the threshold for marking a word as bad. This is useful in real-world scenarios as you can calibrate the conservativeness of the models and the severity of the errors you want to highlight.

In [34]:
SOURCE = Textarea(value=source[0])
MT = Textarea(value=target[0])
_interact = interact(utils.KiwiViz, model=fixed(model), source=SOURCE, mt=MT, threshold=(0.0, 1.0))

HTER: 0.290527880191803


 <span style='color:green'>der</span> <span style='color:red'>*Teil*</span> <span style='color:green'>des</span> <span style='color:green'>regulären</span> <span style='color:red'>*Ausdrucks*</span> <span style='color:green'>innerhalb</span> <span style='color:green'>der</span> <span style='color:red'>*umgekehrten*</span> <span style='color:red'>*Schrägstrich*</span> <span style='color:green'>definiert</span> <span style='color:red'>*das*</span> <span style='color:red'>*Muste*</span> <span style='color:green'>.</span>

# Training a model from scratch

OpenKiwi supports training a set of 4 different architectures:
    - Linear Model
    - Quetch
    - NuQE
    - Predictor-Estimator
    
This can be easily achieved either through or API or the command line. But, contrary to inference, as training posesses a host of different options, we rely on yaml config files to pass these parameters into the framework.

Below, you'll find an example config file for training a NuQE model. NuQE is a simple, end-to-end neural model often used as the baseline for WMT's quality estimation shared task. You can see more details about it [here](https://www.aclweb.org/anthology/W16-2387).

In [None]:
%%yaml yaml_config
#### MODEL SPECIFIC OPTIONS ####
#
model: nuqe

seed: 42

output-dir: runs/nuqe

window-size: 3
max-aligned: 5

# embeddings
source-embeddings-size: 50
source-pos-embeddings-size: 20
target-embeddings-size: 50
target-pos-embeddings-size: 20

# network
hidden-sizes: [400, 200, 100, 50]
dropout: 0.0
embeddings-dropout: 0.5
freeze-embeddings: false
bad-weight: 3.0

# initialization
init-support: 0.1
init-type: uniform

### Pretrained Embedding Options ###
# pip-install the polyglot package to use these
#embeddings-format: polyglot
#    source: path/to/source/embeddings_pkl.tar.bz2
#    target: path/to/target/embeddings_pkl.tar.bz2

#
# TRAINING OPTIONS
#
epochs: 3
train-batch-size: 64
valid-batch-size: 64

log-interval: 100
checkpoint-save: true
checkpoint-keep-only-best: 1
checkpoint-early-stop-patience: 10

optimizer: adam
learning-rate: 0.001

gpu-id: -1

predict-target: true

#
# DATA OPTIONS
#
wmt18-format: true
train-source: WMT19/train.src
train-target: WMT19/train.mt
train-target-tags: WMT19/train.tags
train-alignments: WMT19/train.src-mt.alignments

valid-source: WMT19/dev.src
valid-target: WMT19/dev.mt
valid-target-tags: WMT19/dev.tags
valid-alignments: WMT19/dev.src-mt.alignments

# vocabulary
source-vocab-min-frequency: 2
target-vocab-min-frequency: 2
keep-rare-words-with-embeddings: true
add-embeddings-vocab: false


We save this config to a file so it can be loaded later into kiwi.

In [None]:
utils.save_config(yaml_config, 'nuqe_config.yml')

Then, you can use either the API or the command line to call kiwi and begin training with this configuration!

In [None]:
config = 'nuqe_config.yml'

#Uncomment one of the following lines, they are virtually identical

run_info = kiwi.train(config)
#!kiwi train --config nuqe_config.yml

# Evaluating and Finetuning an existing model

Finally, OpenKiwi also provides an easy way to evaluate existing models against a QE dataset. Here, we will evaluate one of our pre-trained models against the WMT20 dev set (as the test sets are unfortunately not available). 

Then, we will try to continue fine-tuning this model in an attempt to increase it's performance.

### Evaluation

As with training, we defined the evaluation options through a yaml config file. Here, we will use our pre-trained model to predict the tags for the WMMT20 dev set.

In [35]:
%%yaml predest_predict
output-dir: predictions/predest

#
# GENERAL OPTIONS
#
# random
seed: 42

# gpu
gpu-id: -1

model: estimator

# save and load
load-model: trained_models/estimator_en_de.torch/best_model.torch

#
# DATA OPTIONS
#
wmt18-format: False
test-source: WMT19/dev.src
test-target: WMT19/dev.mt
valid-batch-size: 64

<IPython.core.display.Javascript object>

Again, we save the config and call the CLI. This will create predictions in the `output-dir`.

In [36]:
utils.save_config(predest_predict, 'predest_predict.yml')
!kiwi predict --config predest_predict.yml

2020-05-01 17:40:22.880 [kiwi.lib.predict setup:159] {'batch_size': 64,
 'config': 'predest_predict.yml',
 'debug': False,
 'experiment_name': None,
 'gpu_id': -1,
 'load_data': None,
 'load_model': 'trained_models/estimator_en_de.torch/best_model.torch',
 'load_vocab': None,
 'log_interval': 100,
 'mlflow_always_log_artifacts': False,
 'mlflow_tracking_uri': 'mlruns/',
 'model': 'estimator',
 'output_dir': 'predictions/predest',
 'quiet': False,
 'run_uuid': None,
 'save_config': None,
 'save_data': None,
 'seed': 42}
2020-05-01 17:40:22.896 [kiwi.lib.predict setup:160] Local output directory is: predictions/predest
2020-05-01 17:40:22.897 [kiwi.lib.predict run:100] Predict with the PredEst (Predictor-Estimator) model
2020-05-01 17:40:48.398 [kiwi.data.utils load_vocabularies_to_fields:126] Loaded vocabularies from trained_models/estimator_en_de.torch/best_model.torch


Finally, we define a yaml for the evaluation pipeline. 

In [37]:
%%yaml predest_evaluate
# Example file for configuring the evaluation pipeline
#
# The input type for prediction files (Probabilities[probs] or tags)
type: probs
 
# The format of gold files (wmt17/wmt18)
format: wmt18

# Format of predictions (wmt17/wmt18). Either they predict gaps or not.
pred-format: wmt17

# File path for the reference files
gold-target: WMT19/dev.tags

# File path for the prediction files
pred-target: predictions/predest/tags

<IPython.core.display.Javascript object>

In [38]:
utils.save_config(predest_evaluate, 'predest_evaluate.yml')
!kiwi evaluate --config predest_evaluate.yml

---------------------------------------------------------------
Word-level scores for tags:
File                        F1_mult      F1_OK        F1_BAD   
predictions/predest/tags    0.38713      0.90609      0.42725  


These results, are on par with what we expect for a single model in the wmt19 dev set.

### Finetuning

Finally, we can also load the pre-trained model and continue finetuning it. 
Can we further increase it's performance?

Here, we will use the predictor_estimator used on the previous example and continue training it on a randomly selected sub-set of the WMT19 data. This subset is located in `WMT19/small`.

In [None]:
%%yaml predest_finetune
### Train Predictor Estimator ###

model: estimator

#### MODEL SPECIFIC OPTS ####

## ESTIMATOR ##

# If load-model points to a pretrained Estimator,
# These settings are ignored.

# LSTM Settings
hidden-est: 125
rnn-layers-est: 1
dropout-est: 0.0
# Use linear layer to reduce dimension prior to LSTM
mlp-est: True

# Multitask Learning Settings #

# Continue training the predictor on the postedited text.
# If set, will do an additional forward pass through the predictor
# Using the SRC, PE pair and add the `Predictor` loss for the tokens in the
# postedited text PE. Recommended if you have access to PE
# Requires setting train-pe, valid-pe
token-level: True
# Predict Sentence Level Scores
# Requires setting train-sentence-scores, valid-sentence-scores
sentence-level: True
# Use probabilistic Loss for sentence scores instead of squared error.
# If set, the model will output mean and variance of a truncated Gaussian
# distribution over the interval [0, 1], and use log-likelihood loss instead
# of mean squared error.
# Seems to improve performance
sentence-ll: False
# Predict Binary Label for each sentence, indicating hter == 0.0
# Requires setting train-sentence-scores, valid-sentence-scores
binary-level: False

# WMT 18 Format Settings #

# Predict target tags. Requires train-target-tags, valid-target-tags to be set.
#predict-target: true
target-bad-weight: 2.5
# Predict source tags. Requires train-source-tags, valid-source-tags to be set.
#predict-source: false
source-bad-weight: 2.5
# Predict gap tags. Requires train-target-tags, valid-target-tags to be set.
# and wmt18-format set to true
#predict-gaps: true
target-bad-weight: 2.5

### GENERAL OPTS ###

# Do not set or set to negative number for CPU
gpu-id: -1

### TRAIN OPTS ###
epochs: 1
# Additionally Eval and checkpoint every n training steps
# Explicitly disable by setting to zero (default)
checkpoint-validation-steps: 300000
# If False, never save the Models
checkpoint-save: true
# Keep Only the n best models according to the main metric (F1Mult by default)
# USeful to avoid filling the harddrive during a long run
checkpoint-keep-only-best: 3
# If greater than zero, Early Stop after n evaluation cycles without improvement
checkpoint-early-stop-patience: 0


# Print Train Stats Every n batches
log-interval: 100
# LR. Currently ADAM is only optimizer supported.
# 1e-3 * (batch_size / 32) seems to work well
learning-rate: 2e-3

train-batch-size: 64
valid-batch-size: 64



### LOADING ###

# Load pretrained (sub-)model.
# If set, the model architecture params are ignored.
# As the vocabulary of the pretrained model will be used,
# all vocab-params will also be ignored.

# (i) load-pred-source or load-pred-target: Predictor instance
#     -> a new Estimator is initialized with the given predictor(s).
# (ii) load-model: Estimator instance.
#                  As the Predictor is a submodule of the Estimator,
#                  load-pred-{source,target} will be ignored if this is set.

load-model: trained_models/estimator_en_de.torch/estimator_en_de.torch
# load-pred-source: path_to_predictor_source_target
# load-pred-target: runs/model.torch


###  DATA ###

# Set to True to use target_tags in WMT18 format
wmt18-format: true

train-source: WMT19/small/train.src
train-target: WMT19/small/train.mt
train-pe: WMT19/small/train.pe
train-target-tags: WMT19/small/train.tags
train-sentence-scores: WMT19/small/train.hter


valid-source: WMT19/dev.src
valid-target: WMT19/dev.mt
valid-pe: WMT19/dev.pe
valid-target-tags: WMT19/dev.tags
valid-sentence-scores: WMT19/dev.hter

In [None]:
utils.save_config(predest_finetune, 'predest_config.yml')

In [None]:
kiwi.train('predest_config.yml')
#!kiwi train --config predest_config.yml

How do these scores compare to your previous evaluation? Can you improve it? :)

Note: Increasing the performance of these is actually a very difficult task as these models had already been trained on this dataset. The goal is simply to learn how to continue fine-tuning a model on a different dataset.


If you're done, go back to the repo and check the `exercises` folder!