# Connecting Colab to Google Cloud

https://medium.com/@senthilnathangautham/colab-gcp-compute-how-to-link-them-together-98747e8d940e

Machine type `n1-standard-8 (8 vCPUs, 30 GB memory)`

Image c2-deeplearning-pytorch-1-3-cu100-20191219

Made some changes in Step 3: Connect to your server and forward our port:

`gcloud compute ssh --zone us-central1-a instance-4 -- -L 8081:localhost:8081`

Made some changes in Step 4: Run a Jupyter Notebook server on your instance

`jupyter notebook --NotebookApp.allow_origin="https://colab.research.google.com" --port=8081 --NotebookApp.port_retries=0 --no-browser`

# Install OpenKiwi

We want to install OpenKiwi as a local package. Follow these steps

https://unbabel.github.io/OpenKiwi/installation.html#as-local-package

# Task 1: Sentence-Level Direct Assessment - WMT20

In [2]:
import utils
import yaml
from ipywidgets import interact, fixed, Textarea
from functools import partial
%load_ext yamlmagic

[nltk_data] Downloading package punkt to
[nltk_data]     /home/daniel_paramo_v_gmail_com/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Train Predictor

In [0]:
# Download and extract data

OK_url = 'https://www.quest.dcs.shef.ac.uk/wmt20_files_qe/training_en-de.tar.gz'

utils.download_kiwi(OK_url)

In [0]:
import tarfile
my_tar = tarfile.open('./data/training/training_en-de.tar.gz')
my_tar.extractall('./data/training') # specify which folder to extract to
my_tar.close()

In [0]:
#Data is to big, so I am getting the first 200MB rows of the training data
import pandas as pd
tinytrainen = pd.read_csv('./data/training/train.ende.en',chunksize=2000000, sep='None, /n', engine='python')
readme = tinytrainen.get_chunk(2000000)
readme.to_csv(r'./data/training/tinytrainen', index=False, header=False)

In [0]:
tinytrainde = pd.read_csv('./data/training/train.ende.de',chunksize=2000000, sep='None, /n', engine='python')
reader = tinytrainde.get_chunk(2000000)
reader.to_csv(r'./data/training/tinytrainde', index=False, header=False)

In [3]:
import os
os.path.getsize('./data/training/tinytrainde')

297527931

In [8]:
%%yaml train_predictor
#### Train Predictor  ####

model: predictor

# Model Files will be saved here
output-dir: ./OpenKiwi/runs/predictor

#### MODEL SPECIFIC OPTS ####

## PREDICTOR ##

# LSTM Settings (Both SRC and TGT)
hidden-pred: 400
rnn-layers-pred: 2
# If set, takes precedence over other embedding params
embedding-sizes: 200
# Source, Target, and Target Softmax Embedding
source-embeddings-size: 200
target-embeddings-size: 200
out-embeddings-size: 200
# Dropout
dropout-pred: 0.5
# Set to true to predict from target to source
# (To create a source predictor for source tag prediction)
predict-inverse: false

### TRAIN OPTS ###
epochs: 6
# Eval and checkpoint every n samples
# Disable by setting to zero (default)
checkpoint-validation-steps: 5000
# If False, never save the Models
checkpoint-save: true
# Keep Only the n best models according to the main metric (Perplexity by default)
# Ueful to avoid filling the harddrive during a long run
checkpoint-keep-only-best: 1
# If greater than zero, Early Stop after n evaluation cycles without improvement
checkpoint-early-stop-patience: 0

optimizer: adam
# Print Train Stats Every n batches
log-interval: 100
# Learning Rate
# 1e-3 * (batch_size / 32) seems to work well
learning-rate: 2e-3
learning-rate-decay: 0.6
learning-rate-decay-start: 2
train-batch-size: 64
valid-batch-size: 64

### DATA OPTS ###

# Source and Target Files
train-source: ./OpenKiwi/data/training/tinytrainen
train-target: ./OpenKiwi/data/training/tinytrainde
# Optionally load more data which is used only for vocabulary creation.
# This is useful to reduce OOV words if the parallel data
# and QE data are from different domains.
# extend-source-vocab: data/WMT17/word_level/en_de/train.src
# extend-target-vocab: data/WMT17/word_level/en_de/train.pe
# Optionally Specify Validation Sets
# valid-source: data/WMT17/word_level/en_de/dev.src
# valid-target: data/WMT17/word_level/en_de/dev.pe
# If No valid is specified, randomly split the train corpus
split: 0.99


## VOCAB ##

# Load Vocabulary from a previous run.
# This is needed e.g. for training a source predictor via the flag
# predict-inverse: True
# If set, the other vocab options are ignored.
# load-vocab: /mnt/data/datasets/kiwi/trained_models/predest/en_de/vocab.torch

source-vocab-size: 45000
target-vocab-size: 45000
# Remove Sentences not in the specified Length Range
source-max-length: 50
source-min-length: 1
target-max-length: 50
target-min-length: 1
# Require Minimum Frequency of words
source-vocab-min-frequency: 1
target-vocab-min-frequency: 1


### GENERAL OPTS ###

# Experiment Name for MLFlow
# experiment-name: EN-DE Pretrain Predictor
# Do not set or set to negative number for CPU
# gpu-id: 0

<IPython.core.display.Javascript object>

In [0]:
utils.save_config(train_predictor, './OpenKiwi/runs/predictor/train_predictor.yml')

In [0]:
import kiwi

predictor_config = './OpenKiwi/runs/predictor/train_predictor.yml'
kiwi.train(predictor_config)

2020-06-02 13:13:22.411 [root setup:380] This is run ID: ab6f199c883d446a8b8aa05cdb596d85
2020-06-02 13:13:22.412 [root setup:383] Inside experiment ID: 0 (None)
2020-06-02 13:13:22.413 [root setup:386] Local output directory is: ./OpenKiwi/runs/predictor
2020-06-02 13:13:22.414 [root setup:389] Logging execution to MLflow at: None
2020-06-02 13:13:22.414 [root setup:397] Using CPU
2020-06-02 13:13:22.415 [root setup:400] Artifacts location: None
2020-06-02 13:13:22.420 [kiwi.lib.train run:154] Training the PredEst Predictor model (an embedder model) model
2020-06-02 13:15:16.993 [kiwi.lib.train run:187] Predictor(
  (attention): Attention(
    (scorer): MLPScorer(
      (layers): ModuleList(
        (0): Sequential(
          (0): Linear(in_features=1600, out_features=800, bias=True)
          (1): Tanh()
        )
        (1): Sequential(
          (0): Linear(in_features=800, out_features=1, bias=True)
          (1): Tanh()
        )
      )
    )
  )
  (embedding_source): Embedding



Batches:   1%|▏                    | 199/28508 [31:44<70:43:55,  8.99s/ batches]2020-06-02 13:47:13.525 [kiwi.metrics.stats log:60] target_PERP: 564.5462, target_CORRECT: 0.1682, target_ExpErr: 0.9439
Batches:   1%|▏                    | 200/28508 [31:56<77:11:45,  9.82s/ batches]



Batches:   1%|▏                    | 299/28508 [47:32<76:18:10,  9.74s/ batches]2020-06-02 14:03:01.448 [kiwi.metrics.stats log:60] target_PERP: 344.1209, target_CORRECT: 0.2199, target_ExpErr: 0.9006
Batches:   1%|▏                    | 300/28508 [47:44<82:05:50, 10.48s/ batches]



Batches:   1%|▎                  | 399/28508 [1:02:57<56:58:27,  7.30s/ batches]2020-06-02 14:18:25.774 [kiwi.metrics.stats log:60] target_PERP: 254.1533, target_CORRECT: 0.2444, target_ExpErr: 0.8786
Batches:   1%|▎                  | 400/28508 [1:03:08<65:29:12,  8.39s/ batches]



Batches:   2%|▎                  | 499/28508 [1:18:27<68:39:28,  8.82s/ batches]2020-06-02 14:33:55.270 [kiwi.metrics.stats log:60] target_PERP: 194.6490, target_CORRECT: 0.2676, target_ExpErr: 0.8599
Batches:   2%|▎                  | 500/28508 [1:18:38<72:02:34,  9.26s/ batches]



Batches:   2%|▍                  | 599/28508 [1:34:19<81:52:01, 10.56s/ batches]2020-06-02 14:49:46.166 [kiwi.metrics.stats log:60] target_PERP: 152.2536, target_CORRECT: 0.2910, target_ExpErr: 0.8393
Batches:   2%|▍                  | 600/28508 [1:34:29<80:18:19, 10.36s/ batches]



Batches:   2%|▍                  | 699/28508 [1:50:03<69:28:35,  8.99s/ batches]2020-06-02 15:05:27.300 [kiwi.metrics.stats log:60] target_PERP: 128.2772, target_CORRECT: 0.3084, target_ExpErr: 0.8261
Batches:   2%|▍                  | 700/28508 [1:50:10<65:23:40,  8.47s/ batches]



Batches:   3%|▌                  | 799/28508 [2:05:32<68:42:44,  8.93s/ batches]2020-06-02 15:20:57.490 [kiwi.metrics.stats log:60] target_PERP: 108.8060, target_CORRECT: 0.3213, target_ExpErr: 0.8118
Batches:   3%|▌                  | 800/28508 [2:05:40<65:56:35,  8.57s/ batches]



Batches:   3%|▌                  | 899/28508 [2:21:06<65:29:01,  8.54s/ batches]2020-06-02 15:36:31.004 [kiwi.metrics.stats log:60] target_PERP: 100.2458, target_CORRECT: 0.3295, target_ExpErr: 0.8054
Batches:   3%|▌                  | 900/28508 [2:21:14<63:46:05,  8.32s/ batches]



Batches:   4%|▋                  | 999/28508 [2:37:24<86:30:41, 11.32s/ batches]2020-06-02 15:52:51.253 [kiwi.metrics.stats log:60] target_PERP: 89.0870, target_CORRECT: 0.3394, target_ExpErr: 0.7939
Batches:   4%|▋                 | 1000/28508 [2:37:34<83:32:59, 10.93s/ batches]



Batches:   4%|▋                 | 1099/28508 [2:53:10<83:49:36, 11.01s/ batches]2020-06-02 16:08:37.322 [kiwi.metrics.stats log:60] target_PERP: 82.4437, target_CORRECT: 0.3463, target_ExpErr: 0.7864
Batches:   4%|▋                 | 1100/28508 [2:53:20<81:29:45, 10.70s/ batches]



Batches:   4%|▊                 | 1199/28508 [3:09:03<69:53:45,  9.21s/ batches]2020-06-02 16:24:35.497 [kiwi.metrics.stats log:60] target_PERP: 75.1103, target_CORRECT: 0.3530, target_ExpErr: 0.7780
Batches:   4%|▊                 | 1200/28508 [3:09:18<82:33:10, 10.88s/ batches]



Batches:   5%|▊                 | 1299/28508 [3:24:59<79:39:37, 10.54s/ batches]2020-06-02 16:40:30.005 [kiwi.metrics.stats log:60] target_PERP: 68.8188, target_CORRECT: 0.3631, target_ExpErr: 0.7679
Batches:   5%|▊                 | 1300/28508 [3:25:13<86:37:44, 11.46s/ batches]



Batches:   5%|▉                 | 1399/28508 [3:41:07<57:05:58,  7.58s/ batches]2020-06-02 16:56:34.624 [kiwi.metrics.stats log:60] target_PERP: 63.2428, target_CORRECT: 0.3675, target_ExpErr: 0.7615
Batches:   5%|▉                 | 1400/28508 [3:41:17<62:59:31,  8.37s/ batches]



Batches:   5%|▉                 | 1499/28508 [3:57:08<74:11:28,  9.89s/ batches]2020-06-02 17:12:34.993 [kiwi.metrics.stats log:60] target_PERP: 60.4493, target_CORRECT: 0.3746, target_ExpErr: 0.7556
Batches:   5%|▉                 | 1500/28508 [3:57:17<72:31:42,  9.67s/ batches]



Batches:   6%|█                 | 1588/28508 [4:11:04<69:38:46,  9.31s/ batches]

# Train Estimator

In [0]:
#development file

file = open('./OpenKiwi/data/traindev/dev.ende.df.short.tsv')
data = file.readlines()[1:]
file.close()

de = open('./OpenKiwi/data/traindev/wmt20_dev.de', 'w')
en = open('./OpenKiwi/data/traindev/wmt20_dev.en', 'w')
hter = open('./OpenKiwi/data/traindev/wmt20_dev.hter_avg', 'w')
for d in data:
	d = d.split('\t')
	print(d)
	de.write(d[1] + "\n")
	en.write(d[2] + "\n")
	hter.write(d[4] + "\n")
de.close()
en.close()
hter.close()

In [0]:
#train file

file = open('./OpenKiwi/data/traindev/train.ende.df.short.tsv')
data = file.readlines()[1:]
file.close()


de = open('./OpenKiwi/data/traindev/wmt20_train.de', 'w')
en = open('./OpenKiwi/data/traindev/wmt20_train.en', 'w')
hter = open('./OpenKiwi/data/traindev/wmt20_train.hter_avg', 'w')
for d in data:
	d = d.split('\t')
	print(d)
	de.write(d[1] + "\n")
	en.write(d[2] + "\n")
	hter.write(d[4] + "\n")
de.close()
en.close()
hter.close()

In [10]:
%%yaml train_estimator
### Train Predictor Estimator ###

model: estimator

# Model Files will be saved here
output-dir: /OpenKiwi/runs/estimator

#### MODEL SPECIFIC OPTS ####

## ESTIMATOR ##

# If load-model points to a pretrained Estimator,
# These settings are ignored.

# LSTM Settings
hidden-est: 125
rnn-layers-est: 1
dropout-est: 0.0
# Use linear layer to reduce dimension prior to LSTM
mlp-est: True

# Multitask Learning Settings #

# Continue training the predictor on the postedited text.
# If set, will do an additional forward pass through the predictor
# Using the SRC, PE pair and add the `Predictor` loss for the tokens in the
# postedited text PE. Recommended if you have access to PE
# Requires setting train-pe, valid-pe
token-level: False
# Predict Sentence Level Scores
# Requires setting train-sentence-scores, valid-sentence-scores
sentence-level: True
# Use probabilistic Loss for sentence scores instead of squared error.
# If set, the model will output mean and variance of a truncated Gaussian
# distribution over the interval [0, 1], and use log-likelihood loss instead
# of mean squared error.
# Seems to improve performance
sentence-ll: False
# Predict Binary Label for each sentence, indicating hter == 0.0
# Requires setting train-sentence-scores, valid-sentence-scores
binary-level: False

# WMT 20 Format Settings #

# Predict target tags. Requires train-target-tags, valid-target-tags to be set.
predict-target: false
target-bad-weight: 2.5
# Predict source tags. Requires train-source-tags, valid-source-tags to be set.
predict-source: false
source-bad-weight: 2.5
# Predict gap tags. Requires train-target-tags, valid-target-tags to be set.
# and wmt18-format set to true
predict-gaps: false
target-bad-weight: 2.5


### TRAIN OPTS ###
epochs: 10
# Additionally Eval and checkpoint every n training steps
# Explicitly disable by setting to zero (default)
checkpoint-validation-steps: 0
# If False, never save the Models
checkpoint-save: true
# Keep Only the n best models according to the main metric (F1Mult by default)
# USeful to avoid filling the harddrive during a long run
checkpoint-keep-only-best: 3
# If greater than zero, Early Stop after n evaluation cycles without improvement
checkpoint-early-stop-patience: 0


# Print Train Stats Every n batches
log-interval: 100
# LR. Currently ADAM is only optimizer supported.
# 1e-3 * (batch_size / 32) seems to work well
learning-rate: 1e-3

train-batch-size: 8
valid-batch-size: 8



### LOADING ###

# Load pretrained (sub-)model.
# If set, the model architecture params are ignored.
# As the vocabulary of the pretrained model will be used,
# all vocab-params will also be ignored.

# (i) load-pred-source or load-pred-target: Predictor instance
#     -> a new Estimator is initialized with the given predictor(s).
# (ii) load-model: Estimator instance.
#                  As the Predictor is a submodule of the Estimator,
#                  load-pred-{source,target} will be ignored if this is set.

# load-model: path_to_estimator
# load-pred-source: path_to_predictor_source_target
load-pred-target: ./OpenKiwi/runs/predictor/best_model.torch


###  DATA ###

# Set to True to use target_tags in WMT format
wmt20-format: false

train-source: ./OpenKiwi/data/traindev/wmt20_train.en
train-target: ./OpenKiwi/data/traindev/wmt20_train.de
# train-pe: /content/drive/My Drive/Proyectos/Machine Learning/Colab Notebooks/data/train.pe
# train-target-tags: /content/drive/My Drive/Proyectos/Machine Learning/Colab Notebooks/data/train.tags
train-sentence-scores: ./OpenKiwi/data/traindev/wmt20_train.hter_avg


valid-source: ./OpenKiwi/data/traindev/wmt20_dev.en
valid-target: ./OpenKiwi/data/traindev/wmt20_dev.de
# valid-pe: /content/drive/My Drive/Proyectos/Machine Learning/Colab Notebooks/WMT20/data/dev.pe
# valid-target-tags: /content/drive/My Drive/Proyectos/Machine Learning/Colab Notebooks/WMT20/data/dev.tags
valid-sentence-scores: ./OpenKiwi/data/traindev/wmt20_dev.hter_avg

### GENERAL OPTS ###

# Experiment Name for MLFlow
experiment-name: EN-DE Train Estimator
# Do not set or set to negative number for CPU
# gpu-id: 0

<IPython.core.display.Javascript object>

In [0]:
utils.save_config(train_estimator, './OpenKiwi/runs/estimator/train_estimator.yml')
utils.save_config(train_estimator, './OpenKiwi/experiments/train_estimator.yml')

In [25]:
import kiwi

estimator_config = './OpenKiwi/runs/estimator/train_estimator.yml'
kiwi.train(estimator_config)

2020-05-22 02:29:04.479 [root setup:380] This is run ID: 358737f214564365ac497d5d40aaeaa8
2020-05-22 02:29:04.479 [root setup:383] Inside experiment ID: 0 (EN-DE Train Estimator)
2020-05-22 02:29:04.480 [root setup:386] Local output directory is: runs/estimator
2020-05-22 02:29:04.481 [root setup:389] Logging execution to MLflow at: None
2020-05-22 02:29:04.482 [root setup:397] Using CPU
2020-05-22 02:29:04.482 [root setup:400] Artifacts location: None
2020-05-22 02:29:04.490 [kiwi.lib.train run:154] Training the PredEst (Predictor-Estimator) model
2020-05-22 02:29:06.056 [kiwi.data.utils load_vocabularies_to_fields:126] Loaded vocabularies from runs/predictor/best_model.torch
  "num_layers={}".format(dropout, num_layers))
2020-05-22 02:29:07.059 [kiwi.lib.train run:187] Estimator(
  (predictor_tgt): Predictor(
    (attention): Attention(
      (scorer): MLPScorer(
        (layers): ModuleList(
          (0): Sequential(
            (0): Linear(in_features=1600, out_features=800, bias=



Batches: 100%|██████████████████████████| 110/110 [02:58<00:00,  1.62s/ batches]
2020-05-22 02:32:05.086 [kiwi.metrics.stats log:60] RMSE: 76.8967, PEARSON: 0.0132, SPEARMAN: 0.0000, UNKS: 0.5449




  c /= stddev[:, None]
  c /= stddev[None, :]
  return (a < x) & (x < b)
  return (a < x) & (x < b)
  cond2 = cond0 & (x <= _a)
2020-05-22 02:32:17.862 [kiwi.metrics.stats log:60] EVAL_RMSE: 71.7530, EVAL_PEARSON: nan, EVAL_SPEARMAN: nan, EVAL_UNKS: 0.5383
2020-05-22 02:32:17.863 [root save:183] Saving training state to runs/estimator/epoch_1




2020-05-22 02:32:18.320 [root save_latest:241] Saving training state to runs/estimator/temp_latest_epoch
2020-05-22 02:32:18.322 [kiwi.trainers.callbacks save_latest:252] Moving runs/estimator/temp_latest_epoch to runs/estimator/latest_epoch
2020-05-22 02:32:30.878 [kiwi.data.utils save_predicted_probabilities:265] Saving sentence_scores predictions to runs/estimator/epoch_1/sentence_scores
2020-05-22 02:32:30.880 [kiwi.trainers.trainer run:75] Epoch 2 of 3
Batches:  81%|█████████████████████▊     | 89/110 [02:26<00:29,  1.43s/ batches]2020-05-22 02:34:58.805 [kiwi.metrics.stats log:60] RMSE: 72.8000, PEARSON: -0.0020, SPEARMAN: -0.0159, UNKS: 0.5424
Batches:  82%|██████████████████████     | 90/110 [02:27<00:28,  1.42s/ batches]



Batches: 100%|██████████████████████████| 110/110 [03:01<00:00,  1.65s/ batches]
2020-05-22 02:35:31.978 [kiwi.metrics.stats log:60] RMSE: 68.8780, PEARSON: 0.0272, SPEARMAN: 0.0122, UNKS: 0.5332




2020-05-22 02:35:45.000 [kiwi.metrics.stats log:60] EVAL_RMSE: 63.3713, EVAL_PEARSON: nan, EVAL_SPEARMAN: nan, EVAL_UNKS: 0.5383
2020-05-22 02:35:45.002 [root save:183] Saving training state to runs/estimator/epoch_2




2020-05-22 02:35:45.454 [root save_latest:241] Saving training state to runs/estimator/temp_latest_epoch
2020-05-22 02:35:45.456 [kiwi.trainers.callbacks _remove_snapshot:178] Removing previous snapshot: runs/estimator/latest_epoch
2020-05-22 02:35:45.457 [kiwi.trainers.callbacks save_latest:252] Moving runs/estimator/temp_latest_epoch to runs/estimator/latest_epoch
2020-05-22 02:35:58.407 [kiwi.data.utils save_predicted_probabilities:265] Saving sentence_scores predictions to runs/estimator/epoch_2/sentence_scores
2020-05-22 02:35:58.410 [kiwi.trainers.trainer run:75] Epoch 3 of 3
Batches:  72%|███████████████████▍       | 79/110 [02:10<00:54,  1.76s/ batches]2020-05-22 02:38:11.466 [kiwi.metrics.stats log:60] RMSE: 64.8512, PEARSON: 0.0007, SPEARMAN: 0.0062, UNKS: 0.5402
Batches:  73%|███████████████████▋       | 80/110 [02:13<01:00,  2.01s/ batches]



Batches: 100%|██████████████████████████| 110/110 [03:01<00:00,  1.65s/ batches]
2020-05-22 02:38:59.906 [kiwi.metrics.stats log:60] RMSE: 61.2775, PEARSON: -0.0062, SPEARMAN: -0.0063, UNKS: 0.5423




2020-05-22 02:39:13.076 [kiwi.metrics.stats log:60] EVAL_RMSE: 55.6197, EVAL_PEARSON: nan, EVAL_SPEARMAN: nan, EVAL_UNKS: 0.5383
2020-05-22 02:39:13.078 [root save:183] Saving training state to runs/estimator/epoch_3




2020-05-22 02:39:13.529 [root save_latest:241] Saving training state to runs/estimator/temp_latest_epoch
2020-05-22 02:39:13.531 [kiwi.trainers.callbacks _remove_snapshot:178] Removing previous snapshot: runs/estimator/latest_epoch
2020-05-22 02:39:13.532 [kiwi.trainers.callbacks save_latest:252] Moving runs/estimator/temp_latest_epoch to runs/estimator/latest_epoch
2020-05-22 02:39:26.452 [kiwi.data.utils save_predicted_probabilities:265] Saving sentence_scores predictions to runs/estimator/epoch_3/sentence_scores
2020-05-22 02:39:26.455 [root copy_best_model:266] Copying best model to runs/estimator/best_model.torch


<kiwi.lib.train.TrainRunInfo at 0x7fdcaf0da990>