# Human or Neural Translation

- ## Feature-based

Monolingual features:
  - n-gram
      - 2-7 range, top 30k
  - KenLM features 
      - ratios of min and max logprob over the (target) sentence per model
      - the number of tokens with a logprob less than {mean, max, −6} (three features per
      - the logprob of the full sentence given by the left-to-right model

Bilingual features:
  - "Unsupervised feature" aggregation for detecting spurious alignment

- ## Neural

Monolingual features:
  - BiLSTM from scratch
  - LASER representations
  - Pretrained transformers
  
Bilingual features:
  - BiLSTM
  - LASER representations (diff concat with dot)
  - Pretrained transformers

In [1]:
import pandas as pd

from sacremoses import MosesTokenizer
from sklearn.feature_extraction.text import CountVectorizer

from functools import partial

train_df = pd.read_csv('train.tsv', sep='\t', encoding='utf-8')
valid_df = pd.read_csv('valid.tsv', sep='\t', encoding='utf-8')

mt = MosesTokenizer(lang='en')

# Join tokens on whitespace so CountVectorizer is happy
tokenizer = partial(mt.tokenize, return_str=True)

vectorizer = CountVectorizer(tokenizer=tokenizer, max_features=30_000)

Using Europarl corpus, Danish to English. The `en_mt` column has been populated by filtering non-empty source and target rows and translating Danish using the `Helsinki-NLP/opus-mt-da-en` model via Huggingface.

In [2]:
train_df.head()

Unnamed: 0,en,da,en_mt
0,My final point is that animals should not be s...,"Afslutningsvis vil jeg sige, at dyr ikke bør u...","In conclusion, animals should not be subjected..."
1,A clear agreement on this item would have comp...,En klar aftale om dette spørgsmål havde afslut...,A clear agreement on this issue had ended this...
2,Thank you very much.,Mange tak.,Thank you very much.
3,"As a result of this debate, I would like to co...",Som resultat af denne forhandling vil jeg gern...,"As a result of this debate, I would like to co..."
4,I have little doubt that the report will event...,"Jeg tvivler ikke på, at betænkningen med tiden...",I have no doubt that the report will eventuall...


In [3]:
import numpy as np

def reorganize_data(df, ht_col="en", mt_col="en_mt"):
    """Combines HT and MT column and assigns 1 to HTs and 0 to MTs.
    X and y are then shuffled.
    """
    X_ht = df[ht_col].values
    y_ht = np.ones_like(X_ht, dtype=np.int32)
    X_mt = df[mt_col].values
    y_mt = np.zeros_like(X_mt, dtype=np.int32)
    X = np.hstack([X_ht, X_mt])
    y = np.hstack([y_ht, y_mt])
    assert X.shape == y.shape
    # Shuffle the X and y the same way by shuffling indices and indexing
    indices = np.arange(X.shape[0])
    np.random.shuffle(indices)
    return X[indices], y[indices]

In [4]:
X_train, y_train = reorganize_data(train_df)
X_valid, y_valid = reorganize_data(valid_df)

In [5]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ("feat", vectorizer),
    ("model", RandomForestClassifier(n_estimators=1000, max_depth=40))
])

In [6]:
pipe.fit(X_train, y_train)

Pipeline(steps=[('feat',
                 CountVectorizer(max_features=30000,
                                 tokenizer=functools.partial(<bound method MosesTokenizer.tokenize of <sacremoses.tokenize.MosesTokenizer object at 0x000001BC58479430>>, return_str=True))),
                ('model',
                 RandomForestClassifier(max_depth=40, n_estimators=1000))])

In [7]:
pipe.score(X_valid, y_valid)

0.5509

In [8]:
y_pred = pipe.predict(X_valid)

In [9]:
from sklearn.metrics import classification_report

print(classification_report(y_valid, y_pred))

              precision    recall  f1-score   support

           0       0.55      0.59      0.57      5000
           1       0.56      0.51      0.53      5000

    accuracy                           0.55     10000
   macro avg       0.55      0.55      0.55     10000
weighted avg       0.55      0.55      0.55     10000



# LASER-based model

> For the bilingual detection task, we extract the representation of the source and target sentences and
tie them into one vector by taking their absolute difference and dot product, and adding them. This
tied representation is then passed through **3 hidden layers of size 512, 150 and 75 respectively with
dropout (Srivastava et al., 2014) of 50%, and then fed into a relu (Nair and Hinton, 2010) activation
function, whose output is finally passed to the sigmoid function**. For the monolingual task, we just use
the LASER French (target) representation of the sentence and pass it through the very same architecture.
We train the classifiers with the **Adadelta optimizer with gradient clipping (clip value 3)**.

In [10]:
from laserembeddings import Laser

def embed_with_laser(sents, lang="en"):
    laser = Laser()
    return laser.embed_sentences(sents, lang=lang)

In [11]:
X_train_laser = embed_with_laser(X_train)
X_valid_laser = embed_with_laser(X_valid)

In [12]:
import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(1024, 512),
    nn.Dropout(0.5),
    nn.ReLU(),
    nn.Linear(512, 150),
    nn.Dropout(0.5),
    nn.ReLU(),
    nn.Linear(150, 75),
    nn.Dropout(0.5),
    nn.ReLU(),
    nn.Linear(75, 1),
    nn.Sigmoid()
).to("cuda:0")

In [13]:
from torch.utils.data import TensorDataset, DataLoader

train_ds = TensorDataset(torch.from_numpy(X_train_laser).to("cuda:0"), torch.from_numpy(y_train).to("cuda:0"))
valid_ds = TensorDataset(torch.from_numpy(X_valid_laser).to("cuda:0"), torch.from_numpy(y_valid).to("cuda:0"))

In [14]:
def evaluate(model, valid_loader):
    model.eval()
    correct, total = 0, 0
    with torch.no_grad():
        for (X, y) in valid_loader:
            outputs = model(X).round()
            batch_correct = (outputs == y.unsqueeze(-1)).sum()
            batch_total = X.size(0)
            correct += batch_correct
            total += batch_total
    print("Accuracy: {0:.4f}".format(correct / total))
    model.train()
            
        
def train(model, train_loader, valid_loader, opt, num_epochs, log_every=5):
    criterion = nn.BCELoss()
    for epoch in range(num_epochs):

        running_loss = 0.0
        for i, (inputs, labels) in enumerate(train_loader, 1):

            opt.zero_grad()

            outputs = model(inputs)
            loss = criterion(outputs, labels.unsqueeze(-1).float())
            loss.backward()
            nn.utils.clip_grad_norm_(model.parameters(), 3.0)
            opt.step()

            # print statistics
            running_loss += loss.item()
        if epoch % log_every == log_every - 1:
            print("Epoch {0} loss: {1:.4f}".format(epoch, running_loss))
            evaluate(model, valid_loader)

In [15]:
opt = torch.optim.Adadelta(model.parameters())
train_loader = DataLoader(train_ds, batch_size=128, shuffle=True)
valid_loader = DataLoader(valid_ds, batch_size=256)

In [16]:
train(model, train_loader, valid_loader, opt, 75)

Epoch 4 loss: 271.0502
Accuracy: 0.5000
Epoch 9 loss: 271.0195
Accuracy: 0.5000
Epoch 14 loss: 270.9592
Accuracy: 0.5248
Epoch 19 loss: 269.2780
Accuracy: 0.5106
Epoch 24 loss: 266.7996
Accuracy: 0.5718
Epoch 29 loss: 264.8305
Accuracy: 0.5790
Epoch 34 loss: 263.6776
Accuracy: 0.5560
Epoch 39 loss: 263.3049
Accuracy: 0.5815
Epoch 44 loss: 262.0356
Accuracy: 0.5731
Epoch 49 loss: 261.2726
Accuracy: 0.5828
Epoch 54 loss: 260.4928
Accuracy: 0.5737
Epoch 59 loss: 259.7228
Accuracy: 0.5826
Epoch 64 loss: 259.2035
Accuracy: 0.5878
Epoch 69 loss: 258.0294
Accuracy: 0.5824
Epoch 74 loss: 257.0403
Accuracy: 0.5842


# Transformer-based experiment

The authors focus primarily on translating _out of_ English instead of _into_ English, so their choice of pretrained transformers is based on availability of target-language models. Instead, this demonstration focuses on translating Danish to English, so we can use all of our favorite English-language pretrained models.

Here we show a proof-of-concept finetuning the RoBERTa base model for 3 epochs.

In [17]:
from simpletransformers.classification import ClassificationModel, ClassificationArgs

In [26]:
import sklearn

train_df = pd.DataFrame({"text": X_train, "labels": y_train})
eval_df = pd.DataFrame({"text": X_valid, "labels": y_valid})


model_args = ClassificationArgs(num_train_epochs=1, train_batch_size=128, eval_batch_size=256, overwrite_output_dir=True)
model = ClassificationModel("roberta", "roberta-base", args=model_args)

model.train_model(train_df, acc=sklearn.metrics.accuracy_score)

result, model_outputs, wrong_predictions = model.eval_model(eval_df, acc=sklearn.metrics.accuracy_score)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'lm_head.dense.bias', 'lm_head.decoder.weight', 'lm_head.bias', 'lm_head.dense.weight', 'roberta.pooler.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.weight', 'classifie

  0%|          | 0/50000 [00:00<?, ?it/s]



Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/391 [00:00<?, ?it/s]

  0%|          | 0/10000 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/40 [00:00<?, ?it/s]

In [27]:
from pprint import pprint

pprint(result)

{'acc': 0.6594,
 'auprc': 0.7502836292673939,
 'auroc': 0.72908448,
 'eval_loss': 0.6080136880278587,
 'fn': 2619,
 'fp': 787,
 'mcc': 0.3426271720808088,
 'tn': 4213,
 'tp': 2381}
