In [1]:
from baseline_utils import process_baseline

In [2]:
sentences = process_baseline("/home/adaamko/data/1984.sen-aligned.np-aligned.gold")
len(sentences)

6567

# Better together: modern methods plus traditional thinking in NP alignment

## Overview
+ We study a typical intermediary task to Machine Translation, the alignment of NPs in the bitext.
+ We present simple, dictionary- and word vector-based baselines and a BERT-based system.
+ We combine BERT based methods with simple baselines such as stopword removal, lemmatization, and dictionaries.

## Data

+ The dataset is a manually translated and word-aligned corpus of Orwell’s 1984 
+ Was created as part of the MULTEX-East project (Erjavec, 2004).
+ A phrase-level alignment between English and Hungarian noun phrases (NPs) was presented in (Recski et al., 2010)  
* 6567 sentence pairs
 * 25,561 English
 * 22,408 Hungarian NPs
 * Only top-level NPs

## Data

+ In this context we reduce the task of deciding for a pair of NPs whether they should be aligned.
+ We extract all NP candidates from the data.
+ The dataset contains 121,783 NP pairs
 + 18,5 per sentence
 + 18,789 labeled as alignment (2.9 per sentence)

**[ It ]** 0 was **[ a bright cold day in April ]** 1 , and **[ the clocks ]** 2 were striking **[ thirteen ]** 3 .

**[ Derült , hideg áprilisi nap ]** 0 volt , **[ az órák ]** 1 éppen **[ tizenhármat ]** 2 ütöttek .

1-0

2-1

3-2

**[ It ]** 0 depicted simply **[ an enormous face , more than a metre wide ]** 1 : **[ the face of a man of about forty -five , with a heavy black moustache and ruggedly handsome features ]** 2 .

Csak **[ egy hatalmas arc ]** 0 volt **[ látható ]** 1 **[ rajta ]** 2 , **[ méternél is szélesebb arc ]** 3 : **[ egy negyvenöt év körüli , sűrű fekete bajuszos , durva vonású férfi arca ]** 4 .
   
0-2 

1-0 

1-3b 

2-4

In [4]:
len(sentences)
sentences[0]

{'id': 0,
 'en_sen': [(0, ['It']),
  'was',
  (1, ['a', 'bright', 'cold', 'day', 'in', 'April']),
  ',',
  'and',
  (2, ['the', 'clocks']),
  'were',
  'striking',
  (3, ['thirteen']),
  '.'],
 'hu_sen': [(0, ['Derült', ',', 'hideg', 'áprilisi', 'nap']),
  'volt',
  ',',
  (1, ['az', 'órák']),
  'éppen',
  (2, ['tizenhármat']),
  'ütöttek',
  '.'],
 'sentence_hun': None,
 'aligns': [('1', '0'), ('2', '1'), ('3', '2')]}

## Methods

+ Our simplest method relies on MUSE embeddings
+ We obtain bag-of-words representation of NPs
 + remove stopwords using NLTK(Bird et al., 2009)
 + lemmatize using spacy(Honnibal and Montani, 2017) and emmorph(Novak et al., 2016) 
 + we leave NPs unchanged that contains only stopwords
+ We align two NPs if the maximum cosine similarity between any two words are above a threshold.

## Methods

+ Based on the training dataset, we set this threshold to 0,46.
+ If all the words in the NP are outside OOV, we add an edge based on Levenshtein distance.
 +  we find proper nouns such as Oceania and Óceánia

![threshold](https://github.com/adaamko/np_alignment/blob/master/docs/threshold.JPG?raw=true)

## BERT

+ We use the multilingual BERT model.
+ For each pair of sentence
 + we obtain BERT word embeddings by concatenating the sentences together
 + and using it as an input to the pretrained model
 + we use the weights of its last 4 hidden layers
 + we only keep the embeddings of the words
+ We use the word embeddings and feed it into an LSTM layer and then a linear layer to predict the probability of aligning.

_It  was no use trying the lift . [SEP] A felvonóval nem volt érdemes próbálkozni ._

## BERT

+ There are approximately 6 times more negative samples than true edges
+ We experimented with:
 + weighted loss functions
 + over- and under-sampling
+ The best results were achieved by oversampling positive examples

## Dictionary-based alignment

+ Our baseline uses English-Hungarian translation pairs from Wikt2dict and Hokoto.
+ After performing stopword removal and lemmatization we retrieve translations from the dictionaries.
+ If there is a match we add an alignment edge.
+ For words that are at least 5 characters long, a Levenshtein distance not greater than 3 is enough for the words to be considered a match.


## Results

+ We split the data into train and test portions.
+ The test dataset contains 24,357 NP pairs, of which 3,758 (15.43%) are connected by a gold alignment edge.

<div id="table:baselines">

| <span>**Method** </span> | <span>**Precision** </span> | <span>**Recall** </span> | <span>**F-score** </span> |
| :----------------------- | --------------------------: | -----------------------: | ------------------------: |
| always yes               |                       15.43 |                      100 |                     26.73 |
| surface                  |                       22.30 |                    38.27 |                     28.18 |
| MUSE                     |                       63.51 |                    66.29 |                     64.87 |
| MUSE+surface             |                       63.52 |                    67.96 |                     65.66 |
| BERT                     |                       67.06 |   <span>**77.20**</span> |                     71.77 |
| Dict                     |                       77.49 |                    72.01 |                     74.65 |
| Dict+surface             |      <span>**78.08**</span> |                    76.66 |    <span>**77.36**</span> |

                        Maximum precision, recall and F-score of the systems.

</div>

## Results

+ We also experimented with some voting schemes 

<div id="table:hybrid">

| <span>**Method** </span>     | <span>**Precision** </span> | <span>**Recall** </span> | <span>**F-score** </span> |
| :--------------------------- | --------------------------: | -----------------------: | ------------------------: |
| BERT \(\vee\) Dict+surface   |                       62.61 |                    90.77 |                     74.10 |
| BERT \(\wedge\) Dict+surface |                       92.33 |                    63.09 |                     74.96 |
| 3-way vote                   |                       82.30 |                    78.79 |                     80.51 |

                                   Performance of hybrid systems

</div>