# Moses IS-EN EN-IS phrase þýðingarvél
Sjá `README.md` til þess að keyra þetta vélrit (e. notebook).

Í þessu vélriti eru gögn forunnin og Moses þýðingarkerfið notað til þess búa til tvö þýðingarkerfi, IS-EN og EN-IS.
Það er gert ráð fyrir því að öll gögn séu aðgengileg undir `/work/data`. Sjá leiðbeiningar í `README.md` um hvernig það er gert með `docker` eða `singularity`.

Í stuttu máli skiptist vélritið í eftirfarandi þætti:
1. Samhliða og einhliða gögn undirbúin.
1. Tungumála módel byggt fyrir EN og IS (KenLM).
1. Texta skipt í þrjá hluta; train/val/test, fjöldi setninga í val/test er 3000/2000.
1. Moses kerfið þjálfað með train hluta texta.
1. Moses kerfið fínpússað með val hluta texta.
1. Moses kerfið metið með BLEU mælingin á test hluta texta.

Allar skrár og líkön eru raðað í skrána "WORKING_DIR" (sjá `README.md`).

Safnið `corpus.py` skilgreinir föll og gagnategundir sem eru mikið nýttar hér.

In [3]:
from collections import defaultdict, Counter, OrderedDict
import os
import pathlib
from pathlib import Path
import re
from pprint import pprint
import importlib
from typing import List

import matplotlib.pyplot as plt
import numpy as np

import corpus.corpus as c

importlib.reload(c)

%matplotlib notebook

print(os.getenv('MOSESDECODER'))
print(os.getenv('MOSESDECODER_TOOLS'))
print(int(os.getenv('THREADS')))
print(int(os.getenv('MEMORY')))

working_dir = pathlib.Path('/work')
data_dir = working_dir.joinpath('data')
processing_dir = working_dir.joinpath('process')
p = processing_dir
parice_dir = data_dir.joinpath('parice')
rmh_dir = data_dir.joinpath('risamalheild')

IS = c.Lang.IS
EN = c.Lang.EN

RMH, PARICE = 'rmh', 'parice'
TRAIN, VAL, TEST = 'train', 'val', 'test'

langs = [IS, EN]
splits = [TRAIN, VAL, TEST]

CAT = 'cat'
SHUFFLE = 'shuffle'
REGEXP = 'regexp'
SENT_FIX = 'sent_fix'
LOWER = 'lower'
TOKENIZE = 'tok'
PLACEHOLDERS = 'placeholders'
LENGTH = 'length'
DROP = 'drop'
LM = 'lm-blm'
KVISTUR = 'kvistur'
BPE = 'bpe'
TRAIN = 'train'
TEST = 'test'
VAL = 'val'
FINAL = 'final'

/opt/moses
/opt/moses_tools
15
32768


[nltk_data] Downloading package punkt to
[nltk_data]     /home/staff/haukurpj/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Stytta þjálfunarsetningar
Moses á erfitt með að samstilla langar setningar. Við styttum þjálfunarsetningarnar svo einungis setningar sem eru eitt orð eða lengri upp að tölunni sem er skilgreint að neðan. Við höfum tekið eftir því að niðurstöðurnar sem við fáum með hámarkslengd (100) gefa ekki góðar niðurstöður.

Þar sem við notum fall sem er skilgreint í Moses og tekur inn tvær skrár í einu fer nafnavenjan eitthvað á flakk.

In [34]:
def corpus_shorten(path, path_out, lang_id_1, lang_id_2, min_length, max_length):
    !{os.getenv('MOSESDECODER')}/scripts/training/clean-corpus-n.perl {path} {lang_id_1} {lang_id_2} {path_out} {min_length} {max_length}
    return True

# IS is ignored
IN = c.read(p, IS, PARICE, TRAIN, FINAL).with_suffix('')
OUT = c.write(p, IS, PARICE, TRAIN, LENGTH).with_suffix('')

corpus_shorten(IN, OUT, 'en', 'is', 1, 70)

	LANGUAGE = "en_US:en",
	LC_ALL = (unset),
	LC_CTYPE = "C.UTF-8",
	LANG = "en_US.UTF-8"
    are supported and installed on your system.
clean-corpus.perl: processing /work/process/parice-train-final.en & .is to /work/process/parice-train-length, cutoff 1-70, ratio 9
..........(100000)..........(200000)..........(300000)..........(400000)..........(500000)..........(600000)..........(700000)..........(800000)..........(900000)..........(1000000)..........(1100000)..........(1200000)..........(1300000)..........(1400000)..........(1500000)..........(1600000)..........(1700000)..........(1800000)..........(1900000)..........(2000000)..........(2100000)..........(2200000)..........(2300000)..........(2400000)..........(2500000)..........(2600000)..........(2700000)..........(2800000)..........(2900000)..........(3000000)..........(3100000)..........(3200000)..........(3300000).....
Input sentences: 3351141  Output sentences:  3311201


True

### Tungumála módel
Við búum til KenLM mállíkan til þess að gefa okkur líkindi setninga. Til að flýta uppflettingum þá tungumála módelið samtímis kjörsniðið.

In [35]:
def create_lm(path, out_path, order):
    tmp_arpa = out_path.with_suffix('.arpa')
    !{os.getenv('MOSESDECODER')}/bin/lmplz --order {order} --temp_prefix {data_dir}/ --memory 50% --discount_fallback < {path} > {tmp_arpa}
    !{os.getenv('MOSESDECODER')}/bin/build_binary -S 50% {tmp_arpa} {out_path}
    return True

#### EN mállíkan

In [36]:
create_lm(c.read(p, EN, PARICE, TRAIN, LENGTH), c.write(p, EN, PARICE, TRAIN, LM), order=3)

=== 1/5 Counting and sorting n-grams ===
Reading /work/process/parice-train-length.en
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 45972494 types 252581
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:3030972 2:23444791296 3:43958984704
Statistics:
1 252581 D1=0.662731 D2=1.03135 D3+=1.3416
2 3344252 D1=0.712012 D2=1.08004 D3+=1.39646
3 10877598 D1=0.680056 D2=1.14824 D3+=1.4491
Memory estimate for binary LM:
type     MB
probing 269 assuming -p 1.5
probing 289 assuming -r models -p 1.5
trie    111 without quantization
trie     62 assuming -q 8 -b 8 quantization 
trie    105 assuming -a 22 array pointer compression
trie     56 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:3030972 2:53508032 3:217551960
---

True

#### IS mállíkan (RMH + TRAIN)

In [37]:
c.combine((c.read(p, IS, PARICE, TRAIN, FINAL), 
           c.read(p, IS, RMH, FINAL)), 
          c.write(p, IS, RMH, TRAIN, CAT))

True

In [None]:
create_lm(c.read(p, IS, RMH, TRAIN, CAT), c.write(p, IS, RMH, TRAIN, CAT, LM), order=3)

=== 1/5 Counting and sorting n-grams ===
Reading /work/process/rmh-train-cat.is
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
************

Prófa tungumála módel, það ættu ekki að vera nein óþekkt orð.

In [12]:
def eval_sentence(lm_model, sentence):
   !echo "{sentence}" | {os.getenv('MOSESDECODER')}/bin/query {lm_model}

eval_sentence(c.read(p, IS, RMH, TRAIN, CAT, LM, 'moses'), "þetta er flott íslensk setning , er það ekki ?")
eval_sentence(c.read(p, EN, PARICE, TRAIN, LM, 'moses'), "this is a nice english sentence , right ?")

þetta=416 2 -1.7512153	er=105 3 -0.45226997	flott=7093 3 -3.175889	íslensk=8308 2 -4.092872	setning=39567 2 -5.111964	,=25 2 -1.514426	er=105 3 -2.4005053	það=264 3 -1.3744158	ekki=189 3 -1.170756	?=94 3 -1.5291206	</s>=2 3 -0.05531629	Total: -22.628752 OOV: 0
Perplexity including OOVs:	114.06679797702375
Perplexity excluding OOVs:	114.06679797702375
OOVs:	0
Tokens:	11
Name:query	VmPeak:8210512 kB	VmRSS:4860 kB	RSSMax:8195016 kB	user:0	sys:8.79275	CPU:8.79275	real:8.78055
this=208 2 -1.798729	is=200 3 -0.6794696	a=12 3 -1.0007749	nice=994 3 -2.8335419	english=6077 1 -4.5991697	sentence=2824 1 -5.0001507	,=6 2 -1.1415414	right=182 2 -3.7487242	?=93 3 -0.14266676	</s>=2 3 -0.034414142	Total: -20.979181 OOV: 0
Perplexity including OOVs:	125.2904961274026
Perplexity excluding OOVs:	125.2904961274026
OOVs:	0
Tokens:	10
Name:query	VmPeak:296704 kB	VmRSS:5000 kB	RSSMax:281348 kB	user:0	sys:0.428492	CPU:0.428492	real:0.431875


## Moses þjálfunar föll
Næstu föll snúa að þjálfun Moses og annarra atriða sem þarf að hafa í huga. Þjálfunin tekur um 12 klst.
Til þess að sjá framgang þjálfunar - sjá útprent þegar kallað er í föllin. Síðasta skrefið metur þýðingar Moses.

In [11]:
def train_moses(model_dir, corpus, lang_from, lang_to, lang_to_lm, lm_order):
    print(f'tail -f {model_dir}/training.out')
    result = !{os.getenv('MOSESDECODER')}/scripts/training/train-model.perl -root-dir {model_dir} \
        -corpus {corpus} \
        -f {lang_from} -e {lang_to} \
        -alignment grow-diag-final-and -reordering msd-bidirectional-fe \
        -lm 0:{lm_order}:{lang_to_lm}:8 \
        -mgiza -mgiza-cpus {os.getenv('THREADS')} \
        -parallel -sort-buffer-size {os.getenv('MEMORY')} -sort-batch-size 1021 \
        -sort-compress gzip -sort-parallel {os.getenv('THREADS')} \
        -cores {os.getenv('THREADS')} \
        -external-bin-dir {os.getenv('MOSESDECODER_TOOLS')} &> {model_dir}/training.out
    return model_dir

In [12]:
def tune_moses(model_dir, corpus_val_from, corpus_val_to, base_moses_ini):
    print(f'tail -f {model_dir}/tune.out')
    result = !{os.getenv('MOSESDECODER')}/scripts/training/mert-moses.pl \
        {corpus_val_from} \
        {corpus_val_to} \
        {os.getenv('MOSESDECODER')}/bin/moses {base_moses_ini} \
        --mertdir {os.getenv('MOSESDECODER')}/bin \
        --working-dir {model_dir} \
        --decoder-flags="-threads {os.getenv('THREADS')}" &> {model_dir}/tune.out
    return model_dir

In [13]:
def prepare_binarisation(tuned_moses_ini,
                         lm_path_in,
                         lm_path_out,
                         binarised_moses_ini,
                         binarised_phrase_table,
                         binarised_reordering_table):
    !cp {tuned_moses_ini} {binarised_moses_ini}
    !cp {lm_path_in} {lm_path_out}
    # Adjust the path in the moses.ini file to point to the new files.
    escaped_path_in = str(lm_path_in).replace(r'/', '\/')
    escaped_path_out = str(lm_path_out).replace(r'/', '\/')
    !sed -i 's/{escaped_path_in}/{escaped_path_out}/' {binarised_moses_ini}
    # Adjust the path in the moses.ini file to point to the new files.
    escaped_path = str(binarised_phrase_table).replace(r'/', '\/')
    !sed -i 's/PhraseDictionaryMemory/PhraseDictionaryCompact/' {binarised_moses_ini}
    !sed -i 's/4 path=.*\.gz input-factor/4 path={escaped_path} input-factor/' {binarised_moses_ini}
    # Adjust the path in the moses.ini file
    escaped_path = str(binarised_reordering_table).replace(r'/', '\/')
    !sed -i 's/0 path=.*\.gz$/0 path={escaped_path}/' {binarised_moses_ini}
    
def binarise_phrase_table(base_phrase_table, binarised_phrase_table):
    #Create the table
    !{os.getenv('MOSESDECODER')}/bin/processPhraseTableMin \
        -in {base_phrase_table} \
        -nscores 4 \
        -out {binarised_phrase_table}
    
def binarise_reordering_table(base_reordering_table, binarised_reordering_table):
    #Create the table
    !{os.getenv('MOSESDECODER')}/bin/processLexicalTableMin \
        -in {base_reordering_table} \
        -out {binarised_reordering_table}

In [14]:
# It only makes sense to filter the model when you know what text the system needs to translate.
def filter_model(out_dir, moses_ini, corpus):
    !{os.getenv('MOSESDECODER')}/scripts/training/filter-model-given-input.pl {out_dir} {moses_ini} {corpus}


In [15]:
def translate_corpus(moses_ini, corpus, corpus_translated):
    !{os.getenv('MOSESDECODER')}/bin/moses \
        -f {moses_ini} < {corpus} > {corpus_translated}
    
def eval_translation(corpus_gold, corpus_translated):
    result = !{os.getenv('MOSESDECODER')}/scripts/generic/multi-bleu.perl -lc {corpus_gold} < {corpus_translated}
    return result 

### Byrja þjálfanir

In [16]:
def train_tune_eval(LM,
                    LM_ORDER,
                    FROM,
                    TO,
                    MODIFIER,
                    TRAIN_IN,
                    VAL_IN,
                    VAL_OUT,
                    TEST_IN,
                    TEST_OUT):
    model_dir = working_dir.joinpath(f'{FROM}-{TO}-{MODIFIER}')
    base_model_dir = model_dir.joinpath('base')
    tuned_model_dir = model_dir.joinpath('tuned')
    binarised_model_dir = model_dir.joinpath('binarised')
    !mkdir -p {base_model_dir}
    !mkdir -p {tuned_model_dir}
    !mkdir -p {binarised_model_dir}

    base_moses_ini = base_model_dir.joinpath('model/moses.ini')
    base_phrase_table = base_model_dir.joinpath('model/phrase-table.gz')
    base_reordering_table = base_model_dir.joinpath('model/reordering-table.wbe-msd-bidirectional-fe.gz')

    tuned_moses_ini = tuned_model_dir.joinpath('moses.ini')

    binarised_moses_ini = binarised_model_dir.joinpath('moses.ini')
    binarised_phrase_table = binarised_model_dir.joinpath('phrase-table')
    binarised_reordering_table = binarised_model_dir.joinpath('reordering-table')

    # train
    train_moses(base_model_dir, TRAIN_IN, FROM, TO, LM, lm_order=LM_ORDER)

    # tune
    tune_moses(tuned_model_dir, VAL_IN, VAL_OUT, base_moses_ini)

    # binarise
    !mkdir -p {binarised_model_dir}

    lm_out = binarised_model_dir.joinpath('lm.blm')

    prepare_binarisation(tuned_moses_ini, 
                         LM,
                         lm_out, 
                         binarised_moses_ini, 
                         binarised_phrase_table, 
                         binarised_reordering_table)
    binarise_phrase_table(base_phrase_table, binarised_phrase_table)
    binarise_reordering_table(base_reordering_table, binarised_reordering_table)

    # translate
    translated = binarised_model_dir.joinpath(f'translated.{FROM}')

    translate_corpus(binarised_moses_ini, TEST_IN, translated)
    
    

is-en

In [None]:
train_tune_eval(LM = c.read(p, EN, PARICE, TRAIN, LM),
                LM_ORDER = 3,
                FROM = 'is',
                TO = 'en',
                MODIFIER = 'improved',
                TRAIN_IN = c.read(p, IS, PARICE, TRAIN, LENGTH).with_suffix(''),
                VAL_IN = c.read(p, IS, PARICE, VAL, FINAL),
                VAL_OUT = c.read(p, EN, PARICE, VAL, FINAL),
                TEST_IN = c.read(p, IS, PARICE, TEST, FINAL),
                TEST_OUT = c.read(p, EN, PARICE, TEST, FINAL))

tail -f /work/is-en-improved/base/training.out


en-is

In [None]:
train_tune_eval(LM = c.read(p, IS, RMH, TRAIN, CAT, LM),
                LM_ORDER = 3,
                FROM = 'en',
                TO = 'is',
                MODIFIER = 'improved',
                TRAIN_IN = c.read(p, IS, PARICE, TRAIN, LENGTH).with_suffix(''),
                VAL_IN = c.read(p, EN, PARICE, VAL, FINAL),
                VAL_OUT = c.read(p, IS, PARICE, VAL, FINAL),
                TEST_IN = c.read(p, EN, PARICE, TEST, FINAL),
                TEST_OUT = c.read(p, IS, PARICE, TEST, FINAL))

In [10]:
TEST_OUT = c.read(p, EN, PARICE, TEST, FINAL)
FROM = 'is'
TO = 'en'
MODIFIER = 'improved'
model_dir = working_dir.joinpath(f'{FROM}-{TO}-{MODIFIER}')
binarised_model_dir = model_dir.joinpath('binarised')
translated = binarised_model_dir.joinpath(f'translated.{FROM}')
print(eval_translation(TEST_OUT, translated))
print(*c.corpora_peek((TEST_OUT, translated)))

en: • 6 km for category 2 motorcycle ( engine capacity ≥ 150 cc , vmax @lt@ 130 km / h ) ,



To correct comparison we need to map the translated BPE text to the normal test and compare with `test/final.en`

In [38]:
def sent_detokenize_bpe(sentence):
    pieces = sentence.split(" ")
    return ''.join(pieces).replace('▁', ' ').strip()

def corpus_detokenize_bpe(path, out_path):
    with path.open() as f_in, out_path.open('w+') as f_out:
        for line in f_in:
            f_out.write(sent_detokenize_bpe(line)+'\n')
    return True

In [39]:
translated_detokenized = c.corpus_create_path(translated, 'translated_detokenized')
corpus_detokenize_bpe(translated, translated_detokenized)

True

In [40]:
print(eval_translation(TEST_OUT, translated_detokenized))
print(*c.corpora_peek((TEST_OUT, translated_detokenized)))

is: • 6 km fyrir bifhjól í flokki 2 ( slagrými hreyfils ≥ 150 cc , vmax @lt@ 130 km / klukkustund ) ,
 en: • 6 km fyrir bifhjól í flokki 2 ( slagrými hreyfils ≥ 150 cc , vmax @lt@ 130 km / klukkustund ) .
 is: enska aðgerð varðandi sameiginlega fræðslu e - liður 1. málsgrein 5. grein .
 en: enska aðgerð varðandi sameiginlega fræðslu e - lið 1. málsgrein 5. grein
 is: mælingar á reykþéttni útblásturslofts við hröðun ( frá hægagangi og upp í marksnúningshraða , án álags ) .
 en: mælingar á reykþéttni útblásturslofts við hröðun ( frá hægagangi án álags upp .
 is: aðrir tengivagnar
 en: aðrir eftirvagnar og festivagnar
 is: ég er með matareitrun .
 en: ég er með matareitrun .
 is: perlur á stærð við kókoshnetur .
 en: perlur á stærð við kókoshnetur .
 is: þetta markmið skal einkum mæla í fjölgun aðildarríkja sem fella inn samræmdar nálganir við gerð viðbúnaðaráætlana sinna .
 en: þetta markmið skal einkum mæla í fjölgun aðildarríkja samþætta samræmdar nálganir á viðbúnaðaráætlana þeirra .


### Demo
Þýða einhvern texta.

In [44]:
def translate_en_is(moses_ini, sentence):
    sentence = c.sent_process_v1(sentence, c.Lang.EN)
    !echo "{sentence}" | {os.getenv('MOSESDECODER')}/bin/moses -f {moses_ini}

In [45]:
sentence = "This is a proper English sentence, and we can have learnt a better phrase model"
print(translate_en_is(binarised_model_dir.joinpath('moses.ini'), sentence))

Defined parameters (per moses.ini or switch):
	config: /work/en-is-rmh/binarised/moses.ini 
	distortion-limit: 6 
	feature: UnknownWordPenalty WordPenalty PhrasePenalty PhraseDictionaryCompact name=TranslationModel0 num-features=4 path=/work/en-is-rmh/binarised/phrase-table input-factor=0 output-factor=0 LexicalReordering name=LexicalReordering0 num-features=6 type=wbe-msd-bidirectional-fe-allff input-factor=0 output-factor=0 path=/work/en-is-rmh/binarised/reordering-table Distortion KENLM name=LM0 factor=0 path=/work/en-is-rmh/binarised/lm.blm order=3 
	input-factors: 0 
	mapping: 0 T 0 
	threads: 10 
	weight: LexicalReordering0= 0.0169919 0.229923 0.142056 0.00709025 -0.0357982 0.136943 Distortion0= -0.0411258 LM0= 0.0329771 WordPenalty0= -0.117491 PhrasePenalty0= -0.0976637 TranslationModel0= 0.0113354 0.00836338 0.1168 0.0054414 UnknownWordPenalty0= 1 
line=UnknownWordPenalty
FeatureFunction: UnknownWordPenalty0 start: 0 end: 0
line=WordPenalty
FeatureFunction: WordPenalty0 start: 

In [37]:
def translate_is_en(moses_ini, sentence):
    sentence = c.sent_process_v1(sentence, c.Lang.IS)
    !echo "{sentence}" | {os.getenv('MOSESDECODER')}/bin/moses -f {moses_ini}

In [38]:
sentence = "Ég man ekki eftir neinum góðum myndum nýlega "
print(translate_is_en(working_dir.joinpath('is-en/binarised').joinpath('moses.ini'), sentence))

Defined parameters (per moses.ini or switch):
	config: /work/is-en/binarised/moses.ini 
	distortion-limit: 6 
	feature: UnknownWordPenalty WordPenalty PhrasePenalty PhraseDictionaryCompact name=TranslationModel0 num-features=4 path=/work/is-en/binarised/phrase-table input-factor=0 output-factor=0 LexicalReordering name=LexicalReordering0 num-features=6 type=wbe-msd-bidirectional-fe-allff input-factor=0 output-factor=0 path=/work/is-en/binarised/reordering-table Distortion KENLM name=LM0 factor=0 path=/work/is-en/binarised/lm-en.blm order=3 
	input-factors: 0 
	mapping: 0 T 0 
	threads: 14 
	weight: LexicalReordering0= 0.114192 0.0158818 0.0202684 0.083186 0.0208785 0.197803 Distortion0= 0.0160226 LM0= 0.0632488 WordPenalty0= -0.204654 PhrasePenalty0= -0.0417258 TranslationModel0= 0.0177732 0.00823355 0.188931 0.00720186 UnknownWordPenalty0= 1 
line=UnknownWordPenalty
FeatureFunction: UnknownWordPenalty0 start: 0 end: 0
line=WordPenalty
FeatureFunction: WordPenalty0 start: 1 end: 1
line