# Sequence-to-sequence RNN for machine translation

The notebook shows how to implement a recurrent neural network for machine translation 
with help of Knet and NNHelferlein.
The net uses a Tatoeba-corpus to train a one-layer gru network. 
The resulting network demostrates the abilities of such an architecture - however the 
training corpus ist much too small to be sufficient for a professional
translator; and the network should have more layers and more units per layer.

In [1]:
using Random, StatsBase
using Knet, AutoGrad
using NNHelferlein

### The seq-2-seq-model

The sequence-to-sequence model is simple. We need
+ the type
+ a constructor
+ signatures for training (with 2 sequences as arguments) and for prediction (with only the 
  source signature as arg).

#### Type and constructor:

In [2]:
mutable struct S2S
    embed_enc       # embed layer for encoder
    embed_dec       # embed layer for decoder
    encoder         # encoder rnn
    decoder         # decoder rnn
    predict         # predict layer (Linear w/o actf)
    drop            # dropout layer
    voc_in; voc_out # vocab sizes
    embed           # embedding depth
    units           # number of lstm units in layers

    function S2S(n_embed, n_units, n_vocab_in, n_vocab_out)
        embed_enc = Embed(n_vocab_in, n_embed)
        drop = Dropout(0.1)
        embed_dec = Embed(n_vocab_out, n_embed)
        encoder = Recurrent(n_embed, n_units, u_type=:gru)
        decoder = Recurrent(n_embed, n_units, u_type=:gru)
        predict = Linear(n_units, n_vocab_out)

        return new(embed_enc, embed_dec, encoder, decoder,
            predict, drop,
            n_vocab_in, n_vocab_out, n_embed, n_units)
    end
end


#### Training signature

includes the following steps:
+ run the source sequence througth a rnn layer
+ transfer hidden states from encoder to decoder
+ start the decoder with the embedded target sequence (and return all states from all steps)
+ calculate and return loss.

In [3]:
function (s2s::S2S)(i, o)

    seqlen_i = size(i)[1]
    seqlen_o = size(o)[1]
    i = reshape(i, seqlen_i, :)
    o = reshape(o, seqlen_o, :)
    
    x = s2s.embed_enc(i)    # no <start>/<end> tags
    x = s2s.drop(x)
    h = s2s.encoder(x, h=0)
 
    y = s2s.embed_dec(o[1:end-1,:])
    h_dec = s2s.decoder(y, h=h, return_all=true)
    p = s2s.predict(h_dec)
    loss = nll(p, o[2:end,:])
    
    return loss
end


#### Predict signature

is very similar to the trainin signature, except of the decoder part
that now generates a step of the output sequence in every turn 
until the `<end>`-token is detected:


In [4]:
function (s2s::S2S)(i)

    seqlen_i = size(i)[1]
    i = reshape(i, seqlen_i, :)
    
    mb = size(i)[end]
    
    x = s2s.embed_enc(i)
    h = s2s.encoder(x, h=0)
    set_hidden_states!(s2s.decoder, h)

    output = blowup_array([TOKEN_START], mb)
    outstep = blowup_array([TOKEN_START], mb)

    MAX_LEN = 16
    step = 0
    while !all(outstep .== TOKEN_END) && step < MAX_LEN
        step += 1
        dec_in = s2s.embed_dec(outstep)
        h = s2s.decoder(dec_in, h=nothing)
        
        y = softmax(s2s.predict(h), dims=1)
        outstep = de_embed(y)
        output = vcat(output, outstep)
    end

    return output
end


### Example data
Just to test the signatures, we will translate 4 (most?) important sentences from 
German to English:

In [5]:
de = AbstractString[]
push!(de, "Ich programmiere immer in Julia")
push!(de, "Peter liebt Python")
push!(de, "Wir alle lieben Julia")
push!(de, "Ich liebe Julia")

en = AbstractString[]
push!(en, "I always code Julia")
push!(en, "Peter loves Python")
push!(en, "We all love Julia")
push!(en, "I love Julia");

In [6]:
@show en
@show de;

en = AbstractString["I always code Julia", "Peter loves Python", "We all love Julia", "I love Julia"]
de = AbstractString["Ich programmiere immer in Julia", "Peter liebt Python", "Wir alle lieben Julia", "Ich liebe Julia"]


The minibatch is a tuple of 2 matrices x and y with one column per sequence.    
`prepare_corpus()` does some cleaning and calls the *NNHelferlein*-Function
`secuence_minibatch()` which returns an iterator over the (x,y)-tuples and teh vocabularies 
for source and target language.

The argument combination `partial=true, x_padding=false` prevents x-sequences to be padded
and constructs smaller minibatches instead if necessary.

In [7]:
function prepare_corpus(source, target; batchsize=128, 
                        vocab_size=nothing)
    source = clean_sentence.(source)
    target = clean_sentence.(target)
    
    src_vocab = WordTokenizer(source, len=vocab_size)
    trg_vocab = WordTokenizer(target, len=vocab_size)
    
    src = src_vocab(source, add_ctls=false)
    trg = trg_vocab(target, add_ctls=true)

    src = truncate_sequence.(src, 10, end_token=nothing)
    trg = truncate_sequence.(trg, 10, end_token=TOKEN_END)
    
    return sequence_minibatch(src, trg, batchsize, shuffle=true, seq2seq=true, 
                              pad=TOKEN_END, partial=true, x_padding=true), 
           src_vocab, trg_vocab
end 

prepare_corpus (generic function with 1 method)

In [8]:
dfun, de_vocab, en_vocab = prepare_corpus(de, en, batchsize=2)

(SequenceData(Any[(Int32[10 6; 9 8; 13 5], Int32[1 1; 8 6; … ; 10 5; 2 2]), (Int32[15 6; 11 12; … ; 5 14; 2 5], Int32[1 1; 9 6; … ; 5 5; 2 2])], 2, [1, 2], true), WordTokenizer(16, Dict{String, Int32}("immer" => 7, "liebe" => 8, "liebt" => 9, "<start>" => 1, "Peter" => 10, "alle" => 11, "programmiere" => 12, "Julia" => 5, "Python" => 13, "in" => 14…), ["<start>", "<end>", "<pad>", "<unknown>", "Julia", "Ich", "immer", "liebe", "liebt", "Peter", "alle", "programmiere", "Python", "in", "Wir", "lieben"]), WordTokenizer(14, Dict{String, Int32}("We" => 9, "code" => 11, "<start>" => 1, "Peter" => 8, "Julia" => 5, "love" => 7, "Python" => 10, "<unknown>" => 4, "<pad>" => 3, "loves" => 13…), ["<start>", "<end>", "<pad>", "<unknown>", "Julia", "I", "love", "Peter", "We", "Python", "code", "always", "loves", "all"]))

### Train:

For this simple toy-problem, a tiny rnn may be sufficient:

In [9]:
N_EMBED = 6
N_UNITS = 16
s2s = S2S(N_EMBED, N_UNITS, length(de_vocab), length(en_vocab))

S2S(Embed(P(Knet.KnetArrays.KnetMatrix{Float32}(6,16)), identity), Embed(P(Knet.KnetArrays.KnetMatrix{Float32}(6,14)), identity), Recurrent(6, 16, :gru, GRU(input=6,hidden=16), true), Recurrent(6, 16, :gru, GRU(input=6,hidden=16), true), Linear(P(Knet.KnetArrays.KnetMatrix{Float32}(14,16)), P(Knet.KnetArrays.KnetVector{Float32}(14)), identity), Dropout(0.1), 16, 14, 6, 16)

In [10]:
tb_train!(s2s, Adam, dfun, split=nothing, epochs=200, tb_name="de-en-gru",
    acc_fun=hamming_acc,
    mb_loss_freq=100, checkpoints=nothing, eval_freq=10)

Training 200 epochs with 2 minibatches/epoch.
Evaluation is performed every 1 minibatches with 1 mbs.
Watch the progress with TensorBoard at:
/data/aNN/Helferlein/logs/de-en-gru/2022-02-17T18-14-04


[32mProgress: 100%|█████████████████████████████████████████| Time: 0:00:16[39m


Training finished with:
Training loss:       0.1967436671257019
Training accuracy:   1.0


S2S(Embed(P(Knet.KnetArrays.KnetMatrix{Float32}(6,16)), identity), Embed(P(Knet.KnetArrays.KnetMatrix{Float32}(6,14)), identity), Recurrent(6, 16, :gru, GRU(input=6,hidden=16), true), Recurrent(6, 16, :gru, GRU(input=6,hidden=16), true), Linear(P(Knet.KnetArrays.KnetMatrix{Float32}(14,16)), P(Knet.KnetArrays.KnetVector{Float32}(14)), identity), Dropout(0.1), 16, 14, 6, 16)

We train for some seconds and define a last function, that helps to translate directly and test the RNN:   
The function does:
+ transform a sentence in the source language into a list of word-tokens, using
  the source vocab.
+ run the sequence througth the RNN
+ use the target vocab to transform the sequence of tokens back into a sentence
  in the target language:

In [11]:
function translate(inp::T; mdl=s2s, sv=de_vocab, tv=en_vocab) where {T <: AbstractString}
    
    in_seq = sv(inp, split_words=true, add_ctls=false)
    in_seq = reshape(in_seq, (:,1))
    out_seq = mdl(in_seq)
    return tv(out_seq)
end
    

translate (generic function with 1 method)

In [12]:
translate("Ich liebe Julia")

"<start> I love Julia <end>"

In [13]:
translate("Ich programmiere immer in Julia")

"<start> I always code Julia <end>"

In [14]:
translate("Peter liebt Python")

"<start> Peter loves Python <end>"

In [15]:
translate("Wir alle lieben Julia")

"<start> I love Julia <end>"

### More realistic data from Tatoeba:

It is not at all surprising that our rnn is able to memorise 4 sentences - the example 
is just a check for the s2s-network and the tools.

As *NNHelferlein* provides direct access to Tatoeba data, we can train a rnn on a larger
dataset. The Tatoeba German-English corpus includes about 250000 sentences an can be 
easily accesses as follows:

In [16]:
en, de = get_tatoeba_corpus("deu")
en = en[1000:end]; de = de[1000:end]
dtato, de_vocab, en_vocab = prepare_corpus(de, en, batchsize=128)

dir = normpath(joinpath(dirname(pathof(#= /root/.julia/packages/NNHelferlein/GEtSz/src/texts.jl:314 =# @__MODULE__())), "..", "data", "Tatoeba")) = "/root/.julia/packages/NNHelferlein/GEtSz/data/Tatoeba"
pathname = joinpath(dir, fname) = "/root/.julia/packages/NNHelferlein/GEtSz/data/Tatoeba/deu-eng.zip"
Corpus for language deu is already downloaded.
Reading Tatoeba corpus for languages en-deu

importing sentences: 1000
importing sentences: 2000
importing sentences: 3000
importing sentences: 4000
importing sentences: 5000
importing sentences: 6000
importing sentences: 7000
importing sentences: 8000
importing sentences: 9000
importing sentences: 10000
importing sentences: 11000
importing sentences: 12000
importing sentences: 13000
importing sentences: 14000
importing sentences: 15000
importing sentences: 16000
importing sentences: 17000
importing sentences: 18000
importing sentences: 19000
importing sentences: 20000
importing sentences: 21000
importing sentences: 2

(SequenceData(Any[(Int32[3611 14859 … 5 5; 2 2 … 4029 4310], Int32[1 1 … 1 1; 1933 1933 … 6 6; … ; 2 2 … 2 2; 2 2 … 2 2]), (Int32[5 5224 … 2269 487; 475 5 … 4790 3624], Int32[1 1 … 1 1; 6 3759 … 246 246; … ; 2 2 … 2 2; 2 2 … 2 2]), (Int32[2269 2269 … 15159 788; 246 246 … 7 1387], Int32[1 1 … 1 1; 246 246 … 90 90; … ; 2 2 … 5883 1041; 2 2 … 2 2]), (Int32[7702 1822 … 5070 7131; 7 3259 … 29 5], Int32[1 1 … 1 1; 90 90 … 7793 4531; … ; 2 2 … 2 2; 2 2 … 2 2]), (Int32[818 34 … 818 5988; 750 950 … 4046 112], Int32[1 1 … 1 1; 4266 29 … 7970 4412; … ; 2 2 … 2 2; 2 2 … 2 2]), (Int32[5020 4283 … 302 302; 112 29 … 3927 727], Int32[1 1 … 1 1; 4412 165 … 426 426; … ; 2 18 … 2 2; 2 2 … 2 2]), (Int32[1954 2459 … 5 5; 14886 26 … 1618 2319], Int32[1 1 … 1 1; 16382 5777 … 6 6; … ; 2 2 … 2 2; 2 2 … 2 2]), (Int32[5 2135 … 5 5; 4433 1430 … 480 8352], Int32[1 1 … 1 1; 6 1157 … 6 6; … ; 2 2 … 2 2; 2 2 … 2 2]), (Int32[5 5 … 757 10; 4800 7479 … 2257 2810], Int32[1 1 … 1 1; 6 6 … 383 55; … ; 2 2 … 2 2; 2 2 … 2 2]

For the more realistic training data still single layer of 512 LSTM units is used:

In [17]:
N_EMBED = 1024
N_UNITS = 512
s2s = S2S(N_EMBED, N_UNITS, length(de_vocab), length(en_vocab))

S2S(Embed(P(Knet.KnetArrays.KnetMatrix{Float32}(1024,40982)), identity), Embed(P(Knet.KnetArrays.KnetMatrix{Float32}(1024,19288)), identity), Recurrent(1024, 512, :gru, GRU(input=1024,hidden=512), true), Recurrent(1024, 512, :gru, GRU(input=1024,hidden=512), true), Linear(P(Knet.KnetArrays.KnetMatrix{Float32}(19288,512)), P(Knet.KnetArrays.KnetVector{Float32}(19288)), identity), Dropout(0.1), 40982, 19288, 1024, 512)

In [18]:
tb_train!(s2s, Adam, dtato, epochs=20, tb_name="de-en-gru",
    split=0.9, eval_freq=5, eval_size=0.2, 
    acc_fun=hamming_acc, mb_loss_freq=1000, checkpoints=nothing)

Splitting dataset for training (90%) and validation (10%).
Training 20 epochs with 1746 minibatches/epoch and 194 validation mbs.
Evaluation is performed every 350 minibatches with 39 mbs.
Watch the progress with TensorBoard at:
/data/aNN/Helferlein/logs/de-en-gru/2022-02-17T18-20-34


[32mProgress: 100%|█████████████████████████████████████████| Time: 0:22:41[39m


Training finished with:
Training loss:       0.15276523946516668
Training accuracy:   0.9048155519348606
Validation loss:     0.1482027831208921
Validation accuracy: 0.9093793786307477


S2S(Embed(P(Knet.KnetArrays.KnetMatrix{Float32}(1024,40982)), identity), Embed(P(Knet.KnetArrays.KnetMatrix{Float32}(1024,19288)), identity), Recurrent(1024, 512, :gru, GRU(input=1024,hidden=512), true), Recurrent(1024, 512, :gru, GRU(input=1024,hidden=512), true), Linear(P(Knet.KnetArrays.KnetMatrix{Float32}(19288,512)), P(Knet.KnetArrays.KnetVector{Float32}(19288)), identity), Dropout(0.1), 40982, 19288, 1024, 512)

After 20 minutes of training we get:

In [19]:
translate("Tom hört gewöhnlich klassische Musik")

"<start> Tom usually listens to classical music <end>"

In [20]:
translate("Tom trägt fast immer dunkle Kleidung")

"<start> Tom almost always wears dark clothes <end>"

In [21]:
translate("Wie viel Bier soll ich kaufen?")

"<start> How much beer should I buy <end>"

In [22]:
translate("Ich brauche eine Mütze voll Schlaf")

"<start> I need to get some shut-eye <end>"

In [23]:
translate("Ich muss mehr Kaffee trinken")

"<start> I need to drink more coffee <end>"

In [24]:
translate("Tom muss mehr Kaffee trinken")

"<start> Tom needs to drink more coffee <end>"