# Neural Machine Translation

**Reference:** Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." In Advances in neural information processing systems, pp. 3104-3112. 2014. ([Paper](https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks), [Sample code](https://github.com/tensorflow/nmt))

In [1]:
using Knet, Test, Base.Iterators, IterTools, Random # , LinearAlgebra, StatsBase
using AutoGrad: @gcheck  # to check gradients, use with Float64
Knet.atype() = KnetArray{Float32}  # determines what Knet.param() uses.
macro size(z, s); esc(:(@assert (size($z) == $s) string(summary($z),!=,$s))); end # for debugging

@size (macro with 1 method)

## Part -1. Types from the last project

Please copy the following types and related functions from the last project: `Vocab`,
`TextReader`, `Embed`, `Linear`, `mask!`.

In [2]:
# Your code here

struct Vocab
    w2i::Dict{String,Int}
    i2w::Vector{String}
    unk::Int
    eos::Int
    tokenizer
end

# ### Vocab constructor
#
# Implement a constructor for the `Vocab` type. The constructor should take a file path as
# an argument and create a `Vocab` object with the most frequent words from that file and
# optionally unk and eos tokens. The keyword arguments are:
#
# * tokenizer: The function used to tokenize sentence strings.
# * vocabsize: Maximum number of words in the vocabulary.
# * mincount: Minimum count of words in the vocabulary.
# * unk, eos: unk and eos strings, should be part of the vocabulary unless set to nothing.
#
# You may find the following Julia functions useful: `Dict`, `eachline`, `split`, `get`,
# `delete!`, `sort!`, `keys`, `collect`, `push!`, `pushfirst!`, `findfirst`. You can take
# look at their documentation using e.g. `@doc eachline`.


#-
function Vocab(file::String;tokenizer=split, vocabsize=Inf, mincount=1, unk="<unk>", eos="<s>")

    w2i= Dict{String,Int}()
    data = [tokenizer(line) for line in eachline(file)]
    countDict = Dict()
    countDict1 = Dict()

    countD(x)= countDict[x]= get(countDict,x,0)+1
    for line in data
        countD.(line)
    end
    if(vocabsize<length(data))
        juliachars = sort(collect(keys(countDict)), by=(x->countDict[x]), rev=true)[1:vocabsize-1]
        for key in juliachars
            countDict1[key]= countDict[key]
        end
        countDict = countDict1
    end


    for i in collect(keys(countDict))
        if(countDict[i]<mincount)
            delete!(countDict,i)
        end
    end

    data =collect(keys(countDict))
    ins(x)= get!(w2i,x,1+length(w2i))
    if(unk != "")
        UNK = ins(unk)
    end
    if(eos!="")
        EOS = ins(eos)
    end
    ins.(data)
    i2w = Vector{String}(undef,length(w2i))
    for (str,id) in w2i; i2w[id] = str; end
    Vocab(w2i,i2w,1,2,tokenizer)

end


Vocab

In [3]:
#Text Reader


# ## Part 2. TextReader
#
# Next we will implement `TextReader`, an iterator that reads sentences from a file and
# returns them as integer arrays using a `Vocab`.  We want to implement `TextReader` as an
# iterator for scalability. Instead of reading the whole file at once, `TextReader` will
# give us one sentence at a time as needed (similar to how `eachline` works). This will help
# us handle very large files in the future.

struct TextReader
    file::String
    vocab::Vocab
end

# ### iterate
#
# The main function to implement for a new iterator is `iterate`. The `iterate` function
# takes an iterator and optionally a state, and returns a `(nextitem,0)` if the iterator
# has more items or `nothing` otherwise. A one argument call `iterate(x)` starts the
# iteration, and a two argument call `iterate(x,state)` continues from where it left off.
#
# Here are some sources you may find useful on iterators:
#
# * https://github.com/denizyuret/Knet.jl/blob/master/tutorial/25.iterators.ipynb
# * https://docs.julialang.org/en/v1/manual/interfaces
# * https://docs.julialang.org/en/v1/base/collections/#lib-collections-iteration-1
# * https://docs.julialang.org/en/v1/base/iterators
# * https://docs.julialang.org/en/v1/manual/arrays/#Generator-Expressions-1
# * https://juliacollections.github.io/IterTools.jl/stable
#
# For `TextReader` the state should be an `IOStream` object obtained by `open(file)` at the
# start of the iteration. When `eof(state)` indicates that end of file is reached, the
# stream should be closed by `close(state)` and `nothing` should be returned. Otherwise
# `TextReader` reads the next line from the file using `readline`, tokenizes it, maps each
# word to its integer id using the vocabulary and returns the resulting integer array
# (without any eos tokens) and the state.



function Base.iterate(r::TextReader, s=nothing)

    ## Your code here
    if(s===nothing)
        s= open(r.file)
    end
    if(eof(s))
        close(s)
        return nothing
    end
    line  = readline(s)
    getI(x) = get(r.vocab.w2i,x,1)
    line = r.vocab.tokenizer(line)

    arr =getI.(line)
    return arr, s
end

# These are some optional functions that can be defined for iterators. They are required for
# `collect` to work, which converts an iterator to a regular array.

Base.IteratorSize(::Type{TextReader}) = Base.SizeUnknown()
Base.IteratorEltype(::Type{TextReader}) = Base.HasEltype()
Base.eltype(::Type{TextReader}) = Vector{Int}


In [4]:
#Embed

struct Embed; w; end

function Embed(vocabsize::Int, embedsize::Int)
    Embed(param(embedsize,vocabsize))
end

function (l::Embed)(x)
    ## Your code here
    l.w[:,x]
end


In [5]:
#Linear


struct Linear; w; b; end

function Linear(inputsize::Int, outputsize::Int)
    ## Your code here
    Linear(param(outputsize,inputsize), param0(outputsize))
end

function (l::Linear)(x)
    ## Your code here
    l.w * mat(x,dims=1) .+ l.b
end

In [20]:
@doc param

```
param(array; atype)
param(dims...; init, atype)
param0(dims...; atype)
```

The first form returns `Param(atype(array))` where `atype=identity` is the default.

The second form Returns a randomly initialized `Param(atype(init(dims...)))`. By default, `init` is `xavier` and `atype` is `KnetArray{Float32}` if `gpu() >= 0`, `Array{Float32}` otherwise. 

The third form `param0` is an alias for `param(dims...; init=zeros)`.


In [6]:
#Mask


function mask!(a,pad)
    ## Your code here
    matr = a 
    for j in 1:size(matr)[1]
        i=0
        while(i<length(matr[j,:])-1)
            if matr[j,length(matr[j,:])-i-1]!=pad
                break
            
            elseif matr[j,length(matr[j,:])-i]== pad
               matr[j,length(matr[j,:])-i]= 0
            end
            i+=1
        end
    end
    return matr
end


mask! (generic function with 1 method)

## Part 0. Load data

We will use the Turkish-English pair from the [TED Talks Dataset](https://github.com/neulab/word-embeddings-for-nmt) for our experiments.

In [7]:
datadir = "datasets/tr_to_en"

if !isdir(datadir)
    download("http://www.phontron.com/data/qi18naacl-dataset.tar.gz", "qi18naacl-dataset.tar.gz")
    run(`tar xzf qi18naacl-dataset.tar.gz`)
end

if !isdefined(Main, :tr_vocab)
    tr_vocab = Vocab("$datadir/tr.train", mincount=5)
    en_vocab = Vocab("$datadir/en.train", mincount=5)
    tr_train = TextReader("$datadir/tr.train", tr_vocab)
    en_train = TextReader("$datadir/en.train", en_vocab)
    tr_dev = TextReader("$datadir/tr.dev", tr_vocab)
    en_dev = TextReader("$datadir/en.dev", en_vocab)
    tr_test = TextReader("$datadir/tr.test", tr_vocab)
    en_test = TextReader("$datadir/en.test", en_vocab)
    @info "Testing data"
    @test length(tr_vocab.i2w) == 38126
    @test length(first(tr_test)) == 16
    @test length(collect(tr_test)) == 5029
end

┌ Info: Testing data
└ @ Main In[7]:17


[32m[1mTest Passed[22m[39m

## Part 1. Minibatching

For minibatching we are going to design a new iterator: `MTData`. This iterator is built
on top of two TextReaders `src` and `tgt` that produce parallel sentences for source and
target languages.

In [8]:
struct MTData
    src::TextReader        # reader for source language data
    tgt::TextReader        # reader for target language data
    batchsize::Int         # desired batch size
    maxlength::Int         # skip if source sentence above maxlength
    batchmajor::Bool       # batch dims (B,T) if batchmajor=false (default) or (T,B) if true.
    bucketwidth::Int       # batch sentences with length within bucketwidth of each other
    buckets::Vector        # sentences collected in separate arrays called buckets for each length range
    batchmaker::Function   # function that turns a bucket into a batch.
end

function MTData(src::TextReader, tgt::TextReader; batchmaker = arraybatch, batchsize = 128, maxlength = typemax(Int),
                batchmajor = false, bucketwidth = 10, numbuckets = min(128, maxlength ÷ bucketwidth))
    buckets = [ [] for i in 1:numbuckets ] # buckets[i] is an array of sentence pairs with similar length
    MTData(src, tgt, batchsize, maxlength, batchmajor, bucketwidth, buckets, batchmaker)
end

Base.IteratorSize(::Type{MTData}) = Base.SizeUnknown()
Base.IteratorEltype(::Type{MTData}) = Base.HasEltype()
Base.eltype(::Type{MTData}) = NTuple{2}

### iterate(::MTData)

Define the `iterate` function for the `MTData` iterator. `iterate` should return a
`(batch, state)` pair or `nothing` if there are no more batches.  The `batch` is a
`(x::Matrix{Int},y::Matrix{Int})` pair where `x` is a `(batchsize,srclength)` batch of
source language sentences and `y` is a `(batchsize,tgtlength)` batch of parallel target
language translations. The `state` is a pair of `(src_state,tgt_state)` which can be used
to iterate `d.src` and `d.tgt` to get more sentences.  `iterate(d)` without a second
argument should initialize `d` by emptying its buckets and calling `iterate` on the inner
iterators `d.src` and `d.tgt` without a state. Please review the documentation on
iterators from the last project.

To keep similar length sentences together `MTData` uses arrays of similar length sentence
pairs called buckets.  Specifically, the `(src_sentence,tgt_sentence)` pairs coming from
`src` and `tgt` are pushed into `d.buckets[i]` when the length of the source sentence is
in the range `((i-1)*d.bucketwidth+1):(i*d.bucketwidth)`. When one of the buckets reaches
`d.batchsize` `d.batchmaker` is called with the full bucket producing a 2-D batch, the
bucket is emptied and the batch is returned. If `src` and `tgt` are exhausted the
remaining partially full buckets are turned into batches and returned in any order. If the
source sentence length is larger than `length(d.buckets)*d.bucketwidth`, the last bucket
is used.

Sentences above a certain length can be skipped using the `d.maxlength` field, and
transposed `x,y` arrays can be produced using the `d.batchmajor` field.

In [9]:
function Base.iterate(d::MTData, state=nothing)
    if state == nothing
        for b in d.buckets; empty!(b); end
    end
    bucket,ibucket = nothing,nothing
    state_src,state_tgt = nothing,nothing

    while true
        if state === nothing
            iter_src=iterate(d.src)
            iter_tgt=iterate(d.tgt)
        else
            state_src = state[1]
            state_tgt = state[2]
            iter_src=iterate(d.src,state_src)
            iter_tgt=iterate(d.tgt,state_tgt)
        end
        if iter_src === nothing
            ibucket = findfirst(x -> !isempty(x), d.buckets)
            bucket = (ibucket === nothing ? nothing : d.buckets[ibucket])
            break
        else
            sent_src, state_src = iter_src
            sent_tgt, state_tgt= iter_tgt
            if length(sent_src) > d.maxlength || length(sent_src) == 0; continue; end
            ibucket = min(1 + (length(sent_src)-1) ÷ d.bucketwidth, length(d.buckets))
            bucket = d.buckets[ibucket]
            push!(bucket, (sent_src,sent_tgt))
            if length(bucket) === d.batchsize; break; end
        end
    end
    if bucket === nothing; return nothing; end
    
    batch = d.batchmaker(d,bucket)

    empty!(bucket)
    return batch, (state_src,state_tgt)
end

### arraybatch

Define `arraybatch(d, bucket)` to be used as the default `d.batchmaker`. `arraybatch`
takes an `MTData` object and an array of sentence pairs `bucket` and returns a
`(x::Matrix{Int},y::Matrix{Int})` pair where `x` is a `(batchsize,srclength)` batch of
source language sentences and `y` is a `(batchsize,tgtlength)` batch of parallel target
language translations. Note that the sentences in the bucket do not have any `eos` tokens
and they may have different lengths. `arraybatch` should copy the source sentences into
`x` padding shorter ones on the left with `eos` tokens. It should copy the target
sentences into `y` with an `eos` token in the beginning and end of each sentence and
shorter sentences padded on the right with extra `eos` tokens.

In [10]:
function arraybatch(d::MTData, bucket)
    # Your code here
    bucketx = map(x->x[1],bucket)
    buckety= map(x->x[2],bucket)
    batch_x= fill(d.src.vocab.eos, length(bucketx), maximum(length.(bucketx)))
    for i in 1:length(bucket)
        batch_x[i, end-length(bucketx[i])+1:end] = bucketx[i]
    end
    batch_y= fill(d.tgt.vocab.eos, length(buckety),  maximum(length.(buckety) )+2)
    for i in 1:length(bucket)
        batch_y[i, 2:length(buckety[i])+1] = buckety[i]
    end
    
    return (batch_x,batch_y)
end

arraybatch (generic function with 1 method)

In [14]:
@info "Testing MTData"
dtrn = MTData(tr_train, en_train)
ddev = MTData(tr_dev, en_dev)

dtst = MTData(tr_test, en_test)


┌ Info: Testing MTData
└ @ Main In[14]:1


MTData(TextReader("datasets/tr_to_en/tr.test", Vocab(Dict("ağacından"=>19226,"ellisi"=>3,"komuta"=>28565,"adresini"=>4,"yüzeyi"=>19227,"paris'te"=>5,"kafamdaki"=>28566,"yüzeyinde"=>19228,"geçerlidir"=>19229,"kökten"=>9729…), ["<unk>", "<s>", "ellisi", "adresini", "paris'te", "uçaklardan", "buzulların", "hukukun", "kutuplaşma", "pedi"  …  "görünümü", "tahribatı", "yerdeyken", "kazandığında", "bilebilirsiniz", "planına", "köşedeki", "elimize", "gitmiştim", "muhafazakarlar"], 1, 2, split)), TextReader("datasets/tr_to_en/en.test", Vocab(Dict("middle-income"=>4730,"photosynthesis"=>3,"polarizing"=>9462,"henry"=>4,"abducted"=>4731,"whiz"=>5,"rises"=>9463,"hampshire"=>14102,"cost-benefit"=>6,"progression"=>9464…), ["<unk>", "<s>", "photosynthesis", "henry", "whiz", "cost-benefit", "gathered", "underground", "methods", "vis-a-vis"  …  "conquering", "backpack", "leans", "lap", "palestine", "convincing", "non-violent", "linguistics", "smuggled", "shorten"], 1, 2, split)), 128, 922337203685477580

In [12]:
'''
x,y = first(dtst)
@test length(collect(dtst)) == 48
@test size.((x,y)) == ((128,10),(128,24))
@test x[1,1] == tr_vocab.eos
@test x[1,end] != tr_vocab.eos
@test y[1,1] == en_vocab.eos
@test y[1,2] != en_vocab.eos
@test y[1,end] == en_vocab.eos
'''

UndefVarError: UndefVarError: dtst not defined

## Part 2. Sequence to sequence model without attention

In this part we will define a simple sequence to sequence encoder-decoder model for
machine translation.

In [15]:
struct S2S_v1
    srcembed::Embed     # source language embedding
    encoder::RNN        # encoder RNN (can be bidirectional)
    tgtembed::Embed     # target language embedding
    decoder::RNN        # decoder RNN
    projection::Linear  # converts decoder output to vocab scores
    dropout::Real       # dropout probability to prevent overfitting
    srcvocab::Vocab     # source language vocabulary
    tgtvocab::Vocab     # target language vocabulary
end

In [None]:
@doc RNN

### S2S_v1 constructor

Define the S2S_v1 constructor using your predefined layer types (Embed, Linear), and the
Knet RNN type. Please review the RNN documentation using `@doc RNN`, paying attention to
the following options in particular: `numLayers`, `bidirectional`, `dropout`, `dataType`,
`usegpu`. The last two are important if you experiment with array types other than the
default `KnetArray{Float32}`: make sure the RNNs use the same array type as the other
layers. Note that if the encoder is bidirectional, its `numLayers` should be half of the
decoder so that their hidden states match in size.

In [21]:
function S2S_v1(hidden::Int,         # hidden size for both the encoder and decoder RNN
                srcembsz::Int,       # embedding size for source language
                tgtembsz::Int,       # embedding size for target language
                srcvocab::Vocab,     # vocabulary for source language
                tgtvocab::Vocab;     # vocabulary for target language
                layers=1,            # number of layers
                bidirectional=false, # whether encoder RNN is bidirectional
                dropout=0)           # dropout probability
    # Your code here
    srcembed = Embed(length(srcvocab.i2w),srcembsz)
    
    encoder = RNN(size(srcembed),hidden,numLayers=layers,bidirectional=bidirectional ,dropout= dropout,usegpu=false)
    
    tgtembed = Embed(length(tgtvocab.i2w),tgtembsz)
    if bidirectional
        decoder = RNN(size(tgtembed),hidden,numLayers = 2*layers,dropout=dropout,usegpu=false)
    else
        decoder = RNN(size(tgtembed),hidden,numLayers = layers,dropout=dropout,usegpu=false)
    end
    projection = Linear(size(decoder),length(tgtvocab.i2w))
    
        
    
    S2S_v1(srcembed,encoder,tgtembed,decoder,projection,dropout,srcvocab,tgtvocab)
end

S2S_v1

### S2S_v1 loss function

Define the S2S_v1 loss function that takes `src`, a source language minibatch, and `tgt`,
a target language minibatch and returns either a `(total_loss, num_words)` pair if
`average=false`, or `(total_loss/num_words)` average if `average=true`.

Assume that `src` and `tgt` are integer arrays of size `(B,Tx)` and `(B,Ty)` respectively,
where `B` is the batch size, `Tx` is the length of the longest source sequence, `Ty` is
the length of the longest target sequence. The `src` sequences only contain words, the
`tgt` sequences surround the words with `eos` tokens at the start and end. This allows
columns `tgt[:,1:end-1]` to be used as the decoder input and `tgt[:,2:end]` as the desired
decoder output.

Assume any shorter sentences in the batches have been padded with extra `eos` tokens on
the left for `src` and on the right for `tgt`. Don't worry about masking `src` for the
encoder, it doesn't have a significant effect on the loss. However do mask `tgt` before
`nll`: you do not want the padding tokens to be counted in the loss calculation.

Please review `@doc RNN`: in particular the `r.c` and `r.h` fields can be used to get/set
the cell and hidden arrays of an RNN (note that `0` and `nothing` act as special values).

RNNs take a dropout value at construction and apply dropout to the input of every layer if
it is non-zero. You need to handle dropout for other layers in the loss function or in
layer definitions as necessary.

In [25]:
@doc param

```
param(array; atype)
param(dims...; init, atype)
param0(dims...; atype)
```

The first form returns `Param(atype(array))` where `atype=identity` is the default.

The second form Returns a randomly initialized `Param(atype(init(dims...)))`. By default, `init` is `xavier` and `atype` is `KnetArray{Float32}` if `gpu() >= 0`, `Array{Float32}` otherwise. 

The third form `param0` is an alias for `param(dims...; init=zeros)`.


In [73]:
@doc RNN

```
rnn = RNN(inputSize, hiddenSize; opts...)
rnn(x; batchSizes) => y
rnn.h, rnn.c  # hidden and cell states
```

`RNN` returns a callable RNN object `rnn`. Given a minibatch of sequences `x`, `rnn(x)` returns `y`, the hidden states of the final layer for each time step. `rnn.h` and `rnn.c` fields can be used to set the initial hidden states and read the final hidden states of all layers.  Note that the final time step of `y` always contains the final hidden state of the last layer, equivalent to `rnn.h` for a single layer network.

**Dimensions:** The input `x` can be 1, 2, or 3 dimensional and `y` will have the same number of dimensions as `x`. size(x)=(X,[B,T]) and size(y)=(H/2H,[B,T]) where X is inputSize, B is batchSize, T is seqLength, H is hiddenSize, 2H is for bidirectional RNNs. By default a 1-D `x` represents a single instance for a single time step, a 2-D `x` represents a single minibatch for a single time step, and a 3-D `x` represents a sequence of identically sized minibatches for multiple time steps. The output `y` gives the hidden state (of the final layer for multi-layer RNNs) for each time step. The fields `rnn.h` and `rnn.c` represent the hidden states of all layers in a single time step and have size (H,B,L/2L) where L is numLayers and 2L is for bidirectional RNNs.

**batchSizes:** If `batchSizes=nothing` (default), all sequences in a minibatch are assumed to be the same length. If `batchSizes` is an array of (non-increasing) integers, it gives us the batch size for each time step (allowing different sequences in the minibatch to have different lengths). In this case `x` will typically be 2-D with the second dimension representing variable size batches for time steps. If `batchSizes` is used, `sum(batchSizes)` should equal `length(x) ÷ size(x,1)`. When the batch size is different in every time step, hidden states will have size (H,B,L/2L) where B is always the size of the first (largest) minibatch.

**Hidden states:** The hidden and cell states are kept in `rnn.h` and `rnn.c` fields (the cell state is only used by LSTM). They can be initialized during construction using the `h` and `c` keyword arguments, or modified later by direct assignment. Valid values are `nothing` (default), `0`, or an array of the right type and size possibly wrapped in a `Param`. If the value is `nothing` the initial state is assumed to be zero and the final state is discarded keeping the value `nothing`. If the value is `0` the initial state is assumed to be zero and `0` is replaced by the final state on return. If the value is a valid state, it is used as the initial state and is replaced by the final state on return.

In a differentiation context the returned final hidden states will be wrapped in `Result` types. This is necessary if the same RNN object is to be called multiple times in a single iteration. Between iterations (i.e. after diff/update) the hidden states need to be unboxed with e.g. `rnn.h = value(rnn.h)` to prevent spurious dependencies. This happens automatically during the backward pass for GPU RNNs but needs to be done manually for CPU RNNs. See the [CharLM Tutorial](https://github.com/denizyuret/Knet.jl/blob/master/tutorial/80.charlm.ipynb) for an example.

**Keyword arguments for RNN:**

  * `h=nothing`: Initial hidden state.
  * `c=nothing`: Initial cell state.
  * `rnnType=:lstm` Type of RNN: One of :relu, :tanh, :lstm, :gru.
  * `numLayers=1`: Number of RNN layers.
  * `bidirectional=false`: Create a bidirectional RNN if `true`.
  * `dropout=0`: Dropout probability. Applied to input and between layers.
  * `skipInput=false`: Do not multiply the input with a matrix if `true`.
  * `dataType=Float32`: Data type to use for weights.
  * `algo=0`: Algorithm to use, see CUDNN docs for details.
  * `seed=0`: Random number seed for dropout. Uses `time()` if 0.
  * `winit=xavier`: Weight initialization method for matrices.
  * `binit=zeros`: Weight initialization method for bias vectors.
  * `finit=ones`: Weight initialization method for the bias of forget gates.
  * `usegpu=(gpu()>=0)`: GPU used by default if one exists.

**Formulas:** RNNs compute the output h[t] for a given iteration from the recurrent input h[t-1] and the previous layer input x[t] given matrices W, R and biases bW, bR from the following equations:

`:relu` and `:tanh`: Single gate RNN with activation function f:

```
h[t] = f(W * x[t] .+ R * h[t-1] .+ bW .+ bR)
```

`:gru`: Gated recurrent unit:

```
i[t] = sigm(Wi * x[t] .+ Ri * h[t-1] .+ bWi .+ bRi) # input gate
r[t] = sigm(Wr * x[t] .+ Rr * h[t-1] .+ bWr .+ bRr) # reset gate
n[t] = tanh(Wn * x[t] .+ r[t] .* (Rn * h[t-1] .+ bRn) .+ bWn) # new gate
h[t] = (1 - i[t]) .* n[t] .+ i[t] .* h[t-1]
```

`:lstm`: Long short term memory unit with no peephole connections:

```
i[t] = sigm(Wi * x[t] .+ Ri * h[t-1] .+ bWi .+ bRi) # input gate
f[t] = sigm(Wf * x[t] .+ Rf * h[t-1] .+ bWf .+ bRf) # forget gate
o[t] = sigm(Wo * x[t] .+ Ro * h[t-1] .+ bWo .+ bRo) # output gate
n[t] = tanh(Wn * x[t] .+ Rn * h[t-1] .+ bWn .+ bRn) # new gate
c[t] = f[t] .* c[t-1] .+ i[t] .* n[t]               # cell output
h[t] = o[t] .* tanh(c[t])
```


In [22]:
function (s::S2S_v1)(src, tgt; average=true)
    # Your code here
    embout = s.embed(src)
    b,tx = size(src)
    @assert size(embout)== (s.srcembsz,b,tx)
    rnnencoder=s.encoder(embout)
    
    embouttgt = s.emb(tgt[:,1:end-1])
    rnndecoder = s.decoder(embouttgt;h= rnnencoder.h,c=rnnencoder.c)
    rnndecoder
    b,ty = size(tgt)
    @assert size(rnndecoder)==(rnndecoder.h,b,ty)
    
    
        
end

In [24]:
@info "Testing S2S_v1"
#Knet.seed!(1)
model = S2S_v1(512, 512, 512, tr_vocab, en_vocab; layers=2, bidirectional=true, dropout=0.2)

┌ Info: Testing S2S_v1
└ @ Main In[24]:1


TypeError: TypeError: in Type{...} expression, expected UnionAll, got typeof(Knet.CuArray)

In [18]:
(x,y) = first(dtst)
# Your loss can be slightly different due to different ordering of words in the vocabulary.
# The reference vocabulary starts with eos, unk, followed by words in decreasing frequency.
@test model(x,y; average=false) == (14097.471f0, 1432)

┌ Info: Testing S2S_v1
└ @ Main In[18]:1


TypeError: TypeError: in Type{...} expression, expected UnionAll, got typeof(Knet.CuArray)

### Loss for a whole dataset

Define a `loss(model, data)` which returns a `(Σloss, Nloss)` pair if `average=false` and
a `Σloss/Nloss` average if `average=true` for a whole dataset. Assume that `data` is an
iterator of `(x,y)` pairs such as `MTData` and `model(x,y;average)` is a model like
`S2S_v1` that computes loss on a single `(x,y)` pair.

In [None]:
function loss(model, data; average=true)
    # Your code here
end

In [None]:
@info "Testing loss"
@test loss(model, dtst, average=false) == (1.0429117f6, 105937)
# Your loss can be slightly different due to different ordering of words in the vocabulary.
# The reference vocabulary starts with eos, unk, followed by words in decreasing frequency.
# Also, because we do not mask src, different batch sizes may lead to slightly different
# losses. The test above gives (1.0429178f6, 105937) with batchsize==1.

### Training SGD_v1

The following function can be used to train our model. `trn` is the training data, `dev`
is used to determine the best model, `tst...` can be zero or more small test datasets for
loss reporting. It returns the model that does best on `dev`.

In [None]:
function train!(model, trn, dev, tst...)
    bestmodel, bestloss = deepcopy(model), loss(model, dev)
    progress!(adam(model, trn), steps=100) do y
        losses = [ loss(model, d) for d in (dev,tst...) ]
        if losses[1] < bestloss
            bestmodel, bestloss = deepcopy(model), losses[1]
        end
        return (losses...,)
    end
    return bestmodel
end

You should be able to get under 3.40 dev loss with the following settings in 10
epochs. The training speed on a V100 is about 3 mins/epoch or 40K words/sec, K80 is about
6 times slower. Using settings closer to the Luong paper (per-sentence loss rather than
per-word loss, SGD with lr=1, gclip=1 instead of Adam), you can get to 3.17 dev loss in
about 25 epochs. Using dropout and shuffling batches before each epoch significantly
improve the dev loss. You can play around with hyperparameters but I doubt results will
get much better without attention. To verify your training, here is the dev loss I
observed at the beginning of each epoch in one training session:
`[9.83, 4.60, 3.98, 3.69, 3.52, 3.41, 3.35, 3.32, 3.30, 3.31, 3.33]`

In [None]:
@info "Training S2S_v1"
epochs = 10
ctrn = collect(dtrn)
trnx10 = collect(flatten(shuffle!(ctrn) for i in 1:epochs))
trn20 = ctrn[1:20]
dev38 = collect(ddev)
# Uncomment this to train the model (This takes about 30 mins on a V100):
# model = train!(model, trnx10, dev38, trn20)
# Uncomment this to save the model:
# Knet.save("s2s_v1.jld2","model",model)
# Uncomment this to load the model:
# model = Knet.load("s2s_v1.jld2","model")

### Generating translations

With a single argument, a `S2S_v1` object should take it as a batch of source sentences
and generate translations for them. After passing `src` through the encoder and copying
its hidden states to the decoder, the decoder is run starting with an initial input of all
`eos` tokens. Highest scoring tokens are appended to the output and used as input for the
subsequent decoder steps.  The decoder should stop generating when all sequences in the
batch have generated `eos` or when `stopfactor * size(src,2)` decoder steps are reached. A
correctly shaped target language batch should be returned.

In [None]:
function (s::S2S_v1)(src::Matrix{Int}; stopfactor = 3)
    # Your code here
end

In [None]:
# Utility to convert int arrays to sentence strings
function int2str(y,vocab)
    y = vec(y)
    ysos = findnext(w->!isequal(w,vocab.eos), y, 1)
    ysos == nothing && return ""
    yeos = something(findnext(isequal(vocab.eos), y, ysos), 1+length(y))
    join(vocab.i2w[y[ysos:yeos-1]], " ")
end

In [None]:
@info "Generating some translations"
d = MTData(tr_dev, en_dev, batchsize=1) |> collect
(src,tgt) = rand(d)
out = model(src)
println("SRC: ", int2str(src,model.srcvocab))
println("REF: ", int2str(tgt,model.tgtvocab))
println("OUT: ", int2str(out,model.tgtvocab))
# Here is a sample output:
# SRC: çin'e 15 şubat 2006'da ulaştım .
# REF: i made it to china on february 15 , 2006 .
# OUT: i got to china , china , at the last 15 years .

### Calculating BLEU

BLEU is the most commonly used metric to measure translation quality. The following should
take a model and some data, generate translations and calculate BLEU.

In [None]:
function bleu(s2s,d::MTData)
    d = MTData(d.src,d.tgt,batchsize=1)
    reffile = d.tgt.file
    hypfile,hyp = mktemp()
    for (x,y) in progress(collect(d))
        g = s2s(x)
        for i in 1:size(y,1)
            println(hyp, int2str(g[i,:], d.tgt.vocab))
        end
    end
    close(hyp)
    isfile("multi-bleu.perl") || download("https://github.com/moses-smt/mosesdecoder/raw/master/scripts/generic/multi-bleu.perl", "multi-bleu.perl")
    run(pipeline(`cat $hypfile`,`perl multi-bleu.perl $reffile`))
    return hypfile
end

Calculating dev BLEU takes about 45 secs on a V100. We get about 8.0 BLEU which is pretty
low. As can be seen from the sample translations a loss of ~3+ (perplexity ~20+) or a BLEU
of ~8 is not sufficient to generate meaningful translations.

In [None]:
@info "Calculating BLEU"
bleu(model, ddev)

To improve the quality of translations we can use more training data, different training
and model parameters, or preprocess the input/output: e.g. splitting Turkish words to make
suffixes look more like English function words may help. Other architectures,
e.g. attention and transformer, perform significantly better than this simple S2S model.

*This notebook was generated using [Literate.jl](https://github.com/fredrikekre/Literate.jl).*