# RNN language model
Loosely based on [Zaremba et al. 2014](https://arxiv.org/abs/1409.2329), this example trains a word based RNN language model on Mikolov's PTB data with 10K vocab. It uses the `batchSizes` feature of `rnnforw` to process batches with different sized sentences. The `mb` minibatching function sorts sentences in a corpus by length and tries to group similarly sized sentences together. For an example that uses fixed length batches and goes across sentence boundaries see the [charlm](https://github.com/denizyuret/Knet.jl/blob/master/examples/charlm/charlm.ipynb) notebook.

In [12]:
using Knet
EPOCHS=10
RNNTYPE=:lstm
BATCHSIZE=64
EMBEDSIZE=128
HIDDENSIZE=256
VOCABSIZE=10000
NUMLAYERS=1
DROPOUT=0.5
LR=0.001
BETA_1=0.9
BETA_2=0.999
EPS=1e-08;

In [13]:
# Load data
include(Knet.dir("data","mikolovptb.jl"))
(trn,val,tst,vocab) = mikolovptb()
@assert VOCABSIZE == length(vocab)+1 # +1 for the EOS token
for x in (trn,val,tst,vocab); println(summary(x)); end

42068-element Array{Array{UInt16,1},1}
3370-element Array{Array{UInt16,1},1}
3761-element Array{Array{UInt16,1},1}
9999-element Array{String,1}


In [14]:
# Print a sample
println(tst[1])
println(vocab[tst[1]])

UInt16[0x008e, 0x004e, 0x0036, 0x00fb, 0x0938, 0x0195]
String["no", "it", "was", "n't", "black", "monday"]


In [24]:
@doc mikolovptb

```
mikolovptb()
```

Read [PTB](https://catalog.ldc.upenn.edu/ldc99t42) text from Mikolov's [RNNLM](http://www.fit.vutbr.cz/~imikolov/rnnlm) toolkit which has been lowercased and reduced to a 10K vocabulary size.  Return a tuple (trn,dev,tst,vocab) where

```
trn::Vector{Vector{UInt16}}: 42068 sentences, 887521 words
dev::Vector{Vector{UInt16}}: 3370 sentences, 70390 words
tst::Vector{Vector{UInt16}}: 3761 sentences, 78669 words
vocab::Vector{String}: 9999 unique words
```


In [15]:
# Minibatch data into (x,y,b) triples. This is the most complicated part of the code:
# for language models x and y contain the same words shifted, x has an EOS in the beginning, y has an EOS at the end
# x,y = [ s11,s21,s31,...,s12,s22,...] i.e. all the first words followed by all the second words etc.
# b = [b1,b2,...,bT] i.e. how many sentences have first words, how many have second words etc.
# length(x)==length(y)==sum(b) and length(b)=length(s1)+1 (+1 because of EOS)
# sentences in batch should be sorted from longest to shortest, i.e. s1 is the longest sentence
function mb(sentences,batchsize)
    sentences = sort(sentences,by=length,rev=true)
    data = []; eos = VOCABSIZE
    for i = 1:batchsize:length(sentences)
        j = min(i+batchsize-1,length(sentences))
        sij = view(sentences,i:j)
        T = 1+length(sij[1])
        x = UInt16[]; y = UInt16[]; b = UInt16[]
        for t=1:T
            bt = 0
            for s in sij
                if t == 1
                    push!(x,eos)
                    push!(y,s[1])
                elseif t <= length(s)
                    push!(x,s[t-1])
                    push!(y,s[t])
                elseif t == 1+length(s)
                    push!(x,s[t-1])
                    push!(y,eos)
                else
                    break
                end
                bt += 1
            end
            push!(b,bt)
        end
        push!(data,(x,y,b))
    end
    return data
end

mbtrn = mb(trn,BATCHSIZE)
mbval = mb(val,BATCHSIZE)
mbtst = mb(tst,BATCHSIZE)
map(length,(mbtrn,mbval,mbtst))

(658, 53, 59)

In [16]:
# Define model
function initmodel()
    w(d...)=KnetArray(xavier(Float32,d...))
    b(d...)=KnetArray(zeros(Float32,d...))
    r,wr = rnninit(EMBEDSIZE,HIDDENSIZE,rnnType=RNNTYPE,numLayers=NUMLAYERS,dropout=DROPOUT)
    wx = w(EMBEDSIZE,VOCABSIZE)
    wy = w(VOCABSIZE,HIDDENSIZE)
    by = b(VOCABSIZE,1)
    return r,wr,wx,wy,by
end;

In [25]:
@doc rnninit

```
rnninit(inputSize, hiddenSize; opts...)
```

Return an `(r,w)` pair where `r` is a RNN struct and `w` is a single weight array that includes all matrices and biases for the RNN. Keyword arguments:

  * `rnnType=:lstm` Type of RNN: One of :relu, :tanh, :lstm, :gru.
  * `numLayers=1`: Number of RNN layers.
  * `bidirectional=false`: Create a bidirectional RNN if `true`.
  * `dropout=0.0`: Dropout probability. Ignored if `numLayers==1`.
  * `skipInput=false`: Do not multiply the input with a matrix if `true`.
  * `dataType=Float32`: Data type to use for weights.
  * `algo=0`: Algorithm to use, see CUDNN docs for details.
  * `seed=0`: Random number seed. Uses `time()` if 0.
  * `winit=xavier`: Weight initialization method for matrices.
  * `binit=zeros`: Weight initialization method for bias vectors.

RNNs compute the output h[t] for a given iteration from the recurrent input h[t-1] and the previous layer input x[t] given matrices W, R and biases bW, bR from the following equations:

`:relu` and `:tanh`: Single gate RNN with activation function f:

```
h[t] = f(W * x[t] .+ R * h[t-1] .+ bW .+ bR)
```

`:gru`: Gated recurrent unit:

```
i[t] = sigm(Wi * x[t] .+ Ri * h[t-1] .+ bWi .+ bRi) # input gate
r[t] = sigm(Wr * x[t] .+ Rr * h[t-1] .+ bWr .+ bRr) # reset gate
n[t] = tanh(Wn * x[t] .+ r[t] .* (Rn * h[t-1] .+ bRn) .+ bWn) # new gate
h[t] = (1 - i[t]) .* n[t] .+ i[t] .* h[t-1]
```

`:lstm`: Long short term memory unit with no peephole connections:

```
i[t] = sigm(Wi * x[t] .+ Ri * h[t-1] .+ bWi .+ bRi) # input gate
f[t] = sigm(Wf * x[t] .+ Rf * h[t-1] .+ bWf .+ bRf) # forget gate
o[t] = sigm(Wo * x[t] .+ Ro * h[t-1] .+ bWo .+ bRo) # output gate
n[t] = tanh(Wn * x[t] .+ Rn * h[t-1] .+ bWn .+ bRn) # new gate
c[t] = f[t] .* c[t-1] .+ i[t] .* n[t]               # cell output
h[t] = o[t] .* tanh(c[t])
```


In [17]:
# Define loss and its gradient
function predict(ws,xs,bs)
    r,wr,wx,wy,by = ws
    x = wx[:,xs] # xs=(ΣBt) x=(X,ΣBt)
    x = dropout(x,DROPOUT)
    (y,_) = rnnforw(r,wr,x,batchSizes=bs) # y=(H,ΣBt)
    y = dropout(y,DROPOUT)
    return wy * y .+ by  # return=(V,ΣBt)
end

loss(w,x,y,b) = nll(predict(w,x,b), y)

lossgradient = gradloss(loss);

In [26]:
@doc rnnforw

```
rnnforw(r, w, x[, hx, cx]; batchSizes, hy, cy)
```

Returns a tuple (y,hyout,cyout,rs) given rnn `r`, weights `w`, input `x` and optionally the initial hidden and cell states `hx` and `cx` (`cx` is only used in LSTMs).  `r` and `w` should come from a previous call to `rnninit`.  Both `hx` and `cx` are optional, they are treated as zero arrays if not provided.  The output `y` contains the hidden states of the final layer for each time step, `hyout` and `cyout` give the final hidden and cell states for all layers, `rs` is a buffer the RNN needs for its gradient calculation.

The boolean keyword arguments `hy` and `cy` control whether `hyout` and `cyout` will be output.  By default `hy = (hx!=nothing)` and `cy = (cx!=nothing && r.mode==2)`, i.e. a hidden state will be output if one is provided as input and for cell state we also require an LSTM.  If `hy`/`cy` is `false`, `hyout`/`cyout` will be `nothing`. `batchSizes` can be an integer array that specifies non-uniform batch sizes as explained below. By default `batchSizes=nothing` and the same batch size, `size(x,2)`, is used for all time steps.

The input and output dimensions are:

  * `x`: (X,[B,T])
  * `y`: (H/2H,[B,T])
  * `hx`,`cx`,`hyout`,`cyout`: (H,B,L/2L)
  * `batchSizes`: `nothing` or `Vector{Int}(T)`

where X is inputSize, H is hiddenSize, B is batchSize, T is seqLength, L is numLayers.  `x` can be 1, 2, or 3 dimensional.  If `batchSizes==nothing`, a 1-D `x` represents a single instance, a 2-D `x` represents a single minibatch, and a 3-D `x` represents a sequence of identically sized minibatches.  If `batchSizes` is an array of (non-increasing) integers, it gives us the batch size for each time step in the sequence, in which case `sum(batchSizes)` should equal `div(length(x),size(x,1))`. `y` has the same dimensionality as `x`, differing only in its first dimension, which is H if the RNN is unidirectional, 2H if bidirectional.  Hidden vectors `hx`, `cx`, `hyout`, `cyout` all have size (H,B1,L) for unidirectional RNNs, and (H,B1,2L) for bidirectional RNNs where B1 is the size of the first minibatch.


In [18]:
# Train and test loops
function train(model,data,optim)
    Σ,N=0,0
    for (x,y,b) in data
        grads,loss1 = lossgradient(model,x,y,b)
        update!(model, grads, optim)
        n = length(y)
        Σ,N = Σ+n*loss1, N+n
    end
    return Σ/N
end

function test(model,data)
    Σ,N=0,0
    for (x,y,b) in data
        loss1 = loss(model,x,y,b)
        n = length(y)
        Σ,N = Σ+n*loss1, N+n
    end
    return Σ/N
end;

In [19]:
# Initialize and train model
model = optim = nothing; knetgc() # free gpu memory
model = initmodel()
optim = optimizers(model,Adam,lr=LR,beta1=BETA_1,beta2=BETA_2,eps=EPS)

for epoch=1:EPOCHS
    @time j1 = train(model,mbtrn,optim)  # ~39 seconds
    @time j2 = test(model,mbval)         # ~1 second
    @time j3 = test(model,mbtst)         # ~1 second
    println((epoch,exp(j1),exp(j2),exp(j3))); flush(STDOUT)  # prints perplexity = exp(negative_log_likelihood)
end

 38.946507 seconds (1.95 M allocations: 91.074 MiB, 29.31% gc time)
  0.946368 seconds (71.28 k allocations: 3.762 MiB, 33.05% gc time)
  0.827207 seconds (61.01 k allocations: 3.418 MiB, 27.84% gc time)
(1, 738.83484f0, 1.7030005f6, 1.5832602f6)
 38.646436 seconds (1.94 M allocations: 90.676 MiB, 29.44% gc time)
  0.913802 seconds (59.03 k allocations: 3.143 MiB, 33.56% gc time)
  0.832516 seconds (61.01 k allocations: 3.418 MiB, 28.32% gc time)
(2, 587.5651f0, 92741.34f0, 87532.92f0)
 38.806340 seconds (1.94 M allocations: 90.676 MiB, 29.27% gc time)
  0.919686 seconds (59.03 k allocations: 3.143 MiB, 33.59% gc time)
  0.826459 seconds (61.01 k allocations: 3.418 MiB, 27.82% gc time)
(3, 426.00394f0, 28219.928f0, 27050.39f0)
 38.764006 seconds (1.94 M allocations: 90.676 MiB, 29.38% gc time)
  0.919739 seconds (59.03 k allocations: 3.143 MiB, 33.92% gc time)
  0.832677 seconds (61.01 k allocations: 3.418 MiB, 28.34% gc time)
(4, 355.2729f0, 11222.971f0, 10688.494f0)
 38.931251 second