# ML Papers

[Weight Normalization](#weight_norm)

[Highway Networks](#highway)

[DenseNet](#densenet)

[Bootstrap](#bootstrap)

[LSTM/RHN on Natural Language Modeling](#1707.05589)

[Clustering Financial Time Series](#clustering_ts)

[Temporal Convolutional Networks](#tcn)

[Layer Normalization](#layer_norm)

[DropConnect](#dropconn)

[awd-lstm-lm](#awd-lstm-lm)

<a id='weight_norm'></a>
## Weight Normalization

[paper](https://arxiv.org/abs/1602.07868)

Speeds up SGD convergence by regularizing the weight norm. For a given network:

$$ y = \phi(w \cdot x + b) $$

where $w \in \mathcal{R}^k$, $b$ is a scalar bias, $x \in \mathcal{R}^k$, we reparameterize the $w$ as:

$$ w = \frac{g}{\| v \|} v $$

where $g$ is a scaler, $\| v \|$ denotes the Euclidean norm of vector $v$. We now have $\| w \| = g$.

Forthermore, we can reparameterize $g$ as $g = e^s$, where $s$ is a log-scale parameter to learn by SGD. However, empirically, the authors **did not** find this to be an advantage, and optimization was slightly slower.

### Comparison to Batch Norm

In special cases weight norm is the same as batch norm, see paper section 2.2 for detail.

For CNNs, weight normalization is often much faster computationally, it is also non-stochastic, not affected by batch size. It can be viewed as a cheaper and less noisy approximation to batch norm. Equivalence does not hold for deeper architectures.

### Data-Dependent Initialization of Parameters

Important to properly initialize our parameters. Authors proposed to sample the elements of $v$ from a simple distribution with fixed scale, such as a normal distribution with zero mean and standard deviation of 0.05.

This only works where batch norm is applicable, for RNNs and LSTMs, need to resort to standard initilization methods.

<a id='highway'></a>
## Highway Networks

[Summary Paper](https://arxiv.org/abs/1505.00387)

[Full Paper](https://arxiv.org/abs/1507.06228)


Highway networks enables the optimization of the networks with virtually arbitary depth. This is accomplished through the use of a **learned gating machanism** for regulating information flow wihch is inspired by LSTM.

The paper shows that the optimization of highway network is virtually independent of depth. Used to train a 900-layer network.

A plain feedforward network of $L$ layers with $H$ as a non-linear transform function, ignoring layer index:

$$ y = H(x, W_H) $$

For a highway network, the paper adds two non-linear transform: 
* Tranform gate, $T(x W_T)$
* Carry gate, $C(x, W_C)$

$$ y = H(x, W_H) \cdot T(x, W_T) + x \cdot C(x, W_C) $$

The paper sets $C = 1 - T$

The **dimensionality** of $x$, $y$, $H(x, W_H)$ and $T(x, W_T)$ must be the same for the equation above to hold.

If size of the representation needs to be changed, two ways:

1. replace $x$ with $\tilde{x}$ obtained by suitably sub-sampling or zero-padding $x$
2. Use a plain layer without highway to change dimensionality and then continue with stacking highway layers. This is what the paper used.



**Transform gate** is defined as:

$$
\begin{aligned}
T(x) &= \sigma(W_T^T x + b_T) \\
\sigma(x) &= \frac{1}{1 + e^{-x}}, x \in \mathbb{R}
\end{aligned}
$$

$b_T$ can be initialized with a negative number (e.g. -1, -3, etc) such that the network is biased initially towards **carry** behaviour. The paper found that during training, $b_T$ actually got further negative, this behavour suggests that the strong negative biases at low depths are not used to shut down the gates, but to make them more selective. 

$W_H$ can be initialized with various zero mean distributins.

TODO
<a id='densenet'></a>
## Densely Connected Convolutional Networks

<a id='bootstrap'></a>
# Bootstrap

**My current thinking** is that perhaps it’s best to look at multiple measures, large discrepancies would point to something odd. Otherwise, the differences between the methods are mostly minor for practical use. Particularly comparing to other market uncertainties.


Efron: **Accelerated Bootstrap intervals (BCa)** has better estimates of confidence intervals with some assumptions. makes certain assumptions, see below
 
[Paper](https://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.ss/1032280214)

[Slides](https://faculty.washington.edu/heagerty/Courses/b572/public/GregImholte-1.pdf)

[Stack Exchange Answer](https://stats.stackexchange.com/questions/19340/bootstrap-based-confidence-interval)

## `arch`
Python package `arch` has studentized/bias-corrected intervals for bootstrap results.See [doc](http://arch.readthedocs.io/en/latest/bootstrap/confidence-intervals.html#bias-corrected-and-accelerated-bca)

Discovered a minor bug in `conf_int(method=’bca’)`, see [issue](https://github.com/bashtage/arch/issues/193)


## `scikits-bootstrap`

Another implementation, see comments in [code](https://github.com/cgevans/scikits-bootstrap/blob/master/scikits/bootstrap/bootstrap.py)


## Block Bootstrap

**Politis, White, 2004. Automatic Block-Length Selection for the Dependent Bootstrap**

Code in R and Matlab can be found on this [page](http://public.econ.duke.edu/~ap172/) by Prof. Andrew Patton, as well as a correction in 2009 to the original paper.

Main points from the paper:

**Stationary Bootstrap (SB)** is less accurate than **Circular Block Bootstrap (CB)** for estimating $\sigma^2_{\infty}$. Although they have similar bias the SB has higher variance due to the additional randomization involved in drawing the random block size.

SB is **less sensitive** to block size mis-specification compared to CB and/or the moving block bootstrap. 

<a id='1707.05589'></a>

# On the State of the Art of Evaluation in Neural Language Models

[paper](https://arxiv.org/abs/1707.05589)

This paper tuned LSTM and Recurrent Highway Networks to beat many more complex models in natural language modeling. A lot of dropout techniques are used here. 

Instead of comparing models based on the number of hidden units, comparison is done based on **total number of trainable parameters.**

Dropout:
* input dropout
* intra-layer dropout
* down-projected outputs / output dropout

Things to learn about:

**Variational Dropout**: Gal and Ghahramani 2016, [link](https://arxiv.org/abs/1512.05287)

**Recurrent Dropout**: Semeniuta et al. 2016. [link](https://arxiv.org/abs/1603.05118)

**Using mean-field approximation for dropout at test time**. 

**Truncated backpropagation**: Training used Adam/batch size 64/truncated backprop performed with 50 time steps.

Hyperparameter tuning was performed using a black-box tuner based on **batched GP bandits** (Desautels et al. 2014)

<a id='clustering_ts'></a>
## Clustering Financial Time Series: how long is enough?

[paper](https://arxiv.org/abs/1603.04017)

Looked at Hierarchical Correlation Block Model (HCBM) with single-linkage, complete-linkage, average-linkage, Ward, McQuitty, Median, Centroid algos.

Spearman rank correlation is more robust than Pearson correlation when there is:
* noise
* variables have infinite second moment.

Their conclusion was **Ward** method converges faster than others such as single/average-linkage, converges around 250 observations for 256 assets with correlation matrix simiar to Figure 2 in this paper

<a id='tcn'></a>

## Temporal Convolutional Networks (TCN)

[paper](https://arxiv.org/abs/1803.01271), code available on [github](https://github.com/locuslab/TCN). ICLR 2018 reviews [here](https://openreview.net/forum?id=rk8wKk-R-). 


Here is a description of its basic architecure. 

Two principles: 

* network produces the same length as the input
* there can be no leakage from the future into the past

TCN uses **1-D full convolution + causal convolution**, meaning an output at time `t` is only convolved with elements from time `t` and earlier in the previous layers. 

* Hidden layer is the **same length as the input layer**.
* zero padding of length `k-1` where `k` is the kernel size, keeping subsequent layers the same size as the previous ones.

In the `tcn.py`, `Chomp1d()` is essentially a layer that chops off the extra padding added to the end of the inputs to ensure **causal convolution**. 

**Dilated convolution** is used here, starting with the input layer with dilation $d = 1$, i.e. no dilation and increase dilation **exponentially** with the depth of the network (i.e. $d = \mathcal{O}(2^i)$ where $i$ is the layer index).

* **Same padding** size for layer `i`: `padding = (k - 1) * dilation_size`.
* **Larger dilation** enables an output at the top level to represent a **wider** range of inputs, thus increase the **receptive field** of the conv net. 
* 3 ways to **increase** receptive field: 
    1. larger filter size `k`
    2. larger dilation, set `d=2**i` for layer `i`
    3. more of layers, in the code the number of layers is set to the number of channels.

Ensures:

1. some filter hits each input within the effective history,
2. allowing for an extremely large effective history using deep network. 

<img src='./img/tcn.png' width='800'/>

### `TemporalBlock` in code

**Residual Connection** like ResNet, with weight normalization. Unlike ResNet, the input and output of TCN can have different length, therefore authors use a 1x1 convolution to ensure that **element-wise** addition receives tensors of the same shape (figure b, c above).

**Weight normalization** [above](#weight_norm) from [Salimans & Kingma 2016](https://arxiv.org/abs/1602.07868). Pytorch: `torch.nn.utils.weight_norm()` [docs](http://pytorch.org/docs/master/nn.html#weight-norm)

**Spatial dropout** from [Srivastava et al 2014](http://jmlr.org/papers/volume15/srivastava14a.old/srivastava14a.pdf). Pytorch: `torch.nn.Dropout2d()` [docs](http://pytorch.org/docs/master/nn.html#torch.nn.Dropout2d)

Authors demostrated TCN has longer memory than LSTM and GRU in 2 tasks. 

#### Construction

Input shape is `(N, C, L)`, where:
* `N` is number of samples
* `C` is number of channels, or number of feature dimensions
* `L` is sequence length of all channels/features.

For the `Conv1d` layers, **input** and **output** sizes are set to the **number of channels** for input and output. 

In the paper, padding is done such as input and output sequence lengths are matched.

`Chomp1d` applied after `Conv1d` to ensure causal convolution, `chomp_size = padding`.

#### Architecture

Forward flow is as follows: 

```
z = conv1d -> chomp1d(padding) -> relu -> dropout -> conv1d -> chomp1d(padding) -> relu -> dropout

if input_channel != output_channel:
    # map z to output_channels
    r = conv1d(z, input_channels, output_channels, kernel_size=1)
    residual = r(z)
    out = relu(z + residual)
else:
    out = z

return out
```

In [1]:
import torch
import torch.nn as nn
from torch.autograd import Variable

In [3]:
# shape is (N, in_channels, L)
# N: number of samples
# in_channels: or number of features
# L: feature length
x = torch.randn(2, 1, 5)
x.shape

torch.Size([2, 1, 5])

In [4]:
x


(0 ,.,.) = 
  0.2089 -0.2274  0.0779 -0.8448 -0.0247

(1 ,.,.) = 
  0.3559 -0.5786 -0.2676 -1.2133  2.0995
[torch.FloatTensor of size 2x1x5]

In [14]:
padding = 2

In [5]:
m = nn.Conv1d(in_channels=1, out_channels=4, kernel_size=3, padding=padding)

In [6]:
y = m(Variable(x, requires_grad=False))

In [7]:
y.shape

torch.Size([2, 4, 7])

In [8]:
# y is padded by 2
y

Variable containing:
(0 ,.,.) = 
  0.3220  0.2502  0.4168  0.3306  0.6797  0.3565  0.3297
  0.2908  0.3632  0.1106  0.3824 -0.0922  0.6123  0.2945
  0.0077 -0.2472  0.0814 -0.5939  0.1302 -0.4114 -0.1000
 -0.1645 -0.4203 -0.2044 -0.4503  0.1377  0.2141 -0.2240

(1 ,.,.) = 
  0.3170  0.2012  0.5718  0.4940  0.7666 -0.5190  0.2869
  0.2952  0.4129 -0.1018  0.3707 -0.0417  1.6270 -0.5545
  0.0768 -0.4474  0.0599 -0.8202  1.0825 -1.0621  0.7218
 -0.1136 -0.6156 -0.2242 -0.2223  1.2371 -0.6594 -1.3268
[torch.FloatTensor of size 2x4x7]

In [9]:
m.weight

Parameter containing:
(0 ,.,.) = 
 -0.0202 -0.4157 -0.0345

(1 ,.,.) = 
 -0.3997  0.4084  0.0296

(2 ,.,.) = 
  0.3869 -0.2392  0.4698

(3 ,.,.) = 
 -0.5192 -0.5013  0.3462
[torch.FloatTensor of size 4x1x3]

In [10]:
m.weight.data[0,0,2]

-0.034489214420318604

In [11]:
m.bias

Parameter containing:
 0.3292
 0.2847
-0.0904
-0.2368
[torch.FloatTensor of size 4]

In [12]:
torch.sum(torch.mul(x[0, 0, :1], m.weight.data[0, 0, 2])) + m.bias[0]

Variable containing:
 0.3220
[torch.FloatTensor of size 1]

In [13]:
y.data[0, 0, 0]

0.3220287561416626

`Chomp1d()` in the TCN code is to make sure that we chop off the padding on the right to ensure causual convolution. 

See example below. 

In [2]:
class Chomp1d(nn.Module):

    def __init__(self, chomp_size):
        super(Chomp1d, self).__init__()
        self.chomp_size = chomp_size

    def forward(self, x):
        return x[:, :, :-self.chomp_size].contiguous()

In [15]:
chomp = Chomp1d(padding)

In [16]:
# last two columns of y chopped off.
chomp(y)

Variable containing:
(0 ,.,.) = 
  0.3220  0.2502  0.4168  0.3306  0.6797
  0.2908  0.3632  0.1106  0.3824 -0.0922
  0.0077 -0.2472  0.0814 -0.5939  0.1302
 -0.1645 -0.4203 -0.2044 -0.4503  0.1377

(1 ,.,.) = 
  0.3170  0.2012  0.5718  0.4940  0.7666
  0.2952  0.4129 -0.1018  0.3707 -0.0417
  0.0768 -0.4474  0.0599 -0.8202  1.0825
 -0.1136 -0.6156 -0.2242 -0.2223  1.2371
[torch.FloatTensor of size 2x4x5]

<a id='layer_norm'></a>

# Layer Normalization

[2016 Paper](https://arxiv.org/abs/1607.06450)

## Notations

Given the $l^{th}$ hidden layer: 

* $w^l$ be the weight vector, $w^l_i$ is the wieght for the $i^{th}$ hidden unit.
* $h^l$ is the bottom-up input, 
* $b^l$ is the bias,
* let $a^l$ be the vector of the summed inputs to the neuron in that layer:

$$ a^l_i = w^l_i h^l $$
$$ h^{l+1} = f(a^l_i + b^l_i) $$

Where $f()$ is an element-wise non-linear function. 

For [Batch Norm](./andrew_ng/DeepLearning.ai.ipynb#batchnorm), the normalization is done over the **entire** training batch due to computational reasons. This puts constraints on the size of a minibatch and it is hard to apply to RNN.

## Layer Norm

Becaue the changes in the output of one layer will tend to cause highly correlated changes in the summed inputs to the next layer, especially with ReLU units whose output can change by a lot, ***Layer norm** statistics are computed over all the hidden units in the same layer as follows:

* $H$ number of hidden units in a layer

$$
\begin{aligned}
\mu^l &= \frac{1}{H} \sum^H_{i=1} a^l_i \\
\sigma^l &= \sqrt{\frac{1}{H}\sum^H_{i=1}\big( a^l_i - \mu^l\big)^2}
\end{aligned}
$$

Normalization happens within the hidden layer, all hidden units in the same layer share the same $mu$ and $sigma$, but different training cases have different normalization terms.

**Unlike** batch norm, layer norm is:

* does not impose any constraint on the size of minibatch
* can be used in a pure online regime with batch size 1

For RNN, layer norm works as follows

$$
\begin{aligned}
a^t &= W_{hh} h^{t-1} + W_{xh} x^t \\
h^t &= f \bigg[ \frac{g}{\sigma^t} \odot \big( a^t - \mu^t \big) + b \bigg] \\
\mu^t &= \frac{1}{H} \sum^H_{i=1} a^t_i \\
\sigma^t &= \sqrt{\frac{1}{H}\sum^H_{i=1}\big( a^t_i - \mu^t\big)^2}
\end{aligned}
$$

Where:

* $b$ is bias parameter, same dimension as $h^t$
* $g$ is gain parameter, same dimension as $h^t$

Layer norm for RNN results in much more stable hidden-to-hidden gradient dynamics (vs. exploding/vanishing gradients).

Section 4 of the paper discusses related work including weight norm.

<img src='./img/layer_norm_comp.png' width=800/>

<a id='dropconn'></a>
# DropConnect

[Li Wan, et al. 2013](https://cs.nyu.edu/~wanli/dropc/dropc.pdf) and a nice web [page](https://cs.nyu.edu/~wanli/dropc/) that explains it.

## Summary 

Unlike **dropout** which drops output of a neural cell with probability `1-p`, DropConnect drops the **weights of a layer** with probability `1-p`. 

A Bernoulli mask matrix of the same dimension as the weight matrix is generated for **each** training sample, **not** for a minibatch.

Therefore, memory requirement increases as the size of minibatch increases.

Inference is based on drawing samples from a Gaussian distribution with mean and variance estimated with the dropout rate. 

## Notation

For a network layer:

* Input: $v = [ v_1, v_2, \dots, v_n ]^T$
* Weight matrix: $W \in R^{d \times n}$, bias terms are included here.
* Activation function: `a()`
* Output: $r = [r_1, r_2, \dots, r_d]^T$

A simple fully connected layer is:

$$ r = a(Wv) $$

Dimension here:

`(Wv).shape = (d, n) * (n, 1) = (d, 1)`, therefore `r.shape` is `(d, 1)`


## Details

### Dropout

A binary mask **vector** is drawn, $m \in R^d$, where each element of $m$ is drawn from $m_j \sim Bernoulli(p)$.

In [44]:
import scipy.stats as ss

# d is the dimension of output vector v
d = 10
m = ss.bernoulli.rvs(.8, size=d)
m

array([1, 0, 1, 1, 1, 1, 1, 1, 0, 0])

A **forward layer** is then defined as:

$$ r = m \odot a(Wv) $$

For many activation functions such as **tanh**, **centered sigmoid**, and **relu**, that have the property of $a(0) = 0$, dropout is applied at the inputs to the activation function, i.e.

$$ r = a(m \odot Wv) $$

### DropConnect

Similar to dropout as it introduces dynamic sparsity within the model, but the difference is that the sparsity is on the **weights** $W$, rather than on the output vectors.

A binary mask **matrix** is drawn, $M \in R^{d\times n}$ (same dimension as the weight matrix $W$), where each element of $M$ is drawn from $M_{ij} \sim Bernoulli(p)$. The output of this layer is given by:

$$ r = a ((M \odot W) v) $$

Note that the biases are also included in the masking here.

In [51]:
# W is the weight matrix
W = np.random.randn(3, 5)
M = ss.bernoulli.rvs(.8, size=W.shape)

print('W.shape: ', W.shape, 'M.shape: ', M.shape)

# masking is done by element-wise product
M * W

W.shape:  (3, 5) M.shape:  (3, 5)


array([[-0.        , -0.67000094, -0.        ,  2.15240859,  0.78178666],
       [ 0.80938994,  0.73258852,  0.        , -1.07328596, -0.11110874],
       [ 0.48647332, -0.        ,  0.        , -0.8445224 ,  0.        ]])

### DropConnect Training

This mask is generated **for each training example independently**. Therefore, a different connection is used for each example seen. This is a **key component of successful training with DropConnect**. The paper states that "Selecting a single mask for a subset of training examples, such as a mini-batch of 128 examples, **does not** regularize the model enough in practice."

This means that the memory requirement for $M$ grows as the size of the mini-batch grows. 

The mask $M$ is applied to the gradient to update only those elements that were active in the forward pass. When passing gradient down, the **masked weight matrix** ($M \odot W$) is used.

<img src='img/dropconnect_training.png' width=400/>

### DropConnect Inference

For inference, computing $2^{\mid M \mid}$ different masks is **infeasible**.

Dropout made the approximation $\sum_M a((M \odot M)v) \approx a(\sum_M (M \odot M)v)$, works in practice but not justified mathematically. particularly for **relu**. 

For DropConnect, Gaussian approximation via moment matching is used. The mean and variance is defined **before activation** as 

$$u = (M \odot W) v$$

$$
\begin{aligned}
E_M [u] &= p W v \\
V_M [u] &= p(1 - p)(W \odot W) (v \odot v) 
\end{aligned}
$$

Samples are drawn from this distribution and then passed through the activation function, before averaging them and passed to the next layer.

<img src='img/dropconnect_inference.png' width=400/>

# Language Modeling

<a id='awd-lstm-lm'></a>

## Regularizing and Optimizing LSTM Language Models

AvSGD Weight-Dropped LSTM (AWD-LSTM)

S. Merity, et al. ICLR Feb 2018, [paper](https://openreview.net/forum?id=SyyGPP0TZ)

Experiment Data Perplexity
* Penn Treebank, 52.8
* WikiText-2, 52.0

Github [issue](https://github.com/salesforce/awd-lstm-lm/pull/43) to make the code work in `pytorch 0.4`.

### RNN Regularization

Some techniques cited:

* Dropout by Gal & Ghahramani, 2016
* Limiting updates to RNN's hidden state, Semeniuta et al. 2016 / Zone-out Krueger et al. 2016
* Restrictions on the recurrent matrices
* Batch norm / Layer norm - paper argues both introduce additional training parameters and can complicate the training process while increasing the sensitivity of the model.
* This paper introduces **Weight-Dropped LSTM**, applying recurrent regularization through a [DropConnect](#dropconn) mask on the **hidden-to-hidden** recurrent weights only. Code: `weight_drop.py`, class `WeightDrop`.

Specifically, given the following notation for LSTM, weight-drop with drop connect is applyed only on $[U^i, U^f, U^o, U^c]$.

<img src='img/lstm_merity_2017.png' width=200/>

Question: how is this done in code? 

### Extended Regularization Techniques

* **Randomized-length BPTT**. BPTT fixed window results in inefficient use of data. (Where is this in the code?) Solution: 
    *  randomly select the sequence length for the formard and backward pass in two steps. 
    * Learning rate is also rescaled depending on the length of the resulting sequence compared to the originally specified sequence length. (How)


* **Variational dropout** [(Gal & Ghahramani, 2016)](https://arxiv.org/abs/1512.05287), samples a binary dropout mask **only once** upon the first call and then to repeatedly use that **locked** dropout mask for all repeated connections within the forward and backward pass.
    * Used here for **all dropout operations other than hidden-to-hidden**. 
    * specifically, applied to all inputs and outputs of the LSTM.
    * A different mask is used for each example within the minibatch. Similar to dropconnect here.


* **Embedding dropout** (Gal & Ghahramani, 2016), equivalent to performing dropout on the embedding matrix at word level. 
    * Since dropout is at the embedding level, it implies that **all** occurrences of a specific word will disappear within that pass.
    * Equiavalent to performing variational dropout on the connection between the one-hot embedding and the embedding lookup.


* **Weight tying** - shares the weights between the embedding and softmax layer, substantionally reduceing the total number of parameters in the model.



* **Independent embedding size and hidden size**
    * first and last LSTM layers are mofidied such that the **input of the first layer** and **output of the last layer's dimensionality are equal** to the reduced embedding size.


* **Activation regularization (AR)**, applied only to the output of the **final RNN layer**.
    * $L_2$ decay used on individual unit actionvations and on the difference in outputs of an RNN at different time steps.
    * AR penalizes activations that are significantly larger than 0 as a means of regularizing the network.
    * **Defined as**: $\alpha \mathrm{L}_2 (m \odot h_t)$, $m$ is the dropout mask, $\mathrm{L}_2(\cdot) = \|\cdot\|_2$, $h_t$ is the output of the RNN at timestep $t$, $\alpha$ is a scaling coefficient.
    * See [here](#ar_tar)


* **Temporal activation regularization (TAR)**, applied only to the output of the final RNN layer
    * slowness regularizer
    * penalize the model from producing large changes in the hidden state.
    * **Defined as**: $\beta \mathrm{L}_2(h_t - h_{t+1})$, same notation as above, $\beta$ is a scaling coefficient.

### Optimization

Introduces **NT-AvSGD**. A few things that the paper mentioned:

* SGD shown to outperform other adaptive algos in word-level language modeling (see paper for citations)
* NT-AvSGD based on AvSGD 

### Notes on Code

#### `model.py`

Model is defined in `model.py` as `RNNModel`. 

##### Initialization

Lots of dropout and other initalization parameters, comments here.

```
ntoken:
    number of embeddings
ninp:
    embedding dimension
nhid:
    hidden layer size
wdrop: 
    weight dropout for hidden-to-hidden, passed to WeightDrop()
dropout:
    torch.nn.Dropout() layer, stored in self.drop (torch.nn.Drouput), 
    for last rnn layer output
dropouth:
    torch.nn.Dropout() layer, stored in self.hdrop (torch.nn.Dropout), 
    for rnn layers except the last layer
dropouti
    torch.nn.Dropout() layer, stored in self.idrop (torch.nn.Dropout), 
    for embedding droput?
dropoute:
    used in embedded_drop() for model input dropout.
```

Summary of dropout layers used, and their associated dropout rate.

```
WeightDrop:
    w/ wdrop, applied to all RNN layers, (LSTM, GRU)

self.lockdrop:
    LockedDropout(), used to drop:
    1. embeddings, w/ dropoute if training, else 0
    2. output of RNNs for all layers except the last one, w/ dropouth
    3. output of last RNN layer, w/ dropout
    
self.idrop:
    w/ dropouti. This is actually not used, instead embedding_dropout() is used.
    
self.hdrop:
    w/ dropouth. Not used, self.lockdrop is used instead.
    
self.drop:
    w/ dropout. Not used
```

### Model Forward Pass

Flow of forward pass:

1. embedded droput on input, `embedded_drop()` with `dropoute` if in training model, otherwise 0.
2. locked dropout on embedding, `self.lockdrop` with `dropouti`
3. pass to rnn layers (**weight drop** also happens here)
4. For rnn layers except the last one,  apply `self.lockdrop` with `dropouth` (variational dropout)
5. for **last** rnn layer output, apply `self.lockdrop` with `dropout` (AR / TAR, see above)

A few `nn.Dropout()` layers were created but never used: `idrop`, `hdrop`, `drop`.

##### Tied-Weights

This is done in `RNNModel.__init__()`.

`tie_weights` boolean parameter in `RNNModel` class. If set to `True`, the `hidden_size` (i.e. output size) for **last** RNN (LSTM, GRU) layer is set to `ninp` (here it is the **embedding dimension**), otherwise set to `nhid`.

The goal of this is to allow the embedding layer and the final softmax layer to **share weights**. See code snippet below. 

```
self.encoder = torch.nn.Embedding(num_embeddings=ntoken, embedding_dim=ninp)
self.decoder = torch.nn.Linear(nhid, ntoken)
if tie_weights:
    self.decoder.weight = self.encoder.weight
```

`nhid` is the RNN hidden layer size.

[This github issues](https://github.com/salesforce/awd-lstm-lm/issues/48) flagged that `self.decoder` was **never used**. New: the decoder is used in `main.py`.

##### Independent embedding size and hidden size

This is done when initilizing the RNN layers, if `tie_weights==True` then the **first layer's input** and the **last layer's output** are set to `ninp`, which is the same as embedding dimension.

The paper mentioned that most previous LSTM languamge models tie the dimensionality of the **word vectors** to the dimensionality of the LSTM's **hidden state**. The **easiest reduction in total parameters** is to reduce the word vector size (for preventing overfitting).

#### `embedded_dropout()` in `embed_regularize.py`

Dropout mask is generated similiar to `LockedDropout`, applied to embedding weights (`embed.weight`).

#### `LockedDropout` class. 

A blog [post](https://becominghuman.ai/learning-note-dropout-in-recurrent-networks-part-1-57a9c19a2307) mentioned the implementation here.

Defined in `loked_dropout.py`. This is the layer that implements RNN **variantional dropout**. 

I changed the code to adapt `pytorch 0.4`. In this version `torch.Tensor.new()` method is gone from the docs, so I used a different way to create the mask.

This layer essentially generates a bernoulli mask with probability `(1 - dropout)`, normalize by divide this mask by `(1 - dropout)`, then applied to inputs: `mask * x`. This is the **inverse dropout** technique mentioned in Andrew Ng's course.

Note that when generating the mask, we create a mask with first dimension of 1, then the same dimensions as the input. Then the mask is expanded with `expand_as()` to have the same shape as the input.

This is **different** to drawing Bernoulli masks the same shape as the input: it means the **same mask** is applied to all elements along the first dimension of $x$, which is the full sequence output of a LSTM layer, which means the same mask is applied to all timesteps for this layer, achieving variational dropout.

#### `WeightDrop` class

Takes in an RNN layer, extracts `weight_hh_l0` from both `LSTM` and `GRU` for applying dropout.

See this [issue](https://github.com/salesforce/awd-lstm-lm/issues/51) I filed on `_setup()`. 

At the moment there is a hack to disable `RNNBase.flatten_parameters()`, see code comment for this, and [this github issue](https://github.com/salesforce/awd-lstm-lm/issues/7)

Weight dropout is applied in `self._setweights()`.

If using variational dropout, a new mask is generated each time `self.forward()` is called, and applied to `weight_hh_l0` during training mode. Note that the mask here is also normalized by dividing by `(1 - dropout)`. Training mode is **always** set to `True`. **Why?** Inverse dropout used here, i.e. divide by `(1 - dropout)`.

Otherwise, `DropConnect` is used, by simplying dropping some weights in `weight_hh_l0`. 

In the code `_setupweights()`, `self.training` is referenced and passed to `torch.nn.functional.dropout()`. This variable is inherited from the parent `nn.Module` class and by default set to `True`. 

When `RNNModel` is set to `train()` or `eval()` mode, the `WeightDrop` layers will also change according. I.e. if `RNNModel` is set to `eval()` mode, its `WeightDrop` layers will also be set to `eval()` mode w/ `self.training` set to False.

##### Forward Pass

In the forward pass, dropout is applied first to the hidden-to-hidden layer weights of the actual RNN module by calling `self._setweights()`, then the RNN module's `forward()` function is called.

#### `main.py`

Note how model total parameters are counted here.

##### Training

1. calls `model.init_hidden(args.batch_size)`, which returns the hidden states of the model.


2. iterate through training data


3. BPTT length is drawn from one of two normal distributions $\mathcal{N}(bptt, 5)$ and $\mathcal{N}(bptt/2, 5)$. Min `seq_len=5`. To efficiently use all of the data. Also here is a [github issue](https://github.com/salesforce/awd-lstm-lm/issues/33) that talked about this, in short, the use of 2 possible normal distributions, is to **ensure different starting point for batches and still maintain efficient use of GPU** (smaller `seq_len` doesn't come too often, e.g. 5% of the time).

    ```
    # 95% chance that mean of normal distribution is the provided bptt in 
    # command line args
    bptt = args.bptt if np.random.random() < .95 else args.bptt / 2.
    
    # draw from normal distributino (mean=bptt, stdev=5)
    seq_len = max(5, int(np.random.normal(bptt, 5))
    
    # risk of having a very high seq_len is ignored in production code
    # further looking into this, this is mitigated in utils.get_batch()
    # where the seq_len is maxed out at len(train_data) - 1 - batch_count
    # the line essentially bounds bptt with 2x stdev
    # seq_len = min(seq_len, args.bptt + 10)
    ```
    
    
4. Adjust learning rate `lr = lr * seq_len / args.bptt`. Scale `lr` based on the ratio of the drawn `seq_len` and input `args.bptt`. Paper states that this is necessary as fixed learning rate **favours shorter sequences** over longer ones. 


5. batch data returned by `get_batch(train_data, i, arg, seq_len=seq_len)`. From training data, return batches of sizes around `min(seq_len, len(train_data) - 1 - i)`.


6. detach hidden state from previous iterations, by calling `repackage_hidden()`, then `optimizer.zero_grad()`


7. Computes loss, apply Activation Regularization (AR) and Temporal Activation Regularization (TAR)


8. backprop


9. `clip_grad_norm_` for all parameters


10. `optimizer.step()`


11. `batch += 1`, `i += seq_len`


#### `SplitCrossEntropyLoss`

Calculates an approximate softmax.

A few code changes for `pytorch 0.4`

* `clip_grad_norm()` to `clip_grad_norm_()` in `main.py` and `finetune.py`
* 0-tensor access should use `tensor.item()` in `evaluate()` and `train()`
* `torch.nn.functional.log_softmax()` and `torch.nn.functional.softmax()` calls should be used with explicit `dim=-1` (i.e. for 2D tensors this should be dim=1, the dimension represent all classes). See [issue fix here](#https://github.com/pytorch/pytorch/issues/1020)

### TODO 

Run this model with multiple GPUs. see examples [here](https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html)

## Further developemnts on Language Models

As of May 2018, a few things claim to have surpassed `awd-lstm-lm`. See claimed results below.

Techniques worth looking into:

* Dynamic evaluation for sequence models, [paper](https://arxiv.org/abs/1709.07432)
* Mixtures of Softmax, [paper](https://arxiv.org/abs/1711.03953)
* Noisin, [paper](https://arxiv.org/abs/1805.01500). Improvement here seems very small as the MoS paper perplexity is 47.69...


<img src='img/penn_treebank_noisin.png' width=600/>

<a id='ar_tar'></a>
## Activation Regularization & Temporal Activation Regularization

[paper](https://arxiv.org/abs/1708.01009)

The paper finds it is more effective to apply AR to $m \odot h_t$ than applying it to neurons not updated during the current optimization step ($h_t$).

When using AR and TAR together, the authors found that the best results was achieved by decreasing $\alpha$ and $\beta$, likely as the model was over-regularized otherwise.

The authors also tested a few different values of $\alpha$ and $\beta$, from 0 to 9 each, showing that the validation set results were relative insensitive for values larger than or equal to 3. 

Two dropout layers are used here:

* `dp` - dropout rate used on the word vectors and the final RNN output.
* `dp_h` - dropout rate used on the eonnections between RNN layers.
