# ML Papers

[Weight Normalization](#weight_norm)

[Highway Networks](#highway)

[DenseNet](#densenet)

[Bootstrap](#bootstrap)

[LSTM/RHN on Natural Language Modeling](#1707.05589)

[Clustering Financial Time Series](#clustering_ts)

[Temporal Convolutional Networks](#tcn)

[Layer Normalization](#layer_norm)

<a id='weight_norm'></a>
## Weight Normalization

[paper](https://arxiv.org/abs/1602.07868)

Speeds up SGD convergence by regularizing the weight norm. For a given network:

$$ y = \phi(w \cdot x + b) $$

where $w \in \mathcal{R}^k$, $b$ is a scalar bias, $x \in \mathcal{R}^k$, we reparameterize the $w$ as:

$$ w = \frac{g}{\| v \|} v $$

where $g$ is a scaler, $\| v \|$ denotes the Euclidean norm of vector $v$. We now have $\| w \| = g$.

Forthermore, we can reparameterize $g$ as $g = e^s$, where $s$ is a log-scale parameter to learn by SGD. However, empirically, the authors **did not** find this to be an advantage, and optimization was slightly slower.

### Comparison to Batch Norm

In special cases weight norm is the same as batch norm, see paper section 2.2 for detail.

For CNNs, weight normalization is often much faster computationally, it is also non-stochastic, not affected by batch size. It can be viewed as a cheaper and less noisy approximation to batch norm. Equivalence does not hold for deeper architectures.

### Data-Dependent Initialization of Parameters

Important to properly initialize our parameters. Authors proposed to sample the elements of $v$ from a simple distribution with fixed scale, such as a normal distribution with zero mean and standard deviation of 0.05.

This only works where batch norm is applicable, for RNNs and LSTMs, need to resort to standard initilization methods.

<a id='highway'></a>
## Highway Networks

[Summary Paper](https://arxiv.org/abs/1505.00387)

[Full Paper](https://arxiv.org/abs/1507.06228)


Highway networks enables the optimization of the networks with virtually arbitary depth. This is accomplished through the use of a **learned gating machanism** for regulating information flow wihch is inspired by LSTM.

The paper shows that the optimization of highway network is virtually independent of depth. Used to train a 900-layer network.

A plain feedforward network of $L$ layers with $H$ as a non-linear transform function, ignoring layer index:

$$ y = H(x, W_H) $$

For a highway network, the paper adds two non-linear transform: 
* Tranform gate, $T(x W_T)$
* Carry gate, $C(x, W_C)$

$$ y = H(x, W_H) \cdot T(x, W_T) + x \cdot C(x, W_C) $$

The paper sets $C = 1 - T$

The **dimensionality** of $x$, $y$, $H(x, W_H)$ and $T(x, W_T)$ must be the same for the equation above to hold.

If size of the representation needs to be changed, two ways:

1. replace $x$ with $\tilde{x}$ obtained by suitably sub-sampling or zero-padding $x$
2. Use a plain layer without highway to change dimensionality and then continue with stacking highway layers. This is what the paper used.



**Transform gate** is defined as:

$$
\begin{aligned}
T(x) &= \sigma(W_T^T x + b_T) \\
\sigma(x) &= \frac{1}{1 + e^{-x}}, x \in \mathbb{R}
\end{aligned}
$$

$b_T$ can be initialized with a negative number (e.g. -1, -3, etc) such that the network is biased initially towards **carry** behaviour. The paper found that during training, $b_T$ actually got further negative, this behavour suggests that the strong negative biases at low depths are not used to shut down the gates, but to make them more selective. 

$W_H$ can be initialized with various zero mean distributins.

TODO
<a id='densenet'></a>
## Densely Connected Convolutional Networks

<a id='bootstrap'></a>
# Bootstrap

**My current thinking** is that perhaps it’s best to look at multiple measures, large discrepancies would point to something odd. Otherwise, the differences between the methods are mostly minor for practical use. Particularly comparing to other market uncertainties.


Efron: **Accelerated Bootstrap intervals (BCa)** has better estimates of confidence intervals with some assumptions. makes certain assumptions, see below
 
[Paper](https://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.ss/1032280214)

[Slides](https://faculty.washington.edu/heagerty/Courses/b572/public/GregImholte-1.pdf)

[Stack Exchange Answer](https://stats.stackexchange.com/questions/19340/bootstrap-based-confidence-interval)

## `arch`
Python package `arch` has studentized/bias-corrected intervals for bootstrap results.See [doc](http://arch.readthedocs.io/en/latest/bootstrap/confidence-intervals.html#bias-corrected-and-accelerated-bca)

Discovered a minor bug in `conf_int(method=’bca’)`, see [issue](https://github.com/bashtage/arch/issues/193)


## `scikits-bootstrap`

Another implementation, see comments in [code](https://github.com/cgevans/scikits-bootstrap/blob/master/scikits/bootstrap/bootstrap.py)


## Block Bootstrap

**Politis, White, 2004. Automatic Block-Length Selection for the Dependent Bootstrap**

Code in R and Matlab can be found on this [page](http://public.econ.duke.edu/~ap172/) by Prof. Andrew Patton, as well as a correction in 2009 to the original paper.

Main points from the paper:

**Stationary Bootstrap (SB)** is less accurate than **Circular Block Bootstrap (CB)** for estimating $\sigma^2_{\infty}$. Although they have similar bias the SB has higher variance due to the additional randomization involved in drawing the random block size.

SB is **less sensitive** to block size mis-specification compared to CB and/or the moving block bootstrap. 

<a id='1707.05589'></a>

# On the State of the Art of Evaluation in Neural Language Models

[paper](https://arxiv.org/abs/1707.05589)

This paper tuned LSTM and Recurrent Highway Networks to beat many more complex models in natural language modeling. A lot of dropout techniques are used here. 

Instead of comparing models based on the number of hidden units, comparison is done based on **total number of trainable parameters.**

Dropout:
* input dropout
* intra-layer dropout
* down-projected outputs / output dropout

Things to learn about:

**Variational Dropout**: Gal and Ghahramani 2016, [link](https://arxiv.org/abs/1512.05287)

**Recurrent Dropout**: Semeniuta et al. 2016. [link](https://arxiv.org/abs/1603.05118)

**Using mean-field approximation for dropout at test time**. 

**Truncated backpropagation**: Training used Adam/batch size 64/truncated backprop performed with 50 time steps.

Hyperparameter tuning was performed using a black-box tuner based on **batched GP bandits** (Desautels et al. 2014)

<a id='clustering_ts'></a>
## Clustering Financial Time Series: how long is enough?

[paper](https://arxiv.org/abs/1603.04017)

Looked at Hierarchical Correlation Block Model (HCBM) with single-linkage, complete-linkage, average-linkage, Ward, McQuitty, Median, Centroid algos.

Spearman rank correlation is more robust than Pearson correlation when there is:
* noise
* variables have infinite second moment.

Their conclusion was **Ward** method converges faster than others such as single/average-linkage, converges around 250 observations for 256 assets with correlation matrix simiar to Figure 2 in this paper

<a id='tcn'></a>

## Temporal Convolutional Networks (TCN)

[paper](https://arxiv.org/abs/1803.01271), code available on [github](https://github.com/locuslab/TCN). Here is a description of its basic architecure. 

Two principles: 

* network produces the same length as the input
* there can be no leakage from the future into the past

TCN uses **1-D full convolution + causal convolution**, meaning an output at time `t` is only convolved with elements from time `t` and earlier in the previous layers. 

* Hidden layer is the **same length as the input layer**.
* zero padding of length `k-1` where `k` is the kernel size, keeping subsequent layers the same size as the previous ones.

In the `tcn.py`, `Chomp1d()` is essentially a layer that chops off the extra padding added to the end of the inputs to ensure **causal convolution**. 

**Dilated convolution** is used here, starting with the input layer with dilation $d = 1$, i.e. no dilation and increase dilation **exponentially** with the depth of the network (i.e. $d = \mathcal{O}(2^i)$ where $i$ is the layer index).

* **Same padding** size for layer `i`: `padding = (k - 1) * dilation_size`.
* **Larger dilation** enables an output at the top level to represent a **wider** range of inputs, thus increase the **receptive field** of the conv net. 
* 3 ways to **increase** receptive field: 
    1. larger filter size `k`
    2. larger dilation, set `d=2**i` for layer `i`
    3. more of layers, in the code the number of layers is set to the number of channels.

Ensures:

1. some filter hits each input within the effective history,
2. allowing for an extremely large effective history using deep network. 

<img src='./img/tcn.png' width='800'/>

### `TemporalBlock` in code

**Residual Connection** like ResNet, with weight normalization. Unlike ResNet, the input and output of TCN can have different length, therefore authors use a 1x1 convolution to ensure that **element-wise** addition receives tensors of the same shape (figure b, c above).

**Weight normalization** [above](#weight_norm) from [Salimans & Kingma 2016](https://arxiv.org/abs/1602.07868). Pytorch: `torch.nn.utils.weight_norm()` [docs](http://pytorch.org/docs/master/nn.html#weight-norm)

**Spatial dropout** from [Srivastava et al 2014](http://jmlr.org/papers/volume15/srivastava14a.old/srivastava14a.pdf). Pytorch: `torch.nn.Dropout2d()` [docs](http://pytorch.org/docs/master/nn.html#torch.nn.Dropout2d)

Authors demostrated TCN has longer memory than LSTM and GRU in 2 tasks. 

#### Construction

Input shape is `(N, C, L)`, where:
* `N` is number of samples
* `C` is number of channels, or number of feature dimensions
* `L` is sequence length of all channels/features.

For the `Conv1d` layers, **input** and **output** sizes are set to the **number of channels** for input and output. 

In the paper, padding is done such as input and output sequence lengths are matched.

`Chomp1d` applied after `Conv1d` to ensure causal convolution, `chomp_size = padding`.

#### Architecture

Forward flow is as follows: 

```
z = conv1d -> chomp1d(padding) -> relu -> dropout -> conv1d -> chomp1d(padding) -> relu -> dropout

if input_channel != output_channel:
    # map z to output_channels
    r = conv1d(z, input_channels, output_channels, kernel_size=1)
    residual = r(z)
    out = relu(z + residual)
else:
    out = z

return out
```

In [1]:
import torch
import torch.nn as nn
from torch.autograd import Variable

In [3]:
# shape is (N, in_channels, L)
# N: number of samples
# in_channels: or number of features
# L: feature length
x = torch.randn(2, 1, 5)
x.shape

torch.Size([2, 1, 5])

In [4]:
x


(0 ,.,.) = 
  0.2089 -0.2274  0.0779 -0.8448 -0.0247

(1 ,.,.) = 
  0.3559 -0.5786 -0.2676 -1.2133  2.0995
[torch.FloatTensor of size 2x1x5]

In [14]:
padding = 2

In [5]:
m = nn.Conv1d(in_channels=1, out_channels=4, kernel_size=3, padding=padding)

In [6]:
y = m(Variable(x, requires_grad=False))

In [7]:
y.shape

torch.Size([2, 4, 7])

In [8]:
# y is padded by 2
y

Variable containing:
(0 ,.,.) = 
  0.3220  0.2502  0.4168  0.3306  0.6797  0.3565  0.3297
  0.2908  0.3632  0.1106  0.3824 -0.0922  0.6123  0.2945
  0.0077 -0.2472  0.0814 -0.5939  0.1302 -0.4114 -0.1000
 -0.1645 -0.4203 -0.2044 -0.4503  0.1377  0.2141 -0.2240

(1 ,.,.) = 
  0.3170  0.2012  0.5718  0.4940  0.7666 -0.5190  0.2869
  0.2952  0.4129 -0.1018  0.3707 -0.0417  1.6270 -0.5545
  0.0768 -0.4474  0.0599 -0.8202  1.0825 -1.0621  0.7218
 -0.1136 -0.6156 -0.2242 -0.2223  1.2371 -0.6594 -1.3268
[torch.FloatTensor of size 2x4x7]

In [9]:
m.weight

Parameter containing:
(0 ,.,.) = 
 -0.0202 -0.4157 -0.0345

(1 ,.,.) = 
 -0.3997  0.4084  0.0296

(2 ,.,.) = 
  0.3869 -0.2392  0.4698

(3 ,.,.) = 
 -0.5192 -0.5013  0.3462
[torch.FloatTensor of size 4x1x3]

In [10]:
m.weight.data[0,0,2]

-0.034489214420318604

In [11]:
m.bias

Parameter containing:
 0.3292
 0.2847
-0.0904
-0.2368
[torch.FloatTensor of size 4]

In [12]:
torch.sum(torch.mul(x[0, 0, :1], m.weight.data[0, 0, 2])) + m.bias[0]

Variable containing:
 0.3220
[torch.FloatTensor of size 1]

In [13]:
y.data[0, 0, 0]

0.3220287561416626

`Chomp1d()` in the TCN code is to make sure that we chop off the padding on the right to ensure causual convolution. 

See example below. 

In [2]:
class Chomp1d(nn.Module):

    def __init__(self, chomp_size):
        super(Chomp1d, self).__init__()
        self.chomp_size = chomp_size

    def forward(self, x):
        return x[:, :, :-self.chomp_size].contiguous()

In [15]:
chomp = Chomp1d(padding)

In [16]:
# last two columns of y chopped off.
chomp(y)

Variable containing:
(0 ,.,.) = 
  0.3220  0.2502  0.4168  0.3306  0.6797
  0.2908  0.3632  0.1106  0.3824 -0.0922
  0.0077 -0.2472  0.0814 -0.5939  0.1302
 -0.1645 -0.4203 -0.2044 -0.4503  0.1377

(1 ,.,.) = 
  0.3170  0.2012  0.5718  0.4940  0.7666
  0.2952  0.4129 -0.1018  0.3707 -0.0417
  0.0768 -0.4474  0.0599 -0.8202  1.0825
 -0.1136 -0.6156 -0.2242 -0.2223  1.2371
[torch.FloatTensor of size 2x4x5]

<a id='layer_norm'></a>

# Layer Normalization

[2016 Paper](https://arxiv.org/abs/1607.06450)

## Notations

Given the $l^{th}$ hidden layer: 

* $w^l$ be the weight vector, $w^l_i$ is the wieght for the $i^{th}$ hidden unit.
* $h^l$ is the bottom-up input, 
* $b^l$ is the bias,
* let $a^l$ be the vector of the summed inputs to the neuron in that layer:

$$ a^l_i = w^l_i h^l $$
$$ h^{l+1} = f(a^l_i + b^l_i) $$

Where $f()$ is an element-wise non-linear function. 

For [Batch Norm](./andrew_ng/DeepLearning.ai.ipynb#batchnorm), the normalization is done over the **entire** training batch due to computational reasons. This puts constraints on the size of a minibatch and it is hard to apply to RNN.

## Layer Norm

Becaue the changes in the output of one layer will tend to cause highly correlated changes in the summed inputs to the next layer, especially with ReLU units whose output can change by a lot, ***Layer norm** statistics are computed over all the hidden units in the same layer as follows:

* $H$ number of hidden units in a layer

$$
\begin{aligned}
\mu^l &= \frac{1}{H} \sum^H_{i=1} a^l_i \\
\sigma^l &= \sqrt{\frac{1}{H}\sum^H_{i=1}\big( a^l_i - \mu^l\big)^2}
\end{aligned}
$$

Normalization happens within the hidden layer, all hidden units in the same layer share the same $mu$ and $sigma$, but different training cases have different normalization terms.

**Unlike** batch norm, layer norm is:

* does not impose any constraint on the size of minibatch
* can be used in a pure online regime with batch size 1

For RNN, layer norm works as follows

$$
\begin{aligned}
a^t &= W_{hh} h^{t-1} + W_{xh} x^t \\
h^t &= f \bigg[ \frac{g}{\sigma^t} \odot \big( a^t - \mu^t \big) + b \bigg] \\
\mu^t &= \frac{1}{H} \sum^H_{i=1} a^t_i \\
\sigma^t &= \sqrt{\frac{1}{H}\sum^H_{i=1}\big( a^t_i - \mu^t\big)^2}
\end{aligned}
$$

Where:

* $b$ is bias parameter, same dimension as $h^t$
* $g$ is gain parameter, same dimension as $h^t$

Layer norm for RNN results in much more stable hidden-to-hidden gradient dynamics (vs. exploding/vanishing gradients).

Section 4 of the paper discusses related work including weight norm.

<img src='./img/layer_norm_comp.png' width=800/>