# ML Papers

[Highway Networks](#highway)

[DenseNet](#densenet)

[Bootstrap](#bootstrap)

[LSTM/RHN on Natural Language Modeling](#1707.05589)

[Clustering Financial Time Series](#clustering_ts)

## Weight Normalization

[paper](https://arxiv.org/abs/1602.07868)

Speeds up SGD convergence by regularizing the weight norm. For a given network:

$$ y = \phi(w \cdot x + b) $$

where $w \in \mathcal{R}^k$, $b$ is a scalar bias, $x \in \mathcal{R}^k$, we reparameterize the $w$ as:

$$ w = \frac{g}{\| v \|} v $$

where $g$ is a scaler, $\| v \|$ denotes the Euclidean norm of vector $v$. We now have $\| w \| = g$.

Forthermore, we can reparameterize $g$ as $g = e^s$, where $s$ is a log-scale parameter to learn by SGD. However, empirically, the authors **did not** find this to be an advantage, and optimization was slightly slower.

### Comparison to Batch Norm

In special cases weight norm is the same as batch norm, see paper section 2.2 for detail.

For CNNs, weight normalization is often much faster computationally, it is also non-stochastic, not affected by batch size. It can be viewed as a cheaper and less noisy approximation to batch norm. Equivalence does not hold for deeper architectures.

### Data-Dependent Initialization of Parameters

Important to properly initialize our parameters. Authors proposed to sample the elements of $v$ from a simple distribution with fixed scale, such as a normal distribution with zero mean and standard deviation of 0.05.

This only works where batch norm is applicable, for RNNs and LSTMs, need to resort to standard initilization methods.

<a id='highway'></a>
## Highway Networks

[Summary Paper](https://arxiv.org/abs/1505.00387)

[Full Paper](https://arxiv.org/abs/1507.06228)


Highway networks enables the optimization of the networks with virtually arbitary depth. This is accomplished through the use of a **learned gating machanism** for regulating information flow wihch is inspired by LSTM.

The paper shows that the optimization of highway network is virtually independent of depth. Used to train a 900-layer network.

A plain feedforward network of $L$ layers with $H$ as a non-linear transform function, ignoring layer index:

$$ y = H(x, W_H) $$

For a highway network, the paper adds two non-linear transform: 
* Tranform gate, $T(x W_T)$
* Carry gate, $C(x, W_C)$

$$ y = H(x, W_H) \cdot T(x, W_T) + x \cdot C(x, W_C) $$

The paper sets $C = 1 - T$

The **dimensionality** of $x$, $y$, $H(x, W_H)$ and $T(x, W_T)$ must be the same for the equation above to hold.

If size of the representation needs to be changed, two ways:

1. replace $x$ with $\tilde{x}$ obtained by suitably sub-sampling or zero-padding $x$
2. Use a plain layer without highway to change dimensionality and then continue with stacking highway layers. This is what the paper used.



**Transform gate** is defined as:

$$
\begin{aligned}
T(x) &= \sigma(W_T^T x + b_T) \\
\sigma(x) &= \frac{1}{1 + e^{-x}}, x \in \mathbb{R}
\end{aligned}
$$

$b_T$ can be initialized with a negative number (e.g. -1, -3, etc) such that the network is biased initially towards **carry** behaviour. The paper found that during training, $b_T$ actually got further negative, this behavour suggests that the strong negative biases at low depths are not used to shut down the gates, but to make them more selective. 

$W_H$ can be initialized with various zero mean distributins.

TODO
<a id='densenet'></a>
## Densely Connected Convolutional Networks

<a id='bootstrap'></a>
# Bootstrap

**My current thinking** is that perhaps it’s best to look at multiple measures, large discrepancies would point to something odd. Otherwise, the differences between the methods are mostly minor for practical use. Particularly comparing to other market uncertainties.


Efron: **Accelerated Bootstrap intervals (BCa)** has better estimates of confidence intervals with some assumptions. makes certain assumptions, see below
 
[Paper](https://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.ss/1032280214)

[Slides](https://faculty.washington.edu/heagerty/Courses/b572/public/GregImholte-1.pdf)

[Stack Exchange Answer](https://stats.stackexchange.com/questions/19340/bootstrap-based-confidence-interval)

## `arch`
Python package `arch` has studentized/bias-corrected intervals for bootstrap results.See [doc](http://arch.readthedocs.io/en/latest/bootstrap/confidence-intervals.html#bias-corrected-and-accelerated-bca)

Discovered a minor bug in `conf_int(method=’bca’)`, see [issue](https://github.com/bashtage/arch/issues/193)


## `scikits-bootstrap`

Another implementation, see comments in [code](https://github.com/cgevans/scikits-bootstrap/blob/master/scikits/bootstrap/bootstrap.py)


## Block Bootstrap

**Politis, White, 2004. Automatic Block-Length Selection for the Dependent Bootstrap**

Code in R and Matlab can be found on this [page](http://public.econ.duke.edu/~ap172/) by Prof. Andrew Patton, as well as a correction in 2009 to the original paper.

Main points from the paper:

**Stationary Bootstrap (SB)** is less accurate than **Circular Block Bootstrap (CB)** for estimating $\sigma^2_{\infty}$. Although they have similar bias the SB has higher variance due to the additional randomization involved in drawing the random block size.

SB is **less sensitive** to block size mis-specification compared to CB and/or the moving block bootstrap. 

<a id='1707.05589'></a>

# On the State of the Art of Evaluation in Neural Language Models

[paper](https://arxiv.org/abs/1707.05589)

This paper tuned LSTM and Recurrent Highway Networks to beat many more complex models in natural language modeling. A lot of dropout techniques are used here. 

Instead of comparing models based on the number of hidden units, comparison is done based on **total number of trainable parameters.**

Dropout:
* input dropout
* intra-layer dropout
* down-projected outputs / output dropout

Things to learn about:

**Variational Dropout**: Gal and Ghahramani 2016, [link](https://arxiv.org/abs/1512.05287)

**Recurrent Dropout**: Semeniuta et al. 2016. [link](https://arxiv.org/abs/1603.05118)

**Using mean-field approximation for dropout at test time**. 

**Truncated backpropagation**: Training used Adam/batch size 64/truncated backprop performed with 50 time steps.

Hyperparameter tuning was performed using a black-box tuner based on **batched GP bandits** (Desautels et al. 2014)

<a id='clustering_ts'></a>
## Clustering Financial Time Series: how long is enough?

[paper](https://arxiv.org/abs/1603.04017)

Looked at Hierarchical Correlation Block Model (HCBM) with single-linkage, complete-linkage, average-linkage, Ward, McQuitty, Median, Centroid algos.

Spearman rank correlation is more robust than Pearson correlation when there is:
* noise
* variables have infinite second moment.

Their conclusion was **Ward** method converges faster than others such as single/average-linkage, converges around 250 observations for 256 assets with correlation matrix simiar to Figure 2 in this paper

## Temporal Convolutional Networks (TCN)

[paper](https://arxiv.org/abs/1803.01271), code available on [github](https://github.com/locuslab/TCN). Here is a description of its basic architecure. 

Two principles: 

* network produces the same length as the input
* there can be no leakage from the future into the past

TCN uses **1-D fully-convolution**:

* hidden layer is the same length as the input layer.
* zero padding of length (k-1) where $k$ is the kernel size, keeping subsequent layers the same size as the previous ones.

TCN uses **causal convolution**, meaning an output at time t is only convolved with elements from time t and earlier in the previous layers. 

**Dilated convolution** is used here, starting with the input layer with dilation $d = 1$, i.e. no dilation and increase dilation **exponentially** with the depth of the network (i.e. $d = \mathcal{O}(2^i)$ where $i$ is the layer index).

Ensures:

1. some filter hits each input within the effective history,
2. allowing for an extremely large effective history using deep network. 

**Residual Connection** like ResNet, with weight normalization. Unlike ResNet, the input and output of TCN can have different length, therefore authors use a 1x1 convolution to ensure that **element-wise** addition receives tensors of the same shape (figure b, c below).

Authors demostrated TCN has longer memory than LSTM and GRU in 2 tasks. 

<img src='./img/tcn.png' width='800'/>