# Paper
* **Title**: [Backpropagation Applied to Handwritten Zip Code Recognition](https://www.academia.edu/8932889/Backpropagation_Applied_to_Handwritten_Zip_Code_Recognition)
* **Date**: 1989
* **Authors**: Yann LeCun et al.

This is the **earliest** real-world application of a neural net trained with **backpropagation**.

- - -
# Summary
In this paper, we apply the **backpropagation algorithm** to a real-world problem in recognizing handwritten digits taken from the U.S. Mail. The learning network is **directly fed with images**, rather than feature vectors, thus demonstrating the ability of backpropagation networks to deal with large amounts of low-level information. 


### Problem

The ability of learning networks to generalize can be greatly enhanced by providing constraints from the task domain. This paper demonstrates how such constraints can be integrated into a backpropagation network through the architecture of the network. 

### Solution

The "constraints" LeCun is talking about are simply the reused kernel weights:
- *In other words, each of the 64 units in H1.1 uses the same set of 25 weights. Each unit performs the same operation on corresponding parts of the image. The function performed
by a feature map can thus be interpreted as a nonlinear subsampled convolution with a 5 by 5 kernel.*

- *The basic design principle is to reduce the number of free parameters in the network as much as possible without overly reducing its computational power. Application of this principle increases the probability of correct generalization because it results in a specialized network architecture that has a reduced entropy.*

Back at that time "unconstrained" networks were called only the **fully-connected** (i.e. "dense") ones.

So in short, this article is like an introduction to Convolutional Neural Networks with backpropagation applied.

### How it works

* The input of the network is a 16x16 normalized grayscale (not binary) image. 
* The output is composed of 10 units (one per class) and uses place coding.
* network has three hidden layers **H1**, **H2**, **H3**
  * **H1** is a convolutional layer with 12 kernels (5x5 each). With stride=$2$ and padding fillvalue=$-1$.
  * **H1** has 768 units (8*8*12), 19,968 connections (768*26), 1,068 parameters (768 biases + 25*12 weights)
  * **H2** is a convolutional layer with 12 kernels (4x4 each). Stride=$2$ and padding fillvalue=$-1$.
  * [NB]: **H2** units all draw input from the 12 **H1** outputs, however only connecting to different 8 out of the 12 inputs (8x8)
  * **H2** contains 192 units (4*4*12), 38,592 connections (192 units * 201 input lines), 2,592 parameters (12 * 200 weights + 192 biases)
  * **H3** has 30 units fully connected to **H2**. So 5790 connections (30 * 192 + 30)
  * output layer has 10 units fully connected to **H3**. So 310 weights (30 * 10 + 10)
  * total: 1256 units, 64,660 connections, 9760 parameters
* **tanh** (LiSHT) activations on all units (including output units!)
* cost function: mean squared error
* weight init: random values in U[-2.4/F, 2.4/F] where F is the fan-in. "tends to keep total inputs in operating range of sigmoid"
* training
  * SGD on single example at a time
  * 23 epochs

<center><img src="img/lecun_zip_code_nn.png" alt="Neural Network Architecture" width="921" height="468" /></center>
<p style="text-align: center; font-size: small;"><i><b>Figure 1.</b> 1989 LeCun ConvNet per description in the paper</i></p>



Open questions:
* The 12 -> 8 connections from **H2** to **H1** are not described in this paper... I will assume we pick these 8 out of 12 by uniform distribution.
* What is the learning rate? I will run a sweep to determine the best one manually.
* Was any learning rate decay used? Not mentioned, I am assuming no.
* Was any weight decay used? not mentioned, assuming no.

### Results

* final misclassification: 0.14% on train (10 mistakes), 5.0% on test (102 mistakes).

- - - 
# What I learned?

* **Constrained layers**: Layers which are not fully-connected (dense). In other words, layers with reduced number of parameters.


* **LiSHT**: An activation function - "linearly scaled hyperbolic tangent": $\text{lisht}(x) = x * \text{tanh}(x)$


* **Tanh > Sigmoid**: As a general rule of thumb `tanh` is better activation function than `sigmoid`
  * Why Tanh

In this paper LeCun's team applied **scaled hyperbolic tangent** (i.e. tanh) function to the output of each layer in the neural net. So the question is why this function? Why not simple `sigmoid` which is basically the same function, but within the $[0,1]$ range (`tanh`'s range is $[-1, 1]$).

The only reason I can come up with is that `tanh` has steeper gradients than sigmoid (due to the bigger range), which means faster learning. Of course, we can achieve faster learning with higher `learning rate`. However, latter would increase the risk of divergence.

Read more [HERE](https://stats.stackexchange.com/questions/330559/why-is-tanh-almost-always-better-than-sigmoid-as-an-activation-function). LeCun himself wrote following text in another paper: 
  * *Symmetric functions of that kind are believed to yield **faster convergence**, although the learning can be extremely slow if some weights are too small (LeCun 1987).*