### Definition

A residual is the error left over when approximating a function.
* Define $\mathcal{H}(x)$ to be the underlying mapping we want to learn. Such that $\mathcal{H}: X \rightarrow Y$.
* At any given training step we have some approximation of $h_{i}:\lim_{i\to\infty} h_{i}  \rightarrow \mathcal{H}$
* The residual (or error) of the approximation is given as $h(x) - y = r$

However, a resnet defines the underlying mapping to be the identity.

* Define $\mathcal{H}(x) = x$.
* We can define the residual, r, as a function of x. Giving $\mathcal{F}(x):= r = h(x) - x$


In practice this is rearranged and we learn the residual function instead of the underlying mapping.

* $\mathcal{H}(x) = \mathcal{F}(x) + x$. 
So, $\mathcal{F}(x)$ is a measure of how far away the mapping is from the identity (rather than zeros like usual).


Next, we choose $\mathcal{H}(x) = x$ as a special case of $\mathcal{H}(x) = Ax$ where A = I. So, we can make this formulation more general by defining A, and initialising it as I.
* $\mathcal{H}(x) = Ax$
* thus we get $\mathcal{F}(x):= h(x) - Ax$ 

### Backprop derivation

$$x_{l+1} = \mathcal{F}_l(x_l|\mathcal{W}_l) + x_l$$

$$
\frac{\partial \mathcal{L}}{\partial x_l} = \frac{\partial \mathcal{L}}{ \partial\mathcal{F}_l}  \frac{\partial \mathcal{F}_l}{ \partial x}
$$


### Facilitation

Oh... This is why it makes it easy to learn the identity. 
(??? so it is easier to learn zeros than ones?) Hmm, dont know about that. So this assumes that it is easy for the mapping to be pushed to zero?

 

> If the optimal function is closer to an identity mapping than to a zero mapping, it should be easier for the solver to find the perturbations with reference to an identity mapping, than to learn the function as a new one.

I like this. It is like facilitating learning. I guess this is about what biases/heuristics/priors we want to incorporate into our network. How can we reason about this at a higher level? So that a network could learn to change its structure, with the goal of facilitating learning? This is the sort of thing we would hope to learn from unsupervised learning? (!?!)

So this is saying that we expect the mapping to be some linear function of x. (what if we use others?) Add

### Types

So why would passing information/representing features with residuals be better? (types of information?? which is best?) How how does it make sense to combine different types of information? Residual + x = y. 



### Interpretation

Can we interpret these residual functions as errors? How about as uncertainty (probabilistic interpretation)?

Probability uncertainty - residual error ... http://www.probabilistic-numerics.org/general/2015/01/14/UQ/

As RNNs

##### Ensembles

I still dont get why it doesnt effect the next when you drop a layer...

Because they are learning residuals. Error w.r.t. their input/output. If we drop upstream layers, this shouldnt effect the residual that this layer was correcting as the residuals are disjoint?? So they each specialise? 

Could compensate for others??? Why would a layer learn to correct another layer? It wouldnt/shouldnt?


If a layer is giving good/accurate output, then downstream layers will only recieve error signals for the difference between their ... ?? So they will fit themselves to errors that have not been removed upstream in the network.



### Random ideas

Try with ELUs instead of ReLUs. As normally f(x) + x = linear with discontinuity + linear.

##### Gating
Does the residual passed forward include other residuals? $y = \mathcal{F}(x,W\{W_i\}) + W_S x $ OR $y = \mathcal{F}(x,W\{W_i\}) + W_{S_i} x_i + W_{S_{i-1}} x_{i-1}$

Could used multiplicitive residuals to decide which information is sent forward through the residuals? $y =  \mathcal{F}(x,W\{W_i\}) + W_{S_i} x_i \times \sigma(W_{S_{i-1}} x_{i-1})$. Kind of like gating the residuals, or could use the residuals to gate?

##### Mappings

Aka, a way to add a prior?

* Inverse? 
* Rotation? 
* $H(x) = e^x$ so that large values have greater weight.
* 

##### Regularisation
What if you tried to minimise/regularise the distance from the ideal mapping? A way of biasing toward linear identities? Kinda like occams razor? 

### Questions

* Does this change anything related to being convex in inputs???
* How does L1/L2 regulaisation work with this? Probably not very well...?
* Allows you to forget more!!! (like my screen recordings?)
* How are they initialised? With identities?
* What happens if you use adaboost with a residual net? is it like double boosting?