## THE LAUNCHPAD QUESTION: `Is learning better networks as easy as stacking more layers?`
### ~In simple terms: stacking more and more layers in network impoves the learning?

In [1]:
import torch
import torch.nn as nn 
import torch.nn.functional as F 

### as the depth of the network increase the gradients in the network start to vanish or explode in backpropagation
### `when deeper networks are able to start converge, a degradation problem has been exposed: with the network depth incresing, accuracy get saturated and then degrades rapidly. Such degradation is not caused by overfitting and adding more layers to a suitably deep model leads to higher training error` ~from paper

In [2]:
B, T = 4, 8
x = torch.randn((B, T))

In [3]:
# simple transformation function == f(x, W)
f = nn.Linear(in_features=T, out_features=T)

# simple identity operation in residual network
out = f(x) + x # identity connection (x is added directly)


In [4]:
print(f'input shape: {x.shape}')
print(f'f(x) shape: {f(x).shape}')
print(f'output shape: {out.shape}')

input shape: torch.Size([4, 8])
f(x) shape: torch.Size([4, 8])
output shape: torch.Size([4, 8])


## Identity connection **fails** when dimension don't match

In [5]:
B, T, C = 4, 8, 32
# input features
x = torch.randn((B, T))

In [6]:
f = nn.Linear(in_features=T, out_features=C)

out = f(x) # x(4, 8) @ f(8, 32) --> (4, 32)
out.shape

torch.Size([4, 32])

In [7]:
out + x # out(4, 32) + (4, 8) --//--> out and x are not broadcastable
# 1st trailing dimension: 32 != 8
# (4, 32)
# (4, 8) 

RuntimeError: The size of tensor a (32) must match the size of tensor b (8) at non-singleton dimension 1

In [8]:
# SOL: simple projection opeation in residual network (PROJ CONNECTION)
B, T, C = 4, 8, 32
# input features
x = torch.randn((B, T))


In [14]:
f = nn.Linear(in_features=T, out_features=C)

# Projection matrix Ws to match dimension
Ws = nn.Linear(in_features=T, out_features=C)

# f(x, {W}) + Ws.x

# Part 1: f(x, {W})
out1 = f(x) # x(4, 8) @ f(8, 32) --> (4, 32)
print('f(x, {W}): ', {out.shape})

# Ws.x
out2 = Ws(x) # x(4, 8) @ Ws(8, 32) --> (4, 32)
print('Ws.x: ', (out2.shape))


f(x, {W}):  {torch.Size([4, 32])}
Ws.x:  torch.Size([4, 32])


In [15]:
proj = out1 + out2 # (4, 32) + (4, 32) --> (4, 32)
print('output shape: ', (proj.shape))

output shape:  torch.Size([4, 32])


## Broadcasting semantics from [Pytorch](https://docs.pytorch.org/docs/stable/notes/broadcasting.html)

1. Two tensors are “broadcastable” if the following rules hold:

2. Each tensor has at least one dimension.

> When iterating over the dimension sizes, starting at the trailing dimension, the dimension sizes > must either be equal, one of them is 1, or one of them does not exist.