## Libraries

In [1]:
import torch
from torch import tensor

## Broadcasting

### Example broadcast

In [2]:
c = tensor([10, 20, 30])
m = tensor([[1., 2, 3], [4, 5, 6], [7, 8, 9]])

In [3]:
c.shape, m.shape

(torch.Size([3]), torch.Size([3, 3]))

In [4]:
c + m

tensor([[11., 22., 33.],
        [14., 25., 36.],
        [17., 28., 39.]])

### Simple exercise

Suppose this is a batch of RGB images

In [5]:
image_batch = torch.rand(64, 3, 256, 256)
image_batch.shape

torch.Size([64, 3, 256, 256])

Normalize it with vectors of 3 elements. One for the mean and one for std

In [6]:
mean = torch.mean(image_batch, dim=(0, 2, 3))
mean, mean.shape

(tensor([0.4998, 0.5001, 0.4998]), torch.Size([3]))

In [7]:
std = torch.std(image_batch, dim=(0, 2, 3))
std, std.shape

(tensor([0.2887, 0.2886, 0.2887]), torch.Size([3]))

In [8]:
broadcasted_mean = mean.unsqueeze(0).unsqueeze(2).unsqueeze(3)
broadcasted_mean.shape

torch.Size([1, 3, 1, 1])

In [9]:
broadcasted_mean.storage

<bound method Tensor.storage of tensor([[[[0.4998]],

         [[0.5001]],

         [[0.4998]]]])>

In [10]:
broadcasted_std = std.unsqueeze(1).unsqueeze(1)
broadcasted_std.shape

torch.Size([3, 1, 1])

In [11]:
broadcasted_std.storage

<bound method Tensor.storage of tensor([[[0.2887]],

        [[0.2886]],

        [[0.2887]]])>

In [12]:
normalized_images = (image_batch - broadcasted_mean)/broadcasted_std
normalized_images.shape

torch.Size([64, 3, 256, 256])

## Einstein Summation

Compact representation for combining products and sums in a general way. For example:  
ik,kj->ij

Lefthand side represents the "operators" while the right side would be the result's dimensions. The rules of Einsum are:
1. Repeated indices on the left side are implicitly summed over if they are not on the right side
2. Each index can appear at most twice on the left side
3. Unrepeated indices on the left side must appear on the right side

In [13]:
x = torch.randn(2,3)
y = torch.randn(3,2)
x, y

(tensor([[ 0.1973,  0.2416, -0.6079],
         [-0.2403, -0.3139, -0.1490]]),
 tensor([[-0.8592,  2.0951],
         [-0.8479, -0.7700],
         [ 1.3391, -0.5981]]))

### Transpose

In [14]:
torch.einsum("ij->ji", x)

tensor([[ 0.1973, -0.2403],
        [ 0.2416, -0.3139],
        [-0.6079, -0.1490]])

### Matrix product

In [15]:
torch.einsum('ik, kj -> ij', x, y)

tensor([[-1.1885,  0.5909],
        [ 0.2732, -0.1727]])

### Gram matrix

In [16]:
torch.einsum('ij, kj -> ik', x, x)

tensor([[ 0.4669, -0.0327],
        [-0.0327,  0.1785]])

## Forward and backward passes

### Define and init a layer

We could stack the 2 layers on top of the other but the result of linear operations is another linear operation. To make it work we should add a nonlinearity in the middle such as the most common used activation function [ReLU](https://en.wikipedia.org/wiki/Rectifier_(neural_networks))

In [17]:
def lin(x, w, b): return x @ w + b 

We won't actually train here, so we only set random valued matrices

In [18]:
x = torch.randn(200, 100)
y = torch.randn(200)

In [19]:
w1 = torch.randn(100, 50)
b1 = torch.zeros(50)
w2 = torch.randn(50, 1)
b2 = torch.zeros(1)

The result of our first linear layer would be:

In [20]:
l1 = lin(x, w1, b1)
l1.shape

torch.Size([200, 50])

But check how the mean and std change **AFTER** the layer. Take into account how torch.randn works:
> torch.randn: Returns a tensor filled with random numbers from a normal distribution with mean `0` and variance `1` (also called the standard normal distribution).

In [21]:
w1.mean(), w1.std(), w2.mean(), w1.std()

(tensor(0.0158), tensor(1.0073), tensor(0.0874), tensor(1.0073))

In [22]:
l1.mean(), l1.std()

(tensor(0.2552), tensor(10.1200))

Notice we are getting 10 times the original std we'd get from the original torch.randn. This is because of the existing addition in matrix multiplication. Basically, we are adding 100 values with mean 0. The mean itself will NOT change, but the spread of the values will be hugely increased after a single layer as seen. A more exaggerated version would be as follows:

In [23]:
test = torch.randn(10000,10000)
test = test @ test
test.mean(), test.std()

(tensor(0.0056), tensor(100.0168))

In [24]:
x = torch.randn(200,100)
for _ in range(50): x = x @ torch.randn(100,100)
x[0:5, 0:5], x.mean(), x.std()

(tensor([[nan, nan, nan, nan, nan],
         [nan, nan, nan, nan, nan],
         [nan, nan, nan, nan, nan],
         [nan, nan, nan, nan, nan],
         [nan, nan, nan, nan, nan]]),
 tensor(nan),
 tensor(nan))

Since we are dealing with floats, give it a big enough number and we go to *** xddd

### Fixing initialization problems

#### [Xavier Glorot and Yoshua Bengio's weights init scaling](https://oreil.ly/9tiTC)
Basically scale to a given linear layer in order to keep std=1:
$$
\sqrt\frac{1}{n}\\
$$
With *n* being the number of inputs. In our case, with we'd have 0.1

In [25]:
x = torch.randn(200,100)
for _ in range(50): x = x @ (0.1 * torch.randn(100,100))
x[0:5, 0:5], x.mean(), x.std()

(tensor([[ 1.3581,  0.4751, -1.5982,  0.8653, -0.9027],
         [-0.6477,  0.9074,  0.3204, -0.3126,  0.0870],
         [ 0.9161, -0.0672, -1.3581,  1.2238, -1.2131],
         [ 0.1557, -0.3300,  0.1554, -0.0133, -0.2455],
         [ 0.2525,  0.3681, -0.5425,  0.3638, -0.2616]]),
 tensor(0.0025),
 tensor(0.7898))

Where we can see we are keeping the proper std, mean and values. **Notice that even adding 0.01 to the scale factor the std will go up to 100! Sometimes even with the rectification it will change around 30% 1.0 value!!!**
> Now test second layer with this initialization

In [26]:
from math import sqrt
x = torch.randn(200,100)
w1 = torch.randn(100, 50) / sqrt(100)
b1 = torch.zeros(50)
w2 = torch.randn(50, 1) / sqrt(50)
b2 = torch.zeros(1)

In [27]:
l1 = lin(x, w1, b1)
l1.mean(), l1.std()

(tensor(-0.0106), tensor(0.9913))

In [28]:
def relu(x): return x.clamp_min(0.)

In [29]:
l2 = relu(l1)
l2.mean(), l2.std()

(tensor(0.3896), tensor(0.5751))

#### [Corrected init taking ReLU into account](https://oreil.ly/-_quA)
We can see that even with the correction, after ReLU we get that both mean and std have moved. In the paper linked, we can see the computation for the proper initialization taking ReLU into account:
$$
\sqrt\frac{2}{n}\\
$$

In [30]:
x = torch.randn(200,100)
for _ in range(50): x = relu(x @ (torch.randn(100,100) * sqrt(2/100)))
x[0:5, 0:5], x.mean(), x.std()

(tensor([[3.3958, 0.0000, 0.0000, 0.0000, 4.7809],
         [2.0073, 0.1009, 0.0000, 0.0000, 2.5939],
         [2.6231, 0.4898, 0.0000, 0.0000, 3.7671],
         [3.8565, 0.6812, 0.0000, 0.0000, 5.3409],
         [2.8088, 0.0611, 0.0000, 0.0000, 3.6463]]),
 tensor(1.1807),
 tensor(1.8094))

Given that's better, the book follows with creating the model. But for more normalization steps read [BatchNorm](https://arxiv.org/abs/1502.03167) or even [the following article](https://medium.com/biased-algorithms/batch-normalization-alternatives-layernorm-and-instancenorm-52bdf43624b9)

In [31]:
x = torch.randn(200,100)

In [32]:
def model(x):
    l1 = lin(x, w1, b1)
    l2 = relu(l1)
    l3 = lin(l2, w2, b2)
    return l3

In [33]:
out = model(x)
out.shape

torch.Size([200, 1])

Remember MSE (Mean Squared Error):
$$
MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

In [34]:
def mse(output, targ): return (output.squeeze(-1) - targ).pow(2).mean() # Squeeze because of that single dim on cols

In [35]:
loss = mse(out, y)
loss

tensor(2.0987)