# Deep Learning Homework \#01
### Deep Learning Course $\in$ DSSC @ UniTS (Spring 2021)  

#### Submitted by [Emanuele Ballarin](mailto:emanuele@ballarin.cc)  

### Request 1:

Taking inspiration from the notebook [`01-intro-to-pt.ipynb`](https://github.com/ansuini/DSSC_DL_2021/blob/main/labs/01-intro-to-pt.ipynb), build a class for the Multilayer Perceptron (MLP) whose scheme is drawn in the last figure of the notebook. As written there, no layer should have bias units and the activation for each hidden layer should be the Rectified Linear Unit (ReLU) function, also called ramp function. The activation leading to the output layer, instead, should be the softmax function, which prof. Ansuini explained during the last lecture. You can find some notions on it also on the notebook.

#### Preliminaries:

Just to set things clear, we obtain specifications of the desired model from the scheme drawn in the [notebook](https://github.com/ansuini/DSSC_DL_2021/blob/main/labs/01-intro-to-pt.ipynb) (via direct counting) and its accompaining text, and summarize them below.

The desired model should:
- Be a *MultiLayer Perceptron* (*MLP*, a.k.a. *Fully-Connected*, a.k.a. *Dense* block);
- Be composed of *biasless* units;
- Take as input $5$ scalars;
- Return as output $4$ scalars;
- Have *hidden layers* or sizes (in *input-to-output* order): $11$, $16$, $13$, $8$;
- Have *ReLU* *activation function* for *hidden layers*;
- Have the *SoftMax* function as *output function*.

#### The imports *of the day*:

In [1]:
# The usual stuff
import numpy as np  # Just to force-load MKL (if available)
import torch as th
import torch.nn as nn
import torch.nn.functional as F


In [2]:
# The extra stuff (I am not forcing you to do it; uncomment if willing!)

#!pip install --upgrade --no-deps --force --force-reinstall git+https://github.com/TylerYep/torchinfo.git
import torchinfo as thinfo


In [3]:
# The crazy stuff (I am not forcing you to do it; uncomment if willing!)

#!pip install --upgrade --no-deps --force --force-reinstall git+https://github.com/emaballarin/ebtorch.git
from ebtorch.nn import FCBlock

# No problem if it fails: it just won't run the crazy stuff... :)


#### The *standard* solution:

In [4]:
# Define:
class myMLP(nn.Module):
    def __init__(self):
        super().__init__()
        # Layers:
        self.layer1 = nn.Linear(in_features=5, out_features=11, bias=False)
        self.layer2 = nn.Linear(in_features=11, out_features=16, bias=False)
        self.layer3 = nn.Linear(in_features=16, out_features=13, bias=False)
        self.layer4 = nn.Linear(in_features=13, out_features=8, bias=False)
        self.layer5 = nn.Linear(in_features=8, out_features=4, bias=False)
        # Stateful functions:
        # Since ReLU and SoftMax are stateless, no cruft here!

    def forward(self, x):
        # A "layer with elementwise nonlinearity" block
        # <- from here...
        x = self.layer1(x)
        x = F.relu(x)
        # <- ...to here.
        x = self.layer2(x)
        x = F.relu(x)
        x = self.layer3(x)
        x = F.relu(x)
        x = self.layer4(x)
        x = F.relu(x)
        # The "pre-output layer": a linear layer with SoftMax afterwards
        x = self.layer5(x)
        x = F.softmax(x, dim=1)
        return x


# Instantiate:
mymodel_class = myMLP()


#### The *not a class, but I am lazy* solution (a.k.a. `nn.Sequential`)

In [5]:
# Define and Instantiate:
mymodel_seq = nn.Sequential(
    nn.Linear(in_features=5, out_features=11, bias=False),  # H1
    nn.ReLU(),
    nn.Linear(in_features=11, out_features=16, bias=False),  # H2
    nn.ReLU(),
    nn.Linear(in_features=16, out_features=13, bias=False),  # H3
    nn.ReLU(),
    nn.Linear(in_features=13, out_features=8, bias=False),  # H4
    nn.ReLU(),
    nn.Linear(in_features=8, out_features=4, bias=False),  # H5 (pre-output)
    nn.Softmax(dim=1),
)


#### The *crazy* solution (a.k.a. *parameterized FC block*)

In [6]:
# Define and Instantiate (with a proper class under the hood):
# (cfr.: https://github.com/emaballarin/ebtorch/blob/main/ebtorch/nn/architectures.py#L32)
mymodel_fpfcb = FCBlock(
    fin=5,
    hsizes=[11, 16, 13, 8],
    fout=4,
    hactiv=F.relu,
    oactiv=lambda x: F.softmax(x, dim=1),
    bias=False,
)

# WLOG, the instantiation call also automatically supports the use of lists for `hactiv` and `bias`
# (with integrated size-checking) to enable per-layer specifications


### Request 2:

After having defined the class, create an instance of it and print a summary using a method of your choice.

#### Solution:

Classes have been instantiated right after their definition (or, as in some approaches of the proposed solution, just directly).

As far as *model summarization* is concerned, we will use [`torchinfo`](https://github.com/TylerYep/torchinfo), the *maintained*, *properly-coded* (was) fork of [`torchsummary`](https://github.com/sksq96/pytorch-summary).

In [7]:
thinfo.summary(mymodel_class)


Layer (type:depth-idx)                   Param #
├─Linear: 1-1                            55
├─Linear: 1-2                            176
├─Linear: 1-3                            208
├─Linear: 1-4                            104
├─Linear: 1-5                            32
Total params: 575
Trainable params: 575
Non-trainable params: 0

In [8]:
thinfo.summary(mymodel_seq)


Layer (type:depth-idx)                   Param #
├─Linear: 1-1                            55
├─ReLU: 1-2                              --
├─Linear: 1-3                            176
├─ReLU: 1-4                              --
├─Linear: 1-5                            208
├─ReLU: 1-6                              --
├─Linear: 1-7                            104
├─ReLU: 1-8                              --
├─Linear: 1-9                            32
├─Softmax: 1-10                          --
Total params: 575
Trainable params: 575
Non-trainable params: 0

In [9]:
thinfo.summary(mymodel_fpfcb)


Layer (type:depth-idx)                   Param #
├─ModuleList: 1-1                        --
|    └─Linear: 2-1                       55
|    └─Linear: 2-2                       176
|    └─Linear: 2-3                       208
|    └─Linear: 2-4                       104
|    └─Linear: 2-5                       32
Total params: 575
Trainable params: 575
Non-trainable params: 0

#### A brief comment:

As we can see from the output of `torchinfo`, all the three *instantiated models* exhibit the same learnable structures. Exact differences in output can be explained as follows:

- The summary of the *class-instantiated* model just lists *`Linear` layers*, since they are the only portion of the model defined as *class members*. The nonlinear transformations are obtained as *pure function* calls in the `forward(x)`;

- The summary of the `Sequential` model lists both *`Linear` layers* and *activation functions*, since - by definition - it uses *function-objects* to implement nonlinearities;

- As far as the summary of the model obtained via `ebtorch.nn.FCBlock` is concerned, the same as in the case of the *class-instantiated* model applies. Additionally, all *stateful* class members (e.g. *`Linear` layers*) are wrapped inside a `ModuleList`, whose elements are created programmatically from *call-arguments*.

### Request 3:

- Provide detailed calculations (layer-by-layer) on the exact number of parameters in the network.
- Provide the same calculation in the case that the bias units are present in all layers (except input).

#### Preliminaries:

In order to concisely compute the number of learnable parameters *layer-by-layer* for our specific model, we need to notice first that:

- In the case of scalar-neurons *FC* layers, the number of learnable parameters is the number of scalars parameterizing the affine transformation of the output vector from the previous layer (i.e. that vector whose coordinates are the outputs of single scalar-neurons in the previous layer);

- Such affine transformation is parameterized by an $n \times n'$ matrix ($\mathbf{W}$) and an $n'$-dimensional vector ($\mathbf{b}$), with $n$ and $n'$ respectively the layer-specific *input* and *desired output* sizes. Just to put it in another way: $n'$ is the number of units of the layer, whereas $n$ is the number of units of the previous (again, in the scalar-neurons *FC* layer case);

- Of course, if the *layer* to be considered is *biasless*, the vector is fixed as $\mathbf{b} = \mathbf{0}$, and only the matrix $\mathbf{W}$ accounts for the number of learnable parameters. In the case of *layers with bias*, instead, the $\mathbf{b}$ vector has to be considered too;

- Risking overzealousness, we explicitly acknowledge that - as in our specific case:
    - neither the *ReLU* nor the *SoftMax* activation functions carry learnable parameters, as it is the case e.g. of [PReLU](https://arxiv.org/abs/1502.01852)s instead;
    - the use of the term *scalar neuron* explicitly dispels ambiguity with recently proposed *ANN*s with [complex-valued](https://arxiv.org/abs/2101.12249) or even [quaternion-valued](https://sci-hub.do/10.1007/s10462-019-09752-1) neurons.

#### Abstract solution:

This directly leads to the number of *per-layer* learnable parameters being:
- $n \times n'$ (*biasless* case)
- $(n + 1) \times n'$ (*biased* case)

with the sole exception of the *input layer* (i.e. the one composed by *input units*), which carry no learnable parameters by definition.

It also immediately follows that the whole-network number of *learnable parameters* is:
- $\sum_{i=1}^K {{n'}_{i-1} \times {n'}_i}$ (*biasless* case)
- $\sum_{i=1}^K {({n'}_{i-1}+1) \times {n'}_i}$ (*biased* case)

where, as always, ${n'}_j$ denotes the number of units in the $j^{\text{th}}$ layer (being the $0^{\text{th}}$ the *input* and the $K^{\text{th}}$ the *output* layers).


#### Putting the numbers in:

Legenda:  
*layer # . parameters in the *biasless* case | parameters in the *biased* case*

0. $\text{        }$ None $\text{        }$ $|$ $\text{        }$ None $\text{        }$  (*input layer*)
1. $\text{        }5 \times 11 = 55 \text{        }$ $|$ $\text{        }(5+1) \times 11 = 66$
2. $\text{        }11 \times 16 = 176 \text{        }$ $|$ $\text{        }(11+1) \times 16 = 192$
3. $\text{        }16 \times 13 = 208 \text{        }$ $|$ $\text{        }(16+1) \times 13 = 221$
5. $\text{        }13 \times 8 = 104 \text{        }$ $|$ $\text{        }(13+1) \times 8 = 112$
6. $\text{        }8 \times 4 = 32 \text{        }$ $|$ $\text{        }(8+1) \times 4 = 36$

Which finally give the *whole-network* results:
- $55+176+208+104+32 = 575$ (*biasless* case)
- $66+192+221+112+36 = 627$ (*biased* case)

### Request 4:

For each layer within the MLP, calculate the L2 norm and L1 norm of its parameters.

In [10]:
with th.no_grad():

    for model in [mymodel_class, mymodel_seq, mymodel_fpfcb]:
        print("MODEL: ", model, "\n")
        for lay_n, lay_params in enumerate(model.parameters()):
            print(
                "Layer ",
                lay_n + 1,
                ": L2 vector-equivalent norm (Frobenius; EWL2): ",
                th.linalg.norm(lay_params, ord="fro").item(),
            )
            print(
                "      ",
                "  ",
                " L1 vector-equivalent norm (EWL1): ",
                th.linalg.norm(lay_params.flatten(), ord=1).item(),
            )
            print(
                "      ",
                "  ",
                " Matrix 2-norm: ",
                th.linalg.norm(lay_params, ord=2).item(),
            )
            print(
                "      ",
                "  ",
                " Matrix 1-norm: ",
                th.linalg.norm(lay_params, ord=1).item(),
            )
            print("\n")
        print("\n\n")


MODEL:  myMLP(
  (layer1): Linear(in_features=5, out_features=11, bias=False)
  (layer2): Linear(in_features=11, out_features=16, bias=False)
  (layer3): Linear(in_features=16, out_features=13, bias=False)
  (layer4): Linear(in_features=13, out_features=8, bias=False)
  (layer5): Linear(in_features=8, out_features=4, bias=False)
) 

Layer  1 : L2 vector-equivalent norm (Frobenius; EWL2):  1.854434609413147
           L1 vector-equivalent norm (EWL1):  11.770538330078125
           Matrix 2-norm:  1.0609813928604126
           Matrix 1-norm:  3.0345864295959473


Layer  2 : L2 vector-equivalent norm (Frobenius; EWL2):  2.3325161933898926
           L1 vector-equivalent norm (EWL1):  26.914155960083008
           Matrix 2-norm:  1.152938961982727
           Matrix 1-norm:  2.9411096572875977


Layer  3 : L2 vector-equivalent norm (Frobenius; EWL2):  2.1106956005096436
           L1 vector-equivalent norm (EWL1):  27.313369750976562
           Matrix 2-norm:  0.9185633659362793
          