# Deep Learning Course - LAB 1

## Intro to PyTorch

PyTorch (PT) is a Python (and C++) library for Machine Learning (ML) particularly suited for Neural Networks and their applications.

Its great selection of built-in modules, models, functions, CUDA capability, tensor arithmetic support and automatic differentiation functionality make it one of the most used scientific libraries for Deep Learning.

Note: for this series of labs, we advise to install Python >= 3.7

### Installing PyTorch

We advise to install PyTorch following the directions given in its [home page](https://pytorch.org/get-started/locally/). Just typing `pip install torch` may not be the correct action as you have to take into account the compatibility with `cuda`. If you have `cuda` installed, you can find your version by typing `nvcc --version` in a terminal (Linux/iOS). 

If you're using Windows, we first suggest first to install Anaconda and then install PyTorch from the `anaconda prompt` software via `conda` (preferably) or `pip`.

If you're using Google Colab, all the libraries needed to follow this lecture should be pre-installed there.

We see now how to operate on Colab.

### For Colab users

Google Colab is a handy tool that we suggest you use for this course -- especially if your laptop does not support CUDA or has limited hardware capabilities. Anyway, note that **we'll try to avoid GPU code as much as possible**.

Essentially, Colab renders available to you a virtual machine with a limited hardware capability and disk where you can execute your code inside a given time window. You can even ask for a GPU (if you use it too much you'll need to start waiting a lot before it's available though).

#### Your (maybe) first Colab commands

Colab Jupyter-style notebook interface with a few tweaks.

For instance, you may run (some) bash command from here prepending `!` to your code.

This makes it very easy to operate your virtual machine without the need for a terminal. 

#### File transfer on Colab

One of the most intricate action in Colab is file transfer. Since your files reside on the virtual machine, there're two main ways to operate file transfer on Colab.

* `files.download()` / `.upload()`

In [None]:
from google.colab import files

files.upload()


In [None]:
files.download("sample_data/README.md")


Although it may be much more handy to connect your Google Drive to Colab. Here is a snippet that lets you do this.

In [None]:
from google.colab import drive

folder_mount = "/content/drive"  # Your Drive will be mounted on top of this path

drive.mount(folder_mount)


### Dive into PyTorch - connections with NumPy

Like NumPy, PyTorch provides its own multidimensional array class, called `Tensor`. `Tensor`s are essentially the equivalent of NumPy `ndarray`s.
If we wish to operate a very superficial comparison between `Tensor` and `ndarray`, we can say that:
* `Tensor` draws a lot of methods from NumPy, although it's missing some (see [this GitHub issue](https://github.com/pytorch/pytorch/issues/50344) if you're interested).
* `Tensor` is more OO than `ndarray` and solves some inconsistencies within NumPy
* `Tensor` has CUDA support

In [None]:
import torch
import numpy as np

# create custom Tensor and ndarray
x = torch.Tensor([[1, 5, 4], [3, 2, 1]])
y = np.array([[1, 5, 4], [3, 2, 1]])


def pretty_print(obj, title=None):
    if title is not None:
        print(title)
    print(obj)
    print("\n")


pretty_print(x, "x")
pretty_print(y, "y")


What are the types of these objs?

In [None]:
x.dtype, y.dtype


`torch` already thinks with Machine Learning in mind as the `Tensor` is implicitly converted to `dtype float32`, while NumPy makes no such assumption.

For more info on `Tensor` data types, please check the beginning of [this page](https://pytorch.org/docs/stable/tensors.html).

As in NumPy, we can call the `.shape` attribute to get the shape of the structures. Moreover, `Tensor`s have also the `.size()` method which is analogous to `.shape`.

In [None]:
x.shape, y.shape, x.size()


Notice how a `Tensor` shape is **not** a tuple.

We can also create a random `Tensor` analogously to NumPy.

A `2 × 3 × 3` `Tensor` is the same as saying "2 3 × 3 matrices", or a "cubic matrix"

![](img/tensors.jpg)

In [None]:
x = torch.rand([2, 3, 3])
x


In [None]:
y = np.random.rand(2, 3, 3)
y


We can get the total number of elements in a `Tensor` via the `numel()` method

In [None]:
x.numel()


We can get the memory occupied by each element of a `Tensor` via `element_size()`

In [None]:
x.element_size()


Hence, we can quickly calculate the size of the `Tensor` within the RAM

In [None]:
x.numel() * x.element_size()


#### Slicing a `Tensor`

You can slice a `Tensor` (*i.e.*, extract a substructure of a `Tensor`) as in NumPy using the square brackets:

In [None]:
# extract first element (i.e., matrix) of first dimension
pretty_print(x[0], "Slice first element (x[0])")

# extract a specific element
pretty_print(x[1, 2, 0], "Slice element at (1, 2, 0) (x[1, 2, 0])")

# extract first element of second dimension (":" means all the elements of the given dim)
pretty_print(x[:, 0], "Slice first element of second dim (x[:, 0])")

# note that it is equivalent to
pretty_print(x[:, 0, :], "As above (x[:, 0] equivalent to x[:, 0, :])")

# extract range of dimensions (first and second element of third dim)
pretty_print(x[:, :, 0:2], "Slice first and second el of third dim (x[:, :, 0:2])")

# note that it is equivalent to (i.e., you can also pass list for slicing, as opposed to Py vanilla lists/tuples)
pretty_print(x[:, :, (0, 1)], "As above (x[:, :, 0:2] equivalent to x[:, :, (0, 1)])")


In PT, you can also slice any list by interval via the "double colon" notation `::` (`from`:`to - 1`:`step`). Note that `::3` means "take all elements of the object by step of 3 starting from 0 until the list ends".

In [None]:
torch.range(0, 10)[0:7:3]


#### `Tensor` supports linear algebra

In [None]:
z1 = torch.rand([4, 5])
print("z1")
print("shape", z1.shape)
print(z1)

# transposition
z2 = z1.T

print("\nz2")
print("shape", z2.shape)
print(z2)


In [None]:
# matrix multiplication
pretty_print(z1 @ z2, "Matrix multiplication: with '@'")

# equivalent to
pretty_print(torch.matmul(z1, z2), "Matrix multiplication: with torch.matmul")

# and also
pretty_print(z1.matmul(z2), "Matrix multiplication: with Tensor.matmul")


Note that `@` identifies the matrix product.

Don't mistake `@` and `*` as the latter is the Hadamard (element-by-element) product!

In [None]:
z1 * z2  # this gives an Exception


In [None]:
z1 * z1


Generally, the "regular" arithmetic operators for Python act as element-wise operators in `Tensor`s (as in `ndarrays`)

In [None]:
z1 ** 2  # Equivalent to above


In [None]:
z3 = torch.Tensor(
    [[1, 2, 3, 4, 7], [0.2, 2, 4, 5, 3], [-1, 3, -4, 2, 2], [1, 1, 1, 1, 2]]
)
pretty_print(z1 % z3, "z1 % z3 (remainder of integer division)")
pretty_print(z3 // z1, "z3 // z1 (integer division)")  # integer division
z3 /= z1
pretty_print(z3, "in-place tensor division (z3 /= z1)")


As for `ndarrays`, `Tensor`s arithmetic operations support **broadcasting**. Roughly speaking, when two `Tensor`s have different shapes and a binary+ operator is applied to them, PT will try to find a way to make these objects "compatible" for the operation. 

Of course, broadcasting is not always possible, but as a rule of thumb, if some dimensions of a `Tensor` are one and the other dimensions are the same, broadcasting works.

In [None]:
small_vector_5 = torch.Tensor(
    [1, 2, 3, 5, 2]
)  # this is treated as a row vector (1 x 5 matrix)
print("small_vector_5:", small_vector_5, "; Shape:", small_vector_5.shape, "\n")

pretty_print(z1 / small_vector_5, "Broadcasting: dividing matrix by row vector")

small_vector_4 = torch.Tensor([4, 2, 3, 1])
small_vector_4 = small_vector_4.unsqueeze(
    -1
)  # this operation "transposes" the vector into a column vector (4 x 1 matrix)
print("small_vector_4:\n", small_vector_4, "\nShape:", small_vector_4.shape, "\n")

pretty_print(z1 / small_vector_4, "Broadcasting: dividing matrix by column vector")


In [None]:
torch.Tensor([1, 2, 3]) == torch.Tensor(
    [[1, 2, 3]]
)  # single-dim Tensors are also row vectors


#### Reshaping and permuting

Sometimes it may be necessary to reshape the tensors to apply some specific operators.

Take the example of RGB images: they can be seen as `3 x h x w` tensors, where `h` is the height and `w` the width.

![](img/image_tensor.png)

Sometimes, it may be necessary to "flatten" the three matrices into vectors, thus obtaining a `3 x hw` tensor.

In [None]:
image = torch.load("data/img.pt")
image.shape


This flattening may be achieved via the `reshape` method.

In [None]:
image_reshaped = image.reshape(3, 243 * 880)
pretty_print(image_reshaped.shape, "shape of image_reshaped")


We can alternatively use the `view` method...

In [None]:
image_view = image.view(3, 243 * 880)
pretty_print(image_view.shape, "shape of image_view")


**Q**: what is the difference between `reshape` and `view`?

Some libraries encode images as `h x w x 3` tensors instead of `3 x h x w`.

To convert between these two format, one may be tempted to `reshape` or `view` the tensor: in the end, they share the number of elements.



In [None]:
from matplotlib import pyplot as plt

image2 = image.reshape(243, 880, 3)

plt.imshow(image2)
plt.show()


That does not seem to work though: reshape does not change the order of the elements within the memory.

In order to do so, we need to use `permute`, which changes shape **and** the order of the elements.
We need to pass the new order of the dimensions to it.

In [None]:
image3 = image.permute(
    1, 2, 0
)  # 1,2,0 --> old dim1 goes first, old dim2 goes second, dim0 goes last
# can also do image.permute(-2,-1,0) -- works with negative indices as well
plt.imshow(image3)
plt.show()


We already saw a case of incompatible `Tensor`s above. Which one is it?

Some more lineal algebra...


In [None]:
z3_norm = z3.norm(2)
pretty_print(z3_norm, "Tensor norm")
pretty_print(np.linalg.norm(y), "ndarray norm")  # notice how torch is more OO


Notice how methods reducing `Tensor`s to scalars still return singleton `Tensor`s. (be wary of this feature when scripting something in PT)

To "disentangle" the scalar from a `Tensor` use the `.item()` method.

In [None]:
z3_norm.item()


Note that, as for NumPy, PT supports `Tensor`s operator on a subset of its dimensions.

For example, given a `3x4x4 Tensor`, we might want to calculate the norm of each of the three `4x4` matrices composing it. We must hence specify to the `norm` method the dimensions on which we want it to operate the reduction:

In [None]:
z4 = torch.rand((3, 4, 4))
pretty_print(
    z4.norm(dim=(1, 2)), "Norm of the three matrices composing z4 -- z4.norm(dim=(1,2))"
)


As expected, the result is a `1x3 Tensor`, showing the norm of each of the matrices.

We can notice this behaviour in other `Tensor` operator applying a reduction:

In [None]:
print(z4)
print(z4[0, :, 0])
print(z4[0, :, 0].sum())


In [None]:
pretty_print(
    z4.sum(dim=0),
    "z4.sum(dim=0) -- Sum of the three matrices composing the tensor -- is a 4x4 matrix",
)

pretty_print(
    z4.prod(dim=1),
    "z4.prod(dim=1) -- Product of the columns of each matrix composing the tensor",
)


In [None]:
pretty_print(
    z4[0, :, 0],
    "This is how the first element of z4.prod(dim=1) is obtained. We fix the 1st and 3rd dim to `0` ...",
)
pretty_print(z4[0, :, 0].prod(), "... and we operate the prod on it")
print(
    "The other elements are obtained by looping through all of the other possible 1st and 3rd dimension indices. For instance, the element `2,3` of the reduction is obtained as\nz4[2, :, 3].prod().\n"
)
pretty_print(z4[2, :, 3], "z4[2, : ,3]")
pretty_print(z4[2, :, 3].prod(), "z4[2, : ,3].prod()")


#### Seamless conversion from NumPy to PT

In [None]:
y_torch = torch.from_numpy(y)
pretty_print(y_torch, "y converted to torch.Tensor")

x_numpy = x.numpy()
pretty_print(x_numpy, "x converted to numpy.ndarray")

# Note that NumPy implicitly converts Tensor to ndarray whenever it can; the same doesn't happen for PT
pretty_print(np.linalg.norm(x), "Example of implicit conversion Tensor → ndarray")


#### Stochastic functionalities

We can render the (pseudo)random number generator deterministic by calling `torch.manual_seed(integer)`.

This works for both CPU and CUDA RNG calls.

In [None]:
torch.manual_seed(123456)
print("...from now on our random tensor should be the same...")


In [None]:
pretty_print(torch.randperm(10), "(randperm) Random permutation of 0:10")

pretty_print(
    torch.rand_like(z1), "(rand_like) Create random vector with the same shape of z_1"
)

pretty_print(
    torch.randint(10, (3, 3)), "(randint) Like rand, but with integers up to 10"
)

pretty_print(
    torch.normal(0, 1, (3, 3)), "(normal) Sampling a 3x3 iid scalars from N(0,1)"
)

pretty_print(
    torch.normal(
        torch.Tensor([[1, 2, 3], [4, 5, 6], [0, 0, 0]]),
        torch.Tensor([[1, 0.5, 0.9], [0.5, 1, 0.1], [3, 4, 1]]),
    ),
    "Sampling from 9 normals with different means and std into a (3x3) Tensor",
)


### Using GPUs

All `Torch.Tensor` methods support GPU computation via built-in CUDA wrappers.

Just transfer the involved `Tensor`s to CUDA and let the magic happen :)

In [None]:
# check if cuda is available on this machine
torch.cuda.is_available()

has_cuda_gpu = torch.cuda.is_available()


In [None]:
if has_cuda_gpu:
    dim = 10000
    large_cpu_matrix = torch.rand((dim, dim))
    large_gpu_matrix = large_cpu_matrix.to(
        "cuda"
    )  # Can also specify "cuda:gpu_id" if multiple GPUs
    # alternatively, you may also call large_cpu_matrix.cuda() or large_cpu_matrix.cuda(0)
else:
    print(
        "Sorry, this part of the notebook is inaccessible since it seems you don't have a CUDA-capable GPU on your device :/"
    )


In [None]:
pretty_print(large_cpu_matrix.device, "Device of large_cpu_matrix")
pretty_print(large_gpu_matrix.device, "Device of large_gpu_matrix")
pretty_print(
    large_gpu_matrix,
    "If a tensor is not on CPU, the device will also be printed if you print the tensor itself",
)


In [None]:
if has_cuda_gpu:
    import timeit

    # NOTE: please fix this number w.r.t. your GPU and CPU
    repetitions = 100

    print(
        "Norm of large cpu matrix. Time:",
        timeit.timeit("large_cpu_matrix.norm()", number=repetitions, globals=locals()),
    )
    print(
        "Norm of large gpu matrix. Time:",
        timeit.timeit("large_gpu_matrix.norm()", number=repetitions, globals=locals()),
    )
else:
    print(
        "Sorry, this part of the notebook is inaccessible since it seems you don't have a CUDA-capable GPU on your device :/"
    )


Captain obvious: Use `tensor.cpu()` or `tensor.to("cpu")` to move a tensor to your CPU

### Building easy ML models

By using all the pieces we've seen till now, we can build our first ML model using PyTorch: a linear regressor, whose model is

`y = XW + b`

which can also be simplified as

`y = XW`

if we incorporate the bias `b` inside `W` and add to the `X` a column of ones to the right.

We'll first create our data. The `X`s are the 0:9 range plus some iid random noise, while the `y` is just the 0:9 range

In [None]:
x1 = torch.range(0, 9).unsqueeze(-1)
x2 = torch.range(0, 9).unsqueeze(-1)
x3 = torch.range(0, 9).unsqueeze(-1)
x0 = torch.ones([10]).unsqueeze(-1)
X = torch.cat((x1, x2, x3), dim=1)
eps = torch.normal(0, 0.3, (10, 3))
X += eps
X = torch.cat((X, x0), dim=1)

y = torch.range(0, 9).unsqueeze(-1)


pretty_print(X, "X (covariates)")
pretty_print(y, "y (response)")


For the case of linear regression, we usually wish to obtain a set of weights minimizing the so called mean square error/loss (MSE), which is the squared difference between the ground truth and the model prediction, summed for each data instance.

We know that the OLS/Max Likelihood esitmator is the one yielding the optimal set of weights in that regard.

In [None]:
W_hat = ((X.T @ X).inverse()) @ X.T @ y  # OLS estimator

pretty_print(W_hat, "W (weights - coefficients and bias/intercept)")


We can evaluate our model on the mean square loss

In [None]:
def mean_square_loss(y, y_hat):
    return (((y - y_hat).norm()) ** 2).item() / y.shape[0]


Let's apply it to our data.

First we need to obtain the predictions, then we can evaluate the MSE.

In [None]:
y_hat = X @ W_hat
pretty_print(y_hat, "Predictions (y_hat)")

pretty_print(mean_square_loss(y, y_hat), "Loss (MSE)")


#### Using PT built-ins

We will now be exploring the second chunk of PT functionalities, namely the built-in structures and routines supporting the creation of ML models.

We can create the same model we have seen before using PT built-in structures, so we start to see them right away.

Usually, a PT model is a `class` inheriting from `torch.nn.Module`. Inside this class, we'll define two methods:
* the constructor (`__init__`) in which we define the building blocks of our model as class variables (later during our lectures we'll see more "elegant" methods to build models architectures)
* the `forward` method, which specifies as the data fed into the model needs to be processed in order to produce the output

Note for those who already know something about NNs: we don't need to define `backward` methods since we're constructing our model with built-in PT building blocks. PT automatically creates a `backward` routine based upon the `forward` method.

Our model only has one building block (layer) which is a `Linear` layer.
We need to specify the size of the input (i.e. the coefficients `W` of our linear regressor) and the size of the output (i.e. how many scalars it produces) of the layer. We additionaly request our layer to has a bias term `b` (which acts as the intercept of the line we saw before).

The `Linear` layer processes its input as `XW + b`, which is exactly the (first) equation of the linear regressor we saw before.



In [None]:
class LinearRegressor(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.regressor = torch.nn.Linear(in_features=3, out_features=1, bias=True)

    def forward(self, X):
        return self.regressor(X)


We can create an instance of our model and inspect the current parameters by using the `state_dict` method, which prints the building blocks of our model and their current parameters. Note that `state_dict` is essentially a dictonary indexed by the names of the building blocks which we defined inside the constructor (plus some additional identifiers if a layer has more than one set of parameters).

In [None]:
lin_reg = LinearRegressor()

for param_name, param in lin_reg.state_dict().items():
    print(param_name, param)


We can update the parameters via `state_dict` and re-using the same OLS estimates we obtained before.

Note that PT is thought of for Deep Learning: it does not have (I think) the routines to solve different ML problems.

Next time, we'll see as we can unleash PT's gradient-based iterative training routines and compare the results w.r.t. the OLS estimators.

In [None]:
state_dict = lin_reg.state_dict()
state_dict["regressor.weight"] = W_hat[:3].T
state_dict["regressor.bias"] = W_hat[3]
lin_reg.load_state_dict(state_dict)


In [None]:
# Check if it worked
for param_name, param in lin_reg.state_dict().items():
    print(param_name, param)


The `forward` method gets implicitly called by passing the data to our model's instance `lin_reg`:

In [None]:
X_lin_reg = X[:, :3]
predictions_lin_reg = lin_reg(X_lin_reg)
pretty_print(predictions_lin_reg, "Predictions of torch class")


The predictions are the same as before

In [None]:
pretty_print(y_hat, "Predictions of linear model")


### Adding non-linearity

One of the staples of DL is that the relationship between the `X`s and the predictions is **non-linear**.

The non-linearity is obtain by applying a non-linear function (called *activation function*) after each linear layer.

We can complicate just a little bit our linear model to create a **logistic regressor**:

`y = logistic(XW + b)`,

where `logistic(z) = exp(z) / (1 + exp(z))`

The logistic function has different names:
* in statistics, it's usually called *inverse logit* as well
* in DL and mathematics, it's called *sigmoid function* due to its "S" shape

Hystorically, the sigmoid was between the first activation functions used in NNs.

![](img/sigmoid.png)

Logistic regression is usually used as a **binary classification model** instead of a regression model.
In this setting, we suppose we have two destination classes to which we assign values 0 and 1: `y ∈ {0, 1}`.
Since the codomain of the sigmoid is `[0,1]`, we can interpret its output `ŷ` as a probability value, and assign each data to the class 0 if `ŷ <= 0.5`, to the class 1 otherwise.


In [None]:
y = torch.Tensor([0, 1, 0, 0, 1, 1, 1, 0, 1, 1])
pretty_print(y, "y for our classification problem")


Note that we may also want our y to be a vector of `int`s.
We can convert the `Tensor` type to `int` by calling the method `.long()` or `.int()` of `Tensor`.

As in NumPy, the type of the `Tensor` is found within the `dtype` variable of the given `Tensor`.

In [None]:
y = y.int()
pretty_print(y, "y converted to int")
pretty_print(y.dtype, "Data type of y")


Let us now build our logistic regressor in PT.

We only need one single addition wrt the linear regressor: in the `forward` method, we'll add the sigmoidal non-linearity by calling the `sigmoid` function within the `torch.nn.functional` library.

Note that there also exist some "mirror" alias of these functionals within `torch.nn` (e.g. `torch.nn.Sigmoid`): we'll learn in the following lecture why these aliases are there and how to use them.

In [None]:
class LogisticRegressor(torch.nn.Module):
    def __init__(self):
        super().__init__()
        # no difference wrt linear regressor
        self.regressor = torch.nn.Linear(in_features=3, out_features=1)

    def forward(self, X):
        out = self.regressor(X)
        # here we apply the sigmoid fct to the output of regressor
        out = torch.nn.functional.sigmoid(out)
        return out


We can instantiate our logistic regressor and use it to calculate our predictions on the same X as before.

Note that **we're using the initial (random) weights which PT has assigned to the model parameters**.
For our linear regressor, we were able to analytically obtain the OLS value of the parameters.
In the case of logistic regression, there's no MaxLikelihood estimator obtainable in close form and we need to resort to numerical methods to obtain them.
Since the part concerning numerical optimization will be discussed during the next Lab, we will not be training our model (hence results will obviously be sub-par).

In [None]:
log_reg = LogisticRegressor()
y_hat = log_reg(X_lin_reg)
pretty_print(y_hat, "logistic regressor predictions")


There exist many ways to evaluate the performance of the logistic regressor: one of them is **accuracy** (correctly identified units / total number of units).
We can define a function to evaluate accuracy and calculate it on our model and data

In [None]:
def accuracy(y, y_hat):
    # Assign each y_hat to its predicted class
    pred_classes = torch.where(y_hat < 0.5, 0, 1).squeeze().int()
    correct = (pred_classes == y).sum()
    return (correct / y.shape[0]).item()


In [None]:
accuracy(y, y_hat)


#### Visualizing linear and logistic regression as a computational graph

We now need to convert the equation of the linear and the logistic regression:
* `y = σ(WX + b)`

where `σ` is a generic `ℝ → ℝ` function: sigmoid for logistic regression, identity for linear regression.

![](img/log_reg_graph.jpg)

We organize the input in *nodes* (on the left part) s.t. each node represents one dimension/covariate.
For each data instance, we substitute to each node the corresponding numeric value.
The nodes undergo one or more operations, namely, from left to right:

1. Each node is multiplied by its corresponding weight (a value placed on the edge indicates that the node is multiplied by said value)
2. All the corresponding outputs are summed together

These two operations identify the dot product between vectors `X` and `W`

3. The bias term `b` is added
4. The non-linear function `σ` is applied to the result of this sum
5. Finally, we assign that value to the variable `ŷ`, which is also indicated as a node


### Our first MultiLayer Perceptron (MLP)

The MLP is a family of Artificial NNs in which the input is a vector of size `ℝ^d` and the output is again a vector of size `ℝ^p`, where `p` is determined upon the nature of the problem we wish to solve. Additionally, a MLP is characterized by multiple stages (*layers*) of sequential vector-matrix multiplication and non-linearity (steps 1., 2., 3. above) in which each output of the layer `l-1` acts as input to the layer `l`.

Taking inspiration to the graph of the logistic regression, we can translate all into an image to give sense to these words:

![](img/mlp_graph.jpg)

In NNs, each of the nodes within the graph is called a **neuron**

Neurons are organized in **layers**

In computational graphs, layers are shown from left to right (or bottom to top sometimes), which is the direction of the flow of information inside the NN.

The first layer is called **input layer** and represents the dimensions of our data.

The last layer is called **output layer** and represent the output of our NN.

All the intermediate layers are called **hidden layers**. To be defined MLP, there must be at least one hidden layer inside our model.

If the NN is an MLP, each neuron in a given layer (except for the input) receives information from every neuron of the previous; moreover, each neuron in any layer (except for the output) sends information to every neuron of the next layer. There's no communication between neurons of the same layer (if it happens, we have a **Recursive Neural Network**).

For the sake of brevity, usually in NN computational graphs we drop the blocks `+` and `σ`, the values of weights, and the reference to the bias terms, remaining with a scheme conveying info about
* the number of neurons per layer
* the connectivity of neurons

The graph above becomes:

![](img/mlp_graph_common.jpg)

We can then start programming our simple MLP in PT.

We will suppose that our MLP is for **binary classification**, hence the activation function `τ` is the sigmoid.

**Q**: what if we wanted our MLP to operate a regression?

In [None]:
class MLP(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = torch.nn.Linear(in_features=3, out_features=2)
        self.layer2 = torch.nn.Linear(in_features=2, out_features=1)

    def forward(self, X):
        out = self.layer1(X)
        out = torch.nn.functional.relu(out)
        out = self.layer2(X)
        out = torch.nn.functional.sigmoid(out)
        return out


For the great majority of MLP, it's very hard to get analytical solutions to our sets of weights and biases. We then resort to numerical methods for optimization.

In DL, we normally used gradient-based methods like Stochastic Gradient Descent with *backpropagation* to find approximate solutions.

We'll cover these topics in future lectures. For now, the focus is to build a MLP in PT and perform the *forward pass* (=evaluate the model on a set of data).

We can analyse the structure of our MLP by just printing the model

In [None]:
model = MLP()
model


although we might wanna have additional informations.

There's an additional package, called `torch-summary` which helps us producing more informative and exhaustive model summaries.

In [None]:
import sys
!{sys.executable} -m pip install torch-summary


In [None]:
summary(model)


Let us suppose we wish to build a larger model from the graph below.

![](img/mlp_graph_larger.jpg)

We suppose that

1. The layers have no bias units
2. The activation function for hidden layers is `ReLU`

Moreover, we suppose that this is a classification problem.

As you might recall, when the number of classes is > 2, we encode the problem in such a way that the output layer has a no. of neurons corresponding to the no. of classes. Doing so, we establish a correspondence between output units and classes. The value of the $j$-th neuron represents the **confidence** of the network in assigning a given data instance to the $j$-th class.

Classically, when the network is encoded in such way, the activation function for the final layer is the **softmax** function.
If $C$ is the total number of classes,

$softmax(z_j) = \frac{\exp(z_j)}{\sum_{k=1}^C \exp(z_k)}$

where $j\in \{1,\cdots,C\}$ is one of the classes.

If we repeat this calculation for all $j$s, we end up with $C$ normalized values (i.e., between 0 and 1) which can be interpreted as probability that the network assigns the instance to the corresponding class.

In [None]:
# Task: build this network from scratch during class


In [None]:
# instantiate and summarise
