# Deep Learning course - LAB 5

## Pruning in Artificial Neural Networks

### Recap from previous Labs

* Essentially, we now have a bag of _tricks_ to utilize in order to ensure our Multilayer Perceptrons (MLPs) train well while preserving a good generalization
* We know how to construct a simple MLP with a custom number of layers and neurons
* We know how to train the MLP using popular variants of Stochastic Gradient Descent (SGD) such as SGD with momentum and Adam
* We learned some tricks to enforce a better generalization capability of our MLP, namely Weight Decay and Dropout
* We saw how to use proper learning rate schedules for better convergence of our network

### Agenda for today

* Today, we are going more into the detail of Deep Learning (DL) for a more technical lab session devoted to a more "experimental" branch: **network pruning**
* We will see how to implement a specific pruning algorithm, Magnitude Pruning (MP), in an MLP

## What is pruning?

Pruning is referred to the specific act of **removing parameters from a Machine learning model**.

The term originated from tree-based models that, when grown too deep, present pronounced tendencies towards overfitting, so their branches need to be cut (*pruned*) with proper criteria in order to ensure a better generalization.

The main difference between pruning tree-based models and Artificial Neural Networks (ANNs) lies in the fact that, while in tree-based models we operate pruning on _leaf parameters_, in ANNs we usually prune without regard for the position of the parameter, which is a connection (_synapse_) between neurons.

![](https://www.cs.cmu.edu/~bhiksha/courses/10-601/decisiontrees/DTprune.png)

*In decision trees, models are pruned _downside up_ starting from the leaves. Picture from [Carnegie Mellon University](https://www.cs.cmu.edu/~bhiksha/courses/10-601/decisiontrees/).*

![](https://upload.wikimedia.org/wikipedia/commons/2/23/Before_after_pruning.png)

*In ANNs, generically, pruning is operated regardless of the position of the neuron/connection within the network. Picture from [Wikimedia Commons](https://commons.wikimedia.org/wiki/File:Before_after_pruning.png).*



## Why pruning?

Pruning (also called _Weight Sparsification_) is a collection of techniques falling within the category of **Network Miniaturization/Compression**, a bulk of practices aimed at reducing the size, or memory impact, or energetic impact, of a well-performing ANN, while still allowing the _reduced_ model to perform on-par with the original network.

Some other examples of Network Compression include, but not limited to:
* Knowledge Distillation
* Weight Quantization
* Weight Sharing

## How to prune?

Pruning in ANNs is a theme that has been around since the early works of Yann LeCun [2](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.32.7223&rep=rep1&type=pdf) in the late 80s. A huge number of techniques has been proposed since then [3](https://arxiv.org/abs/2102.00554).

The findings in these early papers are essentialy different from those in the most recent ones (i.e., from 2015 on) mainly because the networks back then were not deep: large ANNs tend instead to react differently to pruning.

Regarding how to prune, we can distinguish between two main categories: **Structured Pruning** and **Unstructured Pruning**.

### Structured pruning

To render this classification as neat as possible, structured pruning acts on multiple synapses logically connected in some way. In the case of MLPs, we may think of the act of _removing a neuron_ from a given layer as a structured pruning technique, as we **remove all the connections incoming towards that neuron**.

![](img/struct_prune.png)

In the image above, we prune the first neuron of the second layer. This equates to removing all of its incoming connections. The effect of this on the parameters of the ANN is that now we have a so called *regular sparsity*: the original parameters $\Theta\in\mathbb{R}^{4\times 3}$ are now $\tilde\theta\in\mathbb{R}^{4\times 2}$.

### Unstructured pruning

In this case, instead, we utilize a pruning technique which has no regard for the geometry or the structure of the layers themselves.

![](img/unstr_prune.png)

In the image above, we removed four connections. This leads to an *irregular* form of sparsity in the parameters, which can't be taken advantage of directly, as happened instead in the structured case.
There exist algorithms for sparse linear algebra which can be used in these cases (implemented for example in CuSPARSE) or some specific processor architectures are designed to take advantage of sparsity in matrices (e.g., NVIDIA Tesla A100 GPU).

In this lab we will deal with unstructured sparsity, but an extension to structured sparsity can be easily derived.

## (Least) Magnitude pruning

MP is one of the most straight-forward pruning techniques. Essentially, we remove a fixed fraction of parameters exhibiting a low absolute value.

To get more technical:

* Let us call $\Theta$ the *structure* holding all of the parameters in the network.

* Let us also introduce $p\in(0,1)$, the **pruning rate**. This is the fixed fraction of parameters with small magnitude that we will prune.

1. We obtain $\vec\Theta = \text{vec}\vert\Theta\vert$, the **empirical distribution of the parameters in magnitude**

2. We sort $\vec\Theta$, obtaining $\vec\Theta_\text{sort}$

3. We obtain $\theta_p$ the $p$-th percentile of $\vec\Theta_\text{sort}$, which will act as a threshold for determining which parameters are going to be pruned.

4. For each of the parameter $\theta$ in the original $\Theta$, we operate the pruning: $\tilde\theta = \theta \cdot \mathbb{1}[\vert\theta\vert\geq\theta_p]$ (NB: $\mathbb{1}$ is the *indicator function*, which evaluates to 1 if the condition inside the brackets is verified, to 0 otherwise).

5. We replace $\Theta$ with $\tilde\Theta$, the structure composed of the various $\tilde\theta$s.

#### Masking

The same procedure described above can be obtained with masking. Essentially, instead of directly building $\tilde\Theta$ as element-wise application of $\tilde\theta = \theta \cdot \mathbb{1}[\vert\theta\vert\geq\theta_p]$, we rather do this:

1. $m = \mathbb{1}[\vert\theta\vert\geq\theta_p]$. $m$ is a boolean telling us whether the corresponding parameters needs to be pruned or not. We have as many $m$s as there are $\theta$s.

2. We compose the $m$s in a structure $M$, identical in shape to $\Theta$.

3. Now, $\tilde\Theta = M\odot \Theta$, where $\odot$ is the Hadamard product.

This is the way in which pruning (analogously to Dropout) will be operated by us on PyTorch.

In [1]:
import torch
from scripts import mnist, train_utils, architectures

In [4]:
layers = [
    {"n_in": 784, "n_out": 16, "batchnorm": False},
    {"n_out": 32, "batchnorm": True},
    {"n_out": 64, "batchnorm": True},
    {"n_out": 10, "batchnorm": True}
]
net = architectures.MLPCustom(layers)

In [5]:
net

MLPCustom(
  (layers): Sequential(
    (0): Linear(in_features=784, out_features=16, bias=True)
    (1): ReLU()
    (2): BatchNorm1d(784, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): Linear(in_features=784, out_features=32, bias=True)
    (4): ReLU()
    (5): BatchNorm1d(784, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (6): Linear(in_features=784, out_features=64, bias=True)
    (7): ReLU()
    (8): BatchNorm1d(784, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (9): Linear(in_features=784, out_features=10, bias=True)
    (10): ReLU()
  )
)

### References

[1](https://arxiv.org/abs/2006.03669) O'Neill, James. "An Overview of Neural Network Compression."

[2](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.32.7223&rep=rep1&type=pdf) Yann LeCun, John Denker, Sara Solla. "Optimal Brain Damage."

[3](https://arxiv.org/abs/2102.00554) Hoefler, Torsten, et al. "Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks."