In [1]:
print("hello, git!")

hello, git!


In [2]:
print("agin , hello git!")

agin , hello git!


# Fully Connected Networks (FCNs) & activations

## What are they?

- Powerful function approximators
- Universal approximators (assuming infinite width and/or depth)
- Essentially stacked linear regressions with multiple outputs

## What they do?

Previously we used polynomial features for non-linear datasets, but this comes up with downsides:
- what degree of polynomial should we use?
- maybe other functions would be better (they usually are)?
- if so, what those functions are?

![](./images/complex-fn.png)

We might come to the conclusion that:

> it would be best to learn those functions directly from data

... and that's what neural networks do.

## Perceptron and it's limitations

> Perceptron is binary logistic regression, neural network with one output neuron and one layer

It is able to learn __linearly separable data__ but it will fail for non-linear data.

> __Let's see a very famous XOR problem__

In [29]:
import torch


def xor_problem(model):

    inputs = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]]).float()
    targets = torch.tensor([0, 1, 1, 0]).float()

    optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)

    for _ in range(10000):
        outputs = model(inputs).squeeze()

        loss = torch.nn.functional.binary_cross_entropy_with_logits(outputs, targets)
        loss.backward()

        optimizer.step()
        optimizer.zero_grad()

    print("Predicted logits:\n", model(inputs))
    print("Predicted labels:\n", (model(inputs) > 0).float())
    print("Targets:\n", targets)
    print("Weights:\n")

    for parameter in model.parameters():
        print(parameter.data)


xor_problem(torch.nn.Linear(2, 1))

Predicted logits:
 tensor([[ 0.1166],
        [ 0.3109],
        [-0.2655],
        [-0.0712]], grad_fn=<AddmmBackward>)
Predicted labels:
 tensor([[1.],
        [1.],
        [0.],
        [0.]])
Targets:
 tensor([0., 1., 1., 0.])
Weights:

tensor([[-0.3821,  0.1943]])
tensor([0.1166])


This simple model is unable to learn classification for XOR data

![](./images/shallow-vs-deep.png)

Let's try adding another layer with size equal to `2`:

In [30]:
xor_problem(torch.nn.Sequential(torch.nn.Linear(2, 2), torch.nn.Linear(2, 1)))

Predicted logits:
 tensor([[ 0.0873],
        [-0.0803],
        [ 0.0959],
        [-0.0716]], grad_fn=<AddmmBackward>)
Predicted labels:
 tensor([[1.],
        [0.],
        [1.],
        [0.]])
Targets:
 tensor([0., 1., 1., 0.])
Weights:

tensor([[-0.5845, -0.2504],
        [ 0.5905, -0.4012]])
tensor([-0.3885, -0.5605])
tensor([[0.2497, 0.2618]])
tensor([0.3310])


A single linear transformation (multiplication by weights of model):
- stretches the input space by a certain factor in some direction
- adding a constant (bias) shifts it

> If we add more linear layers the whole transformation is still linear!

![](./images/factor-proof.png)

## Activations

To combat this phenomena, we need to apply __non-linear transformations__ after some (usually all) layers:

![](./images/activation.png)

> Composition of non-linear functions makes the whole transformation non-linear

There are multiple available activation functions, including (but not limited to):
- sigmoid
- tanh
- ReLU
- Leaky ReLU

![](./images/activ-fns.png)

> Activation functions were introduced in order to combat issues of previous dominant activation functions

## Sigmoid

- Initial neural network activation function
- Squashes input to `[0, 1]` range (neuron on or off)

But this activation function has the following drawbacks:
- Non zero centered
- __Oversaturation__ (most severe drawback)

### Non-zero centered

> Neural networks expect data to be zero-centered

If the data coming into neural network is always positive (as is the case with sigmoid) gradient will become either all positive for every neuron or negative.

This leads to zig-zagging during training, especially for smaller batches

> Larger batches mostly mitigate this issue, as the gradient will be averaged across many examples (some positive, some negative for different weights)

## Tanh 

Hyperbolical tangens solves this issue, but:

> Tanh also has oversaturation drawbacks

## Oversaturation

> When neuron activation saturates at the tails (large positive/negative values) local gradient becomes zero (close to zero)

> This local gradient is multiplied by the previous layers local gradients and dies, phenomena known as __dying gradient__ (especially for deeper networks)

## Exercise

- Analyze `gradient_print` given below and make sure you know what it does.
- Analyze `Initializer` class and make sure you know what it does
- Create multiple layer neural network with `tanh` and `sigmoid` activation (try a few options, start with `7` layers and go lower)
- Use `Initializer` and `torch.nn.Module`'s `apply` function to manipulate weights in order to see dying gradient after backpropagation.
- How many layers do we need in order __not to__ use `Initializer`? Check multiple values (you might do that in a loop and adjust the code as needed)
- Finally, create a neural network which solves the xor problem!

In [47]:
def gradient_print(model):

    inputs = torch.rand(32, 10)
    targets = torch.randint(low=0, high=5, size=(32,))

    outputs = model(inputs)

    loss = torch.nn.functional.cross_entropy(outputs, targets)
    loss.backward()

    for i, parameter in enumerate(model.parameters()):
        print(f"\n\n--------- PARAMETER {i} GRADIENT ---------\n\n")
        print(parameter.grad)


class Initializer:
    def __init__(self, target_layer, multiplier):
        self._counter = -1
        self.target_layer = target_layer
        self.multiplier = multiplier

    def __call__(self, submodule):
        self._counter += 1
        if self._counter == self.target_layer:
            submodule.weight.data = (
                torch.ones_like(submodule.weight.data) * self.multiplier
            )

In [48]:
# your model here



--------- PARAMETER 0 GRADIENT ---------


tensor([[3.9513e-08, 7.9359e-08, 6.6509e-08, 5.6311e-08, 7.1541e-08, 2.8979e-08,
         8.8074e-08, 4.1238e-08, 8.5469e-08, 3.5543e-08],
        [3.8386e-08, 6.8035e-08, 5.6585e-08, 4.7749e-08, 6.6983e-08, 2.8352e-08,
         7.5741e-08, 3.8296e-08, 7.5845e-08, 3.4881e-08],
        [3.8629e-08, 7.7392e-08, 6.4328e-08, 5.2117e-08, 7.0474e-08, 2.5492e-08,
         8.0053e-08, 3.7269e-08, 8.1557e-08, 3.0715e-08],
        [3.4803e-08, 7.3151e-08, 6.2568e-08, 5.2504e-08, 6.8592e-08, 2.6935e-08,
         8.1884e-08, 3.8729e-08, 7.9126e-08, 3.0244e-08],
        [3.9992e-08, 8.3039e-08, 6.5890e-08, 5.7668e-08, 7.0517e-08, 3.3804e-08,
         8.7699e-08, 3.9604e-08, 8.7403e-08, 3.7433e-08]])


--------- PARAMETER 1 GRADIENT ---------


tensor([1.4060e-07, 1.2718e-07, 1.3352e-07, 1.3168e-07, 1.4006e-07])


--------- PARAMETER 2 GRADIENT ---------


tensor([[-0.0253,  0.0593, -0.0144, -0.0412, -0.0458],
        [-0.0153,  0.0196, -0.0109, -0.0280, 

Let's take a moment to see the neural network as a whole:


![](./images/nn.png)

- __Depth:__ - how many layers are in a neural network
- __Width:__ - `out_features` in PyTorch, how many neurons are in a certain layer

> In general, we create a bottleneck with `nn.Linear` layers, starting with `N` features and finishing with `M` outputs

> This is a rule of thumb, not a hard knowledge

## ReLU

`ReLU` was designed to combat oversaturation and dying gradient problem.

> `ReLU` is given by `max(0, x)` equation

> In the linear part of activation, gradient will always be `1` or `0` (for negative and zero values)

> Combination of piece-wise linear function can approximate any non-linearity (especially with increasing depth)

### Advantages

- Reportedly much faster training times (initially `6x` improvements)
- Faster implementation (thresholding values on zero)
- No oversaturation

### Drawbacks

- Dying ReLU - large updates to neural networks may "knock off" a neuron into negative regime in which it will stay forever (in many neural networks there are "dead" neurons, around 50% sometimes)
- Too high learning rate may be a cause
- In practice it seems not to be too much of a problem (if the network is large it can compensate with other neurons)

## Exercise

- Create `NegativeInitializer` (similar to the previous exercise)
- It should initialize submodule (specified by `target_layer`) with negative values (any negative values)
- Check initialization for a few layers, see what happens with gradient. Why some gradient is non-zero anyway, what should we change?
- If you know, change it in the `NegativeInitializer` function.

In [49]:
class NegativeInitializer:
    ...


model = torch.nn.Sequential(
    torch.nn.Linear(10, 5),
    torch.nn.ReLU(),
    torch.nn.Linear(5, 5),
    torch.nn.ReLU(),
    torch.nn.Linear(5, 5),
    torch.nn.ReLU(),
    torch.nn.Linear(5, 5),
)

# apply with negative initializer here
gradient_print(model)



--------- PARAMETER 0 GRADIENT ---------


tensor([[ 9.4033e-04,  3.4394e-03,  7.0130e-04,  3.0983e-03, -1.3231e-04,
          1.7180e-03,  3.7560e-03,  7.5737e-04,  1.8372e-03,  2.4804e-03],
        [ 8.5365e-03,  3.1757e-03,  4.6065e-03,  1.3585e-02,  7.0140e-03,
          6.2525e-03, -3.1818e-04,  4.8624e-03,  8.1392e-03,  1.3574e-02],
        [ 8.4892e-06,  1.4374e-05,  3.4557e-06,  5.8296e-05,  7.2065e-05,
          1.8505e-05,  6.0031e-05,  2.1434e-05,  1.5608e-06,  7.9357e-06],
        [-7.3002e-04, -1.4709e-03,  1.3133e-03, -2.6724e-03, -1.4767e-03,
         -5.8764e-04, -3.8302e-03, -8.0369e-04,  2.1004e-04, -6.0261e-04],
        [-2.1563e-03, -6.2478e-04,  1.6013e-04, -3.1905e-03, -3.7933e-04,
         -3.9026e-04,  1.3713e-03,  6.8202e-04, -2.1437e-03, -3.3584e-03]])


--------- PARAMETER 1 GRADIENT ---------


tensor([ 3.4787e-03,  1.2699e-02,  7.3402e-05, -2.8507e-03, -7.2081e-04])


--------- PARAMETER 2 GRADIENT ---------


tensor([[ 1.8009e-04, -2.0210e-03,  7.7468e-0

## Leaky ReLU

> Leaky ReLU solves dying gradient problem using __a small negative slope__ for negative values

$$
\text{LeakyReLU}_s(x) = max(0, x) + s \times min(0, x)
$$

> `s` is usually around `0.01`

## Disadvantages

- Higher computational cost
- Solves a problem which is not a problem in many cases

## Advantages

- Neurons can recover from "dead" state and be useful for neural network

## What can neural networks do?

The motivation that led us to deriving neural networks was that we wanted to model more complex functions. But what functions can a neural network actually represent?

> Neural Networks can represent any continuous function and are **general function approximators**.

![](./images/univ-approx.png)

## Quick look at backpropagation

When `backward` is called, gradient is calculated on a per-layer basis and passed to the previous ones.

![](./images/backprop.png)

## Summary

- Linear layers work like multiclass logistic regression with `in_features` and `out_features`.
- Perceptron has __no hidden layers__.
- __Multilayer Perceptron (MLP)__ is standard neural network with multiple layers.
- We need to use multiple layers interspreded with activations in order to achieve non-linear behaviour
- Main activation function is currently `ReLU` or `LeakyReLU`
- Main problem with `sigmoid` and `tanh` is saturation and dying gradient (though there are used in some neural network blocks like recurrent)
- `ReLU` may suffer from dying neurons phenomena which may impact neural network
- Though it is not the most probable cause of poor network performance
- Remember to use wide layers (say `50`, `100` neurons), depending on task (if the model does not learn, it might need more parameters)
- Sufficiently wide and/or deep neural networks can approximate any function.

## Challenges

- Play around with [Tensorflow Neural Network playground](http://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=4,2&seed=0.97988&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false)
- Run a few experiments with LeakyReLU (dying gradients, dying neurons)
- What is neural network prunning?
- How Maxout activation function works? What are the upsides and downsides of using it? 
- How SeLU activation function works? What are the upsides and downsides of using it?
- Can you somehow show the impact of non-zero centered activation?