# Neural Network Architecture and Hyperparameters

## Activation functions between layers
Sigmoid and Softmax are usually applied on the last layer of a network.

Sigmoid
* Output between 0 and 1
* Can be used anywhere in the network
* Gradients approach zero for low and high values of x (saturation)
* Some values are spo small, that they can prevent the gradient from updating (vanishing gradient) -> training the network is challenging

Softmax
* Output between 0 and 1
* Saturates
* Cannot be used anywhere

### ReLu
Rectified Linear Unit

f(x) = max(x, 0)

* positive input: output equal to input
* negative input: 0

* Widely used
* No upper bounds
* gradient does not approach 0 -> no vanishing gradient problem
* However, once an input element is negative, it will be set to 0 for the rest of the training

In [6]:
import torch
import torch.nn as nn

# Create a ReLU function with PyTorch
relu_pytorch = nn.ReLU()

# Apply your ReLU function on x, and calculate gradients
x = torch.tensor(-1.0, requires_grad=True)
y = relu_pytorch(x)
y.backward()

# Print the gradient of the ReLU function for x
gradient = x.grad
print(gradient) # 0, as input value is -1

tensor(0.)


### Leaky ReLu

* positive input: output equal to input
* negative input: output is input multiplied by a small coefficient (e.g. 0.01) -> non-0 gradient for negative input

In [4]:
# Create a leaky relu function in PyTorch
leaky_relu_pytorch = nn.LeakyReLU(negative_slope=0.05)

x = torch.tensor(-2.0)
# Call the above function on the tensor x
output = leaky_relu_pytorch(x)
print(output)

## Architecture

Fully connected NN
* Linear layers are fully connected
* A neuron computes a linear operation using N+1 parameters, where N is number of outputs from previous layer, +1 for bias
* 3 types of layers: input, hidden, output
* Size of input layer is number of features
* Size of ouput layer is number of classes
* Number of features and classes are imposed by dataset
* "model capacity" = number of parameters in the model (in the hidden layers) -> solve more complex problems

In [9]:
# example model
model = nn.Sequential(nn.Linear(8, 4), nn.Linear(4, 2))
# model capacity
# 1st layer: 4 neurons, 8 inputs -> 4 * (8 + 1) = 36 parameters
# 2nd layer: 2 neurons, 4 inputs -> 2 * (4 + 1) = 10 parameters

total = 0
for param in model.parameters():
    total += param.numel()
total

46

## Learning rate and momentum

Trainig a nNN amounts to solving an optimizatin problem. 
Stochastic Gradient Descent (SGD) is the usual optimizer. SGD takes 2 params:
* learning rate: step size = gradient * learning rate (typically 0.01 to 0.0001)
* momentum: controls inertia, enables to overcome local dips and find global minimum (typically 0.85 to 0.99)

In [None]:
import torch.optim as optim

sgd = optim.SGD(model.parameters(), lr=0.01, momentum=0.95)

## Layer initialization and transfer learning
Layers can be initialized by sampling from a uniform distribution. Initial weights are from 0 to 1.

In [19]:
import torch.nn as nn

layer = nn.Linear(64, 128)
nn.init.uniform_(layer.weight)
layer.weight.min(), layer.weight.max()

(tensor(4.2915e-06, grad_fn=<MinBackward1>),
 tensor(0.9997, grad_fn=<MaxBackward1>))

In [23]:
# another example
layer0 = nn.Linear(16, 32)
layer1 = nn.Linear(32, 64)

# Use uniform initialization for layer0 and layer1 weights
nn.init.uniform_(layer0.weight)
nn.init.uniform_(layer1.weight)

model = nn.Sequential(layer0, layer1)

Fine-tuning
> In practice, a model is typically trained on a large dataset as a starting point, and re-tuned on a smaller one, __but with a smaller learning rate__.
> 
> Some layers can be left untrained = freezed: generally, freeze early layers and tune layers closer to the output

Steps
* Find a model trained on a similar task
* Load weights
* Freeze some layers if necessary
* train with smaller learning rate
* evaluate loss and adjust

In [22]:
import torch.nn as nn

model = nn.Sequential(nn.Linear(64, 128), nn.Linear(128, 256))

for name, param in model.named_parameters():    
  
    # Check if the parameters belong to the first layer
    if name == '0.weight' or name == '0.bias':
      
        # Freeze the parameters
        param.requires_grad = False
  
    # Check if the parameters belong to the second layer
    if name == '1.weight' or name == '1.bias':
      
        # Freeze the parameters
        param.requires_grad = False