# Feedforward networks
Notebbok inspired by [Pytorch official tutorials](https://pytorch.org/tutorials/)

This sections covers:
- simple way of building a feedforward fully connected network
- usage of <tt>nn.Sequential</tt> to build a model 

To begin we want to discuss subclasses of <tt>torch.nn.Module</tt>, which is the PyTorch base class meant to encapsulate behaviors specific to PyTorch Models and their components.

One important behavior of torch.nn.Module is registering parameters. If a particular Module subclass has learning weights, these weights are expressed as instances of torch.nn.Parameter. The Parameter class is a subclass of torch.Tensor, with the special behavior that when they are assigned as attributes of a Module, they are added to the list of that modules parameters. These parameters may be accessed through the parameters() method on the Module class.

## Perform standard imports

In [4]:
import torch
from torch import nn


## Simple tiny model
As a simple example, here’s a very simple model with two linear layers and an activation function. 

In [5]:
class TinyModel(torch.nn.Module):
    
    def __init__(self):
        super(TinyModel, self).__init__()
        
        self.linear1 = torch.nn.Linear(100, 200)
        self.activation = torch.nn.ReLU()
        self.linear2 = torch.nn.Linear(200, 10)
        self.softmax = torch.nn.Softmax()
    
    def forward(self, x):
        x = self.linear1(x)
        x = self.activation(x)
        x = self.linear2(x)
        x = self.softmax(x)
        return x

This shows the fundamental structure of a PyTorch model: there is an
`` __init__()`` method that defines the layers and other components of a
model, and a ``forward()`` method where the computation gets done.

### EXERCISE: create an instance of it and ask it to report on its parameters

You can print the model, or any of its submodules, to learn about
its structure

In [6]:
# YOUR CODE HERE
tinymodel = TinyModel()

print('The model:')
print(tinymodel)

print('\n\nJust one layer:')
print(tinymodel.linear2)

print('\n\nModel params:')
for param in tinymodel.parameters():
    print(param)

print('\n\nLayer params:')
for param in tinymodel.linear2.parameters():
    print(param)

The model:
TinyModel(
  (linear1): Linear(in_features=100, out_features=200, bias=True)
  (activation): ReLU()
  (linear2): Linear(in_features=200, out_features=10, bias=True)
  (softmax): Softmax(dim=None)
)


Just one layer:
Linear(in_features=200, out_features=10, bias=True)


Model params:
Parameter containing:
tensor([[-0.0736, -0.0023, -0.0267,  ..., -0.0019, -0.0361,  0.0325],
        [-0.0105,  0.0442,  0.0950,  ...,  0.0156, -0.0942, -0.0199],
        [-0.0357,  0.0251, -0.0690,  ..., -0.0732, -0.0215, -0.0945],
        ...,
        [ 0.0882, -0.0652, -0.0024,  ..., -0.0236, -0.0484, -0.0670],
        [-0.0731,  0.0219,  0.0312,  ..., -0.0835, -0.0183, -0.0921],
        [ 0.0605,  0.0670,  0.0746,  ..., -0.0120,  0.0307,  0.0406]],
       requires_grad=True)
Parameter containing:
tensor([ 3.4916e-02,  3.7676e-02,  5.4920e-02, -7.7045e-02, -3.7277e-02,
        -2.5462e-02, -1.0615e-02, -4.3839e-02, -9.4245e-02, -2.5768e-02,
         2.1719e-02, -8.9354e-02,  7.0671e-02, -6.4426


## Data Manipulation Layers and Functions
Until now we looked at linear layers. In the next lessons we will study convolutional and recurrent layers.

However there are other layer types that perform important functions in models,
but don’t participate in the learning process themselves.
### Data Manipulation Layers
**Normalization layers** re-center and normalize the output of one layer
before feeding it to another. Centering the and scaling the intermediate
tensors has a number of beneficial effects, such as letting you use
higher learning rates without exploding/vanishing gradients.



In [7]:
my_tensor = torch.rand(1, 4, 4) * 20 + 5
print(my_tensor)

print(my_tensor.mean())

norm_layer = torch.nn.BatchNorm1d(4)
normed_tensor = norm_layer(my_tensor)
print(normed_tensor)

print(normed_tensor.mean())

tensor([[[16.5197, 24.8638, 23.9220, 16.0693],
         [16.6390, 11.0045, 16.8072,  9.7922],
         [16.2298, 12.9146,  5.7717,  7.6408],
         [13.9772,  7.2556, 24.4991, 12.5563]]])
tensor(14.7789)
tensor([[[-0.9405,  1.1117,  0.8800, -1.0513],
         [ 0.9644, -0.8009,  1.0171, -1.1807],
         [ 1.3449,  0.5474, -1.1710, -0.7213],
         [-0.0951, -1.1697,  1.5871, -0.3223]]],
       grad_fn=<NativeBatchNormBackward0>)
tensor(3.7253e-08, grad_fn=<MeanBackward0>)


Running the cell above, we’ve added a large scaling factor and offset to
an input tensor; you should see the input tensor’s ``mean()`` somewhere
in the neighborhood of 15. After running it through the normalization
layer, you can see that the values are smaller, and grouped around zero
- in fact, the mean should be very small (< 1e-8).

This is beneficial because many activation functions (discussed below)
have their strongest gradients near 0, but sometimes suffer from
vanishing or exploding gradients for inputs that drive them far away
from zero. Keeping the data centered around the area of steepest
gradient will tend to mean faster, better learning and higher feasible
learning rates.

**Dropout layers** are a tool for encouraging *sparse representations*
in your model - that is, pushing it to do inference with less data.

Dropout layers work by randomly setting parts of the input tensor to zero
*during training* - dropout layers are always turned off for inference.
This forces the model to learn against this masked or reduced dataset.
For example:




In [8]:
my_tensor = torch.rand(1, 4, 4)

dropout = torch.nn.Dropout(p=0.4)
print(dropout(my_tensor))
print(dropout(my_tensor))

tensor([[[0.0000, 0.0000, 0.6719, 0.0000],
         [0.7051, 0.3668, 0.6258, 0.8969],
         [0.2865, 0.5560, 0.4910, 0.0000],
         [1.3266, 0.6856, 0.0000, 1.5237]]])
tensor([[[1.2388, 0.0536, 0.6719, 1.3306],
         [0.0000, 0.0000, 0.0000, 0.0000],
         [0.0000, 0.5560, 0.0000, 0.0786],
         [0.0000, 0.6856, 0.8308, 0.0000]]])


Above, you can see the effect of dropout on a sample tensor. You can use
the optional ``p`` argument to set the probability of an individual
weight dropping out; if you don’t it defaults to 0.5.

### Activation Functions

Activation functions make deep learning possible. A neural network is
really a program - with many parameters - that *simulates a mathematical
function*. If all we did was multiple tensors by layer weights
repeatedly, we could only simulate *linear functions;* further, there
would be no point to having many layers, as the whole network would
reduce could be reduced to a single matrix multiplication. Inserting
*non-linear* activation functions between layers is what allows a deep
learning model to simulate any function, rather than just linear ones.

``torch.nn.Module`` has objects encapsulating all of the major
activation functions including ReLU and its many variants, Tanh,
Hardtanh, sigmoid, and more. It also includes other functions, such as
Softmax, that are most useful at the output stage of a model.

### Loss Functions

Loss functions tell us how far a model’s prediction is from the correct
answer. PyTorch contains a variety of loss functions, including common
MSE (mean squared error = L2 norm), Cross Entropy Loss and Negative
Likelihood Loss (useful for classifiers), and others.


## The sequential module

The sequential module is, one of the classes that are used to create the PyTorch neural networks without any explicit class. Basically, the sequential module is a container or we can say that the wrapper class is used to extend the nn modules. 

In the sequential container, modules will be added to it in the request they be passed in the constructor. 

Let us look to an example: as usual we define our neural network by subclassing ``nn.Module``, and
initialize the neural network layers in ``__init__``. Every ``nn.Module`` subclass implements
the operations on input data in the ``forward`` method.

Modules are concatenated using the ``nn.Sequential`` container.

In [9]:
class NeuralNetwork(nn.Module):
    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, 10),
        )

    def forward(self, x):
        x = self.flatten(x)
        output = self.linear_relu_stack(x)
        return output

In [10]:
model = NeuralNetwork()
print(model)

NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
  )
)


### EXERCISE: use the sequential module to build a feedforward network
The network should be as follows:
1. It should comprise 10 hidden layers
2. Each hidden layer shuold have a different number os units
3. For each hidden layer the activation function should be the rectified linear unit function
4. The activation of the output layer should be the softmax function (this replaces the sigmoid function when dealing with a multiclass classification problem).

In [11]:
# YOUR CODE HERE
class NetworkSequential(nn.Module):
    def __init__(self, layers_sizes, output_size):
        super(NetworkSequential,self).__init__()
        self.deep_net = nn.Sequential()
        for i_layer, _ in enumerate(layers_sizes[:-1]):
            self.deep_net.add_module(f'linear{i_layer+1}', nn.Linear(layers_sizes[i_layer], layers_sizes[i_layer+1]))
            self.deep_net.add_module(f'activation{i_layer+1}', nn.ReLU())
        self.deep_net.add_module(f'classifier', nn.Linear(layers_sizes[-1], output_size))
        
    def forward(self, x):
        x = self.deep_net(x)        
        return x

In [12]:
output_size = 10
input_size = 1000
n_hidden = 10
layers_sizes = [int(input_size / 1.5**i) for i in range(n_hidden+1)]
print(layers_sizes)
mynetwork = NetworkSequential(layers_sizes, output_size)

[1000, 666, 444, 296, 197, 131, 87, 58, 39, 26, 17]


In [13]:
print(mynetwork)

NetworkSequential(
  (deep_net): Sequential(
    (linear1): Linear(in_features=1000, out_features=666, bias=True)
    (activation1): ReLU()
    (linear2): Linear(in_features=666, out_features=444, bias=True)
    (activation2): ReLU()
    (linear3): Linear(in_features=444, out_features=296, bias=True)
    (activation3): ReLU()
    (linear4): Linear(in_features=296, out_features=197, bias=True)
    (activation4): ReLU()
    (linear5): Linear(in_features=197, out_features=131, bias=True)
    (activation5): ReLU()
    (linear6): Linear(in_features=131, out_features=87, bias=True)
    (activation6): ReLU()
    (linear7): Linear(in_features=87, out_features=58, bias=True)
    (activation7): ReLU()
    (linear8): Linear(in_features=58, out_features=39, bias=True)
    (activation8): ReLU()
    (linear9): Linear(in_features=39, out_features=26, bias=True)
    (activation9): ReLU()
    (linear10): Linear(in_features=26, out_features=17, bias=True)
    (activation10): ReLU()
    (classifier): Lin

In [14]:
model = nn.Sequential(OrderedDict([
          ('conv1', nn.Conv2d(1,20,5)),
          ('relu1', nn.ReLU()),
          ('conv2', nn.Conv2d(20,64,5)),
          ('relu2', nn.ReLU())
        ]))

NameError: name 'OrderedDict' is not defined