# Neural Networks
## Homework 1: Implementing advanced activation functions

**Name**: Gabriele G. Di Marzo

**Matricola**: 2012633

Upload the completed notebook **before 30/11/2022 at 23:59** on the Google Classrom page.

In [None]:
import torch
import matplotlib.pyplot as plt


### Objective

The purpose of this homework is to implement a new layer inside PyTorch, by properly extending the `nn.Module` object. **Before proceeding**, carefully read the following documentation:

+ [nn.Module](https://pytorch.org/docs/stable/generated/torch.nn.Module.html?highlight=module#torch.nn.Module)
+ [PyTorch: Custom Module](https://pytorch.org/tutorials/beginner/examples_nn/polynomial_module.html)

You can also (optionally) learn more about activation functions by reading the survey in [1].

### Introduction: families of activation functions

In the course up to now, we have seen the application of the sigmoid, the softmax, and the ReLU. However, many additional activation functions exist [1], with varying strengths and drawbacks.

Several of them are designed as variants of the ReLU. The **S-Shaped ReLU** (SReLU) [2] is defined as:

$$
\phi(s) = \begin{cases} t^r + a^r(s - t^r) & \text{ if } s > t^r \\ s & \text{ if } t^r > s > t^l \\ t^l + a^l(s - t^l) & \text{ if } s < t^l \end{cases} \,.
$$

The four parameters $t^r, a^r, t^l, a^l$ are trained via back-propagation, and they are **different for each unit in the layer**. 


### Exercise 1: implementing an activation function (1 point)

Let us start with the simpler **[exponential linear squashing](https://paperswithcode.com/method/elish)** (ELiSH) activation function:

$$
\phi(x) = \begin{cases} \sigma(x)x & \text{ if } x \ge 0 \\ \frac{\exp(x) - 1}{1 + \exp(-x)} & \text{ otherwise} \end{cases}
$$

**Exercise 1**: complete the following stub.

In [None]:
def elish(x):
  # x is a generic torch.Tensor, and this function must compute the ELiSH activation function.
    return torch.where(x>=0, torch.sigmoid(x)*x, torch.sigmoid(x)*(torch.exp(x) - 1 ))


**Hints for a correct implementation**:

1. There are several ways of implementing an if/else operation like the one above in PyTorch. In general, the simplest implementation of "*if a then b, else c*" is `torch.where(a, b, c)` (see the documentation for [torch.where](https://pytorch.org/docs/stable/generated/torch.where.html)). Any working variant is accepted here.

Here is a simple sanity check for the correct implementation:

In [None]:
elish(torch.FloatTensor([[0.2, -0.4]])) # Should be approximately [[0.11, -0.13]]

### Exercise 2: some visualization experiments (1 point)

**Exercise 2.1**: plot the ELiSH function in [-5, +5].

In [None]:
x = torch.arange(-5, 5, 0.1, requires_grad=True )
y = elish(x)
plt.plot(x.detach().numpy(), y.detach().numpy())
plt.xlim([-5, 5])
plt.ylim(y.detach().numpy().min(), x.detach().numpy().max())
plt.grid(True)
plt.show()

**Exercise 2.2**: using the utilities from `torch.autograd` ([torch.autograd](https://pytorch.org/docs/stable/autograd.html)), **compute and plot** the derivative of ELiSH using automatic differentiation.

In [None]:
y.backward(x)

In [None]:
# TODO: plot the gradient of the ELiSH function
plt.grid(True)
plt.plot(x.detach().numpy(), x.grad.detach().numpy(), label="derivate of elish funct")
plt.xlim([-5, 5])
plt.show()


**Exercise 2.3 (sanity check)**: build a model using the previously defined activation function, and test it on a random mini-batch of data:

In [None]:
# TODO: complete the definition of the model
class RandomModel(torch.nn.Module): 
  def __init__(self, funct): 
    super().__init__()
    self.l1 = torch.nn.Linear(10, 10)
    self.l2 = torch.nn.Linear(10, 2)
    self.funct = funct
  
  def forward(self, x): 
    x = self.l1(x)
    x = self.funct(x)
    x = self.l2(x)
    return x
    

model = RandomModel(elish)

print(model(torch.randn((5, 10))))

## Exercise 3: implementing a trainable activation function (2 points)

**Exercise 3:** define a `torch.nn.Module` implementing the SReLU.

**Hints for a correct implementation**:
* The layer should **only** implement the activation function. Ideally, it will always be used in combination with a fully-connected layer with no activation function.
* Think carefully about how you want to initialize the parameters.

In [None]:
from torch.nn.parameter import Parameter

class SReLU(torch.nn.Module):
  def __init__(self, units):
    super().__init__()
    self.units = units


    self.alpha1 = Parameter(torch.randn((self.units), requires_grad=True))
    self.beta1 = Parameter(torch.randn((self.units), requires_grad=True))

    self.alpha2 = Parameter(torch.randn((self.units), requires_grad=True))
    self.beta2 = Parameter(torch.randn((self.units), requires_grad=True))

  
  def forward(self, x):
    return torch.where(x<self.beta1, self.beta1 + self.alpha1*(x - self.beta1),
                torch.where(
                    self.beta2<=x, x, self.beta2 + self.alpha2*(x-self.beta2) )
                 )
    

As a sanity check, initialize a SReLU layer and count the number of parameters:

In [None]:
# Initialize the layer
layer = SReLU(2)
# Count the parameters
numb_params = sum([p.numel() for p in layer.parameters() if p.requires_grad])
p_tot = [p for p in layer.parameters()]
print( numb_params ) # Should print 8!

## Exercise 4: training a model with trainable activation functions (1 point)

We will use the following dataset from TensorFlow Datasets:
https://www.tensorflow.org/datasets/catalog/german_credit_numeric

In [None]:
import tensorflow_datasets as tfds
import torch.optim as optim 
from collections import OrderedDict
from torch.utils.data import DataLoader
from sklearn import preprocessing

In [None]:
class dataset(torch.utils.data.Dataset):
  def __init__(self,X,Y):
    self.X = X                           
    self.Y = Y       

  def __len__(self):
    return len(self.X)                  

  def __getitem__(self, idx):
    return [self.X[idx], self.Y[idx]]

train_data = tfds.load('german_credit_numeric', split='train[:75%]', as_supervised=True)
Xtrain, ytrain = train_data.batch(5000).get_single_element()


In [None]:
train_data = tfds.load('german_credit_numeric', split='train[:75%]', as_supervised=True)
Xtrain, ytrain = train_data.batch(5000).get_single_element()


In [None]:
train_data = tfds.load('german_credit_numeric', split='train[:75%]', as_supervised=True)
test_data = tfds.load('german_credit_numeric', split='train[75%:]', as_supervised=True)

Xtrain, ytrain = train_data.batch(5000).get_single_element()
Xtrain, ytrain = preprocessing.normalize(Xtrain.numpy()), ytrain.numpy()
Xtrain, ytrain = torch.Tensor(Xtrain), torch.Tensor(ytrain)

Xtest, ytest = test_data.batch(5000).get_single_element()
Xtest, ytest = preprocessing.normalize(Xtest.numpy()), ytest.numpy()
Xtest, ytest = torch.Tensor(Xtest), torch.Tensor(ytest)

In [None]:
train_loader = DataLoader(dataset(Xtrain, ytrain), batch_size=64, shuffle=True)
test_loader = DataLoader(dataset(Xtest, ytest), batch_size=64, shuffle=False)

**Exercise 4**: write a `nn.Module` using the previous `SReLU`, and train it on the german_credit_numeric dataset.

In [None]:
'''
class basicModel(torch.nn.Module): 
  
  def __init__(self):
    super('basicModel', self).__init__()
    self.lin1 = torch.nn.Linear(24, 10)
    self.lin2 = torch.nn.Linear(10, 10)
    self.lin3 = torch.nn.Linear(10, 5)
    self.out = torch.nn.Linear(5, 1)

  def forward(self, x):
    x = torch.nn.LeakyReLU(self.lin1(x))
    x = torch.nn.LeakyReLU(self.lin2(x))
    x = torch.nn.LeakyReLU(self.lin3(x))
    return torch.nn.Sigmoid(self.out(x))
'''

model = torch.nn.Sequential(OrderedDict([
          
          ('Linear1', torch.nn.Linear(24,10)),
          #('relu', torch.nn.ReLU()),
          ('SReLU1', SReLU(10)),
          
          ('Lin2', torch.nn.Linear(10, 10)),
          #('relu2', torch.nn.ReLU()),
          ('SReLU2', SReLU(10)),
          ('drop2', torch.nn.Dropout(p=0.5)),
          
          ('Linear3', torch.nn.Linear(10, 5)), 
          #('relu3', torch.nn.ReLU()),
          ('SReLU3', SReLU(5)),

          ('Linear4', torch.nn.Linear(5,1)),
          ('Sigmoid', torch.nn.Sigmoid())
        ]))
criterion = torch.nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

In [None]:
losses = []
correct = 0
epochs = 325

model.train()
for epoch in range(epochs): 
  total_loss = 0.0
  for i, data in enumerate(train_loader, 0):

    inputs, labels = data
    y_pred = model(inputs)
    loss = criterion(y_pred, labels.unsqueeze(1))

    correct += (y_pred == labels).float().sum()
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

  losses.append(loss.detach().item())

model.eval()

In [None]:
plt.plot(range(epochs), losses, 'orange')
plt.title('loss funct')
plt.show()
print(f'ultima loss : {losses[-1]}')


In [None]:
total = 0 
correct = 0
acc_test = []
acc_train = []
with torch.no_grad():
    for i, data in enumerate(test_loader):
      inputs, labels = data
      outputs = model(inputs)
      outputs = [1 if outputs[i]>=0.5 else 0 for i in range(len(outputs))]
      correct += sum([1 if outputs[i]==labels[i] else 0 for i in range(len(labels))])
    acc_test = correct
    correct = 0
    for i, data in enumerate(train_loader):
      inputs, labels = data
      outputs = model(inputs)
      outputs = [1 if outputs[i]>=0.5 else 0 for i in range(len(outputs))]
      correct += sum([1 if outputs[i]==labels[i] else 0 for i in range(len(labels))])
    acc_train = correct

print(f'test accuracy   : {acc_test*100/len(test_data)}')
print(f'train accuracy  : {acc_train*100/len(train_data)}')


**Optionally**, you can plot the distribution (histogram) of the parameters after training.

In [None]:
# TODO: plot the histogram
plt.hist( ... )

## Final checklist
1. Carefully check all code. Insert comments when needed. Search for "TODO" to see if you forgot something.
2. Run everything one final time. *Please do not send us notebooks with errors or cells that are not working.*
3. Upload the completed notebook **before 30/11/2021 23:59** on the Google Classrom page.

### References

[1] Apicella, A., Donnarumma, F., Isgrò, F. and Prevete, R., 2021. [A survey on modern trainable activation functions](https://arxiv.org/abs/2005.00817).

[2] Jin, X., Xu, C., Feng, J., Wei, Y., Xiong, J. and Yan, S., 2016. [Deep learning with s-shaped rectified linear activation units](https://arxiv.org/abs/1512.07030). In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 30, No. 1).