# Tutorial: Deploying Regularizers with Giotto-deep

**Author: Henry Kirveslahti**

In this tutorial we discuss the technical details for implementing regularizers and their use in *giotto-deep*. For a less technical introduction to regularization, please refer to the notebook *Basic Tutorial: Regularization with Giotto-deep*.

The notebook is organized as follows:

1. Example of a custom regularizer
2. Hyper-parameter tuning
3. Ad hoc regularizers

In [None]:
import numpy as np
import torch
import torch.nn as nn
from torch.optim import SGD, Adam, RMSprop
import matplotlib.pyplot as plt
from torch.utils.data import TensorDataset, DataLoader
from sklearn.model_selection import train_test_split
from gdeep.trainer import Trainer
from gdeep.trainer.regularizer import Regularizer
from gdeep.trainer.regularizer import TihonovRegularizer
from gdeep.search import GiottoSummaryWriter
from gdeep.models import ModelExtractor
from gdeep.utility import DEVICE
from gdeep.search import HyperParameterOptimization
from gdeep.models import FFNet
writer = GiottoSummaryWriter()

## 1. Custom Regularizers
*Giotto-deep* has already built-in support for $p$-norm regularization, but the framework allows for defining custom regularizers. Below we define the elastic net. It is similar to the existing $p$-norm regularizer, but the penalty term reads

$$
p_i = \lambda_1 \sum \big( ||\beta||_1 \big) + \lambda_2 \sum \big( ||\beta||_2^2 \big)
$$

Typically, a regularizer has just one penalty coefficient $\lambda$. The Elastic net we have two of these, so we need to override the default behavior by specifying the init function.

In [None]:
class ElasticNet(Regularizer):
    def __init__(self, lamda1,lamda2):
        self.lamda1=lamda1
        self.lamda2=lamda2
    def regularization_penalty(self, model):
        """
        The penalty is a combination of the L1 and L2 norms:
        """
        total = torch.tensor(0, dtype=float)
        for parameter in model.parameters():
            total = total + self.params['lambda1'] * torch.norm(parameter, 1) \
                  + self.params['lambda2'] * torch.norm(parameter, 2)**2
        return total    

This is a simple regularizer much in spirit of the $p$-norm regularizers in that it does not require any preprocessing nor parameter updates.

## 2. Hyper parameter tuning
An important aspect of regularization is that of hyper parameter-tuning. To this end, we can use the HPO. Let us first do the example from last notebook: We saw how the value of $\lambda$ about 0.2 boosted the regression coefficient $\alpha_1$ while eliminating the other coefficient $\alpha_2$ that had higher signal-to-noise ratio. Let us see which value of $\lambda$ gives us the best performance when we predict on the validation set.

To recap what we did, first we just run our models from last time on a smaller dataset (it won't take long):

In [None]:
rng = np.random.default_rng()
S=100
z0=rng.standard_normal(S)
z1=0.9*z0+0.1*rng.standard_normal(S)
z2=0.85*z0+0.15*rng.standard_normal(S)
y=z0+rng.standard_normal(S)
X=np.stack([z1,z2],1)
y=y.reshape(-1,1)
y=y.astype(float)
X=X.astype(float)

In [None]:
train_x, val_x, train_y, val_y = train_test_split(X, y, test_size=0.1)
tensor_x_t = torch.Tensor(train_x)
tensor_x_t=tensor_x_t.float()
tensor_y_t = torch.from_numpy(train_y)
tensor_y_t=tensor_y_t.float()
tensor_x_v = torch.Tensor(val_x)
tensor_y_v = torch.from_numpy(val_y)
train_dataset = TensorDataset(tensor_x_t,tensor_y_t)
dl_tr = DataLoader(train_dataset,batch_size=10)
val_dataset = TensorDataset(tensor_x_v,tensor_y_v)
dl_val = DataLoader(val_dataset,batch_size=10)

class Net(nn.Module):
    def __init__(self,featdim='2'):
        super(Net, self).__init__() 
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(eval(featdim), 1, bias=False),
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

In [None]:
network=Net('2')

In [None]:
def l2_norm(prediction, y):
    return torch.norm(prediction - y, p=2).to(DEVICE)

In [None]:
loss_fn = nn.MSELoss()
pipe = Trainer(network, (dl_tr, dl_val), loss_fn, writer,l2_norm)
pipe.train(SGD, 20, False, {"lr": 0.1})

In [None]:
pipe2 = Trainer(network, (dl_tr, dl_val), loss_fn, writer,l2_norm,regularizer=TihonovRegularizer(0.2,p=1))

In [None]:
pipe2.train(SGD, 20, False, {"lr": 0.01})

### The optimization - LASSO
Next we take 100 runs on the HPO to try to find the best value for $\lambda$ for the LASSO in the range $[0.05,0.5]$ with step size $0.01$. We specify the regularization parameters by putting the regularizer, together with its parameters in a dictionary. For details on HPO, please see the HPO tutorial.

In [None]:
search = HyperParameterOptimization(pipe, "accuracy", 100, best_not_last=True)
search.regularize=True
search.store_pickle = True
reg=TihonovRegularizer
optimizers_params = {"lr": [0.01]}
dataloaders_params = {}
models_hyperparams = {}
regularization_params={'regularizer':[reg],'lamda':[0.05,0.5,0.01],'p':[1]}

In [None]:
# starting the HPO
search.start(
    [SGD],
    30,
    False,
    optimizers_params,
    dataloaders_params,
    models_hyperparams,
    regularization_params=regularization_params,
    n_accumulated_grads=0,
)

### Optimization - Custom regularizer
Next we do the same thing for our custom regularizer that we defined above.

In [None]:
search = HyperParameterOptimization(pipe, "accuracy", 20, best_not_last=True)
search.regularize=True
search.store_pickle = True
reg=ElasticNet
optimizers_params = {"lr": [0.01]}
dataloaders_params = {}
models_hyperparams = {}
regularization_params={'reg':[reg], 'lambda1':[0.15,0.85,0.01],'lambda2':[0.0001,0.1,0.01]}

In [None]:
search.start(
    [SGD],
    30,
    False,
    optimizers_params,
    dataloaders_params,
    models_hyperparams,
    regularization_params=regularization_params,
    n_accumulated_grads=0,
)

## 3. Ad hoc regularizers

The penalties in the regularizers that we have seen so far have been straightforward functions of the model parameters. Here we show an example of an ad hoc regularizer that directly penalizes the behavior of the model. Such regularizers may depend on external parameters, and the logic is completely wrapped in the regularizer object.

First we generate some data:

In [None]:
S=1000
X=np.linspace(0,2*np.pi,S)
y=3*np.sin(X)+0.5*rng.standard_normal(S)
plt.plot(X,y)

In [None]:
train_x, val_x, train_y, val_y = train_test_split(X, y, test_size=0.1)
tensor_x_t = torch.Tensor(train_x).reshape(-1, 1)
tensor_x_t=tensor_x_t.float()
tensor_y_t = torch.from_numpy(train_y).reshape(-1, 1)
tensor_y_t=tensor_y_t.float()
tensor_x_v = torch.Tensor(val_x)
tensor_y_v = torch.from_numpy(val_y)
train_dataset = TensorDataset(tensor_x_t,tensor_y_t)
dl_tr = DataLoader(train_dataset,batch_size=10)
val_dataset = TensorDataset(tensor_x_v,tensor_y_v)
dl_val = DataLoader(val_dataset,batch_size=10)

class model1(nn.Module):
    def __init__(self):
        super(model1, self).__init__()
        self.seqmodel = FFNet(arch=[1, 10,10,10, 1])

    def forward(self, x):
        return self.seqmodel(x)


model = model1()

We fit a basic unregularized model:

In [None]:
loss_fn = nn.MSELoss()
pipe = Trainer(model, (dl_tr, dl_val), loss_fn, writer,l2_norm)
pipe.train(SGD, 200, False, {"lr": 0.1})

In [None]:
resp=pipe.model(dl_tr.dataset.tensors[0])
X_t=dl_tr.dataset.tensors[0].detach().numpy().reshape(-1)
y_t=resp.detach().numpy().reshape(-1)

Let us take a look of the graph of the model we defined:

In [None]:
ind=np.argsort(X_t)
plt.plot(X_t[ind],y_t[ind])

Next we define an ad hoc regularizer that penalizes the function from attaining values higher than 2, and we do this in a very barbarian way to demonstrate the regularization logic. In general, these kind of restrictions could also be effectively imposed by a suitable model architecture.

Our regularizer evaluates the model on a grid, which is defined by preprocess step. The penalty consists of evaluating the model on this grid, and then picking out the points where the model exceeded 2. The penalty is the squared sum of the model values at these points.

In [None]:
class CapReg(Regularizer):
    def preprocess(self):
        self.X=torch.linspace(0,2*torch.pi,1000)
        """
        X: grid of points in ascending order
        """
        return True

    def regularization_penalty(self, model):
        """
        We penalized the squared values of the function at the points where it attains value higher than 2.
        """
        res=model(self.X.reshape(-1,1)).reshape(-1)
        inds1=torch.where(res>2)        
        X1=self.X[inds1]
        res1=model(X1.reshape(-1,1)).reshape(-1)
        return torch.sum(res1**2)

In [None]:
reg=CapReg(lamda=1/(2*S))

In [None]:
pipe2 = Trainer(model, (dl_tr, dl_val), loss_fn, writer,l2_norm,regularizer=reg)


The regression penalty that we defined is computed for every batch in our dataset. This way, their gradients are updated every single batch. A reasonable pick for the regression penalty coefficient $\lambda$ should then be inversely proportional to the number of batches.

In [None]:
pipe2.train(SGD, 10, False, {"lr": 0.1})

Let us now take a look at the graphs of the two models:

In [None]:
resp=pipe2.model(dl_tr.dataset.tensors[0])
X_t2=dl_tr.dataset.tensors[0].detach().numpy().reshape(-1)
y_t2=resp.detach().numpy().reshape(-1)
ind2=np.argsort(X_t2)

In [None]:
plt.plot(X_t[ind],y_t[ind], label = 'unregularized')
plt.plot(X_t2[ind2],y_t2[ind2], label = 'regularized')
plt.legend()
plt.show()

We see the regularizer does penalize the graph from taking values greater than 2. This type of regularization will also have side effects on the rest of graph, because the model is quite simple and there is no straightforward connection between the function value being higher than 2 and the network weights. This could be improved by tailoring the model architecture possibly in conjuction with a suitable regularizer.