# EE-559: Practical Session 5

## Introduction

The objective of this session is to illustrate on a 2D synthetic toy data-set how poorly a naive weight initialization procedure performs when a network has multiple layers of different sizes.

In [1]:
import torch
import math

from torch import optim
from torch import Tensor
from torch import nn

## Toy data-set

Write a function
1. generate disc set(nb)

that returns a pair torch.Tensor , torch.LongTensor of dimensions respectively nb * 2 and nb ,corresponding to the input and target of a toy data-set where the input is uniformly distributed in[-1; 1] * [-1; 1] and the label is 1 inside the disc of radius

Create a train and test set of 1; 000 samples, and normalize their mean and variance to 0 and 1. A simple sanity check is to ensure that the two classes are balanced.

**Hint**: My version of generate disc set is 172 characters.

In [2]:
def generate_disc_set(nb):
    input = Tensor(nb, 2).uniform_(-1, 1)
    target = input.pow(2).sum(1).sub(2 / math.pi).sign().add(1).div(2).long()
    return input, target

train_input, train_target = generate_disc_set(1000)
test_input, test_target = generate_disc_set(1000)

mean, std = train_input.mean(), train_input.std()

train_input.sub_(mean).div_(std)
test_input.sub_(mean).div_(std)

mini_batch_size = 100

## Training and test

Write functions

1. train model(model, train input, train target)
1. compute nb errors(model, data input, data target)

The first should train the model with cross-entropy and 250 epochs of standard sgd with nu = 0:1, and
mini-batches of size 100.

The second should also use mini-batches, and return an integer.

**Hint**: My versions of train model and compute nb errors are respectively 512 and 457 characters.

In [3]:
def train_model(model, train_input, train_target):
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), lr = 1e-1)
    nb_epochs = 250

    for e in range(nb_epochs):
        for b in range(0, train_input.size(0), mini_batch_size):
            output = model(train_input.narrow(0, b, mini_batch_size))
            loss = criterion(output, train_target.narrow(0, b, mini_batch_size))
            model.zero_grad()
            loss.backward()
            optimizer.step()

In [4]:
def compute_nb_errors(model, data_input, data_target):

    nb_data_errors = 0

    for b in range(0, data_input.size(0), mini_batch_size):
        output = model(data_input.narrow(0, b, mini_batch_size))
        _, predicted_classes = torch.max(output, 1)
        for k in range(mini_batch_size):
            if data_target[b + k] != predicted_classes[k]:
                nb_data_errors = nb_data_errors + 1

    return nb_data_errors

## Models

Write

1. create shallow model()

that returns a mlp with 2 input units, a single hidden layer of size 128, and 2 output units, and

2. create deep model()

that returns a mlp with 2 input units, hidden layers of sizes respectively 4; 8; 16; 32; 64; 128, and 2
output units.

**Hint**: You can use the nn.Sequential container to make things simpler. My versions of these two
functions are respectively 132 and 355 characters long.

In [5]:
def create_shallow_model():
    return nn.Sequential(
        nn.Linear(2, 128),
        nn.ReLU(),
        nn.Linear(128, 2)
    )

def create_deep_model():
    return nn.Sequential(
        nn.Linear(2, 4),
        nn.ReLU(),
        nn.Linear(4, 8),
        nn.ReLU(),
        nn.Linear(8, 16),
        nn.ReLU(),
        nn.Linear(16, 32),
        nn.ReLU(),
        nn.Linear(32, 64),
        nn.ReLU(),
        nn.Linear(64, 128),
        nn.ReLU(),
        nn.Linear(128, 2)
    )

## Benchmarking

Compute and print the train and test errors of these two models when they are initialized either with
the default pytorch rule, or with a normal distribution of standard deviation 10^-3; 10^-2; 10^-1; 1; and
10.

The error rate with the shallow network for any initialization should be around 1:5%. It should be around 3% with the deep network using the default rule, and around 50% most of the time with the other initializations.

**Hint**: My version is 562 characters long.

In [6]:
for std in [ -1, 1e-3, 1e-2, 1e-1, 1e-0, 1e1 ]:

    for m in [ create_shallow_model, create_deep_model ]:

        model = m()

        if std > 0:
            with torch.no_grad():
                for p in model.parameters(): p.normal_(0, std)

        train_model(model, train_input, train_target)

        print('std {:s} {:f} train_error {:.02f}% test_error {:.02f}%'.format(
            m.__name__,
            std,
            compute_nb_errors(model, train_input, train_target) / train_input.size(0) * 100,
            compute_nb_errors(model, test_input, test_target) / test_input.size(0) * 100
        )
        )

std create_shallow_model -1.000000 train_error 0.40% test_error 1.10%
std create_deep_model -1.000000 train_error 3.60% test_error 3.40%
std create_shallow_model 0.001000 train_error 1.60% test_error 2.00%
std create_deep_model 0.001000 train_error 49.60% test_error 50.10%
std create_shallow_model 0.010000 train_error 1.20% test_error 1.70%
std create_deep_model 0.010000 train_error 49.60% test_error 50.10%
std create_shallow_model 0.100000 train_error 0.80% test_error 1.90%
std create_deep_model 0.100000 train_error 49.60% test_error 50.10%
std create_shallow_model 1.000000 train_error 0.60% test_error 1.00%
std create_deep_model 1.000000 train_error 50.40% test_error 49.90%
std create_shallow_model 10.000000 train_error 0.00% test_error 1.40%
std create_deep_model 10.000000 train_error 50.40% test_error 49.90%
