# Debugging

Author: **Christian Lessig et al.**

`christian.lessig@ecmwf.int`

## Introduction

Most often, we spend more time debugging code than writing it. This is particularly true for python.

Debugging usually consists of three steps:
1. Localize the problem.
2. Understand what precisely goes wrong.
3. Fix the problem.
The third step is usually the easy one once the first two have been accomplished.

To understand 

In [8]:
import os
from importlib import reload
import code

import torch

In [14]:
import model
reload( model)

net = model.MLP( dim_in=512, dim_out=512)

# check if GPU is available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
net = net.to(device)


In [15]:
# test if we can evaluate the network

t_in = torch.rand( (16, 512)).to(device)
t_out = net( t_in)

> [0;32m/etc/ecmwf/nfs/dh2_home_a/nacl/training/ml-training-course/1-model-debugging/model.py[0m(38)[0;36mforward[0;34m()[0m
[0;32m     36 [0;31m[0;34m[0m[0m
[0m[0;32m     37 [0;31m    [0;32mimport[0m [0mpdb[0m[0;34m;[0m [0mpdb[0m[0;34m.[0m[0mset_trace[0m[0;34m([0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m---> 38 [0;31m    [0;32mfor[0m [0mlayer[0m [0;32min[0m [0mself[0m[0;34m.[0m[0mlayers[0m [0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     39 [0;31m      [0mx[0m [0;34m=[0m [0mlayer[0m[0;34m([0m [0mx[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m     40 [0;31m[0;34m[0m[0m
[0m


In [5]:
# test if data loading works

dataset = CustomDataSet( len=32768, batch_size=128, dim_data=512)
lossfct = torch.nn.MSE()

data_iter = iter(dataset)
(source, target) = next(data_iter)

pred = net( source)
loss = lossfct( pred, target)

AttributeError: module 'torch.nn' has no attribute 'MSE'

In [None]:
# training loop

optimizer = torch.optim.AdamW( net, lr=0.00005)

# parallel data loader
loader_params = { 'batch_size': None, 'batch_sampler': None, 'shuffle': False, 
                   'num_workers': 8, 'pin_memory': True}
dataloader = torch.utils.data.DataLoader( dataset, **loader_params, sampler = None)

num_epochs = 8
batches_per_epoch = 128

# data_iter = iter( dataset)
data_iter = iter( dataloader)

optimizer.zero_grad()
for epoch in range( num_epochs) :
  for bidx in range(batches_per_epoch) :

    (source, target) = next(data_iter)

    pred = net( source)
    loss = lossfct( pred, target)

    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

  print( f'Finished epoch={epoch} with loss={loss}.')

## What to do if my training doesn't work?

- Try to overfit!
  - Loss function needs to be meaningful at all