# Debugging

Author: **Christian Lessig et al.**

`christian.lessig@ecmwf.int`

## Introduction

Most often, we spend more time debugging code than writing it. This is particularly true for python.

Debugging usually consists of three steps:
1. Localize the problem.
2. Understand what precisely goes wrong.
3. Fix the problem.
The third step is usually the easy one once the first two have been accomplished.

To localize the problem and understand the issue, it is often important to have an understanding of the software stack that is used to run your code. Many error messages will result from somewhere in the stack and not directly from the user code.

<img src="ml_stack.png" width="400px" >

In simple cases when execution breaks, localizing the problem means to parse the error messages and map it to the code and the call stack. The problem might very well originate elsewhere but where the code breaks is the entry point for you to localize and understand the root cause.

Ones an entry point into the problem has been found, one can investigate what goes wrong. This means almost always to set a break point before the offending line and investigate the state of the program and the code. Simple typos might not require this but in all other circumstances it is easier to use a breakpoint. In python one can break with:

```
import pdb; pdb.set_trace()
```

This opens a debugger shell in the code line following the one where the statement is. Alternatively, one can use:

```
code.interact( local=locals())
```

This opens an interactive python shell in the calling line but does not provide the functionality of a debugger (e.g. a stack trace). However, it can be useful for quick inspection or or code development.

The common cause for bugs is that an assumption about the input/output data is violated. This can be the shape of a tensor (easy) or unexpected values (difficult) or something more subtle (very difficult). In the interactive debugger shell you can investigate it.

In [9]:
import os
from importlib import reload
import code

import torch

In [2]:
import model
reload( model)
from model import MLP

net = MLP( dim_in=512, dim_out=512)

# check if GPU is available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
net = net.to(device)


  return torch._C._cuda_getDeviceCount() > 0


In [3]:
# test if we can evaluate the network

t_in = torch.rand( (16, 4, 512)).to(device)
t_out = net( t_in)

In [6]:
# test if data loading works

import dataset
reload( dataset)
from dataset import CustomDataSet

custom_dataset = CustomDataSet( len=32768, batch_size=128, dim_data=512)
data_iter = iter(custom_dataset)

lossfct = torch.nn.MSELoss()

# load sample
(source, target) = next(data_iter)
source, target = source.to(device), target.to(device)

# evaluate network
pred = net( source)

# compute loss
loss = lossfct( pred, target)

print( f'loss : {loss}')

loss : 1.0001879930496216


In [7]:
# training loop

optimizer = torch.optim.AdamW( net.parameters(), lr=0.00005)

# parallel data loader
loader_params = { 'batch_size': None, 'batch_sampler': None, 'shuffle': False, 
                   'num_workers': 8, 'pin_memory': True}
dataloader = torch.utils.data.DataLoader( custom_dataset, **loader_params, sampler = None)

num_epochs = 8

optimizer.zero_grad()
for epoch in range( num_epochs) :

  # data_iter = iter( dataset)
  data_iter = iter( dataloader)
  
  for bidx, (source, target) in enumerate(data_iter) :

    source, target = source.to(device), target.to(device)
    
    pred = net( source)
    loss = lossfct( pred, target)

    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

  print( f'Finished epoch={epoch} with loss={loss}.')

Finished epoch=0 with loss=2.2095817257650197e-05.
Finished epoch=1 with loss=1.923728398800506e-10.
Finished epoch=2 with loss=1.1075946410032955e-10.
Finished epoch=3 with loss=7.202569096698141e-11.
Finished epoch=4 with loss=4.95193538951888e-11.
Finished epoch=5 with loss=3.534422060580411e-11.
Finished epoch=6 with loss=2.5699102568221832e-11.
Finished epoch=7 with loss=1.912282832083889e-11.


In [8]:
idx = torch.arange( 512)
loss = lossfct( source[idx], target[idx])

IndexError: index 128 is out of bounds for dimension 0 with size 128