<a href="https://colab.research.google.com/github/dkgithub/wiehl24/blob/main/skorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Working with PyTorch can become involved. There are many tool that try to avoid writimg out all the litlle details.
Most commonly used is lighning. Here, we use skorch. It provides a Keras like interface that interacts smoothly with sklearn.

In [3]:
#!rm -rf helpers # if an enforce reinstall is necessary
![ ! -d helpers ] && git clone --recursive https://github.com/dkgithub/erum_ml_school_helpers helpers
!pip install wget



In [4]:
!pip install wget torchinfo skorch livelossplot



In [52]:
# load the helpers package and other software
import helpers as hlp
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import torch
import torchinfo
import skorch as sk
from livelossplot import PlotLosses

In [6]:
#check for accelerators
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print('torch',torch.__version__)
print('device type is',device)
if device == 'cuda' :print(torch.cuda.get_device_name())
from os import environ
if "COLAB_TPU_ADDR" in environ and environ["COLAB_TPU_ADDR"]:
  print("A TPU is connected.")


torch 2.1.0+cu121
device type is cpu


In [7]:
# first, we define a preprocessing function that (e.g.) takes the
# constiuents and returns another representation of them

#def preprocess_constituents(constituents):
#    return constituents[:, :120].reshape((-1, 480))

def preprocess_constituents(constituents):
  # sum all constituents to get jet 4-momenta
  c_sum=constituents.sum(axis=1)
  metric=np.array([1.,-1.,-1.,-1.]) #g_mu_nu
  # calculating invariants wrt. to jet
  c_inv=(constituents*metric*c_sum[:,None,:]).sum(axis=2)
  return c_inv


In [37]:
# here, we define a function to construct the datasets
def makeDataset(name=None,nFiles=None):
  if name not in ['train','valid','test']:
    print(f'Need a proper data split name')
    return
  if name == 'train' and nFiles == None: nFiles = 2
  else: nFiles = 1
  c_vectors, _, labels = hlp.data.load(name, stop_file=nFiles)
  # run the preprocessing
  c_vectors = preprocess_constituents(c_vectors)
  # create torch tensors from numpy arrays, map to float32,
  # and move to GPU if available - device must be defined
  c_tensor      = torch.from_numpy(c_vectors).float().to(device)
  label_tensor  = torch.from_numpy(labels   ).float().to(device)
  # Then, we create a dataset from our tensors
  print(f'dataset {name} \tlength',len(label_tensor),'\tshape',c_tensor.shape)
  return torch.utils.data.TensorDataset(c_tensor,label_tensor)

In [48]:
# construct a dataloader from dataset of batch size bs
def makeDataLoader(dataset,bs=500,shuffle=True):
  loader  = torch.utils.data.DataLoader(dataset, batch_size=bs, shuffle = shuffle)
  n_inner=float(len(loader))
  print(f'The dataloader is able to produce {int(n_inner):3d} batches of {bs} data points each.')
  return loader

In [46]:
dataset_train = makeDataset('train')
dataset_valid = makeDataset('valid')
dataset_test  = makeDataset('test')

dataset train 	length 100000 	shape torch.Size([100000, 200])
dataset valid 	length 50000 	shape torch.Size([50000, 200])
dataset test 	length 50000 	shape torch.Size([50000, 200])


In [51]:
train_loader = makeDataLoader(dataset_train,500)
valid_loader = makeDataLoader(dataset_valid,10000,False)


The dataloader is able to produce 200 batches of 500 data points each.
The dataloader is able to produce   5 batches of 10000 data points each.


Skorch works with callbacks. Callbacks are called at certain points in the processing loop. Especially: epoch start,epoch end, batch start and batch end. Most common callbacks, e.g. scoring are predefined.

In [13]:
from skorch.callbacks import EpochScoring
auc = EpochScoring(scoring='roc_auc', lower_is_better=False)

((100000, 200), torch.Size([100000, 1]))

In [None]:
# we construct a network
from torch import nn
class model(nn.Module):
    def __init__(self,in_size=200,mid_size=200,n_layers=5):
        super().__init__()
        self.in_size  = in_size
        self.mid_size = mid_size
        self.n_layers = n_layers
        linears = nn.ModuleList(nn.Linear(in_size,mid_size))
        self.linears = linnears.append([nn.Linear(mid_size, mid_size) for i in range(n_layers)])

    def forward(self, x):
        # ModuleList can act as an iterable, or be indexed using ints
        for i, l in enumerate(self.linears):
            x = self.linears[i // 2](x) + l(x)
        return x

    def forward(self, X, **kwargs):
        X = self.nonlin(self.dense0(X))
        X = self.dropout(X)
        X = F.relu(self.dense1(X))
        X = F.softmax(self.output(X), dim=-1)
        return X

from skorch import NeuralNetClassifier
net = NeuralNetClassifier(
    ClassifierModule,
    max_epochs=50,
    lr=0.1,
#    callbacks=[LivePlot],
    callbacks=[auc],
#     device='cuda',  # uncomment this to train with CUDA
)