<a href="https://colab.research.google.com/github/dkgithub/wiehl24/blob/main/skorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Working with PyTorch can become involved. There are many tool that try to avoid writimg out all the litlle details.
Most commonly used is lighning. Here, we use skorch. It provides a Keras like interface that interacts smoothly with sklearn.

In [1]:
#!rm -rf helpers # if an enforce reinstall is necessary
![ ! -d helpers ] && git clone --recursive https://github.com/dkgithub/erum_ml_school_helpers helpers
!pip install wget

Cloning into 'helpers'...
remote: Enumerating objects: 50, done.[K
remote: Counting objects: 100% (50/50), done.[K
remote: Compressing objects: 100% (37/37), done.[K
remote: Total 50 (delta 25), reused 31 (delta 12), pack-reused 0[K
Receiving objects: 100% (50/50), 12.21 KiB | 12.21 MiB/s, done.
Resolving deltas: 100% (25/25), done.
Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9655 sha256=75b048f68a0ea9e157e1f0b148594f0378d2175e6f07ea48c785da3439cb93ca
  Stored in directory: /root/.cache/pip/wheels/8b/f1/7f/5c94f0a7a505ca1c81cd1d9208ae2064675d97582078e6c769
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [3]:
!pip install wget torchinfo skorch livelossplot



In [4]:
# load the helpers package and other software
import helpers as hlp
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import torch
import torchinfo
import skorch as sk
from livelossplot import PlotLosses

In [5]:
#check for accelerators
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print('torch',torch.__version__)
print('device type is',device)
if device == 'cuda' :print(torch.cuda.get_device_name())
from os import environ
if "COLAB_TPU_ADDR" in environ and environ["COLAB_TPU_ADDR"]:
  print("A TPU is connected.")


torch 2.1.0+cu121
device type is cuda


In [6]:
# first, we define a preprocessing function that (e.g.) takes the
# constiuents and returns another representation of them

#def preprocess_constituents(constituents):
#    return constituents[:, :120].reshape((-1, 480))

def preprocess_constituents(constituents):
  # sum all constituents to get jet 4-momenta
  c_sum=constituents.sum(axis=1)
  metric=np.array([1.,-1.,-1.,-1.]) #g_mu_nu
  # calculating invariants wrt. to jet
  c_inv=(constituents*metric*c_sum[:,None,:]).sum(axis=2)
  return c_inv


In [7]:
# here, we define a function to construct the datasets
def makeDataset(name=None,nFiles=None):
  if name not in ['train','valid','test']:
    print(f'Need a proper data split name')
    return
  if name == 'train' and nFiles == None: nFiles = 2
  else: nFiles = 1
  c_vectors, _, labels = hlp.data.load(name, stop_file=nFiles)
  # run the preprocessing
  c_vectors = preprocess_constituents(c_vectors)
  # create torch tensors from numpy arrays, map to float32,
  # and move to GPU if available - device must be defined
  c_tensor      = torch.from_numpy(c_vectors).float().to(device)
  label_tensor  = torch.from_numpy(labels   ).float().to(device)
  # Then, we create a dataset from our tensors
  print(f'dataset {name} \tlength',len(label_tensor),'\tshape',c_tensor.shape)
  return torch.utils.data.TensorDataset(c_tensor,label_tensor)

In [8]:
# A fuction to construct a dataloader from dataset of batch size bs
def makeDataLoader(dataset,bs=500,shuffle=True):
  loader  = torch.utils.data.DataLoader(dataset, batch_size=bs, shuffle = shuffle)
  n_inner=float(len(loader))
  print(f'The dataloader is able to produce {int(n_inner):3d} batches of {bs} data points each.')
  return loader

In [9]:
dataset_train = makeDataset('train')
dataset_valid = makeDataset('valid')
dataset_test  = makeDataset('test')

dataset train 	length 100000 	shape torch.Size([100000, 200])
dataset valid 	length 50000 	shape torch.Size([50000, 200])
dataset test 	length 50000 	shape torch.Size([50000, 200])


In [10]:
train_loader = makeDataLoader(dataset_train,500)
valid_loader = makeDataLoader(dataset_valid,10000,False)


The dataloader is able to produce 200 batches of 500 data points each.
The dataloader is able to produce   5 batches of 10000 data points each.


In [12]:
# we construct a network
from torch import nn
class model(nn.Module):
  def __init__(self,in_size=200,mid_size=200,n_layers=5):
    super().__init__()
    self.in_size  = in_size
    self.mid_size = mid_size
    self.n_layers = n_layers
    self.inLay    = nn.Linear(in_size,mid_size)
    self.linears  = nn.ModuleList([nn.Linear(mid_size, mid_size) for i in range(n_layers)])
    self.bnorms   = nn.ModuleList([nn.BatchNorm1d(mid_size) for i in range(n_layers)])
    self.outLay   = nn.Linear(mid_size, 1)

  def forward(self, x):
    x = self.inLay(x)
    x = torch.relu(x)
    # ModuleList can act as an iterable, or be indexed using ints
    for i,lay in enumerate(self.linears):
      x = lay(x)
      self.bnorms[i](x)
      x = torch.relu(x)
      x = self.outLay(x)
      x = torch.sigmoid(x)


Skorch works with callbacks. Callbacks are called at certain points in the processing loop. Especially: epoch start,epoch end, batch start and batch end. Most common callbacks, e.g. scoring are predefined.

In [13]:
from skorch.callbacks import EpochScoring,EpochTimer
auc = EpochScoring(scoring=['accuracy','roc_auc'], lower_is_better=False)

In [15]:

from skorch import NeuralNetClassifier
net = NeuralNetClassifier(
    model,
    max_epochs=50,
    lr=0.01,
#    callbacks=[LivePlot],
    callbacks=[auc],
     device=device
)

In [22]:
c_vectors_train, _, labels_train = hlp.data.load("train", stop_file=2)
c_vec = preprocess_constituents(c_vectors_train)

In [32]:
net.fit(c_vec.astype(np.float32),labels_train)

Re-initializing module.
Re-initializing criterion.
Re-initializing optimizer.


RuntimeError: mat1 and mat2 shapes cannot be multiplied (128x1 and 200x200)

In [31]:
c_vec.astype(np.float32)

array([[ 6641.8994 ,  1080.4357 ,  1038.2922 , ...,     0.     ,
            0.     ,     0.     ],
       [ 6370.779  ,  4102.971  ,  2991.785  , ...,     0.     ,
            0.     ,     0.     ],
       [ 2131.0713 ,  3402.4973 ,  2623.3242 , ...,     0.     ,
            0.     ,     0.     ],
       ...,
       [11724.193  ,  1235.5631 ,  1854.0349 , ...,     0.     ,
            0.     ,     0.     ],
       [ 6667.823  ,  3243.4338 ,  3912.8542 , ...,     0.     ,
            0.     ,     0.     ],
       [ 2828.156  ,   786.15955,  1421.4751 , ...,     0.     ,
            0.     ,     0.     ]], dtype=float32)