In [6]:
print(f'x number of features: {x_feats}')
print(f'y number of features: {y_feats}')

x number of features: 4128
y number of features: 552


In [7]:
mlp = km.parallel.models.LitMLP(
    in_channels=x_feats,
    hidden_channels=100,
    out_channels=y_feats
)
mlp

LitMLP(
  (net): Sequential(
    (0): Normalize()
    (1): Linear(in_features=4128, out_features=100, bias=True)
    (2): SiLU()
    (3): Linear(in_features=100, out_features=100, bias=True)
    (4): SiLU()
    (5): Linear(in_features=100, out_features=100, bias=True)
    (6): SiLU()
    (7): Linear(in_features=100, out_features=552, bias=True)
  )
)

## Dataset creation and data loading mechanics

In [8]:
!cat data.py

# MIT License
# 
# Copyright (c) 2022 alxyok
# 
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
# 
# The above copyright notice and this permission notice shall be included in all
# copies or substantial portions of the Software.
# 
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, A

* `batch_size` sets the number of element in a batch of data.
* `num_workers` sets the number of workers the DataLoader can spawn to handle data loading and Dataset batching.

In [10]:
datamodule = km.parallel.data.FlattenedDataModule(
    batch_size=256,
    num_workers=16
)

## Orchestrating the training

All the training instrumentation is done by an object call the Trainer. You can fix parameters such as `max_epochs`, the `accelerator` type and `device` logical number.

Notably interesting: 
* `callbacks` to handle in-betweens
* `gradient_clip_val` and `gradient_clip_algorithm` to setup the gradient clipping
* `logger` to interface with loss and metrics logging
* `resume_from_checkpoint` helps resuming a previously initiated training
* `amp_backend` to switch to Nvidia Apex framework for Automatic Mixed Precision support

In [11]:
trainer = pl.Trainer(
    max_epochs=1,
    logger=pl.loggers.tensorboard.TensorBoardLogger(
        save_dir=km.LOGS_PATH,
        name='flattened_mlp_logs',
        log_graph=True
    ),
    deterministic=True,
    # amp_backend='apex'
)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs


Training CPU is a one-line

In [12]:
trainer.fit(model=mlp, datamodule=datamodule)


  | Name | Type       | Params
------------------------------------
0 | net  | Sequential | 488 K 
------------------------------------
488 K     Trainable params
0         Non-trainable params
488 K     Total params
1.955     Total estimated model params size (MB)


                                                                      

Global seed set to 42


Epoch 0:   3%|▎         | 98/3816 [00:24<15:48,  3.92it/s, loss=1.57, v_num=2, train_loss=1.060] 

  rank_zero_warn("Detected KeyboardInterrupt, attempting graceful shutdown...")


Testing: 100%|█████████▉| 422/424 [01:24<00:00,  5.70it/s] loss=1.57, v_num=2, train_loss=1.060]--------------------------------------------------------------------------------
DATALOADER:0 TEST RESULTS
{'test_loss': 1.473724365234375, 'test_loss_epoch': 1.473724365234375}
--------------------------------------------------------------------------------
Testing: 100%|██████████| 424/424 [01:24<00:00,  4.99it/s]


Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/multiprocessing/queues.py", line 242, in _feed
    send_bytes(obj)
  File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
    self._send(header + buf)
  File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/multiprocessing/queues.py", line 242, in _feed
    send_bytes(obj)
  File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 404, in _send_bytes
    self._send(header + buf)
  File "/opt/conda/lib/python3.7/multiprocessing/connection.py", line 368, in _send
    n = write(self._h

[{'test_loss': 1.473724365234375, 'test_loss_epoch': 1.473724365234375}]

Never forget to test. The handy thing with the `Trainer` is, if a `.test()` is called somewhere at runtime, once a `SIGTERM` is thrown by the runtime, it gets caught by Lightning, which tries to then run the test anyway.

In [None]:
trainer.test(model=mlp, datamodule=datamodule)

Now let's go single-node multi-GPU

In [None]:
trainer.fit(
    accelerator='gpu',
    devices=[0, 1, 2, 3],
    model=mlp,
    datamodule=datamodule
)
trainer.test(model=mlp, datamodule=datamodule)