# PyTorch

Rev.: 2022-04-09

Example of training a convolutional neural network using the MNIST database. Uses multi-GPU and multi-node data parallelism running on the LNCC Santos Dumont supercomputer.

Adapted from IDRIS: http://www.idris.fr/eng/jean-zay/gpu/jean-zay-gpu-torch-multi-eng.html (please follow this documentation for more information)

<hr style="height:10px;border-width:0;background-color:red">

Currently we are runnin on the node:

In [1]:
! hostname

sdumont18


In [2]:
! module list

No Modulefiles Currently Loaded.


* PyTorch checkpoints: <http://pytorch.org/tutorials/recipes/recipes/saving_and_loading_a_general_checkpoint.html>

In [18]:
%%bash
cd /scratch${PWD#/prj}
mkdir -p checkpoint
rm -f checkpoint/*

In [18]:
import os, torchvision, torchvision.transforms as transforms

SCRATCH = os.environ['PWD'].replace('/prj/', '/scratch/') + '/pytorch'
os.environ['SCRATCH'] = SCRATCH
print(SCRATCH)

torchvision.datasets.MNIST(root=SCRATCH,
                            train=True,
                            transform=transforms.ToTensor(),
                            download=True)

/scratch/ampemi/eduardo.miranda2/pytorch
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to /scratch/ampemi/eduardo.miranda2/pytorch/MNIST/raw/train-images-idx3-ubyte.gz


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=9912422.0), HTML(value='')))


Extracting /scratch/ampemi/eduardo.miranda2/pytorch/MNIST/raw/train-images-idx3-ubyte.gz to /scratch/ampemi/eduardo.miranda2/pytorch/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to /scratch/ampemi/eduardo.miranda2/pytorch/MNIST/raw/train-labels-idx1-ubyte.gz


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=28881.0), HTML(value='')))


Extracting /scratch/ampemi/eduardo.miranda2/pytorch/MNIST/raw/train-labels-idx1-ubyte.gz to /scratch/ampemi/eduardo.miranda2/pytorch/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to /scratch/ampemi/eduardo.miranda2/pytorch/MNIST/raw/t10k-images-idx3-ubyte.gz


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=1648877.0), HTML(value='')))


Extracting /scratch/ampemi/eduardo.miranda2/pytorch/MNIST/raw/t10k-images-idx3-ubyte.gz to /scratch/ampemi/eduardo.miranda2/pytorch/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to /scratch/ampemi/eduardo.miranda2/pytorch/MNIST/raw/t10k-labels-idx1-ubyte.gz


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=4542.0), HTML(value='')))


Extracting /scratch/ampemi/eduardo.miranda2/pytorch/MNIST/raw/t10k-labels-idx1-ubyte.gz to /scratch/ampemi/eduardo.miranda2/pytorch/MNIST/raw



Dataset MNIST
    Number of datapoints: 60000
    Root location: /scratch/ampemi/eduardo.miranda2/pytorch
    Split: Train
    StandardTransform
Transform: ToTensor()

In [63]:
%%writefile mnist-distributed.py 

import os
from datetime import datetime
from time import time
import argparse
import torch.multiprocessing as mp
import torchvision
import torchvision.transforms as transforms
import torch
import torch.nn as nn
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel
import sdenv

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('-b', '--batch-size', default=128, type =int,
        help='batch size. it will be divided in mini-batch for each worker')
    parser.add_argument('-e','--epochs', default=2, type=int, metavar='N',
        help='number of total epochs to run')
    parser.add_argument('-c','--checkpoint', default=None, type=str,
        help='path to checkpoint to load')
    args = parser.parse_args()

    train(args)   

Overwriting mnist-distributed.py


In [64]:
%%writefile -a mnist-distributed.py

class ConvNet(nn.Module):

    def __init__(self, num_classes=10):
        super(ConvNet, self).__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(16),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.layer2 = nn.Sequential(
            nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.fc = nn.Linear(7*7*32, num_classes)

    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = out.reshape(out.size(0), -1)
        out = self.fc(out)
        return out

Appending to mnist-distributed.py


In [65]:
%%writefile -a mnist-distributed.py

def train(args):
    # NCCL it’s the only backend that currently supports InfiniBand
    # and GPUDirect.
    # Configure distribution method: define address and port of the 
    # master node and initialise communication backend (NCCL).
    # init_method='env://',
    # init_method=sdenv.MASTER_TCP,
    dist.init_process_group(
        backend = 'nccl',
        init_method = 'env://',
        world_size=sdenv.size, 
        rank=sdenv.rank
    )
    
    # distribute model
    torch.cuda.set_device(sdenv.local_rank)
    gpu = torch.device("cuda")
    model = ConvNet().to(gpu)
    ddp_model = DistributedDataParallel(
        model, 
        device_ids=[sdenv.local_rank])
    if args.checkpoint is not None:
        map_location = {'cuda:%d' % 0: 'cuda:%d' % sdenv.local_rank}
        ddp_model.load_state_dict(
            torch.load(args.checkpoint, map_location=map_location))
    
    # distribute batch size (mini-batch)
    batch_size = args.batch_size 
    batch_size_per_gpu = batch_size // sdenv.size
    
    # define loss function (criterion) and optimizer
    criterion = nn.CrossEntropyLoss()  
    optimizer = torch.optim.SGD(ddp_model.parameters(), 1e-4)

    # load data with distributed sampler
    train_dataset = torchvision.datasets.MNIST(
        root=os.environ['DSDIR'],
        train=True,
        transform=transforms.ToTensor(),
        download=False)
    
    train_sampler = torch.utils.data.distributed.DistributedSampler(
        train_dataset,
        num_replicas=sdenv.size,
        rank=sdenv.rank)

    train_loader = torch.utils.data.DataLoader(
        dataset=train_dataset,
        batch_size=batch_size_per_gpu,
        shuffle=False,
        num_workers=0,
        pin_memory=True,
        sampler=train_sampler)

    # training (timers and display handled by process 0)
    if sdenv.rank == 0: 
        start = datetime.now()
    total_step = len(train_loader)
    
    for epoch in range(args.epochs):

        if sdenv.rank == 0: start_dataload = time()
        
        for i, (images, labels) in enumerate(train_loader):
            
            # distribution of images and labels to all GPUs
            images = images.to(gpu, non_blocking=True)
            labels = labels.to(gpu, non_blocking=True) 
            
            if sdenv.rank == 0: stop_dataload = time()

            if sdenv.rank == 0: start_training = time()
            
            # forward pass
            outputs = ddp_model(images)
            loss = criterion(outputs, labels)

            # backward and optimize
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            if sdenv.rank == 0: 
                stop_training = time() 
            if (i + 1) % 200 == 0 and sdenv.rank == 0:
                print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}, Time data load: {:.3f}ms, Time training: {:.3f}ms'\
                      .format(epoch + 1, 
                              args.epochs, 
                              i + 1, 
                              total_step, 
                              loss.item(), 
                              (stop_dataload - start_dataload)*1000,
                              (stop_training - start_training)*1000))
            if sdenv.rank == 0: start_dataload = time()
                    
        #Save checkpoint at every end of epoch
        if sdenv.rank == 0:
            torch.save(
                ddp_model.state_dict(), 
                './checkpoint/{}GPU_{}epoch.checkpoint'.format(sdenv.size, epoch+1))

    if sdenv.rank == 0:
        print(">>> Training complete in: "+str(datetime.now()-start))

Appending to mnist-distributed.py


In [66]:
%%writefile -a mnist-distributed.py

if __name__ == '__main__':   
    # display info
    if sdenv.rank == 0:
        print(">>> Training on ", len(sdenv.hostnames), 
              " nodes and ", sdenv.size, 
              " processes, master node is ", sdenv.MASTER_ADDR)
    print("- Process {} corresponds to GPU {} of node {}"\
          .format(sdenv.rank, 
          sdenv.local_rank, 
          sdenv.node_rank))

    main()

Appending to mnist-distributed.py


In [71]:
%%bash
BASE=/scratch${PWD#/prj}
cp mnist-distributed.py sdenv.py $BASE
mkdir -p $BASE/checkpoint
rm -f $BASE/checkpoint/*

Note: the version of PyTorch I'm using doesn't work on K40's from B715, it says it's not supported.

In [72]:
%%writefile monogpu.srm
#!/bin/bash
#SBATCH --job-name monogpu              # SLURM_JOB_NAME
#SBATCH --partition sequana_gpu_shared  # SLURM_JOB_PARTITION
#SBATCH --nodes=1                       # SLURM_JOB_NUM_NODES
#SBATCH --ntasks-per-node=1             # SLURM_NTASKS_PER_NODE
#SBATCH --cpus-per-task=10              # SLURM_CPUS_PER_TASK
#SBATCH --time=00:10:00                 # Limit execution time

# VARIABLES OF INTEREST IN THE SLURM ENVIRONMENT
# <https://slurm.schedmd.com/sbatch.html>
# SLURM_PROCID
#     The MPI rank (or relative process ID) of the current process.
# SLURM_LOCALID
#     Node local task ID for the process within a job.
# SLURM_NODEID
#     ID of the nodes allocated. 

echo '========================================'
echo '- Job ID:' $SLURM_JOB_ID
echo '- # of nodes in the job:' $SLURM_JOB_NUM_NODES
echo '- # of tasks per node:' $SLURM_NTASKS_PER_NODE
echo '- # of tasks:' $SLURM_NTASKS
echo '- # of cpus per task:' $SLURM_CPUS_PER_TASK
echo '- Dir from which sbatch was invoked:' ${SLURM_SUBMIT_DIR##*/}
echo -n '- Nodes allocated to the job: '
nodeset -e $SLURM_JOB_NODELIST

# go to the work directory from which sbatch was invoked
cd $SLURM_SUBMIT_DIR

module load sequana/current
module load cuda/10.2                                              
                                              
# load the Python environment
SCR=/scratch${PWD#/prj}
BASE=/scratch${HOME#/prj}
ENV=miniconda
source $BASE/$ENV/etc/profile.d/conda.sh
conda activate $BASE/$ENV
cd $SCR

# run
echo -n '<1. starting python script > ' && date
echo '-- output -----------------------------'

srun  python  -u mnist-distributed.py  --epochs 8  --batch-size 128

echo '-- end --------------------------------'
echo -n '<2. quit>                    ' && date

Overwriting monogpu.srm


In [73]:
sub = !sbatch monogpu.srm
print(sub[0])
job = sub[0].replace('Submitted batch job ','')

Submitted batch job 10475101


In [74]:
import time
c = [job]
while job in c:
    print(end='.')
    time.sleep(10)
    c = !squeue --job {job} --noheader --format "%i"
print('')
out = !echo /scratch${PWD#/prj}/slurm-
%cat {out[0] + job}.out

..............
- Job ID: 10475101
- # of nodes in the job: 1
- # of tasks per node: 1
- # of tasks: 1
- # of cpus per task: 10
- Dir from which sbatch was invoked: pt
- Nodes allocated to the job: sdumont8041
Loading SEQUANA Software environment
ModuleCmd_Load.c(213):ERROR:105: Unable to locate a modulefile for 'cuda/10.2'
<1. starting python script > Sex Mar 25 11:37:49 -03 2022
-- output -----------------------------
>>> Training on  1  nodes and  1  processes, master node is  172.20.15.42
- Process 0 corresponds to GPU 0 of node 0
[W socket.cpp:401] [c10d] The server socket cannot be initialized on [::]:20324 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [sdumont8041-ic1]:20324 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [sdumont8041-ic1]:20324 (errno: 97 - Address family not supported by protocol).
Epoc

<hr style="height:10px;border-width:0;background-color:green">

In [38]:
! squeue --job 10475084

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          10475084 sequana_g  monogpu eduardo. CG       0:18      1 sdumont8030


In [42]:
%cat /scratch${PWD#/prj}/slurm-10475084.out

- Job ID: 10475084
- # of nodes in the job: 1
- # of tasks per node: 1
- # of tasks: 1
- # of cpus per task: 10
- Dir from which sbatch was invoked: pt
- Nodes allocated to the job: sdumont8030
Loading SEQUANA Software environment
ModuleCmd_Load.c(213):ERROR:105: Unable to locate a modulefile for 'cuda/10.2'
<1. starting python script > Sex Mar 25 11:17:34 -03 2022
-- output -----------------------------
00:00.0 Host bridge: Intel Corporation Sky Lake-E DMI3 Registers (rev 07)
00:04.0 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 07)
00:04.1 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 07)
00:04.2 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 07)
00:04.3 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 07)
00:04.4 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 07)
00:04.5 System peripheral: Intel Corporation Sky Lake-E CBDMA Registers (rev 07)
00:04.6 System peripheral: Intel

<hr style="height:10px;border-width:0;background-color:green">

In [1]:
import time
a = !sbatch monogpu.srm
print(a[0])
while True:
    time.sleep(10)
    b = ! squeue --user $USER --name=monogpu
    if len(b) < 2: break
b = !echo /scratch${PWD#/prj}/slurm-
%cat {b[0]+a[0].replace('Submitted batch job ','')}.out

Submitted batch job 10474567
- Job ID: 10474567
- # of nodes in the job: 1
- # of tasks per node: 1
- # of tasks: 1
- # of cpus per task: 10
- Dir from which sbatch was invoked: pt
- Nodes allocated to the job: sdumont8042
<1. starting python script > Qui Mar 24 22:56:55 -03 2022
-- output -----------------------------
>>> Training on  1  nodes and  1  processes, master node is  172.20.15.43
- Process 0 corresponds to GPU 0 of node 0
[W socket.cpp:401] [c10d] The server socket cannot be initialized on [::]:20324 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [sdumont8042-ic1]:20324 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [sdumont8042-ic1]:20324 (errno: 97 - Address family not supported by protocol).
Epoch [1/8], Step [200/469], Loss: 2.0711, Time data load: 11.317ms, Time training: 2.136ms
Epoch [1/8], S

<hr style="height:10px;border-width:0;background-color:green">

In [33]:
import time
a = !sbatch monogpu.srm
print(a[0])
while True:
    time.sleep(10)
    b = ! squeue --user $USER --name=monogpu
    if len(b) < 2: break
b = !echo /scratch${PWD#/prj}/slurm-
%cat {b[0]+a[0].replace('Submitted batch job ','')}.out

Submitted batch job 10474458
- Job ID: 10474458
- # of nodes in the job: 1
- # of tasks per node: 1
- # of tasks: 1
- # of cpus per task: 10
- Dir from which sbatch was invoked: pt
- Nodes allocated to the job: sdumont8033
<1. starting python script > Qui Mar 24 20:24:52 -03 2022
-- output -----------------------------
>>> Training on  1  nodes and  1  processes, master node is  172.20.15.34
- Process 0 corresponds to GPU 0 of node 0
[W socket.cpp:401] [c10d] The server socket cannot be initialized on [::]:20324 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [sdumont8033-ic1]:20324 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [sdumont8033-ic1]:20324 (errno: 97 - Address family not supported by protocol).
Epoch [1/8], Step [200/469], Loss: 2.0588, Time data load: 7.616ms, Time training: 1.623ms
Epoch [1/8], St

<hr style="height:10px;border-width:0;background-color:green">

In [22]:
import time
a = !sbatch monogpu.srm
print(a[0])
while True:
    time.sleep(10)
    b = ! squeue --user $USER --name=monogpu
    if len(b) < 2: break
b = !echo /scratch${PWD#/prj}/slurm-
%cat {b[0]+a[0].replace('Submitted batch job ','')}.out

Submitted batch job 10473863
- Job ID: 10473863
- # of nodes in the job: 1
- # of tasks per node: 1
- # of tasks: 1
- # of cpus per task: 10
- Dir from which sbatch was invoked: pytorch
- Nodes allocated to the job: sdumont8033
which: no fi_info in (/scratch/ampemi/eduardo.miranda2/env2/bin:/scratch/ampemi/eduardo.miranda2/env2/condabin:/opt/xcs/App/Scripts:/specific/bin:/specific/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ibutils/bin:/opt/xcs/Admin/Scripts:/prj/ampemi/eduardo.miranda2/.local/bin:/prj/ampemi/eduardo.miranda2/bin)
<1. starting python script > Qui Mar 24 11:07:39 -03 2022
-- output -----------------------------
>>> Training on  1  nodes and  1  processes, master node is  sdumont8033
- Process 0 corresponds to GPU 0 of node 0
[W socket.cpp:401] [c10d] The server socket cannot be initialized on [::]:9000 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:558] [c10d] The client socket cannot be initialized to connect to [sdumont8033-ic1]:9

<hr style="height:10px;border-width:0;background-color:red">

## Example of multi-GPU mono-node execution

In [24]:
%%writefile mononode.srm
#!/bin/bash
#SBATCH --job-name mononode             # SLURM_JOB_NAME
#SBATCH --partition sequana_gpu_shared  # SLURM_JOB_PARTITION
#SBATCH --nodes=1                       # SLURM_JOB_NUM_NODES
#SBATCH --ntasks-per-node=4             # SLURM_NTASKS_PER_NODE
#SBATCH --cpus-per-task=10              # SLURM_CPUS_PER_TASK
#SBATCH --time=00:10:00                 # Limit execution time

# VARIABLES OF INTEREST IN THE SLURM ENVIRONMENT
# <https://slurm.schedmd.com/sbatch.html>
# SLURM_PROCID
#     The MPI rank (or relative process ID) of the current process.
# SLURM_LOCALID
#     Node local task ID for the process within a job.
# SLURM_NODEID
#     ID of the nodes allocated. 

echo '========================================'
echo '- Job ID:' $SLURM_JOB_ID
echo '- # of nodes in the job:' $SLURM_JOB_NUM_NODES
echo '- # of tasks per node:' $SLURM_NTASKS_PER_NODE
echo '- # of tasks:' $SLURM_NTASKS
echo '- # of cpus per task:' $SLURM_CPUS_PER_TASK
echo '- Dir from which sbatch was invoked:' ${SLURM_SUBMIT_DIR##*/}
echo -n '- Nodes allocated to the job: '
nodeset -e $SLURM_JOB_NODELIST

# go to the work directory from which sbatch was invoked
cd $SLURM_SUBMIT_DIR

# load the Python environment
SCR=/scratch${PWD#/prj}
BASE=/scratch${HOME#/prj}
source $BASE/env2/etc/profile.d/conda.sh
conda activate $BASE/env2
cd $SCR

# run
echo -n '<1. starting python script > ' && date
echo '-- output -----------------------------'

srun python -u mnist-distributed.py --epochs 8 --batch-size 128

echo '-- end --------------------------------'
echo -n '<2. quit>                    ' && date

Writing mononode.srm


In [25]:
import time
a = !sbatch mononode.srm
print(a[0])
while True:
    time.sleep(10)
    b = ! squeue --user $USER --name=mononode
    if len(b) < 2: break
b = !echo /scratch${PWD#/prj}/slurm-
%cat {b[0]+a[0].replace('Submitted batch job ','')}.out

Submitted batch job 10473923
- Job ID: 10473923
- # of nodes in the job: 1
- # of tasks per node: 4
- # of tasks: 4
- # of cpus per task: 10
- Dir from which sbatch was invoked: pytorch
- Nodes allocated to the job: sdumont8037
which: no fi_info in (/scratch/ampemi/eduardo.miranda2/env2/bin:/scratch/ampemi/eduardo.miranda2/env2/condabin:/opt/xcs/App/Scripts:/specific/bin:/specific/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ibutils/bin:/opt/xcs/Admin/Scripts:/prj/ampemi/eduardo.miranda2/.local/bin:/prj/ampemi/eduardo.miranda2/bin)
<1. starting python script > Qui Mar 24 11:42:10 -03 2022
-- output -----------------------------
- Process 1 corresponds to GPU 1 of node 0
- Process 2 corresponds to GPU 2 of node 0
- Process 3 corresponds to GPU 3 of node 0
>>> Training on  1  nodes and  4  processes, master node is  sdumont8037
- Process 0 corresponds to GPU 0 of node 0
[W socket.cpp:401] [c10d] The server socket cannot be initialized on [::]:9000 (errno: 97 - Address famil

<hr style="height:10px;border-width:0;background-color:red">

## Example of multi-GPU multi-node execution

In [28]:
%%writefile multinode.srm
#!/bin/bash
#SBATCH --job-name multinode            # SLURM_JOB_NAME
#SBATCH --partition sequana_gpu_shared  # SLURM_JOB_PARTITION
#SBATCH --nodes=3                       # SLURM_JOB_NUM_NODES
#SBATCH --ntasks-per-node=4             # SLURM_NTASKS_PER_NODE
#SBATCH --cpus-per-task=10              # SLURM_CPUS_PER_TASK
#SBATCH --time=00:10:00                 # Limit execution time

# VARIABLES OF INTEREST IN THE SLURM ENVIRONMENT
# <https://slurm.schedmd.com/sbatch.html>
# SLURM_PROCID
#     The MPI rank (or relative process ID) of the current process.
# SLURM_LOCALID
#     Node local task ID for the process within a job.
# SLURM_NODEID
#     ID of the nodes allocated. 

echo '========================================'
echo '- Job ID:' $SLURM_JOB_ID
echo '- # of nodes in the job:' $SLURM_JOB_NUM_NODES
echo '- # of tasks per node:' $SLURM_NTASKS_PER_NODE
echo '- # of tasks:' $SLURM_NTASKS
echo '- # of cpus per task:' $SLURM_CPUS_PER_TASK
echo '- Dir from which sbatch was invoked:' ${SLURM_SUBMIT_DIR##*/}
echo -n '- Nodes allocated to the job: '
nodeset -e $SLURM_JOB_NODELIST

# go to the work directory from which sbatch was invoked
cd $SLURM_SUBMIT_DIR

# load the Python environment
SCR=/scratch${PWD#/prj}
BASE=/HOME${PWD#/prj}
source $BASE/env2/etc/profile.d/conda.sh
conda activate $BASE/env2
cd $SCR

# run
echo -n '<1. starting python script > ' && date
echo '-- output -----------------------------'

srun python -u mnist-distributed.py --epochs 8 --batch-size 128

echo '-- end --------------------------------'
echo -n '<2. quit>                    ' && date

Writing multinode.srm


In [29]:
import time
a = !sbatch multinode.srm
print(a[0])
while True:
    time.sleep(10)
    b = ! squeue --user $USER --name=multinode
    if len(b) < 2: break
b = !echo /scratch${PWD#/prj}/slurm-
%cat {b[0]+a[0].replace('Submitted batch job ','')}.out

Submitted batch job 10473925
- Job ID: 10473925
- # of nodes in the job: 3
- # of tasks per node: 4
- # of tasks: 12
- # of cpus per task: 10
- Dir from which sbatch was invoked: pytorch
- Nodes allocated to the job: sdumont8033 sdumont8034 sdumont8037
which: no fi_info in (/scratch/ampemi/eduardo.miranda2/env2/bin:/scratch/ampemi/eduardo.miranda2/env2/condabin:/opt/xcs/App/Scripts:/specific/bin:/specific/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ibutils/bin:/opt/xcs/Admin/Scripts:/prj/ampemi/eduardo.miranda2/.local/bin:/prj/ampemi/eduardo.miranda2/bin)
<1. starting python script > Qui Mar 24 11:51:58 -03 2022
-- output -----------------------------
- Process 6 corresponds to GPU 2 of node 1
- Process 8 corresponds to GPU 0 of node 2
- Process 1 corresponds to GPU 1 of node 0
- Process 10 corresponds to GPU 2 of node 2
- Process 7 corresponds to GPU 3 of node 1
- Process 9 corresponds to GPU 1 of node 2
- Process 5 corresponds to GPU 1 of node 1
[W socket.cpp:558] [c10