# Santos Dumont (SD) - Numba CPU MPI B710

In [2]:
# Mostra os recursos do nó de login
! lscpu | head -n 15 | grep "Model \|CPU(s):\|Thre\|Core\|NUMA\|MHz"

CPU(s):                24
Thread(s) per core:    1
Core(s) per socket:    12
NUMA node(s):          2
Model name:            Intel(R) Xeon(R) CPU E5-2695 v2 @ 2.40GHz
CPU MHz:               2865.820


### Testa a execução

In [1]:
%%bash
module load intel_psxe/2020
source /opt/intel/parallel_studio_xe_2020/intelpython3/etc/profile.d/conda.sh
unset I_MPI_PMI_LIBRARY
time mpiexec -n 1 python -m cProfile -s cumtime numbampib710.py > numbampi.txt


real	0m7.854s
user	0m13.608s
sys	0m1.420s


In [2]:
! head numbampi.txt

Heat: 750.0000 | Time: 0.6431 | MPISize: 1
         2638411 function calls (2405433 primitive calls) in 6.981 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    910/1    0.008    0.000    6.983    6.983 {built-in method builtins.exec}
        1    0.040    0.040    6.983    6.983 numbampib710.py:1(<module>)
   629/32    0.004    0.000    4.278    0.134 <frozen importlib._bootstrap>:978(_find_and_load)
   629/32    0.003    0.000    4.277    0.134 <frozen importlib._bootstrap>:948(_find_and_load_unlocked)


Testa com 16 processos:

In [3]:
%%bash
module load intel_psxe/2020
source /opt/intel/parallel_studio_xe_2020/intelpython3/etc/profile.d/conda.sh
unset I_MPI_PMI_LIBRARY
mpiexec -n 16 python numbampib710.py

Heat: 750.0000 | Time: 1.1839 | MPISize: 16


### Copia arquivo com código python para /scratch

In [5]:
! cp  numbampib710.py  /scratch${PWD#/prj}

### Arquivo de lote do Slurm

In [6]:
%%writefile numbampi.srm
#!/bin/bash
#SBATCH --ntasks=96            #Total de tarefas
#SBATCH --job-name numbampi    #Nome do job, 8 caracteres
#SBATCH --partition cpu_dev    #Fila (partition) a ser utilizada
#SBATCH --time=00:01:00        #Tempo max. de execução
#SBATCH --exclusive            #Utilização exclusiva dos nós

echo '- Job ID:' $SLURM_JOB_ID
echo '- Tarefas por no:' $SLURM_NTASKS_PER_NODE
echo '- Qtd. de nos:' $SLURM_JOB_NUM_NODES
echo '- Tot. de tarefas:' $SLURM_NTASKS
echo '- Nos alocados:' $SLURM_JOB_NODELIST
nodeset -e $SLURM_JOB_NODELIST

#Modulos
module load intel_psxe/2020
source /opt/intel/parallel_studio_xe_2020/intelpython3/etc/profile.d/conda.sh

#Entra no diretório de trabalho
cd /scratch${PWD#/prj}

#Executavel
EXEC='python numbampib710.py'

#Dispara a execucao
srun --mpi=pmi2  -n $SLURM_NTASKS  $EXEC

Writing numbampi.srm


## Envia para a fila de execução dev

In [7]:
%%bash
sbatch numbampi.srm
squeue --user $(whoami) -h -r | wc -l
squeue --partition=cpu_dev -h -r | wc -l
squeue --start --name=numbampi --format "%S %.8i %.9P %.5j %.2t %.5M %.5D %.4C"

Submitted batch job 2542360
1
2
START_TIME    JOBID PARTITION  NAME ST  TIME NODES CPUS
N/A  2542360   cpu_dev numba PD  0:00     4   96


Verifica se já executou:

In [8]:
! squeue --start --name=numbampi --format "%S %.8i %.9P %.5j %.2t %.5M %.5D %.4C"

START_TIME    JOBID PARTITION  NAME ST  TIME NODES CPUS


Mostra o arquivo contendo a saída:

In [9]:
! cat /scratch${PWD#/prj}/slurm-2542360.out

- Job ID: 2542360
- Tarefas por no:
- Qtd. de nos: 4
- Tot. de tarefas: 96
- Nos alocados: sdumont[1245-1248]
sdumont1245 sdumont1246 sdumont1247 sdumont1248
Heat: 602.6262 | Time: 3.2165 | MPISize: 96


Neste caso enviamos para fila `cpu_dev` que é uma fila "rápida" para executar testes, e para trabalhos pequenos.

## Analisando tarefas passadas

In [10]:
! sacct --jobs=2542360 --format=jobname,ncpus,nnodes,maxrss,maxrssnode%13,start,elapsed,cputime

   JobName      NCPUS   NNodes     MaxRSS    MaxRSSNode               Start    Elapsed    CPUTime 
---------- ---------- -------- ---------- ------------- ------------------- ---------- ---------- 
  numbampi         96        4                          2021-11-06T14:09:26   00:00:12   00:19:12 
     batch         24        1          0   sdumont1245 2021-11-06T14:09:26   00:00:12   00:04:48 
    python         96        4          0   sdumont1248 2021-11-06T14:09:27   00:00:11   00:17:36 


In [11]:
! scontrol show node sdumont1248

NodeName=sdumont1248 Arch=x86_64 CoresPerSocket=12
   CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.01
   AvailableFeatures=cpu
   ActiveFeatures=cpu
   Gres=(null)
   NodeAddr=sdumont1248 NodeHostName=sdumont1248 Version=17.02
   OS=Linux RealMemory=64000 AllocMem=0 FreeMem=62648 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=cpu_dev 
   BootTime=2021-11-03T23:41:01 SlurmdStartTime=2021-11-03T23:41:59
   CfgTRES=cpu=24,mem=62.50G
   AllocTRES=
   CapWatts=n/a
   Socket_CapWatts=n/a
   CurrentWatts=5 LowestJoules=82 ConsumedJoules=2206893
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   

