# This is a notebook to summarize the important lessons we learned at the March 2020 GPU hackathon


# Pinned memory

Pinned memory is good for the situation where we are moving the same data from DtH many times. Preallocating a cpu buffer (pinning memory) means that the data transfer is much faster that it would be with pageable memory. In our case, this best applies to the end of each patch where we move data back to the host. This only makes sense however if we will be using the buffer many times. The overhead of creating pinned memory is expensive.

How to pin memory in cupy:

In [None]:
#preallocate first
nwavestep = args.nwavestep
flux_out = np.empty((25,nwavestep))
ivar_out = np.empty((25,nwavestep))
Rdata_out = np.empty((25*nwavestep,25*nwavestep)) #hardcode for now
xflux_out = np.empty((25,nwavestep))
A_out = np.empty((50000,25*nwavestep))
iCov_out = np.empty((25*nwavestep,25*nwavestep))

#then pin
flux_pinned = _pin_memory(flux_out)
ivar_pinned = _pin_memory(ivar_out)
Rdata_pinned = _pin_memory(Rdata_out)
xflux_pinned = _pin_memory(xflux_out)
A_pinned = _pin_memory(A_out)
iCov_pinned = _pin_memory(iCov_out)

#append all pointers into a list
pinned_list = np.array([flux_pinned, ivar_pinned, Rdata_pinned, xflux_pinned, A_pinned, iCov_pinned])


pass pointers to wherever they are needed


In [None]:
try:
    results = ex2d(img.pix, img.ivar*(img.mask==0), psfdata, pinned_list, bspecmin[b],
        bnspec[b], wave, regularize=args.regularize, ndecorr=args.decorrelate_fibers,
        bundlesize=bundlesize, wavesize=args.nwavestep, verbose=args.verbose,
        full_output=True, nsubbundles=args.nsubbundles)

put memory in pinned buffer

In [None]:
fx_gpu_padded = cp.zeros((nspec,nwavesize))
fx_gpu_padded[0:nspec,0:nwave] = fx_gpu
fx_gpu_padded.get(out=pinned_list[0])

move data back to host in pinned buffer

In [None]:
results = dict(flux=pinned_list[0], ivar=pinned_list[1], R=pinned_list[2], xflux=pinned_list[3], A=pinned_list[4], iCov=pinned_list[5])

# Using MPS

MPS is a special NVIDIA MPI helper (?) that condenses GPU operations from multiple MPI ranks in an intelligent way. 

We opted not to explore streams because we thought we'd get equally good benefits from MPS with less work.

Using MPS on corigpu:

In [None]:
#!/bin/bash
if [ $SLURM_PROCID -eq 0 ]; then
    nvidia-cuda-mps-control -d
fi

sleep 10

python -u gpu_wrapper_specter.py -o out.fits --nwavestep 75 --nspec 100

then run the code inside this mps wrapper

this was an ugly workaround -- it may be possible to run without this now

# Using multiple GPUs via MPI

In [None]:
try:
    cp.cuda.Device(rank).use()
    print("moving work to %s" %(rank))
except Exception as e:
    #print("e", e)
    print("only 1 gpu, will continue on Device 0")

# Profiling the code

## nvprof

nvprof is now no longer reccomended-- nvidia officially supports nsight systems/compute

the best reason to use this tool is to profile mpi jobs since it can display data from several ranks stacked

## nsight systems

On cori gpu run nsys and write .qdrep file, move to laptop for local analysis.

In [None]:
srun nsys profile -s none -o desi_nsys_02252020 -t cuda,nvtx --force-overwrite true --stats=true python -u gpu_wrapper_specter.py -o test.fits --nspec 50 --nwavestep 50

## nsight compute

this is really slow and I never actually ran it to completion

Here the kernel name -k is what the compiler calls the kernel. You see this by looking in nsys.

In [None]:
time srun nv-nsight-cu-cli -k dlaed4 -o desi_ncom_02282020 -f python -u gpu_wrapper_specter.py -o out.fits --nspec 50 --nwavestep 50


# Haswell cpu time to beat