<hr style="border-color:coral; border-width:4px"></hr>

# Configuring the GPU kernel

<hr style="border-color:coral; border-width:4px"></hr>


Configuring the GPU kernel means to supply the amount of shared memory, grids and blocks per GPU.  For the `sm` example, the kernel is configured using a 1d grid of blocks, each of which is a 1d thread block.   

    dim3 block(threads_per_block);
    dim3 grid(num_blocks);  

    int shared_memory = shared_memory_per_thread*threads_per_block;

    kernel<<<grid,block,shared_memory>>>(dev_t, dev_sm_id);

The inputs to the `sm` command are then `<num_blocks>`, `<threads_per_block>`, and optionally, `<shared_memory>`.  By default, the shared memory input is 0.  Otherwise, the number should be supplied in bytes.  Each `worker` kernel does 1 second of work.

The goal is to **expose as much parallelism as possible** within the limits of the given GPU architecture.

In [3]:
%matplotlib notebook
%pylab

Using matplotlib backend: nbAgg
Populating the interactive namespace from numpy and matplotlib


In [10]:
%%bash

# Usage : 
# On redhawk : 
#        $ srun --nodelist=<nodeN> sm <num_blocks> <threads_per_block> <shared_memory>

# On Redhawk

nvcc -o sm sm.cu

srun --nodelist=node5 sm 56 1

clock rate = 875500
scale_factor = 0
Memory requirement : 0.44 (kB)
Device has 14 SMs
Distribution of blocks on SMs
------------------------------------------------------------------------------

SM    #blocks/SM  #threads/SM    work/SM(s)       block list/SM
------------------------------------------------------------------------------
 0             4            4       4.00     ( 13,  27,  41,  55)
 1             4            4       4.00     ( 12,  26,  40,  54)
 2             4            4       4.00     ( 11,  25,  39,  53)
 3             4            4       4.00     ( 10,  24,  38,  52)
 4             4            4       4.00     (  9,  23,  37,  51)
 5             4            4       4.00     (  8,  22,  36,  50)
 6             4            4       4.00     (  7,  21,  35,  49)
 7             4            4       4.00     (  6,  20,  34,  48)
 8             4            4       4.00     (  5,  19,  33,  47)
 9             4            4       4.00     (  4,  18,  32,  46)
1

<hr style="border-width:2px; border-style:solid"></hr>

## Question

<hr style="border-width:2px; border-style:solid"></hr>

How many blocks can we specify and still keep the time at 1 second? 

In [None]:
%%bash

srun --nodelist=node5 sm 56 1

<hr style="border-width:2px; border-style:solid"></hr>

## Question

<hr style="border-width:2px; border-style:solid"></hr>

How many threads per block can we specify and still keep the time at 1 second? 

In [None]:
%%bash

srun --nodelist=node5 sm 56 1

<hr style="border-width:2px; border-style:solid"></hr>

## Question

<hr style="border-width:2px; border-style:solid"></hr>

How many threads per block can we specify and still keep the time at 1 second? 

<hr style="border-color:coral; border-width:2px"></hr>

## Efficiency metrics and events

<hr style="border-color:coral; border-width:2px"></hr>

We can obtain metrics from the nvidia profiler to get information on how well we are using the available resources.   

Here is an example of what metrics and events we can analyze. 

    Available Metrics:
                                Name   Description
    Device 0 (GeForce GTX TITAN X):
                       sm_efficiency:  The percentage of time at least one warp is active on a specific multiprocessor

                  achieved_occupancy:  Ratio of the average active warps per active cycle to the maximum number of
                  warps supported on a multiprocessor
    ..............

Events

    warps_launched:  Number of warps launched.
    
    ................
    
     sm_cta_launched:  Number of blocks launched

Complete listings of queries and events are provided below. 

In [7]:
%%bash

srun --nodelist=node1 nvprof --query-metrics

Available Metrics:
                            Name   Description
Device 0 (GeForce GTX TITAN X):
                   sm_efficiency:  The percentage of time at least one warp is active on a specific multiprocessor

              achieved_occupancy:  Ratio of the average active warps per active cycle to the maximum number of warps supported on a multiprocessor

                             ipc:  Instructions executed per cycle

                      issued_ipc:  Instructions issued per cycle

                   inst_per_warp:  Average number of instructions executed by each warp

               branch_efficiency:  Ratio of non-divergent branches to total branches expressed as percentage

       warp_execution_efficiency:  Ratio of the average active threads per warp to the maximum number of threads per warp supported on a multiprocessor

warp_nonpred_execution_efficiency:  Ratio of the average active threads per warp executing non-predicated instructions to the maximum number of threads 

In [8]:
%%bash

srun --nodelist=node1 nvprof --query-events

Available Events:
                            Name   Description
Device 0 (GeForce GTX TITAN X):
	Domain domain_a:
       tex0_cache_sector_queries:  Number of texture cache 0 requests. This increments by 1 for each 32-byte access.

       tex1_cache_sector_queries:  Number of texture cache 1 requests. This increments by 1 for each 32-byte access.

        tex0_cache_sector_misses:  Number of texture cache 0 misses. This increments by 1 for each 32-byte access.

        tex1_cache_sector_misses:  Number of texture cache 1 misses. This increments by 1 for each 32-byte access.

               elapsed_cycles_sm:  Elapsed clocks

	Domain domain_b:
           fb_subp0_read_sectors:  Number of DRAM read requests to sub partition 0, increments by 1 for 32 byte access.

           fb_subp1_read_sectors:  Number of DRAM read requests to sub partition 1, increments by 1 for 32 byte access.

          fb_subp0_write_sectors:  Number of DRAM write requests to sub partition 0, increments by 1 for 3