<hr style="border-width:4px; border-color:coral"></hr>

## Simple CUDA programs

<hr style="border-width:4px; border-color:coral"></hr>

The simplest possible GPU program launches a single kernel with a single thread.  The kernel designation `__global__` indicates that the function `kernel` can be launched from the nost (the CPU, in this case). 

In [19]:
%%file demo_00.cu

__global__ void kernel( void ) 
{
    ;
}

int main(void) 
{
    /* Launch a single thread on one block */
    kernel<<<1,1>>>();

    return 0;
}



Writing demo_00.cu


In [33]:
%%bash 

nvcc -o demo_00 demo_00.cu

srun demo_00

<hr style="border-width:2px; border-color:black"></hr>
In a slightly more complicated program, we use the hardwired index of the block and thread for the kernel launch.

A kernel launch is *asynchronous*, meaning that as soon as the kernel is launched, the CPU code continues.  For this simple program, the CPU code will finish before the GPU kernel is completed.  To force the device to wait, we include the command 

    cudaDeviceSynchronize();
    

In [31]:
%%file demo_01.cu

#include <stdio.h>

__global__ void kernel( void ) 
{
    int ix = blockIdx.x*blockDim.x + threadIdx.x;
    printf("Thread %d; Block %d : Hello, World from global thread index %d\n", threadIdx.x, blockIdx.x,ix);
}

int main(void) 
{
    dim3 grid(3);
    dim3 block(4);
    kernel<<<grid,block>>>();
    cudaDeviceSynchronize();

    return 0;
}


Overwriting demo_01.cu


In [40]:
%%bash 

nvcc  -o demo_01 demo_01.cu

# On R2 : We need to specify the 'gpu' partition (i.e. the queue). 
srun -p gpuq demo_01

# On Redhawk : Each node has access to a GPU, so we don't need to specify a partition
# srun demo_01

Thread 0; Block 0 : Hello, World from global thread index 0
Thread 1; Block 0 : Hello, World from global thread index 1
Thread 2; Block 0 : Hello, World from global thread index 2
Thread 3; Block 0 : Hello, World from global thread index 3
Thread 0; Block 2 : Hello, World from global thread index 8
Thread 1; Block 2 : Hello, World from global thread index 9
Thread 2; Block 2 : Hello, World from global thread index 10
Thread 3; Block 2 : Hello, World from global thread index 11
Thread 0; Block 1 : Hello, World from global thread index 4
Thread 1; Block 1 : Hello, World from global thread index 5
Thread 2; Block 1 : Hello, World from global thread index 6
Thread 3; Block 1 : Hello, World from global thread index 7


<hr style="border-width:2px; border-color:black"></hr>
Reordering this output, we see how local block and thread indices are mapped to a global thread index.  Also note that print statements within a block are issued together, whereas the order among blocks is undefined. 

**Block 0**

    Thread 0; Block 0 : Hello, World from global thread index 0
    Thread 1; Block 0 : Hello, World from global thread index 1
    Thread 2; Block 0 : Hello, World from global thread index 2
    Thread 3; Block 0 : Hello, World from global thread index 3

**Block 1**

    Thread 0; Block 1 : Hello, World from global thread index 4
    Thread 1; Block 1 : Hello, World from global thread index 5
    Thread 2; Block 1 : Hello, World from global thread index 6
    Thread 3; Block 1 : Hello, World from global thread index 7

**Block 2**

    Thread 0; Block 2 : Hello, World from global thread index 8
    Thread 1; Block 2 : Hello, World from global thread index 9
    Thread 2; Block 2 : Hello, World from global thread index 10
    Thread 3; Block 2 : Hello, World from global thread index 11


<hr style="border-width:2px; border-color:black"></hr>
We can pass arguments to kernels in the usual way.  Below, we also create a kernel that can be called from the device using the `__device__` designation.

In [38]:
%%file add.cu

#include <stdio.h>

__device__ int addem( int a, int b ) 
{
    return a + b;
}

__global__ void add( int a, int b, int *c ) 
{
    *c = addem( a, b );
}

int main(void) 
{
    int a,b,c;
    int *dev_c;

    /* Allocate memory on the device */
    cudaMalloc( (void**)&dev_c, sizeof(int));

    a = 2;
    b = 7;
    add<<<1,1>>>(a, b, dev_c );

    cudaDeviceSynchronize();

    /* Copy contents of dev_c back to c */
    cudaMemcpy(&c, dev_c, sizeof(int), cudaMemcpyDeviceToHost);
    
    printf( "%d + %d = %d\n", a,b,c);

    cudaFree(dev_c);

}

Overwriting add.cu


In [41]:
%%bash 

nvcc -o add add.cu

# On R2 : 
srun -p gpuq add

# On Redhawk
# srun add

2 + 7 = 9


<hr style="border-width:2px; border-color:black"></hr>

We can also pass arrays to CUDA kernels and fill array entries just as we would do in a normal C program.  

In [32]:
%%file simple_parallel.cu

#include <stdio.h>

__global__ void add( int *c) 
{
    /* Since we have only one thread per block, the blockIdx and threadIdx are the same */
    int id = blockIdx.x;  
    c[id] = id;
}

int main(void) 
{
    int N = 10;
    
    /* Allocate memory on the device */
    int *dev_c;
    cudaMalloc( (void**)&dev_c, N*sizeof(int));

    /* Launch N thread blocks of 1 thread per block */
    dim3 grid(N);  /* 1 x N array of blocks */
    dim3 block(1); /* 1x1 thread block */
    add<<<grid,block>>>(dev_c);
    
    cudaDeviceSynchronize();

    /* Copy contents of dev_c back to c */
    int c[N];
    cudaMemcpy( &c, dev_c, N*sizeof(int), cudaMemcpyDeviceToHost);

    for(int i = 0; i < N; i++)
    {
        printf( "c[%d] = %d\n",i,c[i]);
    }

    cudaFree(dev_c);

}




Overwriting simple_parallel.cu


In [33]:
%%bash

nvcc -o simple_parallel simple_parallel.cu

srun -p gpuq simple_parallel

c[0] = 0
c[1] = 1
c[2] = 2
c[3] = 3
c[4] = 4
c[5] = 5
c[6] = 6
c[7] = 7
c[8] = 8
c[9] = 9
