# Hello World From GPU

The best way to learn a new programming language is by writing programs using the new language. In this section, we are going to write our first kernel code running on the GPU.

First, let's check that the CUDA compiler is installed properly with the following command on a Linux system:

In [1]:
!which nvcc
!nvcc --version

/usr/local/cuda/bin/nvcc
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85


Let's check if a GPU accelerator card is attached in our machine:

In [2]:
!ls -l /dev/nv*

crw-rw-rw- 1 root root 195,   0 Jun 26 08:27 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Jun 26 08:27 /dev/nvidiactl
crw-rw-rw- 1 root root 195, 254 Jun 26 08:28 /dev/nvidia-modeset
crw-rw-rw- 1 root root 240,   0 Jun 26 08:28 /dev/nvidia-uvm


Now we are ready to write your fi rst CUDA C code. To write a CUDA C program, we need to:
1. Create a source code fi le with the special fi le name extension of .cu. 
2. Compile the program using the CUDA nvcc compiler.
3. Run the executable file from the command line, which contains the kernel code executable on the GPU.


In [3]:
%%file hello_world_gpu.cu 
#include <stdio.h>

// The qualifier __global__ tells the compiler that the function will be called 
// from the CPU and executed on the GPU.

__global__ void helloFromGPU(void)
{
    printf(".............Hello World from GPU!.............\n");
}

int main(void){
    // hello from cpu
    printf("<------------Hello World from CPU!-------------->\n");
    
    // Launch the kernel
    // The parameters within the triple angle brackets are the execution configuration, 
    // which specifi es how many threads will execute the kernel. In this example, we will run 10 GPU threads.
    helloFromGPU <<<1, 10>>>();
    
    
    // explicitly destroy and clean up all resources associated with the current
    // device in the current process
    cudaDeviceReset();
    return 0;
}

Overwriting hello_world_gpu.cu


In [4]:
%%bash
nvcc hello_world_gpu.cu -o hello_world_gpu
./hello_world_gpu

<------------Hello World from CPU!-------------->
.............Hello World from GPU!.............
.............Hello World from GPU!.............
.............Hello World from GPU!.............
.............Hello World from GPU!.............
.............Hello World from GPU!.............
.............Hello World from GPU!.............
.............Hello World from GPU!.............
.............Hello World from GPU!.............
.............Hello World from GPU!.............
.............Hello World from GPU!.............


## CUDA PROGRAM STRUCTURE 

A typical CUDA program structure consists of five main steps: 
1. Allocate GPU memories. 
2. Copy data from CPU memory to GPU memory. 
3. Invoke the CUDA kernel to perform program-specifi c computation. 
4. Copy data back from GPU memory to CPU memory. 
5. Destroy GPU memories.

In the simple program `hello_world_gpu.cu`, you only see the third step: Invoke the kernel. 

- Remove the [cudaDeviceReset function](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1gef69dd5c6d0206c2b8d099abac61f217)

In [5]:
%%file hello_world_gpu.cu 
#include <stdio.h>

__global__ void helloFromGPU(void)
{
    printf(".............Hello World from GPU!.............\n");
}

int main(void){
    // hello from cpu
    printf("<------------Hello World from CPU!-------------->\n");
    
    helloFromGPU <<<1, 10>>>();
    // explicitly destroy and clean up all resources associated with the current
    // device in the current process
    //cudaDeviceReset();
    return 0;
}

Overwriting hello_world_gpu.cu


In [6]:
%%bash
nvcc hello_world_gpu.cu -o hello_world_gpu
./hello_world_gpu

<------------Hello World from CPU!-------------->


- Replace the function `cudaDeviceRest` with `cudaDeviceSynchronize`

In [7]:
%%file hello_world_gpu.cu 
#include <stdio.h>

__global__ void helloFromGPU(void)
{
    printf(".............Hello World from GPU!.............\n");
}

int main(void){
    // hello from cpu
    printf("<------------Hello World from CPU!-------------->\n");
    
    helloFromGPU <<<1, 10>>>();
   
    cudaDeviceSynchronize();
    return 0;
}

Overwriting hello_world_gpu.cu


In [8]:
%%bash
nvcc hello_world_gpu.cu -o hello_world_gpu
./hello_world_gpu

<------------Hello World from CPU!-------------->
.............Hello World from GPU!.............
.............Hello World from GPU!.............
.............Hello World from GPU!.............
.............Hello World from GPU!.............
.............Hello World from GPU!.............
.............Hello World from GPU!.............
.............Hello World from GPU!.............
.............Hello World from GPU!.............
.............Hello World from GPU!.............
.............Hello World from GPU!.............


- Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through the built-in `threadIdx.x` variable.

In [9]:
%%file hello_world_gpu.cu 
#include <stdio.h>

__global__ void helloFromGPU(void)
{   
    if (threadIdx.x == 5)
        printf(".............Hello World from GPU thread %d!.............\n", threadIdx.x);
}

int main(void){
    // hello from cpu
    printf("<------------Hello World from CPU!-------------->\n");
    
    helloFromGPU <<<1, 10>>>();
   
    cudaDeviceSynchronize();
    return 0;
}

Overwriting hello_world_gpu.cu


In [10]:
%%bash
nvcc hello_world_gpu.cu -o hello_world_gpu
./hello_world_gpu

<------------Hello World from CPU!-------------->
.............Hello World from GPU thread 5!.............


## IS CUDA C PROGRAMMING DIFFICULT?

The main difference between CPU programming and GPU programming is the level of programmer exposure to GPU architectural features. Thinking in parallel and having a basic understanding of GPU architecture enables you to write parallel programs that scale to hundreds of cores as easily as you write a sequential program.


If you want to write efficient code as a parallel programmer, you need a basic knowledge of CPU architectures. For example, **locality** is a very important concept in parallel programming. 
- **Locality** refers to the reuse of data so as to reduce memory access latency. 

There are two basic types of reference locality:

- Temporal locality refers to the reuse of data and/or resources within relatively small time durations.
- Spatial locality refers to the use of data elements within relatively close storage locations. 

Modern CPU architectures use large caches to optimize for applications with good spatial and temporal locality. It is the programmer’s responsibility to design their algorithm to effi ciently use CPU cache. Programmers must handle low-level cache optimizations, but have no introspection into how threads are being scheduled on the underlying architecture because the CPU does not expose that information.
CUDA exposes you to the concepts of both memory hierarchy and thread hierarchy, extending your ability to control thread execution and scheduling to a greater degree, using: 
- ➤ Memory hierarchy structure
- ➤ Thread hierarchy structure

For example, a special memory, called shared memory, is exposed by the CUDA programming model. Shared memory can be thought of as a software-managed cache, which provides great speed-up by conserving bandwidth to main memory. With shared memory, you can control the locality of your code directly.

CUDA abstracts away the hardware details and does not require applications to be mapped to traditional graphics APIs. 
At its core are three key abstractions: 
- a hierarchy of thread groups, 
- a hierarchy of memory groups, 
- and barrier synchronization, 

which are exposed to us as a minimal set of language extensions. 