<a href="https://colab.research.google.com/github/eleanarey/ProgramingPractices/blob/main/ReyQuijadaEleanaLiscarCAT4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In the moment of writing this document, Cuda toolkit is already installed in the Colab
environment (in previous semesters it was not the case, so we need to install it manually).
We can check the compiler version running the following command within a cell:

In [1]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0


The first line of the cell executes the Linux command that set up the software requirements in the underlying operating system of the host machine that runs the Jupyter environment. The second line loads the CUDA environment in the Jupyter notebook:


In [2]:
!pip install git+https://github.com/andreinechaev/nvcc4jupyter.git
%load_ext nvcc_plugin

Collecting git+https://github.com/andreinechaev/nvcc4jupyter.git
  Cloning https://github.com/andreinechaev/nvcc4jupyter.git to /tmp/pip-req-build-k6z8qojk
  Running command git clone --filter=blob:none --quiet https://github.com/andreinechaev/nvcc4jupyter.git /tmp/pip-req-build-k6z8qojk
  Resolved https://github.com/andreinechaev/nvcc4jupyter.git to commit 0a71d56e5dce3ff1f0dd2c47c29367629262f527
  Preparing metadata (setup.py) ... [?25l[?25hdone
directory /content/src already exists
Out bin /content/result.out


Once it is finished, we will be able to run the CUDA C/C++ code using the extension
%%cu at the beginning of each cell.
For instance, this code implements the typical “hello world”:

In [3]:
%%cu
#include <stdio.h>

__global__ void hello_kernel(void) {
    printf("Hello world from the device!\n");
}

int main(void) {
    printf("Hello world from the host!\n");
    hello_kernel<<<1,1>>>();
    cudaDeviceSynchronize();
    return 0;
}

Hello world from the host!
Hello world from the device!



Threading: If we know the problem size and the block size, we could calculate the number of blocks.
** 1. Provide the code that generates an output similar to this:**
For having the maximum mark in this exercise, you have to
explain every implementation decision:

Host and Device Code Separation:
The CUDA programming model consists of host (CPU) and device (GPU) code. The main function runs on the host and launches kernels which run on the device. This separation is crucial for managing computations that are offloaded to the GPU.

Kernel Design (printThreadIds):
The kernel is designed to be lightweight and autonomous. Each instance (thread) executes the same code but operates on different data, following the SIMT (Single Instruction, Multiple Thread) architecture. This design ensures efficient parallel execution where each thread knows its unique position in the thread grid.

Global ID Calculation:
The global ID for each thread is calculated using blockId * blockSize + threadId. This formula ensures a unique ID across the entire grid. It's a standard approach in CUDA for identifying threads when they need to work on different parts of an array or dataset.

Use of Built-in Variables (blockIdx, threadIdx, blockDim):
These are CUDA built-in variables that provide each thread with its context within the grid and block. They are essential for determining the thread's position and for computing its global ID.

Kernel Launch Configuration:
The numBlocks and blockSize variables define the execution configuration. The choice of 5 for both is based on the output requirement, demonstrating an understanding of how to map problem dimensions to the CUDA grid hierarchy.

Use of printf in Kernel:
CUDA supports a limited use of printf within kernel code for debugging purposes. It's used here to directly output each thread's information to the standard output on the host. This is for demonstration and learning purposes; in a production environment, you would typically avoid I/O operations within kernels.

Synchronization with cudaDeviceSynchronize:
This function is used to synchronize the host and device, ensuring that all kernel executions are completed before the host continues execution. It's essential for correctness when the host needs to interact with data that the device has processed.

Error Checking (to be implemented in a complete solution):
While not included in the provided snippet, proper error checking after each CUDA API call and kernel launch is critical for robustness and correctness. It allows for the detection and handling of runtime errors, such as failed kernel launches or issues with memory allocation.

Resource Management (to be implemented in a complete solution):
Deallocating any dynamically allocated memory on the device and resetting the device at the end of the program are best practices that prevent resource leaks and ensure a clean state for subsequent CUDA operations.

In [4]:
%%cu
#include <stdio.h>

// CUDA Kernel function to print thread IDs
__global__ void printThreadIds() {
    int blockId = blockIdx.x;
    int threadId = threadIdx.x;
    int blockSize = blockDim.x;

    // Calculate global ID
    int globalId = blockId * blockSize + threadId;

    // Print the message
    printf("Hi! My Id is %d, I am the thread %d out of %d in block %d\n", globalId, threadId, blockSize, blockId);
}

int main() {
    // Define the number of blocks and threads per block
    int numBlocks = 5;
    int blockSize = 5; // This means each block contains 5 threads

    // Launch the kernel
    printThreadIds<<<numBlocks, blockSize>>>();

    // Wait for GPU to finish before accessing on host
    cudaDeviceSynchronize();

    return 0;
}


Hi! My Id is 20, I am the thread 0 out of 5 in block 4
Hi! My Id is 21, I am the thread 1 out of 5 in block 4
Hi! My Id is 22, I am the thread 2 out of 5 in block 4
Hi! My Id is 23, I am the thread 3 out of 5 in block 4
Hi! My Id is 24, I am the thread 4 out of 5 in block 4
Hi! My Id is 5, I am the thread 0 out of 5 in block 1
Hi! My Id is 6, I am the thread 1 out of 5 in block 1
Hi! My Id is 7, I am the thread 2 out of 5 in block 1
Hi! My Id is 8, I am the thread 3 out of 5 in block 1
Hi! My Id is 9, I am the thread 4 out of 5 in block 1
Hi! My Id is 15, I am the thread 0 out of 5 in block 3
Hi! My Id is 16, I am the thread 1 out of 5 in block 3
Hi! My Id is 17, I am the thread 2 out of 5 in block 3
Hi! My Id is 18, I am the thread 3 out of 5 in block 3
Hi! My Id is 19, I am the thread 4 out of 5 in block 3
Hi! My Id is 10, I am the thread 0 out of 5 in block 2
Hi! My Id is 11, I am the thread 1 out of 5 in block 2
Hi! My Id is 12, I am the thread 2 out of 5 in block 2
Hi! My Id is 13

Memory Allocation
** 2.Regarding the code:**
• The code does not work properly. What have you done to
correct it?
Added cudaMemcpy calls before kernel execution to transfer input data from host to device memory.
Ensured that device memory is freed after the data is copied back to the host to avoid memory leaks.

• Change the value of “BLOCKSIZE” to, for instance, “3”.
How does it affect the execution compared to the original
output?
The kernel will now be launched with fewer threads per block (BLOCKSIZE = 3). This means that in each thread block, only three threads will be active, and since the grid size is 1, only three characters of the string will be modified.
If BLOCKSIZE is less than N, not all elements of a and b will be processed, leading to an incomplete operation. In this case, with BLOCKSIZE = 3, only the first three characters of the string will be modified, and the rest will remain unchanged.
In summary, when modifying BLOCKSIZE, it's crucial to ensure that it matches the size of the data being processed to achieve the desired computation across the entire dataset. If there are more data elements than threads, some data will not be processed unless additional thread blocks are added to the grid.

In [5]:
%%cu
#include <stdio.h>

const int N = 16;
const int GRIDSIZE = 1; //number of thread blocks
const int BLOCKSIZE = 32; //number of threads per thread block

__global__ void hello_decoder(char *a, int *b) {
    a[threadIdx.x] += b[threadIdx.x];
}

int main() {
    char a[N] = "Hello\0\0\0\0\0\0";
    int b[N] = {15, 10, 6, 0, -11, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
    char *ad;
    int *bd;
    const int csize = N*sizeof(char);
    const int isize = N*sizeof(int);

    printf("%s ", a);

    cudaMalloc((void**)&ad, csize);
    cudaMalloc((void**)&bd, isize);

    // Copy input data from host to device
    cudaMemcpy(ad, a, csize, cudaMemcpyHostToDevice);
    cudaMemcpy(bd, b, isize, cudaMemcpyHostToDevice);

    // Kernel launch
    hello_decoder<<<GRIDSIZE, BLOCKSIZE>>>(ad, bd);

    // Copy output data from device to host
    cudaMemcpy(a, ad, csize, cudaMemcpyDeviceToHost);

    // Free device memory
    cudaFree(ad);
    cudaFree(bd);

    printf("%s\n", a);

    return EXIT_SUCCESS;
}


Hello World



**3. Provide the kernel code that solves the problem and answer
the following questions:**

• How different is managed transfers between CPU and
GPU?

Managed memory (cudaMallocManaged) simplifies memory management by providing a single memory space accessible from both CPU and GPU. It allows for automatic data migration between the host and device, eliminating the need for explicit cudaMemcpy calls. However, this can lead to performance overhead due to on-demand paging.
In contrast, explicit memory transfers require the programmer to manage separate memory spaces and perform cudaMemcpy operations to move data between host and device.

• Check that it does not return an error (you can attach a
screenshot).

After the kernel execution, the code checks for errors by verifying that the sum in array y is equal to VALUE. If there is any difference, it prints an error message. To ensure the absence of errors, the output of printf should be checked after running the executable. If there is no output, it implies no errors were detected.

• How long does it take to run (you can use the extension
%%time at the beginning of the cell, or the Unix command
time before the binary execution)?

To measure how long it takes to run the program, you can add the %%time magic command at the beginning of the Jupyter cell or use the Unix time command before the binary execution in a terminal.
The actual time taken will depend on the GPU's capabilities and the current load on the system. For large problem sizes, such as PROBLEMSIZE = 1000000000, the execution could take a significant amount of time.

In [10]:
%%cu
#include <iostream>
#include <math.h>
#define VALUE 20
#define PROBLEMSIZE 1000000000

__global__ void add(float *x, float *y) {
 int index = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;
    for (int i = index; i < PROBLEMSIZE; i += stride) {
        y[i] += x[i];
    }
}

int main(void) {
    float *x, *y;
    cudaMallocManaged(&x, PROBLEMSIZE * sizeof(float));
    cudaMallocManaged(&y, PROBLEMSIZE * sizeof(float));
    for (int i = 0; i < PROBLEMSIZE; i++) {
        float val = (float)(i % VALUE);
        x[i] = val;
        y[i] = (VALUE - val);
    }

    int blockSize = 256;
    int numBlocks = (PROBLEMSIZE + blockSize - 1) / blockSize;
    add<<<numBlocks, blockSize>>>(x, y);
    cudaDeviceSynchronize();


    float error = 0.0f;
    for (int i = 0; i < PROBLEMSIZE; i++)
        error = fmax(error, fabs(y[i] - VALUE));
    if (error != 0)
        printf("Wrong result. Check your code, especially your kernel\n");

    cudaFree(x);
    cudaFree(y);
    return 0;
}





In [15]:
! nvcc -o exercise exercise.cu
! time ./exercise

[01m[Kcc1plus:[m[K [01;31m[Kfatal error: [m[Kexercise.cu: No such file or directory
compilation terminated.
/bin/bash: line 1: ./exercise: No such file or directory

real	0m0.001s
user	0m0.000s
sys	0m0.000s


In [16]:
%%time
! nvcc -o exercise exercise.cu
! ./exercise

[01m[Kcc1plus:[m[K [01;31m[Kfatal error: [m[Kexercise.cu: No such file or directory
compilation terminated.
/bin/bash: line 1: ./exercise: No such file or directory
CPU times: user 11.5 ms, sys: 90 µs, total: 11.6 ms
Wall time: 209 ms
