# Hands-On 6: Portable Parallel Programming with CUDA
*   Enzo Bacelar Conte Gebauer
*   Luiz Guilherme Guerreiro
*   Maria Eduarda Lopes de Morais Brito

## `Especificações`


    GPU: RTX 4060 ti 8GB
    CPU: AMD Ryzen 7 5700x
    RAM: 16 GB 3200 mhz


In [4]:
%%writefile saxpy.cu
#include <cuda.h>
#include <stdio.h>
#include <stdlib.h>


__global__ void saxpy(int n, float *x, float *y){
int i = threadIdx.x;
if(i < n)
y[i] = x[i] + y[i];
}

void printVector(float *vector, int n)
{

 for (int i=0; i < n ; ++i)
  printf("%1.0f\t", vector[i]);

  printf("\n\n");
}

void generateVector(float *vector, int n)
{
 for (int i=0; i < n ; ++i)
  vector[i] = i + 1;
}

int main(int argc, char *argv[])
{
  int n = atoi(argv[1]);
  float *x,*y;

  x = (float*) malloc(sizeof(float) * n);
  y = (float*) malloc(sizeof(float) * n);

  generateVector(x, n);
  printVector(x, n);

  generateVector(y, n);
  printVector(y, n);

  float *xd, *yd;

  cudaMalloc( (void**)&xd, sizeof(float) * n );
  cudaMalloc( (void**)&yd, sizeof(float) * n );

  cudaMemcpy(xd, x, sizeof(float) * n, cudaMemcpyHostToDevice);
  cudaMemcpy(yd, y, sizeof(float) * n, cudaMemcpyHostToDevice);

  int NUMBER_OF_BLOCKS = 1;
  int NUMBER_OF_THREADS_PER_BLOCK = n;
  saxpy<<< NUMBER_OF_BLOCKS, NUMBER_OF_THREADS_PER_BLOCK >>>(n, xd, yd);

  cudaDeviceSynchronize();

  cudaMemcpy(y, yd, sizeof(float) * (n), cudaMemcpyDeviceToHost);
  printVector(y, n);

  cudaFree(xd);
  cudaFree(yd);

  free(x);
  free(y);

  return 0;

}

Overwriting saxpy.cu


## Run the Code

In [5]:
!nvcc saxpy.cu -o saxpy

In [6]:
!./saxpy 10

1	2	3	4	5	6	7	8	9	10	

1	2	3	4	5	6	7	8	9	10	

2	4	6	8	10	12	14	16	18	20	



## `Unified Memory (cudaMallocManaged)`

The program in `saxpy-cudaMallocManaged.cu` allocates memory, using `cudaMallocManaged` for a $n$ elements array of integers, and then seeks to initialize all the values of the array in parallel using a CUDA kernel.

In [7]:
%%writefile saxpy-cudaMallocManaged.cu
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>

__global__ void saxpy(int n,  float *x, float *y){
 int i = threadIdx.x;
 if(i < n)
   y[i] = x[i] + y[i];
}

void printVector(float *vector, int n){
for (int i=0; i < n ; ++i)
 printf("%1.0f\t", vector[i]);
printf("\n\n");
}

void generateVector(float *vector, int n){
for (int i=0; i < n ; ++i)
 vector[i] = i + 1;
}

int main(int argc, char *argv[]){
  int n = atoi(argv[1]);
  float *x,*y;

  cudaMallocManaged(&x, sizeof(float) * n);
  cudaMallocManaged(&y, sizeof(float) * n);

  generateVector(x, n);
  printVector(x, n);
  generateVector(y, n);
  printVector(y, n);

  int NUMBER_OF_BLOCKS = 1;
  int NUMBER_OF_THREADS_PER_BLOCK = n;

  saxpy <<< NUMBER_OF_BLOCKS, NUMBER_OF_THREADS_PER_BLOCK >>> (n, x, y);

  cudaDeviceSynchronize();

  printVector(y, n);

  cudaFree(x);
  cudaFree(y);

  return 0;
}

Writing saxpy-cudaMallocManaged.cu


## Run the Code

In [8]:
!nvcc saxpy-cudaMallocManaged.cu -o saxpy-cudaMallocManaged

In [9]:
!./saxpy-cudaMallocManaged 8

1	2	3	4	5	6	7	8	

1	2	3	4	5	6	7	8	

2	4	6	8	10	12	14	16	



## Questions


### How Unified Memory works?
Unified memory allocates the data in a common space between the Host and the accelerator, so when any device needs to use that data, there is no need to send it to the host or device. It simplifies the code and makes the code shorter.

### Is cudaMallocManaged slower than cudaMalloc?
Yes, cudaMallocManaged can be slower than cudaMalloc because it has to manage memory in a way that's accessible by both the CPU and GPU, involving additional overhead. cudaMallocManaged deals with Unified Memory, meaning it handles data migration between CPU and GPU, which can introduce some latency due to the automatic data management and potential page faults. cudaMalloc, on the other hand, allocates memory directly on the GPU, avoiding the overhead associated with unified memory management. In practice, using cudaMalloc with explicit data transfer might offer better performance, especially in scenarios where fine-grained control over data transfers between the CPU and GPU is critical. However, it might complicate the code by adding manual memory management tasks.

## References

M. Boratto. Hands-On Supercomputing with Parallel Computing. Available: https://github.com/muriloboratto/Hands-On-Supercomputing-with-Parallel-Computing. 2022.