## GPGPU Architecture

- GPU Overview.
- eg: Intel arc pro, AMD Radeon pro, Nvidia Tesla.
- CPU has low throughput incase of computer intensive tasks.
- CPU are slower with enhancing images and rendering graphics.
- GPUs outdo CPUs when it comes to 3D rendering due to the complexity of the tasks.
- GPU cores are specialized processors for handling graphics manipulations.

## GPU v/s CPU
- Parallelism
- Instruction set Architecture
- Memory Hierarchy
- Floating-Point Performance
- Power Efficiency

## Key Milestones in Evolution:
- Graphics Rendering
- Programmability
- General-Purpose Computing
- CUDA and GPGPU
- GPU Libraries and Frameworks:
    - cuDNN (CUDA Deep Neural Network),
    - TensorRT,
    - CUDA Toolkit
- AI and Deep Learning
- Heterogeneous Computing.
- Heterogeneous Co-processing model.

## GP-GPU Overview
- General Purpose GPU or GPGPU - general purpose scientific and engineering computing.
- refer slides.

## Applications:
- Big data processing
- Machine Learning and AI
- Scientific simulations
- Virtual reality and gaming
- Cryptography and cyber security.

## Performance
- refer slides with info from Nvidia

## Outline of GP-GPU
- Peripheral component Interface express bus (PCI-express bus)

## Understand GP-GPU Architecture
- At typical GP-GPU architectre consists of the following components:
- GPU core
- Instruction cache.
- Texture units
- Interconnect network
- Streaming multiprocessors
- Data Processing Unit
- Load/Store Unit
- Special Function Unit.

## Nvidia tesla v100 archeitecture

# Architecture of SM and execution model 

## Memory Structure:
- Thread registers
- Thread local memory
- block shared memory
- grid global memory
- grid constant memory
- grid texture memory

## Programming models for GPGPU
- Low level 
    - CUDA - Compute Unified Device Architecture
    - OpenCL
- Directives:
    - OpenMP - Open Multi-Processing
    - OpenACC - Open Accelerator

- In directives = C compiler imports a libraries and directives tell C compiler how to compile and execute
- Low level uses a different compiler altogether. There are no directives, the entire code is different and customized for the hardware.

## Basic Terms:
- Host - The CPU and its memory
- Device - The GPU and its memory
- Kernel - Function compiled for the device and its executed on the device with many threads
- block - a group of threads
- grid - a group of blocks
- warp - a group of grids

## CUDA Architecture
- C code runs serailly on CPU
- Parallel execution is expressed by the kernel function that is executed on a set of threads in parallel on GPU.
- Returns back to host

## OpenCL
- lengthy code
- refer slides

## OpenMP

## OpenACC

## Introduction to CUDA
- Large similarity to OpenMP directives
- works for Fortran, C, and C++
- Compilers supported - PGI, CRAY, GCC.

## Execution Model
- Program runs on the host CPU
- offload compute-intensive regions (kernels) and related data to accelerator GPU.
- Compute Kernels are executed by the GPU.

## Levels of Parallelism
1. Gang
2. Worker
3. Vector (useful for SIMD)

## Key Concepts
- Vector: Threads work in SIMD fashion.
    - Individual tasks that are executed in parallel on the GPU.
    - threads are organized into warps, which are groups of 32 threads each.
    - All threads within a warp are executed on a single GPU core.
- Worker: Groups of threads that can be scheduled and executed on a streaming multiprocessor (SM) within the GPU.
- Gang: workers are organized into gangs. Gangs work independently

## OpenACC Syntax
- For C/C++: #pragma acc directive clauses
- Fortran: !$acc directive clauses

## Compiling an OpenACC Program:
- Compiler that support OpenACC usually require an option that enables the feature:
    - PGI: -acc
- pgcc --acc --Minfo=all test.c
- Minfo=all is a flag that helps display all the debugging info in verbose mode.

## Properties:
- Incremental
- Single source
- Interoperable
- Portable

## Computer constructs:
- parallel
    - Defines a region to be executed on an accelerator
    - Work sharing parallelism has to be defined manually.
- kernels
    - Defines a region to be transferred into a series of kernels to be executed in sequence on an accelerator
    - Work sharing parallelism is defined automatically for the separate kernels.
- With similar work sharing, both can perform equally well.

## Compute Constructs: Kernels
- specified by:
- #pragma acc kernels [clause [,clause]...] new-line structured block
- each iteration of for loop is executed as a separate kernel
- each kernel is executed sequentially.

- Instructor displays a code with 2 for loops one after another
- note that like OpenMP, we dont we dont have use kernels for each separate for loop
    - in kernel compiler will take care of creating multiple gangs and assign workers and vectors accordingly for each iteration.
- however in parallel, we have to specify the directive before each loop.

- Compute construct 
    - parallel
    - kernels
    
- Loop construct
    - loop

- There are differences in compiler level between all of these different constructs.


## Work sharing construct: loop
- #pragma acc loop [clause [,clause]...] new-line for loop
- combined constructs
    - #pragma acc kernels loop
    - #pragma acc parallel loop
- Loop index variables are private variables by default.

## Clauses
- refer slides.

# Parallel/kernel clause
- if clause
- async clause
- num_gangs clause
- num_workers clause
- vector_length clause

## Private
- each loop iteration requires its own copy of its listed variables.
- Syntax: private(var1, var2, var3, ...)
- Avoids race condition

## Reduction
- Syntax: reduction(operator:variable)
- Instructor shows code snipped with and without reduction.
- and demonstrates how without reduction, we get wrong output and with reduction we get right output.
- the code snippet is about parallelising a sum of an array which contains an AP with a = 1, d = 1, and last_n = 10.


## Clauses for Data directive
- if(condition)
- copy(list)
- copyin(list)
- copyout(list)
- create(list)
- present(list)

-  Instructor presents a code snipped with copypin, copyout, and present directives.

## shaping of arrays:
- array slicing in OpenAcc.
- Syntax: x[start:count]
    - x = variable name
    - start = start index
    - count = total number of elements to be selected from start.

## Summary of OpenACC Directives:
```C++
// Manage data movement:
#pragma acc data copyin(a,b) copyout(c)
{
    ...
    // Initiate parallel execution:
    #pragma acc parallel
    {
        // Optimize Loop Mappings:
        #pragma acc loop gang vector
            for(i=0; i<n; ++i){
                c[i] = a[i] + b[i];
                ...
            }
    }
}
```