# Accelerator Programming:

## What is a Accelerator:
    
- Hardware Acceleration is the use of computer hardware designed to perform specific functions more efficiently, when compared to software running on a general-purpose central processing unit (CPU).

- Any software running on GPCPU can run on custom-made hardware.

- This custom-made hardware made for hardware acceleration are called Accelerators.


## Types of Accelerators:
- GPU
- FPGA 
- ASIC

## GPU:
- GPU = Graphics Processing Unit.
- Originally created for accelerating graphics rendering in video games.
- Now they are used in accelerating any task that can be parallelized.
- GPUs can have several cores (usually thousands).
- CPUs are optimized for serial computational workloads while GPUs are optimized for parallel computational workloads.
- Instructor says that concurrently means sequentially, the keyword is "PARALLELY".
- CPU has to be faster because its supposed to do a variety of tasks, which includes loading OS, dealing with RAM, device drivers, etc.
    
## FPGA:
- FPGA = Field Programmable Gate Arrays (FPGAs)
- Used to offload computationally intensive workloads from the main CPU, allowing it to focus on other tasks.
- FPGAs are modular. which means we can change its architecture as per the liking of the user.
- This is because they are "Programmable", this is unlike GPUs where we cannot change the architecture.
- Applications: financial modelling, scientific simulations, encryption/decryption, network traffic management, etc.
- financial trading and algorithmic trading where low latency is crucial.

## FPGA Architecture:
- refer to the slides.
- DSP blocks = Digital Signal Processing blocks
- RAM blocks = Random Access Memory blocks



## CPU:
- ALUs are the cores of both CPU and GPU.
- A Quad core CPU has 4 ALUs, each core will have control block and L1 Cache (The fastest cache inside the computer).
- Instructor says that the farther the cache is from the CPU the more power computer has to expend to access that memory. 
- This is why L1 cache is fastest.
- Similarly there is L2 Cache which is a little further from cores and L3 Cache is again more further from L2 Cache.
- And finally there is DRAM which is furthest memory block from cores inside the CPU.
- Caches stores some small amout of data in the form of pages.
- If there is a cache miss, then CPU has to go to RAM.

## CPU strengths:
- large memory
- fast clock speed

## CPU weakness:
- low memory bandwidth
- low performace/watt
- expensive cache miss

## GPU strenghts:
- high bandwidth memory
- more compute resources
- latency tolerant via parallelism
- high performance/watt
## GPU weakness:
- relatively low memory capacity
- low per thread performance.



# Nvidia GPU Tesla V100:
- Tensor core can parallely execute matrix arithmetic (like addition, subtraction, multiplication, etc.)
- SFU and LDU is communicating data between Core so GPU and textual storage unit at the bottom.
- GPC = Graphics Processing cluster, this contains many GPU cards.
- tex = textual processing unit.
- HBM = High Bandwidth Memory
- NVLink is used for Nvidia's propreitary card for Nvidia compiler for nvCC for running the CUDA codes.
- CUDA code can be identified from "__global__" in the source code.

## Other usecases of GPUs:
- GPUs are also used by Nvidia and other companies to make autonomous driving cars.
- This is because that task involves 360 degree videos and to speedup the processing it will require parallel computation.
- This is a realtime and critical application. Therefore, speedup is crucial and hence GPUs are necessary.

## Tensor Processing Unit (TPU):
- A TPU is an accelerator designed by Google for accelerating machine learning (ML) workloads.
- They are custom-designed ASICs (Application-Specific Integrated Circuits) optimized for ML algorithms.
- They provide high performace and low power consumption.

## TPU architecutre:
- refer slides.

## TPU core components:
- matrix multiplier unit 
- Unified buffer
- activation Unit



## Use case of Accelerators:
- Use of specialized hardware (Accelerator) to speed up work, often with parallel processing.

## Programming on Accelerators:
- GPU:
    - CUDA (proprietary by Nvidia)
    - OpenACC (Open Source by Nvidia, helps developers develop application of their choice)
    - SYCL (New entrant, can be used on any kind of hardware)
- FPGA:
    - SYCL
- ASICs:
    - SYCL

- We will learn about SYCL more in-depth in these sessions.
- Instructor says, if you switch from Nvidia to AMD, CUDA and OpenACC wont work due to vendor lock-in.
- SYCL's motivation was that it helps port source code from one architecture to another.
- Apps like "final cut pro" use CUDA for parallel processing of video editing tasks.

# Programming Languages:
- CUDA:
    - CUDA C Keyword \_\_global\_\_ indicates that a function:
        - runs on the device.
        - called from a host.
- nvcc is a compiler for CUDA code.
- nvcc splits source file into host and device components.
- NVIDIA's compiler handles device function like kernel()
- Host compiler handles host functions like main().
- OpenACC is based on C, its old and deprecated, everyone is moving to SYCL which is based on C++.
- GCC will be the default compiler on host machine and nvcc will be the default compiler on device.
- nvcc is unofficially based on GCC.

## Example of CUDA code:

```c
__global__ void kernel(void)
{
    ...
}

int main(void){
    kernel<<< input parameters >>>();
    printf("Hello, World!\n");
    return 0;
}
```

## Flow of Accelerator Programming:
1. Copy data from main memory to GPU memory.
2. CPU instructs the processing to GPU.
3. GPU executes paralle in each core.
4. Copy the results from the GPU memory to the main memory.

- see slides for the image on processing flow on CUDA.

## Another example of CUDA code:
- code to calculate sin^2 and cos^2 and doing addition using GPU is on the slides.

## OpenACC:
- OpenACC stands for Open Accelerators, is a programming standard for parallel computing.
- It allows programmers to simplify the process of offloading computations to GPUs while still using familiar programming languages like C, C++ and Fortran.
- to specify that --openacc -targets
- refer to the slides for full code.

## OpenMP
- A new directive to offload to accelerators is introduced in openmp4.
- Clauses for data transfer is similar to that of OpenACC.
- sample openmp code is in slides.

## OpenCL:
- OpenCL standard API is a KHRONOS implementation.
- Leverage CPU's, GPU's and other processors and DSP's to accelerate parallel computing.
- kernel function is in a different file and host code (written in C/C++) will bein a different file.
- When both files are compiled together, kernel code will be loaded on accelerator.
- allows you to write accelerated portable code across various devices and architectures.

## SYCL:
- SYCL is a KHRONOS standard, it provides a high level abstraction layer over C++.
- It extends C++ in two key ways:
    - device discovery
    - device control
- It allows you to write both host and device code in same source file.
- It allows you to write portable code across different devices and architectures.
- Helps avoid vendor-lock.
- Short and lengthy sample code available on slides.
<pre>
                        SYCL
            /             |             \
        GPU             FPGA            ACICs
         |
        Nvidia/AMD
</pre>

## Now we will login to PARAM Utkarsh and learn how to write code and execute programs on that supercomputer.