## CMP-3004
## Computer Organization

### Spring 2022


## Flynn’s taxonomy

![](./multi1.png)

### Flynn’s Taxonomy

- **SISD:** A single processor executes a single instruction stream (Uniprocessors)

- **SIMD:** Multiprocessors must execute the same instruction simultaneously on multiple data

- **MISD:** Each processor executes a different instruction sequence on a unique data set

- **MIMD:** A set of processors simultaneously execute different instruction sequences on different data sets

### Computer architecture

![](./multi2.png)

In [None]:
def sum(a,b,c):
    
    return a + b + c

add x1, x2, x3
add x3, x4, x5
ret x5

## Parallel and multiprocessor

- Paralleslism results in 
    - higher throughput
    - better fault tolerance
    - better cost-performance ratio
- If we have $n$ processors running in parallel, a computation job should take $1/n$ of the time (perfect speedup)

### Amdahl's law

Perfect speedup is not possible
- There is always part of the work done by a single processor that needs to be done serially
- The greater the sequential processing, the less cost effective is the architecture

![](./raid6.png)

## Symmetric multiprocessors

- Two or more similar processors of comparable capability
- A bus or other connection scheme is used to share the main memory and I/O facilities
- All processors share access to I/O devices
- All processors can perform the same functions
- Integrated operating system hat provides interaction between processors and their programs
- Processes or threads are scheduled across all of the processors

### Symmetric multiprocessors

![](./multi3.png)

### Communication between processors

- Messages and status information left in common data areas
- Each processor may also have its own private main memory and I/O channels in addition to the shared resources


![](./multi4.png)

### Symmetric multiprocessor organization

The time-shared bus is the simplest mechanism for constructing a multiprocessor system

- Simplicity: the same as the single-processor system
- Flexibility: easy to expand
- Reliability: a failure of an attached device does not propagate (passive medium)

![](./multi5.png)

### Symmetric multiprocessor organization

- The main drawback to the bus organization is performance
    - the bus cycle time limits the speed of the system
- Cache memory per processor
    - reduce the number of bus accesses dramatically
- Two or three levels of cache: 
    - L1 cache internal
    - L2 cache either internal or external
    - Some processors now employ a L3 cache as well
- Cache coherence problem
    - if a word is altered in one cache, it could conceivably invalidate a word in another cache

### Symmetric multiprocessor organization

**MESI protocol:** to provide cache consistency on an SMP. The data cache includes two status bits per tag

- Modified: The line has been modified (different from main memory) and is available only in this cache
- Exclusive: The line is the same as that in main memory and is not present in any other cache
- Shared: The line is the same as that in main memory and may be present in another cache
- Invalid: The line does not contain valid data

### MESI cache line states

![](./multi6.png)

### MESI State Transition Diagram

![](./multi7.png)

### L1-L2 cache consistency

- L1 cache that does not connect directly to the bus
- Some scheme is needed to maintain data integrity across both levels of cache and across all caches in the SMP configuration
- Extend the MESI protocol to the L1 caches (L1 cache includes bits to indicate the state)
- **Goal:** for any line that is present in both an L2 cache and its corresponding L1 cache, the L1 line state should track the state of the L2 line

## Compute Unified Device Architecture (CUDA)

- Software architecture that enables GPUs to be programmed using high-level programming languages such as C and C++
- CUDA requires an NVIDIA GPU
- CUDA is an elegant solution to the problem of representing parallelism in algorithms
    - not all algorithms

### Basic definitions

- A CUDA program can be divided into
    - Code to be run on the host (CPU)
    - Code to be run on the device (GPU)
    - The code related to the transfer of data between the host and the device

- The **host** (CPU) interfaces with the user and controls the **device** (GPU) 
    - The host executes the serial portion of the application

- The **device** (GPU) is connected to the host
    - The device executes the data-parallel, compute-intensive portion of an application

- **Kernel** is the code executed by the **device** 
    - It is a function callable from the host and executed in parallel by many threads
    - An application or library function might consist of one or more kernels
    - A kernel can be written in C language with additional key words to express parallelism

- A **thread** is a single instance of the kernel function 

### GPU v. CPU

![](./gpu1.png)

- GPU consists of streaming multiprocessors (SM)
- GPU uses a massively parallel SIMD (single instruction multiple data) architecture to perform mainly mathematical operations
- A GPU doesn’t require the same complex capabilities of the CPU’s control logic 
    - out of order execution, branch prediction, data hazards
    - doesn't require large amounts of cache memory

### NVIDIA Fermi architecture

![](./gpu2.png)

- 16 streaming multiprocessor (SM)
    - 32 cuda cores each
- Every cuda is an execute unit for integer and float numbers
- GigaThread is the global scheduler
    - responsible for the distribution of thread blocks to all of the SM’s warp schedulers

### Streaming multiprocessor

![](./gpu3.png)

- 16 streaming multiprocessor (SM)
    - 32 cuda cores each
- Every cuda is an execute unit for integer and float numbers


### Warp scheduler

![](./gpu6.png)

- The dual warp scheduler will break up each thread block it is processing into warps
- A warp is a bundle of 32 threads that start at the same starting address and their thread IDs are consecutive

### Special function units (SFUs)

- Each SM has four SFUs
- The SFU performs transcendental operations, such as cosine, sine, reciprocal, and square root, in a single clock cycle

### Load/store units (LD/ST)

Fetch and save data to memory

### Memory hierarchy

![](./gpu4.png)

### Memory hierarchy

![](./gpu5.png)