# CS3210

### Wang Xiyu

## September 1, 2025

# 1 Computer architecture

### 1.1 Parallelism

- Concurrency:
  - Multiple tasks can start, run and complete in overlapping time period
  - may not be running or executing at the same instant
  - multiple execution flows make progress at the same time by interleaving their execution OR by the same time
- Parallelism:
  - Multiple tasks running simultaneosuly
  - Not only making progress, but also execute simultaneously
- Single processer:
  - Bit level paralleism:
    - \* parallelism by increasing the processor word size, e.g. parallel addition of 64 bit numbers on 32 bit machine
  - Instruction level parallelism:
    - \* pipelining: [time parallelism] number of pipeline stages = maximum achievable speedup
    - \* superscaling: [space parallelism] Duplicate the pipeline, multiple instruction can be on the same excution stage. Scheduling is challenging. Stronger structural hazard, less cycles per instruction
  - Thread level parallelism:
    - \* Simultaneous multithreading(SMT): processor provides hardware support for multiple thread level context
    - \* hyper-threading: executing multiple threads per processor at the same time
- Multi-Processor:
  - Shared memory:
  - Distributed memory
- Multicore processor archiecture
  - hierarchical design:
    - \* multiple cores share multiple caches
    - \* cache size increases from leaves to root, as more cores share the same cache
    - \* external memory shared by all cores
  - pipelined design
    - \* data elements are processed by multiple execution cores in a piplined way
    - \* useful for sequential data element processing
  - network-based design
    - \* cores and local caches and memory are connected via interconnection network
      - · Efficient on-chip interconnection: enough bandwidth, scalable

#### Multiprocessing vs. multithreading:

- Multiprocessing: high overhead, i.e. context switches; able to utilize multiple processing units
- Multthreading: low overhead; but effectively utilising the same processing unit

Table 1: Distinguishing "processor", "core", "processing unit", and "logical core"

| Term                                   | What it is                                                                                   | Independence / Context                                           | Shares With                                                                            | OS Sees / Notes                                                                                          |
|----------------------------------------|----------------------------------------------------------------------------------------------|------------------------------------------------------------------|----------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------|
| Processor (CPU / package)              | Physical chip/package containing compute resources (cores, caches, I/O, memory controller).  | Multiple cores; each core is an independent execution engine.    | All cores share off-core resources (e.g., memory controller, sometimes LLC).           | OS sees one processor package with $N$ cores; socket count in NUMA systems.                              |
| Core (physical core)                   | An independent execution pipeline (fetch/decode/execute).                                    | Can run its own instruction stream; has private $L1/L2$ (often). | Shares last-level cache (LLC) and memory controller with other cores on the processor. | OS schedules threads to cores; true parallelism across cores.                                            |
| Processing unit (execution context)    | Generic term for a hardware context that can run instructions (a core or a hardware thread). | Independent program counter, registers, and a stack.             | If it is a hardware thread, shares core pipelines with sibling threads.                | Ambiguous in literature; often equals "schedulable hardware context".                                    |
| Logical core (hardware thread, SMT/HT) | A virtualized execution context exposed by SMT/Hyper-Threading.                              | Own PC/regs/stack, but shares core's execution units and caches. | Siblings on the same physical core compete for pipelines, cache ports, bandwidth.      | OS treats each logical core as<br>a CPU; throughput gain de-<br>pends on resource contention<br>and ILP. |

## 1.2 Memory Organization

### Parallel computers:

- Distibuted-memory: Multiple computers
  - each node is an independent unit, with processor, memory etc.
  - physically distributed memory modules, memory local and private to each node
- Shared-memory: multiprocessor: programs and threads access memory through shared memory provider, unaware of the acual hardware memory architecture, requires cache coherence and memory consistency.
  - cache coherence: local update by one processing unit, other PU should see the change being reflected in their copy of the same data in their cache
  - memory consistency

#### Shared memory model:

- uniform memory access(UMA): same latency of accessing the main memory for all processors. Contention makes this unsuitable for large number of processors.
- Non-uniform memory access(NUMA)
- Cache-coherent Non-uniform memory access(ccNUMA)
- Cache-only memory access(COMA)
- Hybrid(distributed shared memory)
- Latency: time taken for a request from the processor to be serviced by the memory
- Bandwidth: the rate at which the memory system can provide data to the processor
- stall: When the processor cannot run the next instruction in an instruction stream due to dependency on a previous instruction

Table 2: Flynn's taxonomy: instruction/data streams, examples, and caveats

| Class | Instr.<br>Streams | $egin{array}{c} \mathbf{Data} \\ \mathbf{Streams} \end{array}$ | Typical Examples                                                                           | Notes / Caveats                                                                                                                                         |
|-------|-------------------|----------------------------------------------------------------|--------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|
| SISD  | 1                 | 1                                                              | Classic single-core CPUs; pipelined or superscalar scalar execution.                       | Pipelining/superscalar improve<br>throughput but pipelining does<br>not add new data streams, super-<br>scalar does not add new instruction<br>streams. |
| SIMD  | 1                 | Many                                                           | Vector/SIMD ISAs (SSE/AVX/-NEON), GPU warp/warp-lane execution, vector processors.         | Same instruction applied to multiple data elements in lock-step; divergence harms efficiency (masks).                                                   |
| MISD  | Many              | 1                                                              | Rare/mostly theoretical; certain fault-tolerant or systolic designs sometimes cited.       | Not common in general-purpose computing; examples are niche/controversial.                                                                              |
| MIMD  | Many              | Many                                                           | Multicore/multiprocessor systems, clusters; CPUs with SMT (each HW thread its own stream). | Threads/processes can execute different code on different data; includes shared-memory and distributed models.                                          |

Hybrids: Modern systems often combine MIMD (many cores/threads) with SIMD (per-core vectors). A single program may be MIMD + SIMD (e.g., OpenMP across cores + AVX within each core; GPUs: MIMD across warps/SMs, SIMD within a warp).