# CS405 Computer System Architecture

Dr CK Raju

Dept of CSE

August 6, 2018



#### Presentation Outline

- Module I: Parallel Computer Models
  - Evolution of Computer Architecture
  - System Attributes to Performance
  - Multiprocessors and Multicomputers
  - Multivector and SIMD





Architecture of a Simple Processor



- Architecture of a Simple Processor
- Parallel Computer Models



- Architecture of a Simple Processor
- Parallel Computer Models
- Evolution of Computer Architecture



- Architecture of a Simple Processor
- Parallel Computer Models
- Evolution of Computer Architecture
- System Attributes to Performance



- Architecture of a Simple Processor
- Parallel Computer Models
- Evolution of Computer Architecture
- System Attributes to Performance
- Multiprocessors and Multicomputers



- Architecture of a Simple Processor
- Parallel Computer Models
- Evolution of Computer Architecture
- System Attributes to Performance
- Multiprocessors and Multicomputers
- Multivector and SIMD Computers



- Architecture of a Simple Processor
- Parallel Computer Models
- Evolution of Computer Architecture
- System Attributes to Performance
- Multiprocessors and Multicomputers
- Multivector and SIMD Computers
- Conditions of Parallelism



#### Hardware Architecture



ALU (arithmetic, logical and shift), FPU (Floating Point Unit), IR (Instruction Register), PC (Program Counter), PSW (Program Status Word)



## Algorithm of a Simple Processor

#### repeat forever begin

- fetch the next instruction from main memory almost same for all instructions
- execute i.e. carry out the instruction fetched might vary with every instruction



## Algorithm of a Simple Processor

# repeat forever begin

- fetch the next instruction from main memory almost same for all instructions
- execute i.e. carry out the instruction fetched might vary with every instruction

Note: There are two views for the processor

- (a) In one view, machine instruction set architecture (ISA) defines microprocessor
- (b) In other view, hardware design and circuitry defines the microprocessor end



| Generation           | Technology              | Systems               |
|----------------------|-------------------------|-----------------------|
| First (1945-54)      | Vacuum tubes, relays    | Eniac, IBM 701        |
| Second (1955-64)     | Transistors             | Univac, IBM 7090,     |
| Third (1965-74)      | Integrated Circuits     | IBM 360, PDP-8        |
| Fourth (1975-90)     | Microprocessors         | VAX 9000, Cray X-MP   |
| Fifth (1991 onwards) | Artificial Intelligence | IBM ES 9000, Xeon PHI |



| Generation | Architecture               | Systems               |
|------------|----------------------------|-----------------------|
| First      | PC, ALU, FP Arithmetic     | Eniac, IBM 701        |
| Second     | Multiplexed Memory Access  | Univac, IBM 7090,     |
| Third      | SSI, MSI, Cache            | IBM 360, PDP-8        |
| Fourth     | LSI, VLSI, multiprocessors | VAX 9000, Cray X-MP   |
| Fifth      | Scalable Multicomputers    | IBM ES 9000, Xeon PHI |



| Generation | Software                        | Systems               |
|------------|---------------------------------|-----------------------|
| First      | Assembly, Machine               | Eniac, IBM 701        |
| Second     | HLL with compilers              | Univac, IBM 7090,     |
| Third      | Multiprogramming OS             | IBM 360, PDP-8        |
| Fourth     | Multiprocessor OS, Parallel     | VAX 9000, Cray X-MP   |
| Fifth      | Superscalar, massively parallel | IBM ES 9000, Xeon PHI |









 Lookahead techniques were introduced to prefetch instructions in order to overlap instruction fetch decode execute and to enable functional parallelism



- Lookahead techniques were introduced to prefetch instructions in order to overlap instruction fetch decode execute and to enable functional parallelism
- Functional parallelism multiple functional units executed simultaneously and practise pipelining





- Lookahead techniques were introduced to prefetch instructions in order to overlap instruction fetch decode execute and to enable functional parallelism
- Functional parallelism multiple functional units executed simultaneously and practise pipelining
- Pipelining performing identical operations repeatedly over vector data strings - either implicitly or explicitly



- Lookahead techniques were introduced to prefetch instructions in order to overlap instruction fetch decode execute and to enable functional parallelism
- Functional parallelism multiple functional units executed simultaneously and practise pipelining
- Pipelining performing identical operations repeatedly over vector data strings - either implicitly or explicitly

Your smartphone could be an octa-core processor-based ...



- Lookahead techniques were introduced to prefetch instructions in order to overlap instruction fetch decode execute and to enable functional parallelism
- Functional parallelism multiple functional units executed simultaneously and practise pipelining
- Pipelining performing identical operations repeatedly over vector data strings - either implicitly or explicitly

Your smartphone could be an octa-core processor-based ... What should be its parallel architecture ...



- Lookahead techniques were introduced to prefetch instructions in order to overlap instruction fetch decode execute and to enable functional parallelism
- Functional parallelism multiple functional units executed simultaneously and practise pipelining
- Pipelining performing identical operations repeatedly over vector data strings - either implicitly or explicitly

Your smartphone could be an octa-core processor-based ... What should be its parallel architecture ... think, refer and give a feedback in the next session



 Michael Flynn - introduced classification based on instruction streams and data streams



- Michael Flynn introduced classification based on instruction streams and data streams
- Conventional sequentional machines were SISD single instruction stream over single data stream



- Michael Flynn introduced classification based on instruction streams and data streams
- Conventional sequentional machines were SISD single instruction stream over single data stream
- Vector computers equipped with scalar or vector hardware appear as SIMD



- Michael Flynn introduced classification based on instruction streams and data streams
- Conventional sequentional machines were SISD single instruction stream over single data stream
- Vector computers equipped with scalar or vector hardware appear as SIMD
- Parallel computers are reserved for MIMD machines



- Michael Flynn introduced classification based on instruction streams and data streams
- Conventional sequentional machines were SISD single instruction stream over single data stream
- Vector computers equipped with scalar or vector hardware appear as SIMD
- Parallel computers are reserved for MIMD machines
- A MISD architecture called systolic arrays for pipelined execution of specific algorithms is also present





Machine performance is usually dependent on a variety of factors like

hardware technology



- hardware technology
- innovative architectural features



- hardware technology
- innovative architectural features
- efficient resource management



- hardware technology
- innovative architectural features
- efficient resource management
- algorithm design



Machine performance is usually dependent on a variety of factors like

- hardware technology
- innovative architectural features
- efficient resource management
- algorithm design
- data structures, language efficiency, programmer skills, compiler technology etc

Note: Generally its *turnaround time* that is considered that includes disk and memory access, input and output activities, compilation time, OS overhead and CPU time. Since CPU time includes system CPU time and user CPU time, we focus only on *user CPU time* for the purpose of benchmarking performance.



CPU driven by clock with constant cycle time T



- CPU driven by clock with constant cycle time T
- ullet Clock Rate is f=1/T





- CPU driven by clock with constant cycle time T
- Clock Rate is f = 1/T
- ullet Size of program is determined by Instruction Count I<sub>c</sub> no of instructions to be executed in the program





- CPU driven by clock with constant cycle time T
- Clock Rate is f = 1/T
- Size of program is determined by Instruction Count I<sub>c</sub> no of instructions to be executed in the program
- Different instructions use different clock cycles



- CPU driven by clock with constant cycle time T
- Clock Rate is f = 1/T
- Size of program is determined by Instruction Count I<sub>c</sub> no of instructions to be executed in the program
- Different instructions use different clock cycles
- Cycles per instruction CPI is therefore used as a parameter to measure time





- CPU driven by clock with constant cycle time T
- Clock Rate is f = 1/T
- Size of program is determined by Instruction Count I<sub>c</sub> no of instructions to be executed in the program
- Different instructions use different clock cycles
- Cycles per instruction CPI is therefore used as a parameter to measure time
- Average CPI is found out for the program (taking into consideration) all its instructions and the clock cycles needed for executing these)







 $\bullet$  Let Instruction Count of a program be  $I_{c}$ 



- $\bullet$  Let Instruction Count of a program be  $I_c$
- CPU time =  $I_c \times CPI \times T$  (const cycle time = 1/f)



- Let Instruction Count of a program be Ic
- CPU time =  $I_c \times CPI \times T$  (const cycle time = 1/f)
- Usually an instruction might contain fetch (instruction), decode (instruction), fetch operands, execute, and store (results)



- Let Instruction Count of a program be I<sub>c</sub>
- CPU time =  $I_c \times CPI \times T$  (const cycle time = 1/f)
- Usually an instruction might contain fetch (instruction), decode (instruction), fetch operands, execute, and store (results)
- Here instruction decode and execute are carried out by CPU. Other stages need access to memory.



- Let Instruction Count of a program be Ic
- CPU time =  $I_c \times CPI \times T$  (const cycle time = 1/f)
- Usually an instruction might contain fetch (instruction), decode (instruction), fetch operands, execute, and store (results)
- Here instruction decode and execute are carried out by CPU. Other stages need access to memory.
- A memory cycle is time needed to complete one memory reference usually memory cycle is k times processor cycle (ratio)



- Let Instruction Count of a program be I<sub>c</sub>
- CPU time =  $I_c \times CPI \times T$  (const cycle time = 1/f)
- Usually an instruction might contain fetch (instruction), decode (instruction), fetch operands, execute, and store (results)
- Here instruction decode and execute are carried out by CPU. Other stages need access to memory.
- A memory cycle is time needed to complete one memory reference usually memory cycle is k times processor cycle (ratio)
- Therefore total time =  $I_c \times (p + m \times k) \times T$  where p processor cycles needed for decode and execute, m is no of memory references needed and k is ratio between memory cycle and processor cycle, the instruction count and T is the processor cycle time



 $\bullet$  The five performance factors -  $I_c$  , p, m, k, and T are influenced by four system attributes



- $\bullet$  The five performance factors  $I_c$  , p, m, k, and T are influenced by four system attributes
- ISA instruction set architecture, Compiler technology, CPU implementation and control, Cache and memory hierarchy



- ullet The five performance factors  $I_c$  , p, m, k, and T are influenced by four system attributes
- ISA instruction set architecture, Compiler technology, CPU implementation and control, Cache and memory hierarchy
- $\bullet$  ISA affects program length (Ic) and processor cycles needed (p)



- $\bullet$  The five performance factors  $I_c$  , p, m, k, and T are influenced by four system attributes
- ISA instruction set architecture, Compiler technology, CPU implementation and control, Cache and memory hierarchy
- ullet ISA affects program length ( $I_c$ ) and processor cycles needed (p)
- Compiler technology affects program length ( $I_c$ ), processor cycles needed (p) and memory reference count (m)



- $\bullet$  The five performance factors  $I_c$  , p, m, k, and T are influenced by four system attributes
- ISA instruction set architecture, Compiler technology, CPU implementation and control, Cache and memory hierarchy
- ullet ISA affects program length ( $I_c$ ) and processor cycles needed (p)
- Compiler technology affects program length  $(I_c)$ , processor cycles needed (p) and memory reference count (m)
- $\bullet$  CPU implementation and control determine total processor time p  $\times$  T needed



- $\bullet$  The five performance factors  $I_c$  , p, m, k, and T are influenced by four system attributes
- ISA instruction set architecture, Compiler technology, CPU implementation and control, Cache and memory hierarchy
- ullet ISA affects program length ( $I_c$ ) and processor cycles needed (p)
- Compiler technology affects program length (I<sub>c</sub>), processor cycles needed (p) and memory reference count (m)
- $\bullet$  CPU implementation and control determine total processor time p  $\times$  T needed
- Memory technology and hierarchy design influence memory access latency  $(k \times T)$

## Performance Factors versus System Attributes

# THE STATE OF COMPUTING System Attributes to Performance

| System<br>Attributes                       | Performance Factors |                                                       |                                             |                              |                 |  |
|--------------------------------------------|---------------------|-------------------------------------------------------|---------------------------------------------|------------------------------|-----------------|--|
|                                            | Instruction         | Average Cycles per Instruction (CPI)                  |                                             |                              | Processor Cycle |  |
|                                            | Count (I)           | Processor<br>Cycles per<br>Instruction<br>(CPI and p) | Memory<br>References per<br>Instruction (m) | Memory Access<br>Latency (k) | Time (τ)        |  |
| Instruction-set<br>Architecture            |                     |                                                       |                                             |                              |                 |  |
| Compiler<br>Technology                     | V                   |                                                       | $\square$                                   |                              |                 |  |
| Processor<br>Implementation<br>and Control |                     | <b>V</b>                                              |                                             |                              | <b>V</b>        |  |
| Cache and<br>Memory<br>Hierarchy           |                     |                                                       |                                             | <b>V</b>                     | <b>V</b>        |  |

## Multiprocessors and Multicomputers

There are two main categories of parallel computer systems

- systems with common memory that are shared
- systems with distributed memory that are not shared



## Shared memory Multiprocessors

Shared memory multiprocessor models are of three types:

- Uniform-memory-access (UMA)
- Nonuniform-memory-access (NUMA)
- Cache-only memory architecture (COMA)

These systems differ in how the memory and peripheral resources are shared or distributed.





#### UMA Model I

- Physical memory uniformly shared by all processors, with equal access time to all words
- Processors may have local cache memories
- Peripherals also shared in some fashion
- Tightly coupled systems use a common bus, crossbar, or multistage network to connect processors, peripherals and memories
- Many manufacturers have multiprocessor (MP) extensions of uniprocessor (UP) product lines



#### UMA Model II

- Synchronization and communication among processors achieved through shared variables in common memory
- Symmetric MP systems all processors have access to all peripherals, and any processor can run the OS and I/O device drivers
- Asymmetric MP systems not all peripherals accessible by all processors; kernel runs only on selected processors (master node); others are called attached processors (AP)





## The UMA Multiprocessor Model





#### NUMA Model I

- Shared memories, but access time depends on the location of the data item
- Shared memory is distributed among the processors as local memory, but each of these is still accessible by all processors (with varying access times)
- Memory access is fastest from the locally connected processor, with the interconnection network adding delays for other processor accesses.
- Additionally there may be global memory in a multiprocessor system, with two separate interconnection networks, one for clusters of processors and their cluster memories, and the other for the global shared memories

#### **Shared Local Memories**





#### Hierarchical Cluster Model





### **COMA Model**

- In the COMA model, processors only have cache memories; the caches, taken together, form a global address space
- Each cache has an associated directory that aids remote machines in the lookups; hierarchical directories may exist in machines based on this model
- Initial data placement is not critical, as cache blocks will eventually migrate to where they are needed



## Cache-Only Memory Architecture





#### Other Models

There are also other models used in multiprocessor systems, based on a combination of models just described.

- Cache-coherent non-uniform memory access (each processor has a cache directory, and the system has a distributed shared memory)
- cache-coherent cache-only model (processors have caches, no shared memory, caches must be kept coherent)



## Some Early Commercial Multiprocessor Systems

| Company<br>and<br>Model           | Hardware and<br>Architecture                                                                       | Software and<br>Applications                                                       | Remarks                                                                  |
|-----------------------------------|----------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------|--------------------------------------------------------------------------|
| Sequent<br>Symmetry<br>S-81       | Bus-connected with 30 i386 processors, IPC via SLIC bus; Weitek floating-point accelerator.        | DYNIX/OS,<br>KAP/Sequent<br>preprocessor,<br>transaction<br>multiprocessing.       | Latter models<br>designed with<br>faster<br>processors of<br>the family. |
| IBM<br>ES/9000<br>Model<br>900/VF | 6 ES/9000 processors with vector facilities, crossbar connected to I/O channels and shared memory. | OS support: MVS,<br>VM KMS,<br>AIX/370, parallel<br>Fortran, VSF V2.5<br>compiler. | channels,                                                                |
| BBN TC-<br>2000                   | 512 M88100 processors with local memory connected by a Butterfly switch, a NUMA machine.           | Ported Mach/OS with multiclustering, parallel Fortran, time-critical applications. | Latter models<br>designed with<br>faster<br>processors of<br>the family. |





## Multicomputer Models

- Multicomputers consist of multiple computers, or nodes interconnected by a message passing network
- Each node is autonomous, with its own processor and local memory and sometimes local peripherals
- The message-passing network provides point-to-point static connections among the nodes
- Local memories are not shared, so traditional multicomputers are sometimes called no-remote-memory-access (or NORMA) machines
- Inter-node communication is achieved by passing messages through the static connection network

29 / 38

## Generic Message Passing Multicomputer





## Multicomputer Generations

- Each multicomputer uses routers and channels in its interconnection network, and heterogeneous systems may involve mixed mode types and uniform data representation and communication protocols
- First generation: hypercube architecture, software-controlled message switching, processor boards
- Second generation: mesh-connected architecture, hardware message switching, software for medium-grain distributed computing
- Third generation: Fine-grained distributed computing, with each VLSI chip containing the processor and communication resources.



## Some Early Commercial Multicomputer Systems

| System<br>Features                                   | Intel Paragon<br>XP/S                                                                  | nCUBE/2 6480                                                                                      | Parsys<br>SuperNode1000                                                    |
|------------------------------------------------------|----------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------|
| Node Types<br>and<br>Memory                          | 50 MHz i860 XP computing nodes with 16–128 Mbytes per node, special I/O service nodes. | Each node contains<br>a CISC 64-bit CPU,<br>with FPU, 14 DMA<br>ports, with 1–64<br>Mbytes /node. | EC-funded Esprit supernode built with multiple T-800 Transputers per node. |
| Network<br>and I/O                                   | 2-D mesh with<br>SCSI, HIPPI, VME,<br>Ethernet, and<br>custom I/O.                     | 13-dimensional<br>hypercube of 8192<br>nodes, 512-Gbyte<br>memory, 64 I/O<br>boards.              | Reconfigurable interconnect, expandable to have 1024 processors.           |
| OS and<br>Software<br>Task<br>Parallelism<br>Support | OSF conformance with 4.3 BSD, visualization and programming support.                   | Vertex/OS or UNIX supporting message passing using wormhole routing.                              | IDRIS/OS<br>UNIX-<br>compatible.                                           |
| Application<br>Drivers                               | General sparse matrix methods, parallel data manipulation, strategic computing.        | Scientific number crunching with scalar nodes, database processing.                               | Scientific and academic applications.                                      |
| Performance<br>Remarks                               | 5–300 Gflops peak<br>64-bit results, 2.8–<br>160 GIPS peak<br>integer                  | 27 Gflops peak, 36<br>Gbytes/s I/O                                                                | 200 MIPS to 13<br>GIPS peak.                                               |



## Multivector and SIMD Computers

#### Vector Processors

- Vector Processor Variants
  - Vector Supecomputers
  - Attached Processors
- Vector Processor Models / Architecture
  - Register-to-register architecture (vectors are retrieved from registers)
  - Memory-to-memory architecture (vectors are retrieved from memory)
- Representative Systems
  - Cray-I
  - Cray Y-MP (2, 4 or 8 processors with 16Gflops peak performance)
  - Convex C1, C2, C3 series (C3800 family with 8 processors, 4 GB main memory, 2 Gflops peak performance)
  - DEC VAX 9000 (pipeline chaining support)



## Architecture of Vector Supercomputer - register-to-register





## Representative Vector Supercomputers

| System<br>Model | Vector Hardware Architecture     | Compiler and                 |
|-----------------|----------------------------------|------------------------------|
| Model           | and Capabilities                 | Software Support             |
| Convex          | GaAs-based multiprocessor        | Advanced C, Fortran,         |
| C3800           | with 8 processors and            | and Ada vectorizing          |
| family          | 500-Mbyte/s access port.         | and parallelizing compilers. |
|                 | 4 Gbytes main memory.            | Also support inter-          |
|                 | 2 Gflops peak                    | procedural optimization,     |
|                 | performance with                 | POSIX 1003.1/OS              |
|                 | concurrent scalar/vector         | plus I/O interfaces          |
|                 | operations.                      | and visualization system     |
| Digital         | Integrated vector processing     | MS or ULTRIX/OS,             |
| VAX 9000        | in the VAX environment,          | VAX Fortran and              |
| System          | 125-500 Mflops                   | VAX Vector Instruction       |
|                 | peak performance.                | Emulator (VVIEF)             |
|                 | 63 vector instructions.          | for vectorized               |
|                 | 16 x 64 x 64 vector registers.   | program debugging.           |
|                 | Pipeline chaining possible.      |                              |
| Cray Research   | Y-MP runs with 2, 4, or          | CF77 compiler for            |
| Y-MP and        | 8 processors, 2.67 Gflop         | automatic vectorization,     |
| C-90            | peak with Y-MP8256. C-90         | scalar optimization,         |
|                 | has 2 vector pipes/CPU           | and parallel processing.     |
|                 | built with 10K gate ECL          | UNICOS improved              |
|                 | with 16 Gflops peak performance. | from UNIX/V and              |
|                 |                                  | Berkeley BSD/OS.             |





## Multivector and SIMD Computers

#### SIMD Supercomputers

- SIMD Machine Model
  - $S = \langle N, C, I, M, R \rangle$
  - N: No of PEs in the machine
  - C: Set of instructions (scalar/program flow) directly executed by control unit
  - I: Set of instructions broadcast by CU to all PEs for parallel execution
  - M: Set of masking schemes (to know which processors are active)
  - R: Set of data routing functions
- Representative Systems
  - MasPar MP-1 (1024 to 16384 PEs)
  - CM-2 (65536 PEs)
  - DAP600 Family (upto 4096 PEs)
  - Illiac-IV (64 PEs)



## Operational Model of SIMD Supercomputers





## History of Multivector and SIMD tracks

