### Flynn's Classification



- Of the four machine models, most parallel computers built in the past assumed the MIMD model for general purpose computations.
- The SIMD and MISD models are more suitable for special-purpose computations.
- For this reason, MIMD is the most popular model, SIMD next, and MISD the least popular model being applied in commercial machines.





(d) MISD architecture (the systolic array)

### **Multiprocessors and Multicomputers**

- These physical models are distinguished by having a shared common memory or unshared distributed memories.
- The processors in a multiprocessor system communicate with each other through shared variables in a common memory.
- Each computer node in a multicomputer system has a local memory, unshared with other nodes. Inter-processor communication is done through message passing among the nodes.
- There are **three** types of **shared memory multiprocessor**:
  - UMA (Uniform Memory Access)
  - NUMA (Non- uniform Memory Access)
  - COMA (Cache Only Memory)

# **Uniform Memory Access (UMA)**

- Most commonly represented today by Symmetric Multiprocessor (SMP) machines.
- Identical processors.
- Equal access and access times to memory.
- Sometimes called CC-UMA Cache Coherent UMA.
- Cache coherent means if one processor updates a location in shared memory, all the other processors know about the update. Cache coherency is accomplished at the hardware level.
- Multiprocessors are tightly coupled.
- The UMA model is suitable for general-purpose and times haring applications by multiple users.



# Non-Uniform Memory Access (NUMA)

- Often made by physically linking two or more SMPs
- One SMP can directly access memory of another SMP
- Not all processors have equal access time to all memories
- Memory access across link is slower.
- If cache coherency is maintained, then may also be called CC-NUMA
  - Cache Coherent NUMA

# Non-Uniform Memory Access (NUMA)



(a) Shared local memories (e.g. the N Butterfly) (b) A hierarchical cluster model (e.g. the Cedar system at the University of Illinois)

# The COMA model (Cache only Memory Access)

 The COMA model is a special case of NUMA machine in which the distributed main memories are converted to caches.

All caches form a global address space and there is no memory hierarchy at

each processor node.



P: Processor; C: Cache; D: Directory

#### **Distributed-Memory Multicomputers**

- The system consists of multiple computers, often called nodes, interconnected by a message-passing network.
- Each node is an autonomous computer consisting of a processor, local memory, and sometimes attached disks or I/O peripherals.



#### **PRAM Variants**



#### Parallel Multiplication of nxn matrices on CREW PRAM

- Input: Two n x n matrices A and B
- Output: n x n Product C matrix
- #PEs: n³, later reduced to n³/log₂n
- Time Complexity: O(log<sub>2</sub>n) for both cases

```
Step 2
Step 1

    ℓ ← n

      1. Read A(i, k)
                                                   Repeat
                                                           \ell \leftarrow \ell/2
      2. Read B(k, j)
                                                           if (k < \ell) then
      3. Compute A(i, k) \times B(k, j)
                                                              begin
     4. Store in C(i, j, k)
                                                                  Read C(i, j, k)
                                                                  Read C(i, j, k+\ell)
                                                                  Compute C(i, j, k) + C(i, j, k + \ell)
                                                                  Store in C(i, j, k)
                                                        until (\ell = 1)
```





### **VLSI Complexity Model**

- Parallel computers rely on the use of VLSI chips to fabricate the major components such as processor arrays, memory arrays, and large-scale switching networks.
- The **AT**<sup>2</sup> **model** models the constraints while fabricating VLSI chip, these constraints include:
- Memory Bound on Chip Area: The amount of information processed by the chip can be visualized as information flow upward across the chip area. Each bit can flow through a unit area of the horizontal chip slice. Thus, the chip area bounds the amount of memory bits stored on the chip.
- I/O Bound on Volume AT: The volume of the rectangular cube is represented by the product AT. As information flows through the chip for a period of time T, the number of input bits cannot exceed the volume AT



(a) Memory-limited bound on chip area
A and I/O-limited bound on chip history
represented by the volume AT

### **VLSI Complexity Model**

- Bisection Communication Bound: a communication limited lower bound on the bisection area.
- The bisection area represents the maximum amount of information exchange between the two halves of the chip circuit during the time period T.
- If S be the problem size involved in computation, then it has been seen that there exists a lower bound f(S) such that:

$$O(f(S)) \leq AT^2$$



#### Estimating chip area 'A' and compute time 'T' nxn matrix multiplication

Doall 10 for 
$$0 \le i, j \le n-1$$

PE $(i, j)$  sets  $C(i, j)$  to 0 /Initialization/

Do 50 for  $0 \le k \le n-1$ 

Doall 20 for  $0 \le i \le n-1$ 

PE $(i, k)$  broadcasts  $A(i, k)$  along its row bus

Doall 30 for  $0 \le j \le n-1$ 

PE $(k, j)$  broadcasts  $B(k, j)$  along its column bus

PE $(k, j)$  broadcasts  $B(k, j)$  along its column bus

PE $(i, j)$  now has  $A(i, k)$  and  $B(k, j)$ ,  $0 \le i, j \le n-1$ /

Doall 40 for  $0 \le i, j \le n-1$ 

PE $(i, j)$  computes  $C(i, j) \leftarrow C(i, j) + A(i, k) \times B(k, j)$ 

Continue

The above algorithm has a sequential loop along the dimension indexed by k. It takes n time units (iterations) in this k-loop. Thus, we have T = O(n). Therefore,  $AT^2 = O(n^2) \cdot (O(n))^2 = O(n^4)$ 



A **4 x 4** mesh of processing-elements (PB) with broadcast buses on each row and on each column

#### Comparison between dataflow and control-flow computers



(a) A sample program and its dataflow graph

(c) Data-driven execution on a 4-processor dataflow computer in 14 cycles

#### (b) Sequential execution on a uniprocessor in 48 cycles



(d) Parallel execution on a shared-memory 4-processor system in 14 cycles

| Machine Model       | Control Flow (control-driven)                                                            | Dataflow (data-driven)                                                                   | Reduction (demand-driven)                                                                                    |  |
|---------------------|------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|--|
| Basic<br>Definition | Conventional computation; token of control indicates when a statement should be executed | Eager evaluation; statements<br>are executed when all of their<br>operands are available | Lazy evaluation; statements<br>are executed only when<br>their result is required for<br>another computation |  |
| Advantages          | Full control  The most successful model for commercial products                          | Very high potential for parallelism                                                      | Only required instructions are executed                                                                      |  |
|                     | Complex data and control structures are easily implemented                               | High throughput                                                                          | High degree of parallelism                                                                                   |  |
|                     |                                                                                          | Free from side effects                                                                   | Easy manipulation of data<br>structures                                                                      |  |
| Disadvantages       | In theory, less efficient than the other two                                             | Time lost waiting for unneeded arguments                                                 | Does not support sharing of<br>objects with changing local<br>state                                          |  |
|                     | Difficult in preventing run-time errors                                                  | High control overhead                                                                    |                                                                                                              |  |
|                     |                                                                                          | Difficult in manipulating data structures                                                | Time needed to propagate<br>demand tokens                                                                    |  |

#### SYSTEM INTERCONNECT ARCHITECTURES

- These include networks which are used for interconnecting computer subsystems or for constructing multiprocessors or multicomputers.
- These networks can be used for internal connections among processors, memory modules, and HO adaptors in a centralized system, or for distributed networking of rnulticoniputor nodes.
- The topology of an interconnection network can be either static or dynamic.
  - Static networks are formed of point-to-point direct connections which will not change during program execution.
    - They are used for fined connections among subsystems of e centralized system or multiple computing nodes of a distributed system.
  - Dynamic networks are implemented with switched channels, which are dynamically configured to match the communication demand in user programs.
    - They include buses, crossbar switches, multistage networks, and routers which are often used in shared-memory multiprocessors.

#### SYSTEM INTERCONNECT ARCHITECTURES

- Node Degree (d): The number of edges {links or channels} incident on a node.
  - In the case of unidirectional channels, the number of channels into a node is the indegree, and that out of a node is the outdegree.
  - The node degree should be kept a (small) constant, in order to reduce oost.
- Diameter (D): of a network is the maximum shortest path between any two nodes.
  - The path length is measured by the number of links traversed.

#### SYSTEM INTERCONNECT ARCHITECTURES

- Bisection Width (b): When a given network is cut into two equal halves, the minimum number of edges (channels) along the cut is called the channel bisection width.
  - If a channel has w bit wires, then wire bisection width B = bw, reflecting the wiring density of network.
  - o If B is fixed, then **w** = **B/b**, providing a good indicator of the maximum communication bandwidth along the bisection of a network.
- Data-Routing Functions: A data-routing network is used for inter-PE data exchange.
  - Commonly seen data-routing functions among the PEs include shifting, rotation, permutation (one-to-one), broadcast (one-to-all), multicast (one-to-many), shuffle, exchange, etc.
  - These routing functions can be implemented on ring, mesh, hypercube, or multistage networks.
  - E.g. permutation pi = (a,b,c)(d,e) means a->b, b->c, c->a, d->e, e->d, where (a,b,c) has period of 3 and (d,e) has period of 2. Combining the two will result into the permutation of cycle 3x2=6.

# Hypercube Routing Functions



(a) A 3-cube with nodes denoted as C2C1C0 in binary



Fig. 2.15 Three routing functions defined by a binary 3-cube

# Other static connection networks





| Network type            | Node<br>degree, d | Network<br>diameter, | No. of<br>links, l | Bisection width, B | Symmetry | Remarks on<br>network size                             |
|-------------------------|-------------------|----------------------|--------------------|--------------------|----------|--------------------------------------------------------|
| Linear Array            | 2                 | N-1                  | N-1                | 1                  | No       | N nodes                                                |
| Ring                    | 2                 | LN/2J                | N                  | 2                  | Yes      | N nodes                                                |
| Completely<br>Connected | N-1               | 1                    | N(N - 1)/2         | $(N/2)^2$          | Yes      | N nodes                                                |
| Binary<br>Tree          | 3                 | 2(h - 1)             | N-1                | 1                  | No       | Tree height $h = \lceil \log_2 N \rceil$               |
| Star                    | N-1               | 2                    | N-1                | LN/2 ]             | No       | N nodes                                                |
| 2D-Mesh                 | 4                 | 2(r-1)               | 2N-2r              | r                  | No       | $r \times r$ mesh<br>where $r = \sqrt{N}$              |
| Illiac<br>Mesh          | 4                 | r-1                  | 2N                 | 2r                 | No       | Equivalent to a chordal ring of $r = \sqrt{N}$         |
| 2D-Torus                | 4                 | 2[1/2]               | 2 <i>N</i>         | 2r                 | Yes      | $r \times r$ torus<br>where $r = \sqrt{N}$             |
| Hypercube               | n                 | "                    | nN/2               | N/2                | Yes      | N nodes,<br>$n = \log_2 N$<br>(dimension)              |
| ccc                     | 3                 | 2k - 1 + Lk/2        | 3N/2               | N/(2k)             | Yes      | $N = k \times 2^k$ nodes with a cycle length $k \ge 3$ |
| k-ary n-cube            | 2n                | n_k/2_               | nN                 | 2k**-1             | Yes      | $N = k^n$ nodes                                        |