



# PMPH Introduction: Course Organization, Hardware Trends, List Homomorphism

Cosmin E. Oancea cosmin.oancea@diku.dk

Department of Computer Science (DIKU) University of Copenhagen

September 2023 PMPH Lecture Slides



# **Intended Learning Outcomes**

List Homomorphism: a way of writing inherently-parallel programs.

- explain when and where PMPH'23 lectures and lab sessions are located in space and time,
- explain Moore's law and argue based on hardware trends and technological constraint why all modern and future architectures (will) adopt some form of massive parallelism,
- explain what a list-homomorphic program is, and be able to apply it to build programs;
- illustrate and apply the 1<sup>st</sup> Theorem of List Homomorphism to transform said programs into inherently parallel ones.



- Brief History: Parallelism Paves the Path to Higher Performance
- Course Organization
- 3 Hardware Trends of Critical Components of a Parallel System
  - Processor
  - Memory
  - Interconnect
- 4 Technological Challenges/Constraints
  - Power
  - Reliability
  - Wire Delays
  - Design Complexity
  - CMOS Endpoint
- 5 List Homomorphisms (LH)
  - List Homomorphism Definition and 1st Theorem
  - Almost Homomorphisms Gorlatch'96



### Moore's Law (1960s)

 "Number of transistors in a dense integrated circuit doubles approximately every two years."



### Moore's Law (1960s)

- "Number of transistors in a dense integrated circuit doubles approximately every two years."
- Rephrased as:
  - computing power doubles every 19-24 months, and
  - cost effectiveness (performance/cost) keeps pace.

#### **Brief History**

- ICPP, ISCA (1980/90s): parallel architectures popular topic.
- Whatever happened? Mid90 Killer-Micro:



### Moore's Law (1960s)

- "Number of transistors in a dense integrated circuit doubles approximately every two years."
- Rephrased as:
  - computing power doubles every 19-24 months, and
  - cost effectiveness (performance/cost) keeps pace.

#### **Brief History**

- ICPP, ISCA (1980/90s): parallel architectures popular topic.
- Whatever happened? Mid90 Killer-Micro:
  - path of least resistance: ever increasing the speed of Single CPU
  - Commercial arena: multiprocessors just an uniprocessor extension.
- What Changed? Multiprocessors Trend: Academia & Industry:
  - power complexity  $P_{dynamic} \sim Freq^3$ . Example!



### Moore's Law (1960s)

- "Number of transistors in a dense integrated circuit doubles approximately every two years."
- Rephrased as:
  - computing power doubles every 19-24 months, and
  - cost effectiveness (performance/cost) keeps pace.

### **Brief History**

- ICPP, ISCA (1980/90s): parallel architectures popular topic.
- Whatever happened? Mid90 Killer-Micro:
  - path of least resistance: ever increasing the speed of Single CPU
  - Commercial arena: multiprocessors just an uniprocessor extension.
- What Changed? Multiprocessors Trend: Academia & Industry:
  - power complexity  $P_{dynamic} \sim Freq^3$ . Example!
  - Memory WALL: ever-increasing performance gap between processor & memory C. Oancea: Intro Sept 2023

# Processor: Clock Frequency/Rate

1990-2004: clock rate has increased exponentially.



2004: Intel cancels Pentium4 @4Ghz and shifts focus to multi-cores.



# Memory Wall? Which Memory Wall??





# Biggest Challenge For Parallel Hardware





### Biggest Challenge For Parallel Hardware



### Important Juncture:

- Trend Today: the number of cores grows exponentially.
- Biggest Challenge: develop efficient Massively-Parallel Software!
- think programs with parallelism in mind rather than hack some parallelism out of a sequential implementation.



- 1 Brief History: Parallelism Paves the Path to Higher Performance
- Course Organization
- Hardware Trends of Critical Components of a Parallel System
  - Processor
  - Memory
  - Interconnect
- 4 Technological Challenges/Constraints
  - Power
  - Reliability
  - Wire Delays
  - Design Complexity
  - CMOS Endpoint
- 5 List Homomorphisms (LH)
  - List Homomorphism Definition and 1st Theorem
  - Almost Homomorphisms Gorlatch'96



### When and Where?

#### Lectures:

- Monday 13:00 15:00 (øv-4-0-02, Ole Maaløes Vej 5, Biocenter)
- Wednesday 13:00 15:00 (Aud 06, Universitetsparken 5, HCØ),

#### Labs:

- Monday 15:00 17:00 (øv-4-0-32, Ole Maaløes Vej 5, Biocenter)
- Wednesday 15:00 17:00 (aud-NBB 2.0.G.064/070, Jagtvej 155)

#### Flexible Schedule on Wednesday:

- We also reserved (aud NBB 2.0.G.064/070, Jagtvej 155) for Wednesday 10:00 – 12:00;
- In case there is a vast majority of you who prefer starting in the morning, we can re-schedule the lecture and lab on Wednesday.

# Tentative Lecture/Lab Schedule

Your TAs: Anders Holst (anersholst@gmail.com) and Nikolaj Hey Hinnerskov (nihi@di.ku.dk)

The TAs mainly grade and provide feedback for the weekly assignments and moderates Absalon/Discord discussions.

Cosmin will lead the lectures & labs.

#### Continuous-evaluation assessment:

- four individual weekly assignments: 40% of final grade
- one group project + final presentation and discussion: 60% of final grade.
  - you may chose from multiple possible projects
  - presented tentatively during Lab at some point (to be announced)
  - or discuss your own project with me.
  - projects can be very practical in CUDA, Futhark, or more theoretical



### What does PMPH studies?

Hardware track studies the design space of the critical components of parallel hardware:

- processor (ILP, intra and inter-core)
- memory hierarchy (coherency)
- interconnect (inter-cores or core-cache routing)



### What does PMPH studies?

Hardware track studies the design space of the critical components of parallel hardware:

- processor (ILP, intra and inter-core)
- memory hierarchy (coherency)
- interconnect (inter-cores or core-cache routing)

Software track studies programming models for expressing data parallelism + way to reason and optimize parallelism:

- ullet High-level of abstraction: list-homomorphism  $\equiv$  functional map-reduce style + flattening
- Low-level of abstraction:
   loops and transformations rooted in data dependency analysis
- Lecture Notes are available for the software track!



### What does PMPH studies?

Hardware track studies the design space of the critical components of parallel hardware:

- processor (ILP, intra and inter-core)
- memory hierarchy (coherency)
- interconnect (inter-cores or core-cache routing)

Software track studies programming models for expressing data parallelism + way to reason and optimize parallelism:

- ullet High-level of abstraction: list-homomorphism  $\equiv$  functional map-reduce style + flattening
- Low-level of abstraction: loops and transformations rooted in data dependency analysis
- Lecture Notes are available for the software track!

Lab track applies in practice various optimizations/transformations learned on the software track.



- Brief History: Parallelism Paves the Path to Higher Performance
- Course Organization
- Hardware Trends of Critical Components of a Parallel System
  - Processor
  - Memory
  - Interconnect
- 4 Technological Challenges/Constraints
  - Power
  - Reliability
  - Wire Delays
  - Design Complexity
  - CMOS Endpoint
- 5 List Homomorphisms (LH)
  - List Homomorphism Definition and 1st Theorem
  - Almost Homomorphisms Gorlatch'96



## Reading

Please read chapter 1, "Motivation, hardware trends and technological constraints" from lecture notes.

I am going to do a quick overview of it.



### **Abstractions**

- A program is to a process/thread what a recipe is for cooking.
- Processor (core): hardware entity capable of sequencing & executing thread's instructions.
- MT Cores multiple threads, each running in its thread context.
- Multiprocessor: set of processors connected to execute a workload
  - mass produced, off-the-shelf, each several cores & levels of cache
  - trend towards migrating system functions on the chip: memory controllers, external cache directories, network interface



# Processor: Clock Frequency/Rate

Historically the clock rate (at which instr are executed) has increased exponentially (1990-2004).





# Processor: Clock Frequency/Rate

Historically the clock rate (at which instr are executed) has increased exponentially (1990-2004).



### Closer Look at Clock Rate



 Technology (process shrinkage): every generation transistors' switching speed increases 41%.

 Pipeline Depth: more stages ⇒ less complex ⇒ less gates/stage

> # of gates delays dropped by 25% every process generation.

Improved Circuit Design



### Closer Look at Clock Rate



- Technology (process shrinkage): every generation transistors' switching speed increases 41%.
  - Pipeline Depth: more stages  $\Rightarrow$  less complex  $\Rightarrow$  less gates/stage
    - # of gates delays dropped by 25% every process generation.
  - Improved Circuit Design

Clock Rate Increase is Not Sustainable:

- Deeper pipelines: difficult to build useful stages with < 10 gates
- Wire delays: wire-transm speed ↑ much slower than switching,
- Circuits clocked at higher rates consume more power!



### **Processor: Feature Size & Number of Transistors**



Each process generation offers new resources. How best to use the > 100 billion transistors? Large-Scale CMPs (100s-1000s cores):

- more cache, better memory-system design
- fetch and decode multiple instr per clock
- running multiple threads per core and on multiple cores Sept 2023



# **Memory Systems**

- (Main) Memory Wall: growing gap between processor and memory speed. Processor cannot execute faster than memory system can deliver data and instructions!
- Want Big, Fast & Cheap Memory System
  - access time increases with memory size as it is dominated by wire delays⇒ this will not change in future technologies
  - multi-level hierarchies (relies on principle of locality)
  - efficient management is KEY, e.g., cache coherence.
  - Cost and Size memories in a basic PC in 2008:

| Memory      | Size  | Marginal Cost | Cost Per MB | Access Time |
|-------------|-------|---------------|-------------|-------------|
| L2 Cache    | 1MB   | \$20/MB       | \$20        | 5nsec       |
| Main Memory | 1 GB  | \$50/GB       | 5c          | 200 nsec    |
| Disk        | 500GB | \$100/500GB   | 0.02c       | 5 msec •    |

## Memory Wall? Which Memory Wall??

- DRAM density increases 4× every 3 years, BUT
- DRAM speed ↑ only with 7% per year! (processor speed by 50%)
- Perception was that Memory Wall will last forever!
- Memory Wall Stopped Growing around 2002.
- Multi/Many-Cores ⇒ shifted from Latency to Bandwidth WALL



# **Disk Memory**

- Historically disk performance improved by 40% per year
- DiskTime=AccessTime+TransferTime (AccessTime=Seek+Latency)
- Historically, transfer time have dominated, but
- Today: transfer and access time are of the same msecs order
- Future, Access Time will dominate, but proc-disk gap still large



Seek Time: head to reach right track, latency: time to reach the first record on track, both depend on rotation speed & independent on block size.

### Interconnection Networks

### Present at many layers:

- On-Chip Interconnects: forward values between pipeline stages, AND between execution units AND connect cores to shared cache banks.
- System Interconnects: connect processors (CMPs) to memory and IO
- I/O Interconnects, usually bus e.g., PCI, connect various devices to the System Bus
- Inter-Systems Interconnects: connect separate systems (chassis or boxes) & include
  - SANs: connect systems at very short distance
  - LANs, WANs (not interesting for us).
- Internet: global world-wide interconnect (not interesting for us).



- - Processor
  - Memory
  - Interconnect
- Technological Challenges/Constraints
  - Power
  - Reliability
  - Wire Delays
  - Design Complexity
  - CMOS Endpoint
- - List Homomorphism Definition and 1st Theorem
  - Almost Homomorphisms Gorlatch'96



## **Technological Contraints**

- In the Past: tradeoff between cost (area) and time (performance).
- Today: design is challenged by several technological limits
  - Major new contraint is Power
  - wire delays
  - reliability
  - complexity of design
- It seems that parallelism addresses well all these constraints.



### Power

- Total Power = Dynamic + Static (Leakage)
  - $P_{dynamic} = \alpha CV^2 f$  consumed by a gate when it switches state  $P_{static} = VI_{sub} \sim Ve^{-kV_T/T}$  (caches)

V supply voltage, f clock rate, lpha activity factor, lpha f the rate at which gates switch, T temperature

- Dynamic power favors parallel processing over higher clock rate
  - ullet  $P_{dynamic}\sim f^3$  mostly dissipated in processor



### Power

• Total Power = Dynamic + Static (Leakage)

$$P_{dynamic} = \alpha CV^2 f$$
 consumed by a gate when it switches state  $P_{static} = VI_{sub} \sim Ve^{-kV_T/T}$  (caches)

V supply voltage, f clock rate,  $\alpha$  activity factor,  $\alpha f$  the rate at which gates switch, T temperature

- Dynamic power favors parallel processing over higher clock rate
  - ullet  $P_{dynamic} \sim f^3$  mostly dissipated in processor
  - ullet increase clock freq 4×  $\Rightarrow$



### Power

- Total Power = Dynamic + Static (Leakage)
  - $P_{dynamic} = \alpha CV^2 f$  consumed by a gate when it switches state  $P_{static} = VI_{sub} \sim Ve^{-kV_T/T}$  (caches)

V supply voltage, f clock rate,  $\alpha$  activity factor,  $\alpha f$  the rate at which gates switch, T temperature

- Dynamic power favors parallel processing over higher clock rate
  - ullet  $P_{dynamic} \sim f^3$  mostly dissipated in processor
  - increase clock freq  $4\times \Rightarrow 4\times$  speedup @  $64\times$  dynamic power!
  - replicate a uniprocessor  $4 \times \Rightarrow 4 \times$  speedup @  $4 \times$  power
- Static Power: dissipated in all circuits, at all time, no matter of frequency and whether it switches or not.

Proportional to the area of circuit but independent of clock rates and circuit activity.

- ullet negligible 15 years ago, but as feature size decreased so did the threshold voltage  $V_{\mathcal{T}}$  every generation
- Recently overtook dynamic power as major source of dissipation!
- Power/Energy are Critical Problems
   e.g., costly & many battery operated devices.



### Reliability

- Transient Failures (Soft Errors):
  - Corruption Sources: cosmic rays, alpha particles radiating from the packaging material, electrical noise;
  - Charge stored in a transistor Q = C V
  - Supply voltage V decreases every generation (consequence of features-size shrinking)
  - As Q decreases, it is easier to flip bits
  - Device operational but values have been corrupted
  - DRAM/SRAM error detection and correction capability
- Intermittent/Temporary Failures:
  - last longer, should try to continue execution
  - aging or temporary environmental variation, e.g., temperature
- Permanent Failures: device will never function again, must be isolated & replaced by spare
- Chip Mutiprocessors: promote better reliability
  - using threads for redundant execution,
  - faulty cores can be disabled ⇒ natural failsafe degradation



# Wire Delays

- Miniaturization ⇒ transistors switch faster, but the propagation delay of signals on wire does not scale as well.
- Wire Delay Propagation  $\sim$  RC.  $R \sim L/CS_{area}$ . Miniaturization  $\Rightarrow$  cross-section area keeps shrinking each generation, annuls the benefit of length shrinking.
- Wires can be pipelined like logic.
- Deeper pipelines are better because communication limited to only few stages.
- Impact of wire delays also favors multiprocessors, since communication traffic is hierarchical:
  - most communication is local
  - inter-core communication is occasional



# **Design Complexity**

- Design Verification has become the dominant cost of chip development today, major design constraint.
- Chip density increases much faster than the productivity of verification engineers (new tools and speed of systems):
  - register-transfer language level, i.e., logic is correct
  - core level, i.e., correctness of forwarding, memory disambiguation,
  - multi-core level, e.g., cache coherence, memory consistency.
- Vast majority of chip resources dedicated to storage also due to verification complexity:
  - trivial to increase the size of: caches, store buffers, load/store/ fetch queues, reorder buffers, directory for cache coherence, etc.
- Design Trend Favors Multiprocessors: easier to replicate the same structure multiple times than to design a large, complex one.

# CMOS (Endpoint) Meets Quantum Physics

- CMOS is rapidly reaching the limits of miniaturization,
- ullet Feature size: half pitch distance (half the distance between two metal wires). Gate length  $\sim 1/2$  feature size.
- If present trends continues feature size < 10nm by 2020</li>
- Radius of atom:  $0.1 \sim 0.2 nm$   $\Rightarrow$  gate length quickly reaches atomic distances that are governed by quantum physics, i.e., binary logics replaced with probabilistic states.
- Not clear what will follow (?)



- - Processor
  - Memory
  - Interconnect
- - Power
  - Reliability
  - Wire Delays

  - CMOS Endpoint
- List Homomorphisms (LH)
  - List Homomorphism Definition and 1st Theorem
  - Almost Homomorphisms Gorlatch'96



#### Realm of finite lists:

++ denotes list concatenation:

$$[1, 2, 3] ++ [4, 5, 6, 7] \equiv [1, 2, 3, 4, 5, 6, 7]$$

ullet empty list [] is the neutral element: [] ++ x  $\equiv$  x ++ []  $\equiv$  x

### LH: a special form of divide and conquer programming:

h [] = e  
h [x] = f x  
h (x ++ y) = (h x) 
$$\odot$$
 (h y)



#### Realm of finite lists:

- ++ denotes list concatenation:
  - $[1, 2, 3] ++ [4, 5, 6, 7] \equiv [1, 2, 3, 4, 5, 6, 7]$
- empty list [] is the neutral element: [] ++  $x \equiv x$  ++ []  $\equiv x$

### LH: a special form of divide and conquer programming:

```
h [] = e
h[x] = f x
h (x ++ y) = (h x) \odot (h y)
```

A well-defined program requires that no matter how the input list is partitioned into x ++ y, the result is the same!

```
--computes the length of a list,
-- (how many elements a list has)
len :: [T] -> Int
```

len 
$$(x++y) = (len x) ???? (len$$

#### Realm of finite lists:

- ++ denotes list concatenation:
  - $[1, 2, 3] ++ [4, 5, 6, 7] \equiv [1, 2, 3, 4, 5, 6, 7]$
- ullet empty list [] is the neutral element: [] ++ x  $\equiv$  x ++ []  $\equiv$  x

### LH: a special form of divide and conquer programming:

h [] = e  
h [x] = f x  
h(x ++ y) = (h x) 
$$\odot$$
 (h y)

A well-defined program requires that no matter how the input list is partitioned into x ++ y, the result is the same!

#### Realm of finite lists:

- ++ denotes list concatenation:
  - $[1, 2, 3] ++ [4, 5, 6, 7] \equiv [1, 2, 3, 4, 5, 6, 7]$
- ullet empty list [] is the neutral element: [] ++ x  $\equiv$  x ++ []  $\equiv$  x

### LH: a special form of divide and conquer programming:

```
h [] = e
h [x] = f x
h (x ++ y) = (h x) \odot (h y)
```

A well-defined program requires that no matter how I partition the input list into x ++ y I get the same result!

```
--Assume p :: T -> Bool given,
--compute whether all elements of
--a list satisfy predicate p.
all<sub>p</sub> :: [T] -> Bool
all<sub>p</sub> [] = ???
all<sub>p</sub> [x] = ???
all<sub>p</sub> (x++y) = (all<sub>p</sub> x) ??? (all<sub>p</sub>)
```

### Blank Slide



#### Realm of finite lists:

- ++ denotes list concatenation:
  - $[1, 2, 3] ++ [4, 5, 6, 7] \equiv [1, 2, 3, 4, 5, 6, 7]$
- ullet empty list [] is the neutral element: [] ++ x  $\equiv$  x ++ []  $\equiv$  x

#### LH: a special form of divide and conquer programming:

```
h [] = e
h [x] = f x
h (x ++ y) = (h x) \odot (h y)
```

A well-defined program requires that no matter how I partition the input list into x ++ y I get the same result!

```
--Assume p :: T -> Bool given,
--compute whether all elements of
--a list satisfy predicate p.
all<sub>p</sub> :: [T] -> Bool
all<sub>p</sub> [] = True
all<sub>p</sub> [x] = p x
all<sub>p</sub> (x++y) = (all<sub>p</sub> x) && (all<sub>p</sub> x)
```

Why would it be incorrect to say that  $all_p$  [] = False?

# Math Preliminaries: Monoid & Homomorphism

### Definition (Monoid)

Assume set S and  $\odot: S \times S \to S$ .  $(S, \odot)$  is called a Monoid if it satisfies the following two axioms:

- (1) Associativity:  $\forall x, y, z \in S$  we have  $(x \odot y) \odot z \equiv x \odot (y \odot z)$  and
- (2) Identity Element:  $\exists e \in S$  such that  $\forall a \in S$ ,  $e \odot a \equiv a \odot e \equiv a$ .

 $((S, \odot))$  is called a group if it also satisfies that any element is invertible, i.e.,  $\forall a, \exists a^{-1}$  such that  $a \odot a^{-1} \equiv a^{-1} \odot a \equiv e$ .)

E.g.,  $(\mathbb{N}, +)$ ,  $(\mathbb{Z}, \times)$ ,  $(\mathbb{L}_T, ++)$ , where  $\mathbb{L}_T$  denotes lists of elements of type T, and ++ list concatenation.



# Math Preliminaries: Monoid & Homomorphism

### Definition (Monoid)

Assume set S and  $\odot: S \times S \to S$ .  $(S, \odot)$  is called a Monoid if it satisfies the following two axioms:

- (1) Associativity:  $\forall x, y, z \in S$  we have  $(x \odot y) \odot z \equiv x \odot (y \odot z)$  and
- (2) Identity Element:  $\exists e \in S$  such that  $\forall a \in S$ ,  $e \odot a \equiv a \odot e \equiv a$ .

 $((S, \odot))$  is called a group if it also satisfies that any element is invertible, i.e.,  $\forall a, \exists a^{-1}$  such that  $a \odot a^{-1} \equiv a^{-1} \odot a \equiv e$ .)

E.g.,  $(\mathbb{N},+)$ ,  $(\mathbb{Z},\times)$ ,  $(\mathbb{L}_T,++)$ , where  $\mathbb{L}_T$  denotes lists of elements of type T, and ++ list concatenation.

### Definition (Monoid Homomorphism)

A monoid homomorphism from monoid  $(S, \oplus)$  to monoid  $(T, \odot)$  is a function  $h: S \to T$  such that  $\forall u, v \in S$ ,  $h(u \oplus v) \equiv h(u) \odot h(v)$ .

#### Realm of finite lists:

- ++ denotes list concatenation:
  - $[1, 2, 3] ++ [4, 5, 6, 7] \equiv [1, 2, 3, 4, 5, 6, 7]$
- ullet empty list [] is the neutral element: [] ++ x  $\equiv$  x ++ []  $\equiv$  x

### LH: a special form of divide and conquer programming:

h [] = e  
h [x] = f x  
h (x ++ y) = (h x) 
$$\odot$$
 (h y)

A well-defined program requires that no matter how I partition the input list into (x ++ y), the result is the same!

**EXERCISE:** prove that  $(Img(h), \odot)$  is a monoid with neutral element e.



# Basic Blocks of Parallel Programming: Map

map ::  $(\alpha \to \beta) \to [\alpha] \to [\beta]$  has inherently parallel semantics.

$$x = map f \begin{bmatrix} a_1, & a_2, & ..., & a_n \end{bmatrix}$$
 $\downarrow \qquad \downarrow \qquad \downarrow$ 
 $x \equiv \begin{bmatrix} fa_1, & fa_2, & ..., & fa_n \end{bmatrix}$ 



# Basic Blocks of Parallel Programming: Reduce

reduce :: 
$$(\alpha \to \alpha \to \alpha) \to \alpha \to [\alpha] \to \alpha$$
  
reduce  $\odot$   $e$   $[a_1, a_2, ..., a_n] \equiv e \odot a_1 \odot a_2 \odot ... \odot a_n$   
where  $\odot$  is an associative binary operator.





```
h [] = e
h [x] = f(x)
h (x ++ y) = (h x) \odot (h y)
h (x ++ y) = (h x) \odot (h y)
```

Important Note:  $\odot$  needs to be associative and e needs to be the neutral element of  $\odot$ !



```
h [] = e
h [x] = f(x)
h (x ++ y) = (h x) \odot (h y)
h (x ++ y) = (h x) \odot (h y)
```

Important Note:  $\odot$  needs to be associative and e needs to be the neutral element of  $\odot$ !

```
-- one :: Int -> Int,one(x)=1
len :: [T] -> Int
len [] = 0
len [x] = one x -- = 1
len (x ++ y) = (len x) + (len y)
```



```
h [] = e
h [x] = f(x)
h (x ++ y) = (h x) \odot (h y)
h (x ++ y) = (h x) \odot (h y)
```

Important Note:  $\odot$  needs to be associative and e needs to be the neutral element of  $\odot$ !

```
-- one :: Int -> Int, one(x)=1 len :: [T] -> Int len [] = 0 len [x] = one x -- \equiv 1 len (x ++ y) = (len x) + (len y)
```

```
\operatorname{all}_{\rho} [ ] = True \operatorname{all}_{\rho} [x] = p(x) \operatorname{all}_{\rho} (x++y) = (all<sub>\rho</sub> x) && (all<sub>\rho</sub> y)
```



```
h [x] = f(x)
                                    h \equiv (reduce \odot e) \circ (map f)
h (x ++ y) = (h x) \odot (h y)
```

Important Note:  $\odot$  needs to be associative and e needs to be the neutral element of  $\odot$ !

```
-- one :: Int \rightarrow Int, one(x)=1
len :: [T] -> Int.
                                              len \equiv (reduce (+) 0) \circ
len \begin{bmatrix} 1 \end{bmatrix} = 0
                                                         (map one)
len [x] = one x -- \equiv 1
len (x ++ y) = (len x) + (len y)
```

```
all_p[] = True
                                             {\tt all}_p \equiv ({\tt reduce} (\&\&) {\tt True})
all_n[x] = p(x)
                                                        (map p)
all_p (x++y) = (all_p x) && (all_p y)
```



# **List Homomorphism Invariants**

### Theorem (List-Homomorphism Promotions)

Given unary functions f, g and an associative binary operator  $\odot$  then:

- 1.  $(map f) \cdot (map g) \equiv map (f \cdot g)$
- 2. (map f).  $(reduce (++) []) \equiv (reduce (++) [])$ . (map (map f))
- 3. (reduce  $\odot$   $e_{\odot}$ ). (reduce (++) [])  $\equiv$  (reduce  $\odot$   $e_{\odot}$ ). (map (reduce  $\odot$   $e_{\odot}$ ))
  - 2. 3. ⇒ code generation: list is segmented, segments are distributed on different processors, computation proceeds locally on each processor, and the local results are reduced.
  - 2. 3.  $\Leftarrow$  flattening optimization: uncovers more parallelism
  - e.g., map f  $[1..4] = (map f) \cdot (red ++) [[1,2],[3,4]] = ^{prom2}$ ?

### List Homomorphism Invariants: Demo

2. (map f).  $(reduce (++) []) \equiv (reduce (++) [])$ . (map (map f))

Assume  $distr_p$  distributes a list into p sublists of roughly the same number of elements.

Assume distr<sub>shp</sub> distributes a list according to a given shape, e.g., if arr = [1,2,3,4,5] and shp = [2,3] then distr<sub>shp</sub> arr = [[1,2], [3,4,5]].

Assume the representation of a list of lists is flat, i.e., we keep a shape around.



### **List Homomorphism Invariants: Demo** ⇒

2. (map f).  $(\text{reduce } (++) []) \equiv (\text{reduce } (++) [])$ . (map (map f)) map  $f \equiv (\text{reduce } (++) [])$ . (map (map f)). distr<sub>p</sub> Useful for code generation:



### List Homomorphism Invariants: Demo ←

2. (map f).  $(\text{reduce } (++) []) \equiv (\text{reduce } (++) [])$ . (map (map f)) map  $(\text{map } f) \equiv \text{distr}_{shp}$ . (map f). (reduce (++) []) Useful for load balancing (optimization):



# Optimizing Map-Reduce Computation (Exercise)

### Theorem (Optimized Map Reduce)

Assume  $\operatorname{distr}_p :: [\alpha] \to [[\alpha]]$  distributes a list into p sublists, each containing about the same number of elements. Denoting redomap  $\odot$  f  $e_{\odot} \equiv (\operatorname{reduce} \odot e_{\odot}) \cdot (\operatorname{map} f)$ , the equality holds:

- Prove it using the promotion Lemmas before!
- $\bullet$  Hint: (reduce (++) []) . distr<sub>p</sub>  $\equiv$  id, hence
- ullet redomap  $\odot$  f  $e_{\odot}$   $\equiv$  (reduce  $\odot$   $e_{\odot}$ ). (map f). (reduce (++) []). distr $_p$



- Brief History: Parallelism Paves the Path to Higher Performance
- Course Organization
- Hardware Trends of Critical Components of a Parallel System
  - Processor
  - Memory
  - Interconnect
- 4 Technological Challenges/Constraints
  - Power
  - Reliability
  - Wire Delays
  - Design Complexity
  - CMOS Endpoint
- 5 List Homomorphisms (LH)
  - List Homomorphism Definition and 1st Theorem
  - Almost Homomorphisms Gorlatch'96



# Maximum Segment Sum Problem (MSSP)

"Systematic Extraction and Implementation of Divide-and-Conquer Parallelism", Sergei Gorlatch, 1996.

Intuition: a non-homomorphic function g can be sometimes lifted to a homomorphic one f, by computing a baggage of extra info.

The initial fun obtained by projecting the homomorphic result:  $g = \pi \circ f$ 

### **Maximum-Segment Sum Problem (MSSP)**:

Given a list of integers, find the contiguous segment of the list whose members have the largest sum among all such segments.

The result is only the maximal sum (not the segment's members). For simplicity lets assume we are interested only in **positive sums**.

E.g., mss [1, -2, 3, 4, -1, 5, -6, 1] = 11 (the corresponding segment is [3, 4, -1, 5]).



# **MSSP: Preliminary Reasoning**

### **Maximum-Segment Sum Problem (MSS)**:

Given a list of integers, find the contiguous segment of the list whose members have the largest sum among all such segments.

The result is only the maximal sum (not the segment's members). For simplicity lets assume we are interested only in **positive sums**.

### A first straightforward/naive attempt:

```
mss [] = 0

mss [a] = a \uparrow 0 //\uparrow denotes Max

mss (x ++ y) = mss(x) ??? mss(y)
```



# **MSSP: Preliminary Reasoning**

### **Maximum-Segment Sum Problem (MSS)**:

Given a list of integers, find the contiguous segment of the list whose members have the largest sum among all such segments.

The result is only the maximal sum (not the segment's members). For simplicity lets assume we are interested only in **positive sums**.

### A first straightforward/naive attempt:

mss [] - 0  
mss [a] = a 
$$\uparrow$$
 0 // $\uparrow$  denotes Max  
mss (x ++ y) = mss(x) ??? mss(y)



Which case is problematic?

How to combine mss1 and mss2?



# **MSSP: Preliminary Reasoning**

### **Maximum-Segment Sum Problem (MSS)**:

Given a list of integers, find the contiguous segment of the list whose members have the largest sum among all such segments.

The result is only the maximal sum (not the segment's members). For simplicity lets assume we are interested only in **positive sums**.

### A first straightforward/naive attempt:

mss [] = 0  
mss [a] = a 
$$\uparrow$$
 0 // $\uparrow$  denotes Max  
mss (x ++ y) = mss(x) ??? mss(y)



#### How to combine mss1 and mss2?

#### Which case is problematic?

Answer: when the segment of interest lies partly in x and partly in y!

### **MSSP: A Better Reasoning**

The problematic case is when the segment of interest lies partly in x and partly in y!

We need to compute extra information:



### MSSP: A Better Reasoning

The problematic case is when the segment of interest lies partly in  $\boldsymbol{x}$  and partly in  $\boldsymbol{y}$ !

We need to compute extra information:

- maximum concluding segment
- maximum initial segment
- total segment sum



# MSSP: A Better Reasoning

The problematic case is when the segment of interest lies partly in  $\mathbf{x}$  and partly in  $\mathbf{y}$ !

We need to compute extra information:

- maximum concluding segment
- maximum initial segment
- total segment sum







the mis, mcs, mss, and ts for the result of the two concatenating segments.  $\uparrow$  denotes max.





the mis, mcs, mss, and ts for the result of the two concatenating segments.  $\uparrow$  denotes max.

$$mis = mis1 \uparrow (ts1 + mis2)$$
 $mcs =$ 





the mis, mcs, mss, and ts for the result of the two concatenating segments.  $\uparrow$  denotes max.

mis = mis1 
$$\uparrow$$
 (ts1 + mis2)  
mcs = mcs2  $\uparrow$  (mcs1 + ts2)  
mss =





the mis, mcs, mss, and ts for the result of the two concatenating segments. \(\gamma\) denotes max.

mis = mis1 
$$\uparrow$$
 (ts1 + mis2)  
mcs = mcs2  $\uparrow$  (mcs1 + ts2)  
mss = mss1  $\uparrow$  mss2  $\uparrow$  (mcs1 + mis2)  
ts = ts1 + ts2



# Maximum-Segment Sum = Near Homomorphism

```
Correct Solution (Haskellish)
-- x \uparrow y = if(x >= y) then x else y
(mssx, misx, mcsx, tsx) ⊙ (mssy, misy, mcsy, tsy) = (
         (mssx ↑ mssy ↑ (mcsx+misy),
          misx \uparrow (tsx+misy),
         (mcsx+tsy) \uparrow mcsy,
         tsx + tsy
f x = (x \uparrow 0, x \uparrow 0, x \uparrow 0, x)
emss = (reduce \odot (0,0,0,0)) . (map f)
mss = \pi_1 \cdot emss
       where \pi_1 (a, _, _, _) = a
```

The baggage: 3 extra integers (misx, mcsx, tsx) and a constant number of integer operations per communication stage.



# **Longest Satisfying Segment Problems**

- Class of problems which requires to find the longest segment of a list for which some property holds, such as:
- longest sequence of zeros, or longest sequence made from the same number, or longest sorted sequence.
- Not all predicates can be written as a list homomorphism, e.g., longest sequence whose sum is 0.

#### Restrict The Shape of the Predicate to:

```
True
[x, y]
[x : y : zs] = (p [x,y]) \land p (y : zs)
```



### **Longest Satisfying Segment Problems**

### Restrict the Shape of the Predicate:

#### Extra Baggage:

- As before, the length of the longest initial/concluding satisfying segments (lis/lcs), and the total list length (t1).
- When considering the concatenation of the (lcs, lis) pair, it
  is not guaranteed that the result satisfies the predicate, e.g.,
   (sorted x) ∧ (sorted y) ⇒ sorted x++y.



# **Longest Satisfying Segment Problems**

### Restrict the Shape of the Predicate:

#### Extra Baggage:

- As before, the length of the longest initial/concluding satisfying segments (lis/lcs), and the total list length (t1).
- When considering the concatenation of the (lcs, lis) pair, it
  is not guaranteed that the result satisfies the predicate, e.g.,
  (sorted x) ∧ (sorted y) ⇒ sorted x++y.
- We also need the last element of lcs and the first elem of lis,
- in order to compute whether (lcs x) is connected to (lis y) i.e., p [lastx,firsty] == True



# Longest Satisfying Segment Problem: Exercise

### Exercise: fill in the blanks, test in Futhark for zeros/same/sorted

```
(lssx, lisx, lcsx, tlx, firstx, lastx) ⊙
(lssy, lisy, lcsy, tly, firsty, lasty)
  = (newlss, newlis, newlcs, tlx+tly, first, last)
     where
        connect = ...
       newlss = ...
       newlis = ...
       newlcs = ...
       first = if tlx == 0 then firsty else firstx
       last = if tly == 0 then lastx else lasty
f x = (xmatch, xmatch, xmatch, 1, x, x)
    where xmatch = if (p [x]) then 1 else 0
elss = (reduce (\odot) (0,0,0,0,0,0)). (map f)
lss = \pi_1 . elss
      where \pi_1 (a, _, _, _, _) = a
```

### **Conclusion**

What have we talked about today?

• Hardware Parallelism:



### Conclusion

What have we talked about today?

- Hardware Parallelism: the only way of scalably increasing the compute power.
- Big Challenge:



#### **Conclusion**

What have we talked about today?

- Hardware Parallelism: the only way of scalably increasing the compute power.
- Big Challenge:having parallel commodity software.
- List-Homomorphism:

   a way of reasoning about parallelism and of building inherently parallel programs.

