Skip to content
Talk: "Understanding efficient parallel scan"
TeX Haskell Makefile
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
Pictures
ShadowedPictures
circuits
.gitignore
HScan.lhs
Makefile
README.md
macros.tex
mine.fmt
notes.md
understanding-parallel-scan.lhs

README.md

Understanding efficient parallel scan

I first gave this talk at the Hacker Dojo in Mountain View, California on October 24, 2013.

Revisions:

  • 2015-09-30: added circuit figures

You can find the slides (PDF) in my talks folder.

Abstract

This talk will demonstrate high-level algorithm design for parallel functional programming, by means of the example of (parallel) prefix computation---known as "scan" to functional programmers. We'll start with a simple and precise scan specification that leads directly to an algorithm performing quadratic work and hence quadratic time in a sequential implementation. An easy observation leads to a familiar sequential algorithm performing linear work. This version, however, does not lend itself to parallel execution. A conventional divide-and-conquer idea leads to a functional algorithm that can execute in parallel O (log n) time (given sufficient computational resources) and O (n log n) work. Playing with variations and poking at asymmetries, a beautiful generalization reveals itself. An inversion of representation then leads to a linear work algorithm while retaining logarithmic time, without changing the scan source code. This inversion yields an example of non-regular (or "nested") algebraic data types, which are relatively obscure even in the functional programming community. (I suspect that these non-regular types are often more parallel-friendly than their familiar counterparts, as well as naturally capturing some shape/size constraints.) Yet another modest variation leads to O (n log log n) work and O (log log n) time. These three parallel algorithms can be decomposed into five simple pieces, which can be systematically recomposed to automatically generate parallel scan algorithms for a broad range of data structures.

You can’t perform that action at this time.