-
Notifications
You must be signed in to change notification settings - Fork 413
/
parIters.chpl
489 lines (421 loc) · 18 KB
/
parIters.chpl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
// Parallel Iterators
/*
This primer explains how to write parallel iterators in Chapel,
which can be used to drive parallel ``forall`` loops. It assumes
that the reader already knows how to define serial iterators in
Chapel, as summarized in the :ref:`iterators primer (iterators.chpl)
<primers-iterators>` for example.
Chapel has two main flavors of parallel iterators: `Standalone`
parallel iterators are the simpler form and can be used to define
parallelism for a simple forall loop like ``forall i in
myIter(...)``. `Leader-follower` iterators are a more involved form
that support zippered forall loops, like ``forall (...,i,...) in
zip(..., myIter(...), ...)``. Note that only defining
leader-follower iterators is sufficient for use in a non-zippered
forall loop, though often with additional overhead.
For a more thorough introduction to leader-follower iterators, refer
to our PGAS 2011 paper, `User-Defined Parallel Zippered Iterators in
Chapel`_. Note that there are known lacks and issues with Chapel's
current definition of parallel iterators, and that we anticipate
addressing these and improving them over time. To that end, this
primer should be considered a snapshot of their status in the
current implementation.
*/
/*
Motivating Example: a `count` iterator
--------------------------------------
*/
// In this primer, we're going to create a simple iterator named
// ``count`` that will be able to be invoked in either ``for`` or
// ``forall`` loops. ``count`` will be defined to take an argument
// ``n`` as input and an optional argument ``low`` (set to 1 by
// default), and it will yield ``n`` integers starting with ``low``.
//
// We'll use the following config const ``numTasks`` to indicate the
// degree of parallelism that ``count()`` should use in its forall
// loops. By default, we've set it to the maximum amount of
// parallelism expected on the current locale, but this can be
// overridden on the executable command-line using the
// ``--numTasks=<n>`` option.
//
config const numTasks = here.maxTaskPar;
if numTasks < 1 then
halt("numTasks must be a positive integer");
//
// If compiled with ``verbose`` set to ``true``, the parallel
// iterators in this primer will print indications of what they're
// doing under the surface.
//
config param verbose = false;
//
// Next, we declare a problem size for this test. By default we use a
// small problem size to make the output readable. Of course, to use
// the parallelism effectively you'd want to use a much larger problem
// size (override on the execution command-line using the
// ``--probSize=<n>`` option).
//
config const probSize = 15;
var A: [1..probSize] real;
//
// To get started, we'll define a traditional serial iterator for
// ``count``. In part, this is for purposes of illustration in this
// primer. However, it is also a necessity in that Chapel's current
// implementation of parallel iterators requires there to be a serial
// overload of the iterator as well, to model the expected signature
// and yielded type.
//
iter count(n: int, low: int=1) {
for i in low..#n do
yield i;
}
//
// Here are some simple loops using the serial iterator to demonstrate
// it. First we iterate over all indices in our problem size to
// initialize array ``A``:
//
for i in count(probSize) do
A[i] = i:real;
writeln("After serial initialization, A is:");
writeln(A);
writeln();
//
// Then we override the default value of ``low`` in order to negate
// the "middle" elements of ``A``:
//
for i in count(n=probSize/2, low=probSize/4) do
A[i] = -A[i];
writeln("After negating the middle of A:");
writeln(A);
writeln();
//
// For serial zippered iteration, nothing is required other than
// this single iterator:
//
for (i, a) in zip(count(probSize), A) do
a = i/10.0;
writeln("After re-assigning A using zippering:");
writeln(A);
writeln();
/*
.. primers-parIters-standalone-parallel
A standalone parallel `count` iterator
--------------------------------------
*/
// To create a parallel version of ``count``, we will declare a second
// overload of the iterator with the same signature, but an additional
// ``param`` argument named ``tag`` of built-in enumerated type
// ``iterKind``, to distinguish it from the serial version. The
// author of a standalone parallel iterator should use a ``where``
// clause to distinguish this overload from others. Specifically,
// when the Chapel compiler attempts to implement a forall loop like
// ``forall i in count(...)``, it will attempt to resolve the iterator
// by passing in ``iterKind.standalone`` as its value, to distinguish
// it from the serial iterator above. This argument is what marks
// this version of the iterator as a parallel iterator. After the
// ``tag`` argument, the rest of the argument list should exactly
// match that of the serial iterator. For the ``count()`` example,
// this means providing the same ``n`` and ``low`` arguments as
// before.
//
// Unlike serial iterators, parallel iterators are allowed to contain
// ``yield`` statements within parallel constructs like ``coforall``,
// ``cobegin``, and ``forall``. In our implementation of the
// standalone parallel version of ``count`` here, we use a
// ``coforall`` loop to define ``numTasks`` tasks and then divide the
// iteration space up amongst them. Specifically, each task calls
// into a helper routine defined at the bottom of this file,
// ``computeChunk()`` that computes its sub-range of the values to be
// counted as a function of its task ID (``tid``) and the total number
// of tasks (``numTasks``). The iterator also contains debugging
// output which can be enabled by compiling with ``-sverbose=true``.
//
iter count(param tag: iterKind, n: int, low: int = 1)
where tag == iterKind.standalone {
if verbose then
writeln("Standalone parallel count() is creating ", numTasks, " tasks");
coforall tid in 0..#numTasks {
const myIters = computeChunk(low..#n, tid, numTasks);
if verbose then
writeln("task ", tid, " owns ", myIters);
for i in myIters do
yield i;
}
}
// Though not shown in this example, standalone parallel iterators may
// also target multiple locales using features like ``on`` statements
// or distributed arrays.
/*
.. primers-parIters-standalone-usage
Using the standalone parallel 'count' iterator
----------------------------------------------
*/
// Having defined a standalone parallel iterator, we can execute the
// same loops as before, but using forall loops to make the execution
// parallel. Since these forall loops are not using zippered
// iteration, the standalone version of the ``count()`` iterator is
// used.
//
forall i in count(probSize) do
A[i] = i:real;
writeln("After parallel initialization, A is:");
writeln(A);
writeln();
//
// Invoking it again with a different low value:
//
forall i in count(n=probSize/2, low=probSize/4) do
A[i] = -A[i];
writeln("After negating the middle of A in parallel:");
writeln(A);
writeln();
/*
.. _primers-parIters-leader-follower:
Leader-follower iterators
-------------------------
*/
// The parallel iterators for zippered forall loops are necessarily
// more involved since each iterand expression may have its own
// preferred way of doing parallel iteration, yet its yielded values
// must be combined with the other iterands in a way that generates a
// coherent result tuple. To deal with this challenge, given a
// forall loop of the form:
/* .. code-block:: chapel
forall (a, b, c) in zip(A, B, C) do
// ...loop body...
*/
// Chapel's definition designates the first iterand -- in this case,
// ``A`` -- as the 'leader'. In addition, all iterands in the
// zippering are considered 'followers' (so for this loop, ``A``,
// ``B``, and ``C`` would be).
//
// Given such a loop, the compiler will roughly translate it into
// the following loop structure:
/* .. code-block:: chapel
for work in A.lead() do // implemented by inlining the leader
for (a, b, c) in zip(A.follow(work), B.follow(work), C.follow(work)) do
// ...loop body...
*/
// where ``.lead()`` and ``.follow()`` represent the leader-follower
// iterators using a simplified naming scheme.
//
// Note that since Chapel's implicitly parallel features are defined
// in terms of zippered iteration, they are also implemented using
// leader-follower iterators. For example, ``A = B + C;`` will be
// converted to an equivalent zippered parallel loop and then to the
// leader-follower idiom shown above. ``foo(A, B, C)`` goes through a
// similar transformation, where ``foo()`` is defined to take scalar
// arguments and is promoted in this call. In both cases, ``A``
// serves as the leader and ``A``, ``B``, and ``C`` are all followers.
/*
.. primers-parIters-roles:
Leader-follower Roles
---------------------
*/
// At a high level, the role of a leader iterator is to:
//
// a) create the parallel tasks used to implement the forall loop,
// b) associate the tasks with specific locales, if desired,
// c) assign work (e.g., iterations) to each parallel task
//
// The leader typically creates the parallelism using task parallel
// features like coforall loops; and it associates tasks with locales
// using locality features like on-clauses. The leader specifies work
// for tasks by having each task it creates yield some representation
// of the work it owns.
//
// The role of the follower iterator is to take as an input argument
// a chunk of work (as yielded by a leader) and to serially iterate
// over and yield the elements/values corresponding to those
// iterations in order.
//
// Let's consider how these roles might play out for our ``count()``
// iterator:
/*
.. primers-parIters-leader
count: leader
-------------
*/
// As with the standalone parallel iterator, leader and follower
// iterators are defined as overloads of the serial version of the
// iterator, once again distinguished by an initial
// ``param`` argument of type ``iterKind``. To
// invoke the leader iterator and differentiate it from the other
// overloads, the compiler will pass in the value ``iterKind.leader`` to
// this argument. The author of the leader iterator should use a
// ``where`` clause to distinguish this overload from the others. As
// with the standalone iterator, the rest of the argument list should
// match that of the serial iterator exactly.
//
// The implementation of our ``count()`` leader iterator is relatively
// similar to the standalone case. It again uses a coforall loop to
// create a number of tasks equal to the number specified by our
// ``numTasks`` configuration constant. However, rather than iterating
// over and yielding the values owned by each task, it will instead
// yield a version of the range itself as a means of telling the follower
// iterators what to do.
//
// To be a legal leader iterator, we could simply have each task yield
// its range as the representation of the work we want the follower to
// perform. However, to support zippering our leader with follower
// iterators written by others, we typically take the convention of
// having iterators over 1D or dense rectangular index spaces yield
// tuples of ranges shifted to a 0-based coordinate system. In this
// way, the leader-follower iterators have a common representation for
// the work even though each may use its own indexing system. This
// permits, for example, arrays of the same size/shape to be zippered
// together even if they have different indices.
//
// For this reason, rather than yielding subranges of ``low..#n``,
// we'll yield subranges of ``0..n-1`` and rely on the follower to
// shift the values back to their preferred coordinate system. As
// a result, we translate each task's range by ``-low`` to shift it
// from ``low``-based coordinates to 0-based coordinates; and then we
// make a 1-tuple out of it.
//
// Note the debugging output inserted into this iterator. While
// learning about leader-follower iterators, it's useful to turn
// this debugging output on by compiling with ``-sverbose=true``
//
iter count(param tag: iterKind, n: int, low: int=1)
where tag == iterKind.leader {
if verbose then
writeln("In count() leader, creating ", numTasks, " tasks");
coforall tid in 0..#numTasks {
const myIters = computeChunk(low..#n, tid, numTasks);
const zeroBasedIters = myIters.translate(-low);
if verbose then
writeln("task ", tid, " owns ", myIters, " yielded as: ", zeroBasedIters);
yield (zeroBasedIters,); // yield a 1-tuple of our sub-range
}
}
//
// As mentioned at the outset, this leader is fairly static and
// simple. More generally, a leader can introduce tasks more
// dynamically, partition work between the tasks more dynamically,
// etc. See :mod:`DynamicIters` for some more interesting examples of
// leader iterators, including those that use dynamic partitioning.
//
/*
.. primers-parIters-follower
count: follower
---------------
*/
// The follower is another overload of the same iterator name, this
// time taking the ``iterKind.follower`` param enumeration as its
// first argument. The subsequent arguments should match the leader
// and serial iterators exactly again (so, ``n`` and ``low`` for our
// example). The final argument must be called ``followThis`` which
// represents the data yielded by the leader (in our case, the 1-tuple
// of 0-based ranges).
//
// The goal of the follower is to do the iteration specified by the
// ``followThis`` argument, serially yielding the elements
// corresponding to those iterations. In our case, this involves
// plucking the range back out of the 1-tuple of ranges, and shifting
// it back to our ``low``-based coordinate system. We then use a
// standard for-loop to iterate over that range and yield the
// corresponding indices. Followers, as the name suggests, tend not
// to be very sophisticated, and simply do what the leader tells them
// to.
//
// As with the leader, this follower has been authored to support
// debugging output when compiled with ``-sverbose=true``.
//
iter count(param tag: iterKind, n: int, low: int=1, followThis)
where tag == iterKind.follower && followThis.size == 1 {
const (followInds,) = followThis;
const lowBasedIters = followInds.translate(low);
if (verbose) then
writeln("Follower received ", followThis, " as work chunk; shifting to ",
lowBasedIters);
for i in lowBasedIters do
yield i;
}
//
// Now let's use our leader/follower iterators. In the following
// loop, ``count()`` serves as the leader and follower while the ``A``
// array is just a follower. This works because ``A`` is a
// rectangular array whose follower iterator also accepts tuples of
// 0-based ranges like the ones ``count()``'s leader is yielding. If
// we were to have ``count()`` yield something else (like a raw
// subrange of ``low..#n``), then the two things could not be zippered
// correctly because they wouldn't be speaking the same language --
// either in terms of the type of work being yielded (range
// vs. 1-tuple of range), nor the description of the work
// (``low``-based indices vs. 0-based indices).
//
forall (i, a) in zip(count(probSize), A) do
a = i/10.0;
writeln("After re-assigning A using parallel zippering:");
writeln(A);
writeln();
//
// We can also zipper in the opposite order, making ``A`` the leader,
// in which case ``count()`` no longer controls the degree of parallelism
// and work assignment since it is no longer the leader. Instead,
// ``A``'s leader iterator (defined as part of its domain map) is invoked.
// For standard Chapel arrays and domain maps, these leader-follower
// iterators are controlled by the ``dataPar*`` configuration constants
// described in doc/rst/usingchapel/executing.rst.
//
forall (a, i) in zip(A, count(probSize)) do
a = i/100.0;
writeln("After re-assigning A using parallel zippering and A as the leader:");
writeln(A);
writeln();
//
// Finally, as mentioned at the outset, operations that are equivalent
// to zippering also use leader-follower iterators, so for example
// the following whole-array assignment will use ``A``'s leader and
// ``count()``'s follower:
//
A = count(probSize, low=100);
writeln("After re-assigning A using whole-array assignment:");
writeln(A);
writeln();
/*
.. primers-parIters-closing-notes
Closing notes
-------------
*/
// Chapel data types like records and classes can support iteration by
// defining iterator methods (invoked by name) or ``these()``
// iterators which support iterating over variables of that type
// directly. Such iterator methods can be overloaded to support
// standalone and/or leader-follower versions to support parallel
// iteration over the variable.
//
// As mentioned at the outset, our leader-follower scheme has a number
// of changes planned for it such as interface improvements and better
// error checking. We'll update this primer over time as we improve
// these features.
//
// **Definition of helper function used above:**
//
// The following utility function partitions a range into
// ``numChunks`` sub-ranges and returns a range representing the
// indices for sub-range ``myChunk`` (counting from 0). The absolute
// difference between the size of the ranges returned is at most 1
// (either 0 or 1). If the value of remainder ``rem`` is equal to 0,
// then each sub-range contains ``elemsPerChunk`` indices, equal to
// ``floor(numElements/numChunks)`` work items. But if ``rem`` is not
// equal to 0, then the first ``rem`` sub-ranges get
// (``elemsPerChunk+ 1``) indices and the rest (chunks ``rem`` to
// ``numChunks-1``) get ``elemsPerChunk`` indices. For simplicity, this
// routine works only for unstrided ranges with the default index type
// of ``int``. These constraints could be relaxed with more effort.
//
proc computeChunk(r: range, myChunk, numChunks) {
const numElems = r.size;
const elemsperChunk = numElems/numChunks;
const rem = numElems%numChunks;
var mylow = r.low;
if myChunk < rem {
mylow += (elemsperChunk+1)*myChunk;
return mylow..#(elemsperChunk + 1);
} else {
mylow += ((elemsperChunk+1)*rem + (elemsperChunk)*(myChunk-rem));
return mylow..#elemsperChunk;
}
}
// .. _User-Defined Parallel Zippered Iterators in Chapel: http://pgas11.rice.edu/papers/ChamberlainEtAl-Chapel-Iterators-PGAS11.pdf