Use faster lud_diagonal #11

Munksgaard · 2020-01-17T13:35:05Z

This version of lud_diagonal uses intra-group parallelism and is faster
with the right tuning parameters and incremental flattening:

Before:

$ cat lud-clean.fut.tuning
main.suff_intra_par_14=32
main.suff_intra_par_16=1024
main.suff_intra_par_18=2000000000
main.suff_intra_par_20=32
main.suff_outer_par_13=2000000000
main.suff_outer_par_15=2000000000
main.suff_outer_par_17=2000000000
main.suff_outer_par_19=1024

$ FUTHARK_INCREMENTAL_FLATTENING=1 futhark bench --backend=opencl lud-clean.fut   
Compiling lud-clean.fut...
Results for lud-clean.fut (using lud-clean.fut.tuning):
dataset data/16by16.in:     292.60μs (avg. of 10 runs; RSD: 0.12)
dataset data/64.in:         517.90μs (avg. of 10 runs; RSD: 0.11)
dataset data/256.in:       2167.60μs (avg. of 10 runs; RSD: 0.14)
dataset data/512.in:       4181.10μs (avg. of 10 runs; RSD: 0.12)
dataset data/2048.in:     20156.60μs (avg. of 10 runs; RSD: 0.12)

After

$ cat lud-clean.fut.tuning 
main.suff_intra_par_14=32
main.suff_intra_par_16=1024
main.suff_intra_par_18=32
main.suff_intra_par_20=32
main.suff_intra_par_24=32
main.suff_intra_par_6=32
main.suff_outer_par_13=2000000000
main.suff_outer_par_15=2000000000
main.suff_outer_par_17=1568
main.suff_outer_par_19=1024

$ FUTHARK_INCREMENTAL_FLATTENING=1 futhark bench --backend=opencl lud-clean.fut   
Compiling lud-clean.fut...
Results for lud-clean.fut (using lud-clean.fut.tuning):
dataset data/16by16.in:     142.40μs (avg. of 10 runs; RSD: 0.10)
dataset data/64.in:         215.80μs (avg. of 10 runs; RSD: 0.03)
dataset data/256.in:       1260.80μs (avg. of 10 runs; RSD: 0.03)
dataset data/512.in:       2692.20μs (avg. of 10 runs; RSD: 0.04)
dataset data/2048.in:     19300.50μs (avg. of 10 runs; RSD: 0.09)

on gpu04-diku-apl.

Munksgaard · 2020-01-22T13:30:48Z

In comparison, lud.fut performs about similar to the old lud-clean.fut:

$ cat lud.fut.tuning 
main.suff_intra_par_19=1024
main.suff_intra_par_21=1024
main.suff_intra_par_23=32
main.suff_intra_par_25=32
main.suff_outer_par_18=2000000000
main.suff_outer_par_20=2000000000
main.suff_outer_par_22=7200
main.suff_outer_par_24=50176

$ FUTHARK_INCREMENTAL_FLATTENING=1 futhark bench --backend=opencl lud.fut
Compiling lud.fut...
Results for lud.fut (using lud.fut.tuning):
dataset data/16by16.in:     303.60μs (avg. of 10 runs; RSD: 0.09)
dataset data/64.in:         548.50μs (avg. of 10 runs; RSD: 0.10)
dataset data/256.in:       2297.50μs (avg. of 10 runs; RSD: 0.12)
dataset data/512.in:       6138.60μs (avg. of 10 runs; RSD: 0.05)
dataset data/2048.in:     24013.00μs (avg. of 10 runs; RSD: 0.12)

whereas the Rodinia implementation is a lot faster than ours:

$ ./lud -i ../../../data/lud/64.dat                                               
WG size of kernel = 16 X 16
Reading matrix from file ../../../data/lud/64.dat
num_devices = 1
Time consumed(microseconds): 82.000000
Time consumed(microseconds): 61.000000
Time consumed(microseconds): 59.000000
Time consumed(microseconds): 68.000000
Time consumed(microseconds): 58.000000
Time consumed(microseconds): 57.000000
Time consumed(microseconds): 57.000000
Time consumed(microseconds): 58.000000
Time consumed(microseconds): 58.000000
Time consumed(microseconds): 58.000000
Time consumed(microseconds): 67.000000

$ ./lud -i ../../../data/lud/256.dat                                              
WG size of kernel = 16 X 16
Reading matrix from file ../../../data/lud/256.dat
num_devices = 1
Time consumed(microseconds): 287.000000
Time consumed(microseconds): 293.000000
Time consumed(microseconds): 260.000000
Time consumed(microseconds): 264.000000
Time consumed(microseconds): 322.000000
Time consumed(microseconds): 266.000000
Time consumed(microseconds): 278.000000
Time consumed(microseconds): 299.000000
Time consumed(microseconds): 258.000000
Time consumed(microseconds): 274.000000
Time consumed(microseconds): 268.000000

$ ./lud -i ../../../data/lud/512.dat                                              
WG size of kernel = 16 X 16
Reading matrix from file ../../../data/lud/512.dat
num_devices = 1
Time consumed(microseconds): 543.000000
Time consumed(microseconds): 564.000000
Time consumed(microseconds): 653.000000
Time consumed(microseconds): 564.000000
Time consumed(microseconds): 536.000000
Time consumed(microseconds): 558.000000
Time consumed(microseconds): 538.000000
Time consumed(microseconds): 526.000000
Time consumed(microseconds): 544.000000
Time consumed(microseconds): 638.000000
Time consumed(microseconds): 537.000000

$ ./lud -i ../../../data/lud/2048.dat                                             
WG size of kernel = 16 X 16
Reading matrix from file ../../../data/lud/2048.dat
num_devices = 1
Time consumed(microseconds): 2237.000000
Time consumed(microseconds): 2176.000000
Time consumed(microseconds): 2167.000000
Time consumed(microseconds): 2185.000000
Time consumed(microseconds): 2175.000000
Time consumed(microseconds): 2155.000000
Time consumed(microseconds): 2172.000000
Time consumed(microseconds): 2158.000000
Time consumed(microseconds): 2155.000000
Time consumed(microseconds): 2157.000000
Time consumed(microseconds): 2155.000000

athas · 2020-01-27T10:09:06Z

What is the difference between lud.fut and lud-clean.fut? And if we really want a version that uses this kind of hack, should it then go in the "clean" version?

Munksgaard · 2020-01-27T10:16:13Z

I'm not sure there's much point in keeping both lud.fut and lud-clean.fut, as both seem to perform about the same (before this PR). I believe the original intent was to have a nice implementation that was easy (or easier) to understand and modify and one that ran fast, but that doesn't seem necessary any more.

athas · 2020-01-27T12:42:50Z

I'll merge this if you create one single lud.fut that contains your grand vision and get rid of lud-clean.fut.

lud-clean was originally created as a nicer but slower implementation of lud. However, it is not actually any slower any more, so we should replace lud with lud-clean. This commits does so.

This version of lud_diagonal uses intra-group parallelism and is much faster with the right tuning parameters.

Munksgaard · 2020-01-28T09:46:17Z

I'll merge this if you create one single lud.fut that contains your grand vision and get rid of lud-clean.fut.

I've pushed new commits that merge lud-clean and lud, and apply the changes to lud_diagonal that I've been working on.

Use faster lud_diagonal Former-commit-id: 26964cb

Use faster lud_diagonal Former-commit-id: 26964cb Former-commit-id: 6d19699

Use faster lud_diagonal Former-commit-id: 26964cb

Munksgaard force-pushed the faster-lud branch from 908a1ff to 2b0dcb8 Compare January 20, 2020 07:35

Munksgaard added 2 commits January 28, 2020 09:56

Combine lud-clean.fut and lud.fut into one

3574c35

lud-clean was originally created as a nicer but slower implementation of lud. However, it is not actually any slower any more, so we should replace lud with lud-clean. This commits does so.

Use faster lud_diagonal

305f771

This version of lud_diagonal uses intra-group parallelism and is much faster with the right tuning parameters.

Munksgaard force-pushed the faster-lud branch from 81caf4c to 305f771 Compare January 28, 2020 09:00

athas merged commit 26964cb into diku-dk:master Jan 28, 2020

Munksgaard deleted the faster-lud branch January 28, 2020 11:45

athas added a commit that referenced this pull request Oct 27, 2021

Merge pull request #11 from Munksgaard/faster-lud

6d19699

Use faster lud_diagonal Former-commit-id: 26964cb

athas added a commit that referenced this pull request Oct 27, 2021

Merge pull request #11 from Munksgaard/faster-lud

fd66c98

Use faster lud_diagonal Former-commit-id: 26964cb Former-commit-id: 6d19699

athas added a commit that referenced this pull request Oct 27, 2021

Merge pull request #11 from Munksgaard/faster-lud

5f9f3d8

Use faster lud_diagonal Former-commit-id: 26964cb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use faster lud_diagonal #11

Use faster lud_diagonal #11

Munksgaard commented Jan 17, 2020

Munksgaard commented Jan 22, 2020

athas commented Jan 27, 2020

Munksgaard commented Jan 27, 2020

athas commented Jan 27, 2020

Munksgaard commented Jan 28, 2020 •

edited

Loading

Use faster lud_diagonal #11

Use faster lud_diagonal #11

Conversation

Munksgaard commented Jan 17, 2020

Munksgaard commented Jan 22, 2020

athas commented Jan 27, 2020

Munksgaard commented Jan 27, 2020

athas commented Jan 27, 2020

Munksgaard commented Jan 28, 2020 • edited Loading

Munksgaard commented Jan 28, 2020 •

edited

Loading