Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use faster lud_diagonal #11

Merged
merged 2 commits into from Jan 28, 2020
Merged

Use faster lud_diagonal #11

merged 2 commits into from Jan 28, 2020

Conversation

@Munksgaard
Copy link
Contributor

Munksgaard commented Jan 17, 2020

This version of lud_diagonal uses intra-group parallelism and is faster
with the right tuning parameters and incremental flattening:

Before:

$ cat lud-clean.fut.tuning
main.suff_intra_par_14=32
main.suff_intra_par_16=1024
main.suff_intra_par_18=2000000000
main.suff_intra_par_20=32
main.suff_outer_par_13=2000000000
main.suff_outer_par_15=2000000000
main.suff_outer_par_17=2000000000
main.suff_outer_par_19=1024

$ FUTHARK_INCREMENTAL_FLATTENING=1 futhark bench --backend=opencl lud-clean.fut   
Compiling lud-clean.fut...
Results for lud-clean.fut (using lud-clean.fut.tuning):
dataset data/16by16.in:     292.60μs (avg. of 10 runs; RSD: 0.12)
dataset data/64.in:         517.90μs (avg. of 10 runs; RSD: 0.11)
dataset data/256.in:       2167.60μs (avg. of 10 runs; RSD: 0.14)
dataset data/512.in:       4181.10μs (avg. of 10 runs; RSD: 0.12)
dataset data/2048.in:     20156.60μs (avg. of 10 runs; RSD: 0.12)

After

$ cat lud-clean.fut.tuning 
main.suff_intra_par_14=32
main.suff_intra_par_16=1024
main.suff_intra_par_18=32
main.suff_intra_par_20=32
main.suff_intra_par_24=32
main.suff_intra_par_6=32
main.suff_outer_par_13=2000000000
main.suff_outer_par_15=2000000000
main.suff_outer_par_17=1568
main.suff_outer_par_19=1024

$ FUTHARK_INCREMENTAL_FLATTENING=1 futhark bench --backend=opencl lud-clean.fut   
Compiling lud-clean.fut...
Results for lud-clean.fut (using lud-clean.fut.tuning):
dataset data/16by16.in:     142.40μs (avg. of 10 runs; RSD: 0.10)
dataset data/64.in:         215.80μs (avg. of 10 runs; RSD: 0.03)
dataset data/256.in:       1260.80μs (avg. of 10 runs; RSD: 0.03)
dataset data/512.in:       2692.20μs (avg. of 10 runs; RSD: 0.04)
dataset data/2048.in:     19300.50μs (avg. of 10 runs; RSD: 0.09)

on gpu04-diku-apl.

@Munksgaard Munksgaard force-pushed the Munksgaard:faster-lud branch from 908a1ff to 2b0dcb8 Jan 20, 2020
@Munksgaard

This comment has been minimized.

Copy link
Contributor Author

Munksgaard commented Jan 22, 2020

In comparison, lud.fut performs about similar to the old lud-clean.fut:

$ cat lud.fut.tuning 
main.suff_intra_par_19=1024
main.suff_intra_par_21=1024
main.suff_intra_par_23=32
main.suff_intra_par_25=32
main.suff_outer_par_18=2000000000
main.suff_outer_par_20=2000000000
main.suff_outer_par_22=7200
main.suff_outer_par_24=50176

$ FUTHARK_INCREMENTAL_FLATTENING=1 futhark bench --backend=opencl lud.fut
Compiling lud.fut...
Results for lud.fut (using lud.fut.tuning):
dataset data/16by16.in:     303.60μs (avg. of 10 runs; RSD: 0.09)
dataset data/64.in:         548.50μs (avg. of 10 runs; RSD: 0.10)
dataset data/256.in:       2297.50μs (avg. of 10 runs; RSD: 0.12)
dataset data/512.in:       6138.60μs (avg. of 10 runs; RSD: 0.05)
dataset data/2048.in:     24013.00μs (avg. of 10 runs; RSD: 0.12)

whereas the Rodinia implementation is a lot faster than ours:

$ ./lud -i ../../../data/lud/64.dat                                               
WG size of kernel = 16 X 16
Reading matrix from file ../../../data/lud/64.dat
num_devices = 1
Time consumed(microseconds): 82.000000
Time consumed(microseconds): 61.000000
Time consumed(microseconds): 59.000000
Time consumed(microseconds): 68.000000
Time consumed(microseconds): 58.000000
Time consumed(microseconds): 57.000000
Time consumed(microseconds): 57.000000
Time consumed(microseconds): 58.000000
Time consumed(microseconds): 58.000000
Time consumed(microseconds): 58.000000
Time consumed(microseconds): 67.000000

$ ./lud -i ../../../data/lud/256.dat                                              
WG size of kernel = 16 X 16
Reading matrix from file ../../../data/lud/256.dat
num_devices = 1
Time consumed(microseconds): 287.000000
Time consumed(microseconds): 293.000000
Time consumed(microseconds): 260.000000
Time consumed(microseconds): 264.000000
Time consumed(microseconds): 322.000000
Time consumed(microseconds): 266.000000
Time consumed(microseconds): 278.000000
Time consumed(microseconds): 299.000000
Time consumed(microseconds): 258.000000
Time consumed(microseconds): 274.000000
Time consumed(microseconds): 268.000000

$ ./lud -i ../../../data/lud/512.dat                                              
WG size of kernel = 16 X 16
Reading matrix from file ../../../data/lud/512.dat
num_devices = 1
Time consumed(microseconds): 543.000000
Time consumed(microseconds): 564.000000
Time consumed(microseconds): 653.000000
Time consumed(microseconds): 564.000000
Time consumed(microseconds): 536.000000
Time consumed(microseconds): 558.000000
Time consumed(microseconds): 538.000000
Time consumed(microseconds): 526.000000
Time consumed(microseconds): 544.000000
Time consumed(microseconds): 638.000000
Time consumed(microseconds): 537.000000

$ ./lud -i ../../../data/lud/2048.dat                                             
WG size of kernel = 16 X 16
Reading matrix from file ../../../data/lud/2048.dat
num_devices = 1
Time consumed(microseconds): 2237.000000
Time consumed(microseconds): 2176.000000
Time consumed(microseconds): 2167.000000
Time consumed(microseconds): 2185.000000
Time consumed(microseconds): 2175.000000
Time consumed(microseconds): 2155.000000
Time consumed(microseconds): 2172.000000
Time consumed(microseconds): 2158.000000
Time consumed(microseconds): 2155.000000
Time consumed(microseconds): 2157.000000
Time consumed(microseconds): 2155.000000
@athas

This comment has been minimized.

Copy link
Member

athas commented Jan 27, 2020

What is the difference between lud.fut and lud-clean.fut? And if we really want a version that uses this kind of hack, should it then go in the "clean" version?

@Munksgaard

This comment has been minimized.

Copy link
Contributor Author

Munksgaard commented Jan 27, 2020

I'm not sure there's much point in keeping both lud.fut and lud-clean.fut, as both seem to perform about the same (before this PR). I believe the original intent was to have a nice implementation that was easy (or easier) to understand and modify and one that ran fast, but that doesn't seem necessary any more.

@athas

This comment has been minimized.

Copy link
Member

athas commented Jan 27, 2020

I'll merge this if you create one single lud.fut that contains your grand vision and get rid of lud-clean.fut.

Munksgaard added 2 commits Jan 28, 2020
lud-clean was originally created as a nicer but slower implementation of lud.
However, it is not actually any slower any more, so we should replace lud with
lud-clean. This commits does so.
This version of lud_diagonal uses intra-group parallelism and is much faster
with the right tuning parameters.
@Munksgaard Munksgaard force-pushed the Munksgaard:faster-lud branch from 81caf4c to 305f771 Jan 28, 2020
@Munksgaard

This comment has been minimized.

Copy link
Contributor Author

Munksgaard commented Jan 28, 2020

I'll merge this if you create one single lud.fut that contains your grand vision and get rid of lud-clean.fut.

I've pushed new commits that merge lud-clean and lud, and apply the changes to lud_diagonal that I've been working on.

@athas athas merged commit 26964cb into diku-dk:master Jan 28, 2020
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@Munksgaard Munksgaard deleted the Munksgaard:faster-lud branch Jan 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants
You can’t perform that action at this time.