Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use faster lud_diagonal #11

Merged
merged 2 commits into from
Jan 28, 2020
Merged

Use faster lud_diagonal #11

merged 2 commits into from
Jan 28, 2020

Conversation

Munksgaard
Copy link
Contributor

This version of lud_diagonal uses intra-group parallelism and is faster
with the right tuning parameters and incremental flattening:

Before:

$ cat lud-clean.fut.tuning
main.suff_intra_par_14=32
main.suff_intra_par_16=1024
main.suff_intra_par_18=2000000000
main.suff_intra_par_20=32
main.suff_outer_par_13=2000000000
main.suff_outer_par_15=2000000000
main.suff_outer_par_17=2000000000
main.suff_outer_par_19=1024

$ FUTHARK_INCREMENTAL_FLATTENING=1 futhark bench --backend=opencl lud-clean.fut   
Compiling lud-clean.fut...
Results for lud-clean.fut (using lud-clean.fut.tuning):
dataset data/16by16.in:     292.60μs (avg. of 10 runs; RSD: 0.12)
dataset data/64.in:         517.90μs (avg. of 10 runs; RSD: 0.11)
dataset data/256.in:       2167.60μs (avg. of 10 runs; RSD: 0.14)
dataset data/512.in:       4181.10μs (avg. of 10 runs; RSD: 0.12)
dataset data/2048.in:     20156.60μs (avg. of 10 runs; RSD: 0.12)

After

$ cat lud-clean.fut.tuning 
main.suff_intra_par_14=32
main.suff_intra_par_16=1024
main.suff_intra_par_18=32
main.suff_intra_par_20=32
main.suff_intra_par_24=32
main.suff_intra_par_6=32
main.suff_outer_par_13=2000000000
main.suff_outer_par_15=2000000000
main.suff_outer_par_17=1568
main.suff_outer_par_19=1024

$ FUTHARK_INCREMENTAL_FLATTENING=1 futhark bench --backend=opencl lud-clean.fut   
Compiling lud-clean.fut...
Results for lud-clean.fut (using lud-clean.fut.tuning):
dataset data/16by16.in:     142.40μs (avg. of 10 runs; RSD: 0.10)
dataset data/64.in:         215.80μs (avg. of 10 runs; RSD: 0.03)
dataset data/256.in:       1260.80μs (avg. of 10 runs; RSD: 0.03)
dataset data/512.in:       2692.20μs (avg. of 10 runs; RSD: 0.04)
dataset data/2048.in:     19300.50μs (avg. of 10 runs; RSD: 0.09)

on gpu04-diku-apl.

@Munksgaard
Copy link
Contributor Author

In comparison, lud.fut performs about similar to the old lud-clean.fut:

$ cat lud.fut.tuning 
main.suff_intra_par_19=1024
main.suff_intra_par_21=1024
main.suff_intra_par_23=32
main.suff_intra_par_25=32
main.suff_outer_par_18=2000000000
main.suff_outer_par_20=2000000000
main.suff_outer_par_22=7200
main.suff_outer_par_24=50176

$ FUTHARK_INCREMENTAL_FLATTENING=1 futhark bench --backend=opencl lud.fut
Compiling lud.fut...
Results for lud.fut (using lud.fut.tuning):
dataset data/16by16.in:     303.60μs (avg. of 10 runs; RSD: 0.09)
dataset data/64.in:         548.50μs (avg. of 10 runs; RSD: 0.10)
dataset data/256.in:       2297.50μs (avg. of 10 runs; RSD: 0.12)
dataset data/512.in:       6138.60μs (avg. of 10 runs; RSD: 0.05)
dataset data/2048.in:     24013.00μs (avg. of 10 runs; RSD: 0.12)

whereas the Rodinia implementation is a lot faster than ours:

$ ./lud -i ../../../data/lud/64.dat                                               
WG size of kernel = 16 X 16
Reading matrix from file ../../../data/lud/64.dat
num_devices = 1
Time consumed(microseconds): 82.000000
Time consumed(microseconds): 61.000000
Time consumed(microseconds): 59.000000
Time consumed(microseconds): 68.000000
Time consumed(microseconds): 58.000000
Time consumed(microseconds): 57.000000
Time consumed(microseconds): 57.000000
Time consumed(microseconds): 58.000000
Time consumed(microseconds): 58.000000
Time consumed(microseconds): 58.000000
Time consumed(microseconds): 67.000000

$ ./lud -i ../../../data/lud/256.dat                                              
WG size of kernel = 16 X 16
Reading matrix from file ../../../data/lud/256.dat
num_devices = 1
Time consumed(microseconds): 287.000000
Time consumed(microseconds): 293.000000
Time consumed(microseconds): 260.000000
Time consumed(microseconds): 264.000000
Time consumed(microseconds): 322.000000
Time consumed(microseconds): 266.000000
Time consumed(microseconds): 278.000000
Time consumed(microseconds): 299.000000
Time consumed(microseconds): 258.000000
Time consumed(microseconds): 274.000000
Time consumed(microseconds): 268.000000

$ ./lud -i ../../../data/lud/512.dat                                              
WG size of kernel = 16 X 16
Reading matrix from file ../../../data/lud/512.dat
num_devices = 1
Time consumed(microseconds): 543.000000
Time consumed(microseconds): 564.000000
Time consumed(microseconds): 653.000000
Time consumed(microseconds): 564.000000
Time consumed(microseconds): 536.000000
Time consumed(microseconds): 558.000000
Time consumed(microseconds): 538.000000
Time consumed(microseconds): 526.000000
Time consumed(microseconds): 544.000000
Time consumed(microseconds): 638.000000
Time consumed(microseconds): 537.000000

$ ./lud -i ../../../data/lud/2048.dat                                             
WG size of kernel = 16 X 16
Reading matrix from file ../../../data/lud/2048.dat
num_devices = 1
Time consumed(microseconds): 2237.000000
Time consumed(microseconds): 2176.000000
Time consumed(microseconds): 2167.000000
Time consumed(microseconds): 2185.000000
Time consumed(microseconds): 2175.000000
Time consumed(microseconds): 2155.000000
Time consumed(microseconds): 2172.000000
Time consumed(microseconds): 2158.000000
Time consumed(microseconds): 2155.000000
Time consumed(microseconds): 2157.000000
Time consumed(microseconds): 2155.000000

@athas
Copy link
Member

athas commented Jan 27, 2020

What is the difference between lud.fut and lud-clean.fut? And if we really want a version that uses this kind of hack, should it then go in the "clean" version?

@Munksgaard
Copy link
Contributor Author

I'm not sure there's much point in keeping both lud.fut and lud-clean.fut, as both seem to perform about the same (before this PR). I believe the original intent was to have a nice implementation that was easy (or easier) to understand and modify and one that ran fast, but that doesn't seem necessary any more.

@athas
Copy link
Member

athas commented Jan 27, 2020

I'll merge this if you create one single lud.fut that contains your grand vision and get rid of lud-clean.fut.

lud-clean was originally created as a nicer but slower implementation of lud.
However, it is not actually any slower any more, so we should replace lud with
lud-clean. This commits does so.
This version of lud_diagonal uses intra-group parallelism and is much faster
with the right tuning parameters.
@Munksgaard
Copy link
Contributor Author

Munksgaard commented Jan 28, 2020

I'll merge this if you create one single lud.fut that contains your grand vision and get rid of lud-clean.fut.

I've pushed new commits that merge lud-clean and lud, and apply the changes to lud_diagonal that I've been working on.

@athas athas merged commit 26964cb into diku-dk:master Jan 28, 2020
@Munksgaard Munksgaard deleted the faster-lud branch January 28, 2020 11:45
athas added a commit that referenced this pull request Oct 27, 2021
Use faster lud_diagonal

Former-commit-id: 26964cb
athas added a commit that referenced this pull request Oct 27, 2021
Use faster lud_diagonal

Former-commit-id: 26964cb
Former-commit-id: 6d19699
athas added a commit that referenced this pull request Oct 27, 2021
Use faster lud_diagonal

Former-commit-id: 26964cb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants