Optimized DCT Implementation #24

abonander · 2018-12-27T01:55:46Z

Seeking an optimized DCT implementation with a compatible license.

Options:

RustDCT
Needs benchmarking as current API supports 1D slices only so it needs a transpose step that the in-tree impl doesn't. (Multidimensional DCTs ejmahler/rust_dct#2)

Dubious:

FFTW; licensed under GPL, not compatible with MIT
Can provide interface to link in like old UserDCT implementation but composable with the new 3.0-alpha API.

cc @ejmahler

The text was updated successfully, but these errors were encountered:

ejmahler · 2018-12-27T05:35:26Z

As a quick test, I benchmarked this with a test implementation based on rust_dct with the following code:

fn external_dct(data: &mut [f64], width: usize, height: usize) {
	let mut planner = rustdct::DCTplanner::new();
	let width_dct = planner.plan_dct2(width);
	let height_dct = planner.plan_dct2(height);

	let mut scratch = vec![0f64; data.len()];

	// width DCT
	for (src, dest) in data.chunks_mut(width).zip(scratch.chunks_mut(width)) {
		width_dct.process_dct2(src, dest);
	}

	// transpose
	unsafe { transpose(width, height, &scratch, data) };

	// height DCT
	for (src, dest) in data.chunks_mut(height).zip(scratch.chunks_mut(height)) {
		height_dct.process_dct2(src, dest);
	}

	// transpose back
	unsafe { transpose(height, width, &scratch, data) };
}

and "transpose" is copied from the rustFFT project.

"external_dct_no_transpose" is the same as "external_dct", but with the transpose lines commented out. This represents the theoretical fastest possible execution - but in reality cache problems might prevent anything from ever going that fast.

test external_impl_004x004              ... bench:         691 ns/iter (+/- 121)
test external_impl_008x008              ... bench:         891 ns/iter (+/- 52)
test external_impl_016x016              ... bench:       1,958 ns/iter (+/- 12)
test external_impl_256x256              ... bench:     884,817 ns/iter (+/- 461,570)
test external_no_transpose_impl_004x004 ... bench:         627 ns/iter (+/- 96)
test external_no_transpose_impl_008x008 ... bench:         825 ns/iter (+/- 480)
test external_no_transpose_impl_016x016 ... bench:       1,751 ns/iter (+/- 286)
test external_no_transpose_impl_256x256 ... bench:     771,039 ns/iter (+/- 270,178)
test naive_impl_004x004                 ... bench:       2,180 ns/iter (+/- 661)
test naive_impl_008x008                 ... bench:      16,341 ns/iter (+/- 216)
test naive_impl_016x016                 ... bench:     134,652 ns/iter (+/- 76,653)
test naive_impl_256x256                 ... bench: 587,429,464 ns/iter (+/- 72,507,053)

The results show that using rust_dct over the current algorithm is a clear win, but that's not what we're testing. We want to know how much the transpose hurts runtime. For 8x8, we see that transposing takes 8% longer. For 16x16, transposing takes 12% longer. For 256x256, transposing takes 15% longer.

So theoretically, the maximum improvement you can get is 8-15%.

ejmahler · 2018-12-27T05:48:46Z

test external_impl_004x004              ... bench:         304 ns/iter (+/- 107)
test external_impl_008x008              ... bench:         589 ns/iter (+/- 138)
test external_impl_016x016              ... bench:       1,625 ns/iter (+/- 351)
test external_impl_256x256              ... bench:     870,007 ns/iter (+/- 396,721)
test external_no_transpose_impl_004x004 ... bench:         251 ns/iter (+/- 80)
test external_no_transpose_impl_008x008 ... bench:         505 ns/iter (+/- 125)
test external_no_transpose_impl_016x016 ... bench:       1,408 ns/iter (+/- 942)
test external_no_transpose_impl_256x256 ... bench:     768,939 ns/iter (+/- 306,382)

This is a small variation were the same DCT planner (and thus the same DCT instance) is re-used every time. For 8x8, we see that the computation takes 50% longer when we have to recompute the DCT instance every time.

So separately from the discussion of transposing vs striding the column DCT, if you end up using rustdct I recommend re-architecting your hash so that the same DCT instance(s) are re-used multiple times.

abonander · 2018-12-27T11:33:34Z

Those numbers are compelling but I have a few questions:

how is the naive benchmark implemented?
how is transpose implemented? I'm trying to avoid unsafe code where possible. In a previous implementation that used transposition I had a naive O(N * M) loop swapping rows and columns.
are you compiling with -C target-cpu=native and what series is the CPU you're benchmarking on?

If the benchmark was properly vectorized I wouldn't expect to see the naive implementation to be an order of magnitude slower than one using doubles.

I also want to clarify that the DCT data should be getting reused outside of the loop; the coefficients are calculated ahead of time and then reused by the same Hasher instance.

My column-indexing adapter may also be introducing some unnecessary bounds checks. I wonder if I can eliminate those without unsafe code.

Note that the latest HEAD on 3-alpha has a slightly different implementation as well. That one doesn't use any unsafe code that I can remember off the top of my head.

ejmahler · 2018-12-27T14:57:28Z

You can see the full test code in my fork of img_hash: ejmahler@0262388

Naive benchmark is following:

fn bench_naive(b: &mut test::Bencher, width: usize, height: usize) {

    let mut signal = vec![0f64; width * height];

    b.iter(|| { img_hash::dct_2d(&mut signal, width); });
}
#[bench] fn naive_impl_004x004(b: &mut test::Bencher) { bench_naive(b, 4, 4); }
#[bench] fn naive_impl_008x008(b: &mut test::Bencher) { bench_naive(b, 8, 8); }
#[bench] fn naive_impl_016x016(b: &mut test::Bencher) { bench_naive(b, 16, 16); }
//#[bench] fn naive_impl_256x256(b: &mut test::Bencher) { bench_naive(b, 256, 256); }

And I modified img_hash to mark dct_2d function as pub.

transpose is implemented via the same loop tiling method as rustFFT, which improves cache friendliness over the naive transpose algorithm. It uses unsafe, but I'm interested in trying a version that is completely safe. It will be slower, but not a lot slower. If it turns out to be the same speed, I'm interested in converting rustFFT's version to use safe code.
I haven't run with any special commands - just "cargo bench". I'll have time to run them with cpu-specific options tonight, or you have the code so you can run them yourself.
Final thought: These benchmarks were run on a laptop, so it's possible that you'll see results you expect if you run them on a desktop

abonander · 2018-12-27T15:05:50Z

Yeah you're using the implementation from master which is doing trig in the loop, not precalculating coefficients. I'll implement benchmarks on my 3-alpha branch and see how they stack up.

Cache trashing shouldn't be an issue as the coefficients matrix as well as the scratch space should all fit in L1. However, trig operations don't vectorize so I think they're dragging down the loop.

Also, I'm using f32 in my branch because high precision isn't really necessary for this application; I think doubling the number of elements that can fit in a SIMD register will beat any FFT implementation on f64 hands-down but I'm obviously going to try to verify that conjecture.

Also going to benchmark on my 7820x which supports AVX-2 fused multiply-add. I think by default the compiler only assumes SSE2 or something like that.

abonander · 2018-12-28T07:22:37Z

I noticed now that RustDCT is generic over the floating point types and can operate on f32 directly. Very nice.

I should have benchmark results tonight.

abonander · 2018-12-28T16:31:01Z

Yeah RustDCT +transpose beats my naive DCT implementation hands-down even in the smallest case so I'm completely fine with switching over.

I've spent all night trying to optimize a safe transpose but your implementation beats my best effort by 20% on larger matrices: 0764a76

Looking at the optimized assembly, I haven't completely eliminated bounds-checks but the ones that remain don't seem to be in the hottest part of the loop because replacing the index with get_unchecked_mut() only improves performance noticeably on the smallest matricies; at larger sizes it's lost in the noise.

I think this technique could be adaptable to your tiled implementation, though; with some more tweaking I might be able to eliminate the bounds checks entirely in safe code.

I also tried doubling BLOCK_SIZE, it improves performance on matrices that are taller than they are wide but worsens it in the opposite case. Square matrices it seems to be mostly dependent on alignment to cache lines. It might be worth dynamically switching BLOCK_SIZE and maybe using non-square blocks that more closely match the aspect ratio of the input matrix.

Speaking of which, I bet we could squeeze out a good bit more performance with either implementation if we used allocations that were aligned to cache lines. My transpose seems to be a lot more sensitive to cache alignment. However, it worsens performance on the smaller matrices if I add a leading loop that processes the unaligned elements and then passes the remaining aligned slice to .chunks_exact().

ejmahler · 2019-01-01T23:13:36Z

I've done a little more research on this, and have learned some things:

For small sizes, the allocation of scratch space is by far the slowest part of the 2D dct2:

These are benchmark results with scratch allocations:

test external_impl_004x004              ... bench:       1,832 ns/iter (+/- 56)
test external_impl_008x008              ... bench:       1,955 ns/iter (+/- 99)
test external_impl_016x016              ... bench:       2,562 ns/iter (+/- 40)
test external_impl_256x256              ... bench:     518,990 ns/iter (+/- 23,939)
test external_no_transpose_impl_004x004 ... bench:       1,815 ns/iter (+/- 51)
test external_no_transpose_impl_008x008 ... bench:       1,908 ns/iter (+/- 50)
test external_no_transpose_impl_016x016 ... bench:       2,420 ns/iter (+/- 64)
test external_no_transpose_impl_256x256 ... bench:     464,690 ns/iter (+/- 11,559)

And these are the same benchmarks but with the scratch preallocated or omitted:

test external_impl_004x004              ... bench:          82 ns/iter (+/- 6)
test external_impl_008x008              ... bench:         210 ns/iter (+/- 17)
test external_impl_016x016              ... bench:         793 ns/iter (+/- 28)
test external_impl_256x256              ... bench:     535,170 ns/iter (+/- 50,945)
test external_no_transpose_impl_004x004 ... bench:          52 ns/iter (+/- 1)
test external_no_transpose_impl_008x008 ... bench:         154 ns/iter (+/- 7)
test external_no_transpose_impl_016x016 ... bench:         638 ns/iter (+/- 13)
test external_no_transpose_impl_256x256 ... bench:     469,070 ns/iter (+/- 12,303)

As you can see, allocation seems to add about 1750ns, all on its own. (the 256x256 results are slower for the no-allocation version, but I suspect it's just noise). Note that these benchmarks were all run on a windows PC, and it looks like allocations are pretty expensive on windows.

I implemented a fixed 8x8 version that doesn't need transposes:

test bench_fixed_transposeless_8x8      ... bench:          50 ns/iter (+/- 2)
test external_impl_008x008              ... bench:         210 ns/iter (+/- 17)

So the transposeless version is 4x faster than getting trait objects from the DCT planner and doing transposes etc, and 3x faster than using the size-8 butterflies directly, but still dong transposes. So eliminating the transposes would definitely speed things up.

The implementation you have in your alpha branch does 3 allocations along side calling dct2_2d (one for the image data and one for the crop, and once inside dct2_2d). I can imagine rolling all of these into a single allocation and using slice::split_at_mut to divide it up afterwards. I don't know how long imageops::resize takes, but if we assume it takes the same amount of time as an allocation, then reducing the allocations would cut the runtime of the hash in half from 4 allocations to 2.
Assuming the biggest performance impact comes from those 2 "allocations" above, then we have 1750x2=3500ns as the base runtime cost of this hash. On top of this, we either have 50ns for my transposeless version, or 210ns for the DCTplanner version, for a total of 3550ns vs 3710ns, a difference of 5%. So we can get maybe a 5% improvement by rewriting RustDCT or exposing an alternate path that supports an offset+stride - not worth the engineering effort IMO.

ejmahler · 2019-01-01T23:22:07Z

Finally, I'm annoyed at having to copy+paste transposes everywhere, so I intend to publish a crate this week that will have a few different transpose routines in it. I'll let you know when its up.

ejmahler · 2019-01-02T02:58:50Z

https://crates.io/crates/transpose

If you come up with a way to write this code safely I'd be happy to accept a pull request

abonander · 2019-01-04T14:05:49Z

I may look at it closer at a later date but as far as img_hash goes it'd be nice to just be able to forget about the details of the 2D DCT. transpose as an external crate is a great step in that direction. I'm happy to leave it there for now, I'm eager to get 3.0 out the door.

I unfortunately can't avoid the allocation when resizing the image; the image API doesn't really give me any options there. That, and because some hash algorithms require different resize dimensions, the amount of capacity required can vary drastically and dynamically. So fixed-sized scratch arrays aren't really practical there.

However, I can combine the allocation of the image data when converting it to floats (probably what you meant) and the allocation of the scratch space in the DCT function. That's quite trivial.

crop_dct_2d doesn't actually allocate, by the way, it copies within the vector and then truncates it. Taking and returning ownership seemed more idiomatic than taking &mut Vec<f32> which most people (and Clippy) see and think it should be &mut [f32] instead.

abonander · 2020-02-19T00:06:05Z

Calling this good with rustdct and transpose.

abonander added help wanted enhancement labels Dec 27, 2018

abonander closed this as completed Feb 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized DCT Implementation #24

Optimized DCT Implementation #24

abonander commented Dec 27, 2018 •

edited

ejmahler commented Dec 27, 2018

ejmahler commented Dec 27, 2018 •

edited

abonander commented Dec 27, 2018 •

edited

ejmahler commented Dec 27, 2018 •

edited

abonander commented Dec 27, 2018 •

edited

abonander commented Dec 28, 2018

abonander commented Dec 28, 2018 •

edited

ejmahler commented Jan 1, 2019 •

edited

ejmahler commented Jan 1, 2019

ejmahler commented Jan 2, 2019 •

edited

abonander commented Jan 4, 2019

abonander commented Feb 19, 2020

Optimized DCT Implementation #24

Optimized DCT Implementation #24

Comments

abonander commented Dec 27, 2018 • edited

ejmahler commented Dec 27, 2018

ejmahler commented Dec 27, 2018 • edited

abonander commented Dec 27, 2018 • edited

ejmahler commented Dec 27, 2018 • edited

abonander commented Dec 27, 2018 • edited

abonander commented Dec 28, 2018

abonander commented Dec 28, 2018 • edited

ejmahler commented Jan 1, 2019 • edited

ejmahler commented Jan 1, 2019

ejmahler commented Jan 2, 2019 • edited

abonander commented Jan 4, 2019

abonander commented Feb 19, 2020

abonander commented Dec 27, 2018 •

edited

ejmahler commented Dec 27, 2018 •

edited

abonander commented Dec 27, 2018 •

edited

ejmahler commented Dec 27, 2018 •

edited

abonander commented Dec 27, 2018 •

edited

abonander commented Dec 28, 2018 •

edited

ejmahler commented Jan 1, 2019 •

edited

ejmahler commented Jan 2, 2019 •

edited