Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider relaxing initialisation requirement of scratch buffers #105

Open
WalterSmuts opened this issue Dec 8, 2022 · 28 comments
Open

Consider relaxing initialisation requirement of scratch buffers #105

WalterSmuts opened this issue Dec 8, 2022 · 28 comments

Comments

@WalterSmuts
Copy link
Contributor

The Fft trait requires scratch buffers to be initialised by defining their type as &mut [Complex<T>]. This type carries an implicit requirement that the memory be initialised. Since the fft algorithm should ALWAYS write to the scratch buffer before reading from it, it does not really require the memory to be initialised. To capture this in the type, I think we'd need &mut MaybeUninit<[Complex<T>]> or perhaps &mut [MaybeUninit<Complex<T>>].

The benefit would be a slight performance improvement for applications. These applications won't be required to fill the scratch buffer with a needless default value anymore and would allow wrapping libraries such as easyfft to implement fft operations on slices and arrays without requiring the Default trait bound on the elements.

I haven't looked at the implementation of the fft algorithms to see if it's practical to express. It could be that it gets in the way of the logic in the way it's currently expressed. It would also carry another major version change.

@HEnquist
Copy link
Contributor

HEnquist commented Dec 8, 2022

This is quite similar to #79

@WalterSmuts
Copy link
Contributor Author

Interesting. Haven't thought about the case of the out_of_place output. That's true, it does apply for it as well.

Forcing all users to do this would be a large ergonomics burden on users,

I'm not sure I understand the burden. Note, it won't change the process_with_scratch API. It would only change the process_outofplace_with_scratch API. It would however change it from:

    fn process_outofplace_with_scratch(
        &self,
        input: &mut [Complex<T>],
        output: &mut [Complex<T>],
        scratch: &mut [Complex<T>]
    )

to:

    fn process_outofplace_with_scratch_uninit(
        &self,
        input: &mut [Complex<T>],
        output: [MaybeUninit<Complex<T>>],
        scratch: &mut [MaybeUninit<Complex<T>>]
    ) -> [Complex<T>]

Note the output is being consumed and the function now returns a [Complex<T>]. This would point towards the same memory, just the MaybeUninit type wrapper is removed. As for creation of the slices... You could just provide a allocate_uninit_slice helper function that's generic over the type of element and length of the slice.

and adding it as an alternative API would require a very large amount of boilerplate that would be too large to maintain and document.

This double API idea is also a good idea for folks who DO have an initialised array, perhaps a buffer that get's re-used as the output. I may be missing the large amounts of boilerplate and documentation required. Won't you be able to define the one in terms of the other? Something like:

fn process_outofplace_with_scratch(
    &self,
    input: &mut [Complex<T>],
    output: &mut [Complex<T>],
    scratch: &mut [Complex<T>],
) {
    output = self.process_outofplace_with_scratch_uninit(
        input,
        MaybeUninit::new(output.clone()),
        scratch,
    );
}

The compiler should be able to optimise out the clone... But that's yet to be tested.

@HEnquist
Copy link
Contributor

HEnquist commented Dec 8, 2022

I think the speed benefit alone is too small to motivate a change. Anyone who cares about performance should be reusing the same buffers for every call, so the initialization is only done once. I would also guess that allocating the space for the buffer takes more time than clearing it. And compared to creating an FFT instance it's nothing.

@HEnquist
Copy link
Contributor

HEnquist commented Dec 8, 2022

Is it allowed to cheat with unsafe?

    let mut buf: Vec<f32> = Vec::with_capacity(1000);
    unsafe {
        buf.set_len(1000);
    }

This should skip the clearing of the new vector. I'm also assuming that anyone who wants to squeeze the last tiny drops of performance from their code isn't afraid of a little unsafe :)

WalterSmuts added a commit to WalterSmuts/fftconvolve that referenced this issue Dec 9, 2022
Unfortunately this requires an extra Default trait bound which results
in requiring a major version change.

Practically it's not an issue since all elements that you'd want to
convolve would implement Default, but it's a bit annoying.

Default is required for initializing the scratch-buffer, which is
strictly not necessary but requires changes to `rustfft`. It's currently
pending this issue in rustfft:
ejmahler/RustFFT#105
WalterSmuts added a commit to WalterSmuts/fftconvolve that referenced this issue Dec 9, 2022
Unfortunately this requires an extra Default trait bound which results
in requiring a major version change for fftconvolve.

Practically it's not an issue since all elements that you'd want to
convolve would implement Default, but it's a bit annoying.

Default is required for initializing the scratch-buffer, which is
strictly not necessary but requires changes to `rustfft`. It's currently
pending this issue in rustfft:
ejmahler/RustFFT#105
WalterSmuts added a commit to WalterSmuts/fftconvolve that referenced this issue Dec 9, 2022
Unfortunately this requires an extra Default trait bound which results
in requiring a major version change for fftconvolve.

Practically it's not an issue since all elements that you'd want to
convolve would implement Default, but it's a bit annoying.

Default is required for initializing the scratch-buffer, which is
strictly not necessary but requires changes to `rustfft`. It's currently
pending this issue in rustfft:
ejmahler/RustFFT#105
@WalterSmuts
Copy link
Contributor Author

Is it allowed to cheat with unsafe?

AFAIK the answer is no. The documentation provides two invariants that need to hold:

Safety

  • new_len must be less than or equal to capacity().
  • The elements at old_len..new_len must be initialized.

The first one is pretty obvious. If the allocator didn't allocate enough memory for the Vec, then accessing past the Vec will, in the best case, cause a segfault, and in the worst case access another allocation it doesn't have ownership of and can cause the program execution to do just about anything, i.e. undefined behaviour. AFAICT you should be able to ensure this invariant holds by calling reserve_capacity before calling set_len.

The second item is more subtle. It's quite obvious that they won't be initialised, but why is that a problem? Well, the compiler doesn't know what type items we're working with. Some items like bool only has two valid byte representations and having anything else is immediate UB. From the MaybeUninit docs:

Similarly, entirely uninitialized memory may have any content, while a bool must always be true or false. Hence, creating an uninitialized bool is undefined behavior:

use std::mem::{self, MaybeUninit};

let b: bool = unsafe { mem::uninitialized() }; // undefined behavior! ⚠️
// The equivalent code with `MaybeUninit<bool>`:
let b: bool = unsafe { MaybeUninit::uninit().assume_init() }; // undefined behavior! ⚠️

From what I can dig up, it applies to floats too: rust-lang/unsafe-code-guidelines#71 (comment)

@WalterSmuts
Copy link
Contributor Author

Went down a bit of a rabbit hole to find out WHY these rules around UB are so strict. The above was enough to convince me that the spec says it requires all values to be valid, but I was still not sure what reason it had to use such strict rules.

This blog post by Ralf Jung gave me a bit more insight in exactly how invalid byte representations of types can lead to UB.

I'm not yet convinced that an uninitialised scratch buffer definitely WILL lead to UB when actually executed, but the post was enough to show that there is subtle nuance to the subject of UB and that the pointy-hatted compiler folk have some good reasons for making things complicated.

@HEnquist
Copy link
Contributor

The first one is pretty obvious. If the allocator didn't allocate enough memory for the Vec, then accessing past the Vec will, in the best case, cause a segfault, and in the worst case access another allocation it doesn't have ownership of and can cause the program execution to do just about anything, i.e. undefined behaviour. AFAICT you should be able to ensure this invariant holds by calling reserve_capacity before calling set_len.

I used with_capacity, that should either allocate the desired capacity or panic. So this part should be fine.

The second item is more subtle. It's quite obvious that they won't be initialised, but why is that a problem? Well, the compiler doesn't know what type items we're working with. Some items like bool only has two valid byte representations and having anything else is immediate UB. From the MaybeUninit docs:

This is what I meant with cheating. RustFFT will never read from any location in the scratch or output buffer that it hasn't already written to. So there should never be a situation where the possibly garbage content gets read.
The same will be true if using a MaybeUninitialized, where it will need to write to uninitialized locations, and will only read from initialized ones.

@WalterSmuts
Copy link
Contributor Author

WalterSmuts commented Dec 10, 2022

Have a look at that blog post. It outlines situations where you don't necessarily need to read from uninitialised variables for UB to occur. The compiler can do optimisations that rely on the assumption that all types that aren't wrapped in MaybeUninit will be initialised properly. And if these assumptions aren't met, the optimisations will change the semantics of the program.

Perhaps there aren't any such optimisations today for numbers of arrays but it's a hairy part of the compiler and it'll be difficult to say for sure. Either way, it's currently not part of the spec, so it could be an issue in the future.

@HEnquist
Copy link
Contributor

Yes I read it.
The problem I see is that any reasonable implementation of your suggestion would essentially need to do the same. I mean to just assume the scratch and output buffers are initialised before starting to use them. Then my "cheat" should be just as good (or bad).
Doing it the right way becomes quite ugly: https://doc.rust-lang.org/std/mem/union.MaybeUninit.html#initializing-an-array-element-by-element
And it get's really awful if you try to do it in the simd code.

My main issue with his whole thing is that people who are really interested in performance will be really careful to always reuse their buffers to avoid allocations. Then this becomes a lot of extra complications in order to slightly speed up a one-time operation.

@WalterSmuts
Copy link
Contributor Author

The problem I see is that any reasonable implementation of your suggestion would essentially need to do the same.

I think the crucial difference is that the rust compiler would emit different metadata to LLVM about the variable while it's wrapped in a MaybeUninit. LLVM would then not be free to apply some set of optimisations.

My main issue with his whole thing is that people who are really interested in performance will be really careful to always reuse their buffers to avoid allocations. Then this becomes a lot of extra complications in order to slightly speed up a one-time operation.

Sure, that's valid.

@RamType0
Copy link

The important thing is that just assigning uninitialized value for the variable T causes undefined behavior, even if we never read from there before we assign initialized value.

In contrast, assigning uninitialized value for variable MaybeUninit<T> is valid, unless we call assume_init before we assign initialized value.

The solution using MaybeUninit<T> is valid as its definition.
The solution using set_len is somehow working correctly with CURRENT IMPLEMENTATION, and potentially causes undefined behavior in the future actually.

Leaving current API makes people using set_len solution, and plants "undefined behavior time bombs" here and there.

@ejmahler
Copy link
Owner

Is there a way to use MaybeUninit for the output of an out-of-place transform which doesn't require callers to use unsafe code to extract the result?

@WalterSmuts
Copy link
Contributor Author

Is there a way to use MaybeUninit for the output of an out-of-place transform which doesn't require callers to use unsafe code to extract the result?

Yes, I showed an example of the API I'd propose in a previous message:

    fn process_outofplace_with_scratch_uninit(
        &self,
        input: &mut [Complex<T>],
        output: [MaybeUninit<Complex<T>>],
        scratch: &mut [MaybeUninit<Complex<T>>]
    ) -> [Complex<T>]

Creating the scratch buffer and output buffer would be as simple as:

// Statically sized
use std::mem::MaybeUninit;
let scratch: &[MaybeUninit<f32>] = &[MaybeUninit::uninit(); 100];

// Dynamically sized
let scratch_buffer_len = 100;
let scratch: Box<[MaybeUninit<f32>]> = vec![MaybeUninit::uninit(); scratch_buffer_len].into();

Similar code can be written for the output buffer. There's no need for the client to delve into unsafe. The logic of process_outofplace_with_scratch_uninit would need to call assume_init() (an unsafe method) somewhere, but that's in the library code, not for the client to worry about.

I've been feeling a bit unsatisfied about why these uninitialised values are UB. I started a thread in the rust language zulip forums to see what the rust folks say. My conclusion is that this is on the edge of the classification. "Officially" it's UB but in practice, it's likely to not exhibit any issues. There are no examples of optimisations folks can give that would break this case. MiniRust even currently considers it defined behaviour.

@ejmahler
Copy link
Owner

I agree that creating the buffer can be done in entirely safe code by the caller.

What about reading the result? Will the caller have to call assume_init() themselves in order to make use of RustFFT's output? Or am I missing something?

@WalterSmuts
Copy link
Contributor Author

Will the caller have to call assume_init() themselves in order to make use of RustFFT's output?

Nope. This logic can be inside the process_outofplace_with_scratch_uninit function.

Or am I missing something?

I think you may be missing that the output parameter of the process_outofplace_with_scratch_uninit method is being consumed and being returned as [Complex<T>]. This is necessary because we need to have the API change the type of the output variable. The only way to do this is to consume the variable, transmute the type and return a different type. The memory being pointed to is all the same though (modulo compiler optimisations)

@ejmahler
Copy link
Owner

Ah I see, I missed the return value.

@HEnquist
Copy link
Contributor

If you then want to reuse the output buffer, what do you do then?

@ejmahler
Copy link
Owner

ejmahler commented Dec 16, 2022

I guess you'd just let the returned output buffer go out of scope, which would release the borrow of the MaybeUninit version of the buffer, letting you reuse it again. I haven't tried it but it seems like the compiler would be perfectly happy with that.

To that end, I suppose the signature would have to be more specifically:

fn process_outofplace_with_scratch_uninit<'a>(
        &self,
        input: &'_ mut [Complex<T>],
        output: &'a mut [MaybeUninit<Complex<T>>],
        scratch: &'_ mut [MaybeUninit<Complex<T>>]
    ) -> &'a mut [Complex<T>]

Is this correct?

@WalterSmuts
Copy link
Contributor Author

If you then want to reuse the output buffer, what do you do then?

You can just wrap the output buffer in the MaybeUninit again. Something like let buffer = MaybeUninit::new(output_from_previous_computation).

@WalterSmuts
Copy link
Contributor Author

WalterSmuts commented Dec 16, 2022

I guess you'd just let the returned output buffer go out of scope, which would release the borrow of the MaybeUninit version of the buffer

Notice my suggestion doesn't borrow the buffer. It would "consume" the buffer. I'm not too familiar with the borrowing semantics to know if your suggestion would work. I would consider it as a different suggestion for the same API.

EDIT:

To that end, I suppose the signature would have to be more specifically:

fn process_outofplace_with_scratch_uninit<'a>(
        &self,
        input: &'_ mut [Complex<T>],
        output: &'a mut [MaybeUninit<Complex<T>>],
        scratch: &'_ mut [MaybeUninit<Complex<T>>]
    ) -> &'a mut [Complex<T>]

Is this correct?

I think this may work but you'll require another method on MaybeUninit:
https://doc.rust-lang.org/std/mem/union.MaybeUninit.html#method.assume_init_mut

@ejmahler
Copy link
Owner

Looking at the API, i'm realizing that assume_init et al work on single elements.

To assume a slice of MaybeUninit are all initialized, you have to call slice_assume_init_mut which isn't stable yet. And there is no "consuming" version.

@RamType0
Copy link

@HEnquist
Copy link
Contributor

Can someone show an example of a project where implementing this would make a measurable improvement?

@WalterSmuts
Copy link
Contributor Author

Not sure about specific applications, but I can get some rough numbers for my specific machine:

use std::mem::MaybeUninit;
use std::time::Instant;

fn main() {
    let start = Instant::now();
    let mut buffer: Box<[f32; 1_000_000_000]> = Box::new([0_f32; 1_000_000_000]);
    let allocation_duration = start.elapsed();
    println!(
        "Time elapsed allocating with zeroing is: {:?}",
        allocation_duration
    );

    let start = Instant::now();
    for val in buffer.iter_mut() {
        *val = 1.0;
    }
    let write_duration = start.elapsed();
    println!(
        "Time elapsed to write ones with zeroing is: {:?}",
        write_duration
    );
    println!(
        "Totoal time elapsed with zeroing is: {:?}",
        allocation_duration + write_duration
    );

    let start = Instant::now();
    #[allow(invalid_value)]
    let mut buffer: Box<[f32; 1_000_000_000]> =
        unsafe { Box::new(MaybeUninit::uninit().assume_init()) };
    let allocation_duration = start.elapsed();
    println!(
        "Time elapsed allocating without zeroing is: {:?}",
        allocation_duration
    );

    let start = Instant::now();
    for val in buffer.iter_mut() {
        *val = 1.0;
    }
    let write_duration = start.elapsed();
    println!(
        "Time elapsed to write ones without zeroing is: {:?}",
        write_duration
    );
    println!(
        "Totoal time elapsed without zeroing is: {:?}",
        allocation_duration + write_duration 
    );
}

This results in the following output when run with --release optimisations:

Time elapsed allocating with zeroing is: 404.011991ms
Time elapsed to write ones with zeroing is: 261.576366ms
Totoal time elapsed with zeroing is: 665.588357ms
Time elapsed allocating without zeroing is: 13.056µs
Time elapsed to write ones without zeroing is: 486.652389ms
Totoal time elapsed without zeroing is: 486.665445ms

My interpretation is that the first allocation is also doing the page faulting. The second allocation is 13.056µs, i.e. next to no time, because there is almost nothing to do other than just reserving the memory. To get an accurate measurement you need to write to each byte so you can ensure the entire buffer is faulted in.

For this specific situation (my machine and working with a billion samples) I would save 35% of the time the application would otherwise have spent allocating. Which is on the order of a few hundreds of milliseconds. Hard to say if it's relevant for initialisation of some applications.

This percentage gets smaller as your buffer size decreases. For a million samples you get ~20% speedup and for 1000 samples you get ~8% speedup (although this is pretty susceptible to noise).

@HEnquist
Copy link
Contributor

Here is a variation of the same, with an FFT instead to give a more realistic comparison.

use std::time::Instant;
use rustfft::{FftPlanner, num_complex::Complex};

const LEN: usize = 32*1024*1024;

fn main() {
    let mut data = vec![Complex::new(0.0, 0.0); LEN];
    let mut planner = FftPlanner::<f32>::new();
    let fft = planner.plan_fft_forward(LEN);
    let scratch_len = fft.get_outofplace_scratch_len();
    let start = Instant::now();
    let mut output: Vec<Complex<f32>> = vec![Complex::new(0.0, 0.0); LEN];
    let mut scratch: Vec<Complex<f32>> = vec![Complex::new(0.0, 0.0); scratch_len];
    let allocation_duration = start.elapsed();
    println!(
        "Time elapsed allocating with zeroing is: {:?}",
        allocation_duration
    );

    let start = Instant::now();
    fft.process_outofplace_with_scratch(&mut data, &mut output, &mut scratch);
    let write_duration = start.elapsed();
    println!(
        "Time elapsed to process with zeroing is: {:?}",
        write_duration
    );
    println!(
        "Total time elapsed with zeroing is: {:?}",
        allocation_duration + write_duration
    );

    let mut data = vec![Complex::new(0.0, 0.0); LEN];
    let start = Instant::now();
    let mut output: Vec<Complex<f32>> = Vec::with_capacity(LEN);
    let mut scratch: Vec<Complex<f32>> = Vec::with_capacity(scratch_len);
    unsafe { output.set_len(LEN)};
    unsafe { scratch.set_len(scratch_len)};
    let allocation_duration = start.elapsed();
    println!(
        "Time elapsed allocating without zeroing is: {:?}",
        allocation_duration
    );

    let start = Instant::now();
    fft.process_outofplace_with_scratch(&mut data, &mut output, &mut scratch);
    let process_duration = start.elapsed();
    println!(
        "Time elapsed to proces without zeroing is: {:?}",
        process_duration
    );
    println!(
        "Total time elapsed without zeroing is: {:?}",
        allocation_duration + process_duration 
    );
}

I get quite a large variation of the results, but averaging the last five runs I get:
total with zeroing: 309 ms
total without zeroing: 318 ms

It's pretty consistent that it runs faster with zeroing.

@WalterSmuts
Copy link
Contributor Author

I also initially got a couple of runs where the non-zeroing version is slower. Weird. Not really sure what's happening there. Could be noise. CPU heating up. Warming up some cache somewhere. Not sure.

I ran watch cargo run --release for a bit and consistently get ~10 ms slower for the zeroing version. Can't seem to trigger another case where the non-zeroing version is slower.

@ejmahler
Copy link
Owner

ejmahler commented Dec 21, 2022

Not sure how relevant this is, but when i was doing all the benchmarking for rustfft's avx, one important thing I found to do for consistency is disabling turbo boost. The easiest way I've found to do that is via the bios. Without turbo boost, your benchmarks will be slower across the board, but more consistent in exchange.

@WalterSmuts
Copy link
Contributor Author

I tried out @ejmahler's suggestion and it doesn't seem to make much of a difference to consistency. But since I re-attempted the test I noticed I was running with the nightly compiler previously. Running with stable (out of date) I got the same results as @HEnquist. I suspect there were recent changes to the compiler that changed the optimisations applied to uninit data.

I downloaded the newest nightly (8dfb33954 2022-12-25) and stable (1.66.0) versions of rustc and both of them consistently run faster with the non-zero-ing version. @HEnquist I suspect if you updated your compiler you'd see the same results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants