Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Measure execution speed #133

Closed
bjorn3 opened this issue Nov 2, 2018 · 31 comments
Closed

Measure execution speed #133

bjorn3 opened this issue Nov 2, 2018 · 31 comments
Labels
optimize-speed

Comments

@bjorn3
Copy link
Owner

bjorn3 commented Nov 2, 2018

No description provided.

@bjorn3

This comment has been minimized.

@sunfishcode
Copy link
Contributor

sunfishcode commented Nov 3, 2018

Currently in Cranelift the IR verifier is enabled by default, which can take a lot of time. Can you benchmark with the "enable_verifier" setting disabled?

@bjorn3
Copy link
Owner Author

bjorn3 commented Nov 3, 2018

This is just execution speed.

@sunfishcode
Copy link
Contributor

sunfishcode commented Nov 3, 2018

Ah, please update the issue title then :-). Also, you may want to try setting Cranelift's opt_level to best.

@bjorn3 bjorn3 changed the title Measure compilation speed Measure execution speed Nov 3, 2018
@bjorn3
Copy link
Owner Author

bjorn3 commented Nov 3, 2018

Compilation speed is at decent level already. Running hyperfine with opt_level set to best right now.

Edit: doesn't seem to change much: flags_builder.set("opt_level", "best").unwrap();

@bjorn3

This comment has been minimized.

@sunfishcode
Copy link
Contributor

sunfishcode commented Nov 5, 2018

At a high level, it's not too surprising that Cranelift's execution speed on Rust would be in the ballpark of LLVM's O0 on Rust, because it's not doing any inlining. The rough short-term plan is to enable the MIR inliner to help with this.

There's probably a bunch of low-hanging fruit too, just making sure common Rust constructs are compiled well.

@lachlansneff
Copy link

lachlansneff commented Nov 8, 2018

Are you multithreading compilation? Cranelift is inherently very good at parallel compilation,

@bjorn3
Copy link
Owner Author

bjorn3 commented Nov 8, 2018

@lachlansneff No, rustc's TyCtxt is not thread safe (!Send + !Sync) and I believe cranelift's Module isn't either (!Sync)

@sunfishcode
Copy link
Contributor

sunfishcode commented Nov 8, 2018

That's true, cranelift-codegen can be run with multiple instances in parallel, but cranelift-module doesn't yet make use of that.

@bjorn3

This comment has been minimized.

@bjorn3
Copy link
Owner Author

bjorn3 commented Nov 11, 2018

abc

Flamegraph for mod_bench_inline

@sunfishcode
Copy link
Contributor

sunfishcode commented Nov 11, 2018

I took a quick look at the code. Here are some notes:

Some of the code will get better once Cranelift has more support for i8 and the associated workarounds are removed.

Is -Zmir-opt-level=3 in use when building libcore? I'm seeing things like core::cmp::impls::<impl core::cmp::PartialOrd for u32>::lt not being inlined, which is the kind of thing we're really going to want to inline.

If I'm reading this correctly, there's a small memmove in there, which the small memcpy/memmove/memset optimization should help with, once bytecodealliance/cranelift#586 is fixed.

There's a codegen abort when I enable opt_level=best. I'll investigate that.

@bjorn3
Copy link
Owner Author

bjorn3 commented Nov 11, 2018

Is -Zmir-opt-level=3 in use when building libcore?

Yes, the whole sysroot.

If I'm reading this correctly, there's a small memmove in there

I am currently using my own code for copying locals:

CValue::ByRef(from, _src_layout) => {
let size = dst_layout.size.bytes() as i32;
// FIXME emit_small_memcpy has a bug as of commit CraneStation/cranelift@b2281ed
// fx.bcx.emit_small_memcpy(fx.module.target_config(), addr, from, size, layout.align.abi() as u8, src_layout.align.abi() as u8);
let mut offset = 0;
while size - offset >= 8 {
let byte =
fx.bcx
.ins()
.load(fx.pointer_type, MemFlags::new(), from, offset);
fx.bcx.ins().store(MemFlags::new(), byte, addr, offset);
offset += 8;
}
while size - offset >= 4 {
let byte = fx.bcx.ins().load(types::I32, MemFlags::new(), from, offset);
fx.bcx.ins().store(MemFlags::new(), byte, addr, offset);
offset += 4;
}
while offset < size {
let byte = fx.bcx.ins().load(types::I8, MemFlags::new(), from, offset);
fx.bcx.ins().store(MemFlags::new(), byte, addr, offset);
offset += 1;
}
}
}
}

So that memmove comes from the copy_nonoverlapping intrinsic:

fx.bcx.call_memmove(fx.module.target_config(), dst, src, byte_amount);

Which likely came fromcore::mem::swap: https://github.com/rust-lang/rust/blob/b76ee83254ec0398da554f25c2168d917ba60f1c/src/libcore/iter/range.rs#L228

There's a codegen abort when I enable opt_level=best. I'll investigate that.

😭

@bjorn3

This comment has been minimized.

@lachlansneff
Copy link

lachlansneff commented Nov 11, 2018

As a side note, it's interesting to see functions like index and len getting called when they should definitely be inlined.

@sunfishcode
Copy link
Contributor

sunfishcode commented Nov 12, 2018

Another perf issue: https://github.com/CraneStation/cranelift/issues/597 .

@bjorn3
Copy link
Owner Author

bjorn3 commented Nov 16, 2018

Now that bytecodealliance/cranelift#598 is merged commit 8233ade enables opt_level=best for -Copt-level=3 (eg sysroot and mod_bench_inline)

mod_bench_inline is now faster than mod_bench_llvm_0 🎉

Benchmark #1: ./target/out/mod_bench
  Time (mean ± σ):      7.048 s ±  0.120 s    [User: 7.041 s, System: 0.000 s]
  Range (min … max):    6.944 s …  7.360 s
 
Benchmark #2: ./target/out/mod_bench_inline
  Time (mean ± σ):      3.975 s ±  0.100 s    [User: 3.972 s, System: 0.000 s]
  Range (min … max):    3.830 s …  4.122 s
 
Benchmark #3: ./target/out/mod_bench_llvm_0
  Time (mean ± σ):      4.243 s ±  0.059 s    [User: 4.240 s, System: 0.000 s]
  Range (min … max):    4.168 s …  4.329 s
 
Benchmark #4: ./target/out/mod_bench_llvm_1
  Time (mean ± σ):      1.625 s ±  0.015 s    [User: 1.622 s, System: 0.001 s]
  Range (min … max):    1.607 s …  1.649 s
 
Benchmark #5: ./target/out/mod_bench_llvm_2
  Time (mean ± σ):     422.1 ms ±   3.0 ms    [User: 419.6 ms, System: 0.0 ms]
  Range (min … max):   419.2 ms … 429.1 ms
 
Benchmark #6: ./target/out/mod_bench_llvm_3
  Time (mean ± σ):     421.5 ms ±   3.1 ms    [User: 419.2 ms, System: 0.0 ms]
  Range (min … max):   419.2 ms … 428.8 ms
 
Summary
  './target/out/mod_bench_llvm_3' ran
    1.00 ± 0.01 times faster than './target/out/mod_bench_llvm_2'
    3.86 ± 0.05 times faster than './target/out/mod_bench_llvm_1'
    9.43 ± 0.25 times faster than './target/out/mod_bench_inline'
   10.07 ± 0.16 times faster than './target/out/mod_bench_llvm_0'
   16.72 ± 0.31 times faster than './target/out/mod_bench'

@lachlansneff
Copy link

lachlansneff commented Nov 16, 2018

@sunfishcode Are there any obvious optimizations that we're missing here?

@bjorn3

This comment has been minimized.

bjorn3 added a commit that referenced this issue Nov 16, 2018
@lachlansneff
Copy link

lachlansneff commented Nov 16, 2018

@bjorn3 It looks like that's using the debug version of the codegen backend. Shouldn't that be the release version to maximize compilation speed?

@sunfishcode
Copy link
Contributor

sunfishcode commented Nov 16, 2018

Here's a summary of the ideas from above for how we can improve performance from here:

@bjorn3
Copy link
Owner Author

bjorn3 commented Nov 16, 2018

It looks like that's using the debug version of the codegen backend.

Oops :) Benchmarking it in release mode atm.

This fix for the small memcpy/etc. optimization, and then update this code to make use of it.

And more importantly

// FIXME emit_small_memcpy has a bug as of commit CraneStation/cranelift@b2281ed
// fx.bcx.emit_small_memcpy(fx.module.target_config(), addr, from, size, layout.align.abi() as u8, src_layout.align.abi() as u8);

@bjorn3
Copy link
Owner Author

bjorn3 commented Nov 16, 2018

Now with --release:

Benchmark #1: rustc -Zalways-encode-mir -Cpanic=abort -Zcodegen-backend=/home/bjorn/Documenten/rustc_codegen_cranelift/target/release/librustc_codegen_cranelift.so -L crate=target/out --out-dir target/out --sysroot ~/.xargo/HOST example/mod_bench.rs --crate-type bin -Zmir-opt-level=3 -Og --crate-name mod_bench_inline
  Time (mean ± σ):      86.0 ms ±   5.6 ms    [User: 57.3 ms, System: 20.2 ms]
  Range (min … max):    81.1 ms … 106.6 ms
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 
Benchmark #2: rustc example/mod_bench.rs --crate-type bin -Copt-level=0 -o target/out/mod_bench_llvm_0 -Cpanic=abort
  Time (mean ± σ):     115.0 ms ±   7.2 ms    [User: 107.4 ms, System: 19.9 ms]
  Range (min … max):   107.4 ms … 138.3 ms
 
Benchmark #3: rustc example/mod_bench.rs --crate-type bin -Copt-level=1 -o target/out/mod_bench_llvm_1 -Cpanic=abort
  Time (mean ± σ):     129.3 ms ±   7.0 ms    [User: 112.6 ms, System: 18.8 ms]
  Range (min … max):   122.4 ms … 151.2 ms
 
Benchmark #4: rustc example/mod_bench.rs --crate-type bin -Copt-level=2 -o target/out/mod_bench_llvm_2 -Cpanic=abort
  Time (mean ± σ):     102.7 ms ±   6.0 ms    [User: 88.4 ms, System: 16.8 ms]
  Range (min … max):    97.3 ms … 123.8 ms
 
Benchmark #5: rustc example/mod_bench.rs --crate-type bin -Copt-level=3 -o target/out/mod_bench_llvm_3 -Cpanic=abort
  Time (mean ± σ):     103.0 ms ±   6.5 ms    [User: 87.8 ms, System: 17.8 ms]
  Range (min … max):    97.4 ms … 125.8 ms
 
Summary
  'rustc -Zalways-encode-mir -Cpanic=abort -Zcodegen-backend=/home/bjorn/Documenten/rustc_codegen_cranelift/target/release/librustc_codegen_cranelift.so -L crate=target/out --out-dir target/out --sysroot ~/.xargo/HOST example/mod_bench.rs --crate-type bin -Zmir-opt-level=3 -Og --crate-name mod_bench_inline' ran
    1.19 ± 0.10 times faster than 'rustc example/mod_bench.rs --crate-type bin -Copt-level=2 -o target/out/mod_bench_llvm_2 -Cpanic=abort'
    1.20 ± 0.11 times faster than 'rustc example/mod_bench.rs --crate-type bin -Copt-level=3 -o target/out/mod_bench_llvm_3 -Cpanic=abort'
    1.34 ± 0.12 times faster than 'rustc example/mod_bench.rs --crate-type bin -Copt-level=0 -o target/out/mod_bench_llvm_0 -Cpanic=abort'
    1.50 ± 0.13 times faster than 'rustc example/mod_bench.rs --crate-type bin -Copt-level=1 -o target/out/mod_bench_llvm_1 -Cpanic=abort'

@lachlansneff
Copy link

lachlansneff commented Nov 16, 2018

Yay, we are now technically a faster debug backend for rustc! 😀

There are a couple compile-time optimizations in the pipe, should improve this hopefully.

@bjorn3
Copy link
Owner Author

bjorn3 commented Nov 16, 2018

Yes, at least on this small benchmark.

@bstrie
Copy link
Contributor

bstrie commented Nov 16, 2018

To help inform us as to how excited we ought to be, is there a document somewhere describing the path that would need to be taken to get Cranelift upstreamed into rustc for use with debug builds? As far as we random onlookers know, it could be anywhere from "oh, it's basically done, we just need to flip a switch" to "years and years away, don't hold your breath". :)

@bjorn3
Copy link
Owner Author

bjorn3 commented Nov 16, 2018

is there a document somewhere describing the path that would need to be taken to get Cranelift upstreamed into rustc for use with debug builds?

No, getting this even close to upstreaming is blocked on at least rust-lang/rust#55627 and supporting libstd (#146). Haven't spoken to any rust devs about this. I want to get a MVP first before making this more widely known.

"oh, it's basically done, we just need to flip a switch"

This is not the case.

"years and years away, don't hold your breath"

I hope not :)

@bjorn3
Copy link
Owner Author

bjorn3 commented Nov 17, 2018

Minimized some outdated benchmark results, because they are long.

@bjorn3
Copy link
Owner Author

bjorn3 commented Dec 28, 2018

tcx.encode_metadata and the mir borrow checker will need to be optimized. 😄 They both take more time than cg_clif in release mode itself while compiling core.

screenshot_2018-12-28 std_release trace - speedscope 1

Edit: just realized that I was profiling only the last part of the compilation.

@bjorn3
Copy link
Owner Author

bjorn3 commented Mar 17, 2021

I may open more focused issues in the future.

@bjorn3 bjorn3 closed this as completed Mar 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimize-speed
Projects
None yet
Development

No branches or pull requests

4 participants