Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize optimizer eliminate stage #3732

Merged
merged 26 commits into from Sep 18, 2015

Conversation

aidanhs
Copy link
Contributor

@aidanhs aidanhs commented Sep 2, 2015

These are just speed and style changes - there should be no output difference at all.

Using the .bc file from #3718, before:

# EMCC_DEBUG=1 emcc -O3 -o out.js slow/libemscripten_worker.bc 2>&1 | grep 'emcc step\|total time'
DEBUG    root: emcc step "parse arguments and setup" took 0.01 seconds
DEBUG    root: emcc step "bitcodeize inputs" took 0.00 seconds
DEBUG    root: emcc step "process inputs" took 0.00 seconds
DEBUG    root: emcc step "calculate system libraries" took 0.43 seconds
DEBUG    root: emcc step "link" took 0.63 seconds
DEBUG    root: emcc step "post-link" took 10.19 seconds
DEBUG    root: emcc step "emscript (llvm=>js)" took 8.62 seconds
DEBUG    root: emcc step "source transforms" took 0.34 seconds
DEBUG    root: emcc step "js opts" took 88.50 seconds
DEBUG    root: emcc step "final emitting" took 0.04 seconds
DEBUG    root: total time: 108.75 seconds

After:

# EMCC_DEBUG=1 emcc -O3 -o out.js slow/libemscripten_worker.bc 2>&1 | grep 'emcc step\|total time'
DEBUG    root: emcc step "parse arguments and setup" took 0.01 seconds
DEBUG    root: emcc step "bitcodeize inputs" took 0.00 seconds
DEBUG    root: emcc step "process inputs" took 0.00 seconds
DEBUG    root: emcc step "calculate system libraries" took 0.43 seconds
DEBUG    root: emcc step "link" took 0.63 seconds
DEBUG    root: emcc step "post-link" took 10.35 seconds
DEBUG    root: emcc step "emscript (llvm=>js)" took 9.11 seconds
DEBUG    root: emcc step "source transforms" took 0.78 seconds
DEBUG    root: emcc step "js opts" took 67.66 seconds
DEBUG    root: emcc step "final emitting" took 0.05 seconds
DEBUG    root: total time: 89.01 seconds

So 20-25% optimisation speedup, 15-20% overall speedup. The optimizer tests pass, I've not yet run any others, will do so now.

I've got a few other optimisations, but want to get these and the switch optimisation (#3733) fix done first.

#ifdef PROFILING
tstmtscan += clock() - start;
start = clock();
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation looks off here and on the block above.

@aidanhs aidanhs force-pushed the aphs-optimize-optimizer-4 branch 2 times, most recently from 4f73ab0 to 31e528a Compare September 2, 2015 14:15
@aidanhs
Copy link
Contributor Author

aidanhs commented Sep 2, 2015

All core asm2 tests succeed.

@kripken
Copy link
Member

kripken commented Sep 2, 2015

Reading the code, this all looks good (and I like the refactoring), and those numbers sound great. However, I tested on another codebase (a large game engine) and I actually see a slowdown: 7.37 seconds to 8.27.

I also tested on Poppler from the test suite, with EMCC_CORES=1 to make it slow enough to be noticeable, and I get 2.54 to 2.60 seconds, so a tiny slowdown, probably just noise.

This makes me think we should test this more, it might be that it helps the original testcase that motivated this, but harms others.

@aidanhs
Copy link
Contributor Author

aidanhs commented Sep 3, 2015

Is the game engine available for me to compile? Possibly as a .bc? Big slow testcases are good. I'll take a look at poppler as well and see if I can massage it into something usable.

I'm kinda surprised it's slower, but on reflection I can see that the changes I made could punish lots of small functions - I wonder if that's what's happening. There is a fairly invasive change I was considering, but delayed to see what the thoughts were thus far. Let me work on that and see if it can deliver any improvement.

@kripken
Copy link
Member

kripken commented Sep 3, 2015

The engine isn't open, even as a .bc. But I'll test locally tomorrow with each patch separately, to at least try to narrow down the issue.

I'm hoping though that this isn't specific to unity. Maybe poppler or bananabread (I can provide a bc for that) with -s LINKABLE=1 (to disable dead code elimination and leave a lot more stuff) might also show something.

@aidanhs
Copy link
Contributor Author

aidanhs commented Sep 3, 2015

I suspect you'll find it's the final commit, 31e528a. I should've really gone through the benchmarks before, will try and do so tomorrow.

@kripken
Copy link
Member

kripken commented Sep 3, 2015

Yes, it does look like the last commit ("Avoid iterating over irrelevant tracked variables") is where the unity slowdown comes from.

@aidanhs aidanhs force-pushed the aphs-optimize-optimizer-4 branch 2 times, most recently from 7a28638 to 0edde45 Compare September 4, 2015 23:44
@aidanhs
Copy link
Contributor Author

aidanhs commented Sep 4, 2015

Can you try the latest branch against unity?

@aidanhs
Copy link
Contributor Author

aidanhs commented Sep 4, 2015

Sorry, that posted before I was ready - I've got rid of the problematic commit and there's still a speedup with the lambdas test run from 42s to 18s. There's even a very tiny sqlite speedup (~3%).

@aidanhs aidanhs force-pushed the aphs-optimize-optimizer-4 branch 7 times, most recently from 6eb10bc to 46b372f Compare September 7, 2015 18:59
@aidanhs
Copy link
Contributor Author

aidanhs commented Sep 7, 2015

I've added a couple of new commits after managing to grab the unity .bc file from a unity install and I'm mostly happy with where this is up to.

I did benchmarking of the optimizer run directly against an arbitrary unity js chunk (14M), the sqlite (14M) and the lambdas testcase (22M). Each measurement is for all O3 optimisation passes, best of 5 runs, user time as reported by the time bash builtin:

Case Before branch After branch
Unity 27.431s 26.247s
Sqlite 1m20.520s 1m18.458s
Lambdas 1m19.287s 52.067s

It's worth taking a look at "Avoid array bounds checking for traverse* functions" - it's theoretically less safe because Ref[] does bounds checks, but I feel that these functions are small enough to be verifiable.

@kripken
Copy link
Member

kripken commented Sep 8, 2015

I still see a slowdown here, on unity. When I set EMCC_CORES=1 to reduce noise, I get 50.88 50.99 for the js opts stage in EMCC_DEBUG=1 before this patch, and 51.12 51.20 after this patch. It's around a 0.5% slowdown, so not large, but still in the wrong direction.

How are you measuring your times?

@aidanhs
Copy link
Contributor Author

aidanhs commented Sep 8, 2015

That's very interesting. My measurement was taken by pausing the build process during js opts, taking one of the chunks out of there and running the optimizer directly against it (optimizer $file asm eliminate simplifyExpressions simplifyIfs registerizeHarder minifyLocals asmLastOpts last minifyWhitespace).

When I do a full compilation using EMCC_CORES=1 EMCC_DEBUG=1 emcc -s LINKABLE=1 -o /tmp/out.js -O2 slow/UnityNative.bc (LINKABLE=1 to avoid dropping functions) I still get the speedups - from 243.82, 243.83 to 232.21, 233.13 (which is the same order of magnitude as in my table).

Assuming you're on Linux, what distro and gcc version are you using? I've got Ubuntu 14.04 and the default compiler, gcc 4.8.4 - I wonder if a different compiler version is doing something differently.

@aidanhs
Copy link
Contributor Author

aidanhs commented Sep 8, 2015

Ignore me, I've done something very stupid, let me look again...

@aidanhs
Copy link
Contributor Author

aidanhs commented Sep 9, 2015

Summary of changes:

  • at an explicit reminder (for people like me) to enable release mode when cmaking the optimizer
  • use a name->vec lookup for dependencies rather than name->set as vecs are faster
  • new commit to clean up ref construction

Now I've fixed my issues and tweaked the dep lookup, I get from about 54.1s knocked down to 53.3s for the unity js opts phase on a single core. other.test_js_opts passes.

@kripken
Copy link
Member

kripken commented Sep 9, 2015

I still see qualitatively different numbers here, no change or a tiny slowdown on unity, tiny slowdown on poppler, no change on bullet.

For poppler, for example, I run EMCC_CORES=1 EMCC_DEBUG=1 ./emcc -O2 poppler.bc -s LINKABLE=1 and I get 7.62 7.60 7.61 before, 7.67 7.68 7.67 after. Very small slowdown, but consistent.

Maybe it's just a matter of a different cpu / different local compiler and stdlib, that accounts for us seeing different things. But if so then it suggests the changes here are going to vary across machines.

If we can't figure this out, it might make sense to still merge this or parts of this, as the refactorings are nice. But perhaps we should only take the refactorings for clarity, and not for speed?

@aidanhs
Copy link
Contributor Author

aidanhs commented Sep 9, 2015

No no, it's not good enough to slow things down! I will rummage around for some low hanging fruit.

@kripken
Copy link
Member

kripken commented Sep 9, 2015

Another thought, there might be some bigger things to optimize in that pass. If I recall correctly, around the comment "Look for statements, including while-switch pattern" we have the main loop, which traverses the whole tree and calls scan. And scan does its own traversal on the entire tree shown to it. I don't remember all the details, but I wonder if we couldn't stop those traversals in some cases, and avoid a big chunk of O(N^2) work. But, this is tricky stuff.

 - !isSpace in Frag() is not necessary because that case will fall
   through and abort anyway
 - any successful switch case will assign str, no need to recheck
 - there is only one callsite of parseAfterIdent, move the skipspace
   inside the call (and therefore avoid calling skipspace twice
   before expressions)
 - no need for two checks for 0 bytes
We know that there will be no assignment part of var nodes because
normalisation has either removed them or aborted because the var node
needs fixing.
@aidanhs
Copy link
Contributor Author

aidanhs commented Sep 17, 2015

I've pushed a bunch of changes. I opted not to go for the harder idea, and instead pursued the low-hanging fruit. There are some speedups from

  • Reserving space in vectors on creation, rather than relying on pushing to reallocate ("Reserve space on vector creation", "Also reserve space when creation nodes during runtime").
  • Allocate vectors in an arena, like Values ("Allocate vecs in arena").
  • Improving memory locality in traverse* by sticking values in locals ("Caching values in locals is faster than hitting memory", "Cache array size, use pointer arithmetic to bypass some indirection").

Unfortunately, the final item makes the traverse* functions really ugly.

However, the rest of the commits are generally just refactoring. You may wish to examine the following two with suspicion:

  • "Don't pass optional blocks as arguments" - removes optional parameters, in exchange for needing a function to transfer block contents from one node to another. I prefer this way because it makes it very clear that the functions are always going to create a block.
  • "Don't perform unnecessary handling of VAR nodes" ⚠️ - the commit message provides justification, but was it deliberate that processing would continue in release mode (i.e. when the assert was removed)? I figured that requiring valid asm was a small price to pay for the code simplification.

The rest of the refactorings should be easily verifiable.

I see a ~10% speedup in the optimisation of unity (54.5s to 48.5s with EMCC_CORES=1 EMCC_DEBUG=1 emcc -O2 UnityNative.bc -s LINKABLE=1), sqlite (6.45s to 5.87s with asm2.test_sqlite) and poppler (4.97 to 4.48 with asm2.test_poppler).

I've run all asm2 tests (+other.test_{js,native}_optimizer) with assertions enabled on the optimizer and they seem fine (anything that fails also fails on incoming for me).

@kripken
Copy link
Member

kripken commented Sep 18, 2015

Confirmed, I see an 8% speedup. Nice! Reviewing now.

@kripken
Copy link
Member

kripken commented Sep 18, 2015

Please add any changes after this point as followup commits, so I don't need to re-read the ones I am going through now.

@kripken
Copy link
Member

kripken commented Sep 18, 2015

That's it for my comments. Excellent work!

@aidanhs
Copy link
Contributor Author

aidanhs commented Sep 18, 2015

Made review changes.

@kripken kripken merged commit d2d975f into emscripten-core:incoming Sep 18, 2015
@kripken
Copy link
Member

kripken commented Sep 18, 2015

Merged, thanks!

@aidanhs aidanhs deleted the aphs-optimize-optimizer-4 branch September 18, 2015 23:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants