Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new JIT tier which is tightly coupled to the AST Interpreter #651

Merged
merged 2 commits into from Jul 6, 2015

Conversation

undingen
Copy link
Contributor

This JIT is tightly coupled to the ASTInterpreter, at every CFGBlock* entry/exit
on can switch between interpreting and directly executing the generated code without having
to do any translations.
Generating the code is pretty fast compared to the LLVM tier but the generated code is not as fast
as code generated by the higher LLVM tiers.
But because the JITed can use runtime ICs, avoids a lot of interpretation overhead and
stores CFGBlock locals symbols inside register/stack slots, it's much faster than the interpreter.

Current perf numbers:

  1. run:
              pyston (calibration)                      :    1.0s base1: 1.0 (+1.5%)
              pyston django_template.py                 :    5.2s base1: 6.7 (-23.0%)
              pyston pyxl_bench.py                      :    4.6s base1: 5.0 (-8.0%)
              pyston sqlalchemy_imperative2.py          :    5.7s base1: 6.8 (-14.9%)
              pyston django_migrate.py                  :    1.9s base1: 3.1 (-38.4%)
              pyston virtualenv_bench.py                :    6.2s base1: 8.1 (-22.9%)
              pyston interp2.py                         :    3.7s base1: 4.1 (-9.9%)
              pyston raytrace.py                        :    5.8s base1: 6.0 (-3.5%)
              pyston nbody.py                           :    7.3s base1: 7.0 (+4.1%)
              pyston fannkuch.py                        :    6.3s base1: 6.4 (-1.5%)
              pyston chaos.py                           :   17.5s base1: 18.1 (-3.6%)
              pyston fasta.py                           :    4.5s base1: 4.3 (+5.1%)
              pyston pidigits.py                        :    5.8s base1: 5.3 (+8.7%)
              pyston richards.py                        :    1.9s base1: 1.7 (+10.2%)
              pyston deltablue.py                       :    1.4s base1: 1.8 (-20.4%)
              pyston (geomean-10eb)                     :    4.6s base1: 5.1 (-9.5%)

best of 3 runs:

              pyston (calibration)                      :    1.0s base3: 1.0 (-0.5%)
              pyston django_template.py                 :    4.4s base3: 4.6 (-5.4%)
              pyston pyxl_bench.py                      :    3.9s base3: 3.8 (+2.5%)
              pyston sqlalchemy_imperative2.py          :    4.9s base3: 5.1 (-5.1%)
              pyston django_migrate.py                  :    1.7s base3: 1.7 (+1.7%)
              pyston virtualenv_bench.py                :    5.0s base3: 5.2 (-2.4%)
              pyston interp2.py                         :    3.6s base3: 4.1 (-12.4%)
              pyston raytrace.py                        :    5.5s base3: 5.4 (+1.4%)
              pyston nbody.py                           :    7.0s base3: 7.2 (-2.6%)
              pyston fannkuch.py                        :    6.1s base3: 6.3 (-2.0%)
              pyston chaos.py                           :   16.8s base3: 17.5 (-4.2%)
              pyston fasta.py                           :    4.4s base3: 4.1 (+5.8%)
              pyston pidigits.py                        :    5.7s base3: 5.3 (+8.2%)
              pyston richards.py                        :    1.7s base3: 1.5 (+11.3%)
              pyston deltablue.py                       :    1.2s base3: 1.4 (-12.9%)
              pyston (geomean-10eb)                     :    4.2s base3: 4.3 (-1.4%)


because of the increased malloc traffic this PR should benefit a lot from jemalloc...

@undingen
Copy link
Contributor Author

with jemalloc:
best of 3 runs:

              pyston (calibration)                      :    1.0s base3_je: 1.0 (+0.3%)
              pyston django_template.py                 :    3.9s base3_je: 4.3 (-9.5%)
              pyston pyxl_bench.py                      :    3.6s base3_je: 3.6 (-1.2%)
              pyston sqlalchemy_imperative2.py          :    4.2s base3_je: 4.6 (-7.8%)
              pyston django_migrate.py                  :    1.4s base3_je: 1.6 (-8.3%)
              pyston virtualenv_bench.py                :    4.1s base3_je: 4.7 (-12.7%)
              pyston interp2.py                         :    3.6s base3_je: 4.1 (-12.2%)
              pyston raytrace.py                        :    5.5s base3_je: 5.6 (-1.9%)
              pyston nbody.py                           :    6.9s base3_je: 6.9 (-1.2%)
              pyston fannkuch.py                        :    6.1s base3_je: 6.3 (-2.4%)
              pyston chaos.py                           :   16.6s base3_je: 17.2 (-4.0%)
              pyston fasta.py                           :    4.4s base3_je: 4.3 (+1.4%)
              pyston pidigits.py                        :    5.7s base3_je: 5.2 (+9.2%)
              pyston richards.py                        :    1.6s base3_je: 1.5 (+11.5%)
              pyston deltablue.py                       :    1.1s base3_je: 1.3 (-15.9%)
              pyston (geomean-10eb)                     :    3.9s base3_je: 4.1 (-4.2%)

@undingen
Copy link
Contributor Author

with jemalloc:
first run:

              pyston (calibration)                      :    1.0s base1_je: 1.0 (+0.1%)
              pyston django_template.py                 :    4.7s base1_je: 6.4 (-27.0%)
              pyston pyxl_bench.py                      :    4.4s base1_je: 4.9 (-10.8%)
              pyston sqlalchemy_imperative2.py          :    5.0s base1_je: 6.2 (-19.1%)
              pyston django_migrate.py                  :    1.6s base1_je: 2.9 (-45.3%)
              pyston virtualenv_bench.py                :    4.7s base1_je: 7.7 (-39.1%)
              pyston interp2.py                         :    3.6s base1_je: 4.1 (-12.4%)
              pyston raytrace.py                        :    5.7s base1_je: 5.9 (-3.8%)
              pyston nbody.py                           :    7.2s base1_je: 7.1 (+1.5%)
              pyston fannkuch.py                        :    6.2s base1_je: 6.4 (-3.3%)
              pyston chaos.py                           :   17.0s base1_je: 17.6 (-3.6%)
              pyston fasta.py                           :    4.6s base1_je: 4.4 (+5.3%)
              pyston pidigits.py                        :    5.7s base1_je: 5.2 (+9.5%)
              pyston richards.py                        :    1.7s base1_je: 1.6 (+6.6%)
              pyston deltablue.py                       :    1.3s base1_je: 1.7 (-20.4%)
              pyston (geomean-10eb)                     :    4.3s base1_je: 4.9 (-13.2%)

@undingen undingen force-pushed the inc_jit_rewriter2 branch 3 times, most recently from 4e5d195 to 359c6e2 Compare June 29, 2015 20:36
@undingen
Copy link
Contributor Author

I tried to reduce the number of GC allocated mem (to decrease the number of collections) by creating a wrapper around theASTInterpreter class to just store a pointer to the ASTInterpreter so we only have to GC allocate sizeof(Box) + (sizeof(ASTInterpreter*)) but it didn't increase the performance. (Even though a large portion of the GC allocated mem are ASTInterpreter instances)

I also tried to align the jump targets to 16byte adresses by emitting multi byte nop sequences but this also didn't help. (Would have surprised me if it would have helped - don't think there is much to gain in this area)

@undingen undingen force-pushed the inc_jit_rewriter2 branch 3 times, most recently from 279bca1 to fa593f6 Compare June 30, 2015 22:00
@@ -790,6 +790,8 @@ void Assembler::cmp(Indirect mem, Register reg) {
}

void Assembler::lea(Indirect mem, Register reg) {
RELEASE_ASSERT(mem.base != RSP, "We have so generate the SIB byte...");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need a SIB byte for R12.

@undingen
Copy link
Contributor Author

The main issues I see currently:

  • ugly additional members inside CFGBlock*
  • memory won't get freed
  • JitFragment and JitCodeBlock are not the best name. baseline jit is also not too great for such a simple jit tier
  • integration with the IC/rewriter classes feels a little messy...


class JitFragment;

// Manages a fixed size memory block which stores JITed code.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor request -- I think it'd be nice to start with a little bit more detail here about the overall baseline-JIT strategy and how it relates to the interpreter. Just a quick high-level overview of what the classes in this file are for, how they get used, what our strategy is for going back and forth between the interpreter/jitted code.

@kmod
Copy link
Collaborator

kmod commented Jun 30, 2015

I'll read some more of the code in a sec, but overall my feedback is: yeah there's some stuff that is a little bit quick-and-dirty, but I think that's ok for now. I'd say, let's just be good about leaving notes on the things that we think could be better, then work on getting this in and iterating. I'm ok with this area of the code being a little bit rough to start since it feels like we'll be experimenting with it a lot.

a.push(assembler::RBP);
a.mov(assembler::RSP, assembler::RBP);

static_assert(scratch_size % 16 == 0, "stack aligment code depends on this");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to make sure that the eh_frame we emit matches the stack size that we ask for -- I think the eh_frame_template hardcodes a certain amount of scratch (48 bytes I think).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I choose to use frame pointers because IMHO if I emit the EH info for a function with frame pointers it doesn't contain the amount of bytes the stack has to be adjusted but just that the stack pointer has to be restored by copying the base pointer.
At least that's what I thought whats happening by comparing different EH table output.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, yeah you're right :)

@undingen undingen changed the title WIP: Add a new JIT tier which is tightly coupled to the AST Interpreter Add a new JIT tier which is tightly coupled to the AST Interpreter Jul 1, 2015
@undingen
Copy link
Contributor Author

undingen commented Jul 1, 2015

I addressed all your immediate comments and but some of the non JIT changes into a separate request (#664) and rebased.
I didn't squash it yet in order to make reviewing easier.

sorted_symbol_table[source_info->getInternedStrings().get(PASSED_GENERATOR_NAME)] = generator;
// TODO: we will immediately want the liveness info again in the jit, we should pass
// it through.
std::unique_ptr<LivenessAnalysis> liveness = computeLivenessInfo(source_info->cfg);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be getLiveness()? I think the call to computeLivenessInfo() in irgen.cpp could also be changed to access the cached version.

@undingen
Copy link
Contributor Author

undingen commented Jul 1, 2015

addressed your comments in the last commit

@kmod
Copy link
Collaborator

kmod commented Jul 1, 2015

I think the issue with unique_ptr's and forward-declared types is that the compiler will try to generate a default destructor (for the containing class) which won't work. You can get around this by declaring a destructor (ie `~SourceInfo();``); the fun part is you might not have to define it since it probably doesn't get called anywhere :P

@kmod
Copy link
Collaborator

kmod commented Jul 1, 2015

I think we might need to be careful about reentrancy -- what if we are in the baseline jit, and we hit a function call that ends up triggering the baseline jit on the same function? I haven't finished reading through the code, but assuming there isn't anything that explicitly handles this, I think there could be issues. For instance, the JitFragmentWriter will save the code_offset at the beginning of the jit, but it will end up writing the assembly to whatever the current offset is in the code block. I think this could mess up the offsets in the instructions, and also we will write the original code_offset into the CFGBlock even though some other assembly will live there :(

I think it might be enough to say "if the current codeblock offset at the end is not the same as when we started, abort the fragment"? But I'm not sure if that's necessary (or enough).

@undingen
Copy link
Contributor Author

undingen commented Jul 1, 2015

The JitCodeBlock has a iscurrently_tracing (which should probably be renamed) member which makes sure there is always only one JIT fragment active per JitCodeBlock.
But with multi threading this will need some lock...

@kmod
Copy link
Collaborator

kmod commented Jul 2, 2015

Ah ok, nice :) Could we still have an assert in fragmentFinished/finishCompilation that asserts the fragmentwriter and the codeblock agree on where the code will end up?

Also I think it's safe to assume we're not getting rid of the GIL any time soon. Probably not until after Windows and Python 3 support :/

@undingen
Copy link
Contributor Author

undingen commented Jul 2, 2015

I will continue tomorrow with eliminating the shared_ptr and adding the check.
Can I stash it then or do you prefer to stash it later because you want to review it in more detail?

@kmod
Copy link
Collaborator

kmod commented Jul 2, 2015

ok finally read through it all :) looks good to me; feel free to rebase/squash and we should be able to get this merged in tomorrow!

This JIT is tightly coupled to the ASTInterpreter, at every CFGBlock* entry/exit
on can switch between interpreting and directly executing the generated code without having
to do any translations.
Generating the code is pretty fast compared to the LLVM tier but the generated code is not as fast
as code generated by the higher LLVM tiers.
But because the JITed can use runtime ICs, avoids a lot of interpretation overhead and
stores CFGBlock locals sysbols inside register/stack slots, it's much faster than the interpreter.
@undingen
Copy link
Contributor Author

undingen commented Jul 2, 2015

I made the last changes and squashed. There should be no functional changes besides that I enabled again the callattr IC because I don't get the protobuf error any more. I don't think the issue is fixed but I also don't think it's hiding inside the JIT (more likely Runtime ICs/rewritter) if we encounter it again I will try to track it down.
I also added a new commit with a small micro optimization (change ASTInterpreter field offsets).
Should be ready for merging :-)

kmod added a commit that referenced this pull request Jul 6, 2015
Add a new JIT tier which is tightly coupled to the AST Interpreter
@kmod kmod merged commit dbe7866 into pyston:master Jul 6, 2015
@kmod
Copy link
Collaborator

kmod commented Jul 6, 2015

Awesome! having a baseline JIT will change a lot of things :)

@undingen
Copy link
Contributor Author

undingen commented Jul 6, 2015

Cool :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants