Add a new JIT tier which is tightly coupled to the AST Interpreter #651

undingen · 2015-06-27T19:15:22Z

This JIT is tightly coupled to the ASTInterpreter, at every CFGBlock* entry/exit
on can switch between interpreting and directly executing the generated code without having
to do any translations.
Generating the code is pretty fast compared to the LLVM tier but the generated code is not as fast
as code generated by the higher LLVM tiers.
But because the JITed can use runtime ICs, avoids a lot of interpretation overhead and
stores CFGBlock locals symbols inside register/stack slots, it's much faster than the interpreter.

Current perf numbers:

run:

              pyston (calibration)                      :    1.0s base1: 1.0 (+1.5%)
              pyston django_template.py                 :    5.2s base1: 6.7 (-23.0%)
              pyston pyxl_bench.py                      :    4.6s base1: 5.0 (-8.0%)
              pyston sqlalchemy_imperative2.py          :    5.7s base1: 6.8 (-14.9%)
              pyston django_migrate.py                  :    1.9s base1: 3.1 (-38.4%)
              pyston virtualenv_bench.py                :    6.2s base1: 8.1 (-22.9%)
              pyston interp2.py                         :    3.7s base1: 4.1 (-9.9%)
              pyston raytrace.py                        :    5.8s base1: 6.0 (-3.5%)
              pyston nbody.py                           :    7.3s base1: 7.0 (+4.1%)
              pyston fannkuch.py                        :    6.3s base1: 6.4 (-1.5%)
              pyston chaos.py                           :   17.5s base1: 18.1 (-3.6%)
              pyston fasta.py                           :    4.5s base1: 4.3 (+5.1%)
              pyston pidigits.py                        :    5.8s base1: 5.3 (+8.7%)
              pyston richards.py                        :    1.9s base1: 1.7 (+10.2%)
              pyston deltablue.py                       :    1.4s base1: 1.8 (-20.4%)
              pyston (geomean-10eb)                     :    4.6s base1: 5.1 (-9.5%)

best of 3 runs:

              pyston (calibration)                      :    1.0s base3: 1.0 (-0.5%)
              pyston django_template.py                 :    4.4s base3: 4.6 (-5.4%)
              pyston pyxl_bench.py                      :    3.9s base3: 3.8 (+2.5%)
              pyston sqlalchemy_imperative2.py          :    4.9s base3: 5.1 (-5.1%)
              pyston django_migrate.py                  :    1.7s base3: 1.7 (+1.7%)
              pyston virtualenv_bench.py                :    5.0s base3: 5.2 (-2.4%)
              pyston interp2.py                         :    3.6s base3: 4.1 (-12.4%)
              pyston raytrace.py                        :    5.5s base3: 5.4 (+1.4%)
              pyston nbody.py                           :    7.0s base3: 7.2 (-2.6%)
              pyston fannkuch.py                        :    6.1s base3: 6.3 (-2.0%)
              pyston chaos.py                           :   16.8s base3: 17.5 (-4.2%)
              pyston fasta.py                           :    4.4s base3: 4.1 (+5.8%)
              pyston pidigits.py                        :    5.7s base3: 5.3 (+8.2%)
              pyston richards.py                        :    1.7s base3: 1.5 (+11.3%)
              pyston deltablue.py                       :    1.2s base3: 1.4 (-12.9%)
              pyston (geomean-10eb)                     :    4.2s base3: 4.3 (-1.4%)

because of the increased malloc traffic this PR should benefit a lot from jemalloc...

undingen · 2015-06-27T19:28:08Z

with jemalloc:
best of 3 runs:

              pyston (calibration)                      :    1.0s base3_je: 1.0 (+0.3%)
              pyston django_template.py                 :    3.9s base3_je: 4.3 (-9.5%)
              pyston pyxl_bench.py                      :    3.6s base3_je: 3.6 (-1.2%)
              pyston sqlalchemy_imperative2.py          :    4.2s base3_je: 4.6 (-7.8%)
              pyston django_migrate.py                  :    1.4s base3_je: 1.6 (-8.3%)
              pyston virtualenv_bench.py                :    4.1s base3_je: 4.7 (-12.7%)
              pyston interp2.py                         :    3.6s base3_je: 4.1 (-12.2%)
              pyston raytrace.py                        :    5.5s base3_je: 5.6 (-1.9%)
              pyston nbody.py                           :    6.9s base3_je: 6.9 (-1.2%)
              pyston fannkuch.py                        :    6.1s base3_je: 6.3 (-2.4%)
              pyston chaos.py                           :   16.6s base3_je: 17.2 (-4.0%)
              pyston fasta.py                           :    4.4s base3_je: 4.3 (+1.4%)
              pyston pidigits.py                        :    5.7s base3_je: 5.2 (+9.2%)
              pyston richards.py                        :    1.6s base3_je: 1.5 (+11.5%)
              pyston deltablue.py                       :    1.1s base3_je: 1.3 (-15.9%)
              pyston (geomean-10eb)                     :    3.9s base3_je: 4.1 (-4.2%)

undingen · 2015-06-27T19:32:38Z

with jemalloc:
first run:

              pyston (calibration)                      :    1.0s base1_je: 1.0 (+0.1%)
              pyston django_template.py                 :    4.7s base1_je: 6.4 (-27.0%)
              pyston pyxl_bench.py                      :    4.4s base1_je: 4.9 (-10.8%)
              pyston sqlalchemy_imperative2.py          :    5.0s base1_je: 6.2 (-19.1%)
              pyston django_migrate.py                  :    1.6s base1_je: 2.9 (-45.3%)
              pyston virtualenv_bench.py                :    4.7s base1_je: 7.7 (-39.1%)
              pyston interp2.py                         :    3.6s base1_je: 4.1 (-12.4%)
              pyston raytrace.py                        :    5.7s base1_je: 5.9 (-3.8%)
              pyston nbody.py                           :    7.2s base1_je: 7.1 (+1.5%)
              pyston fannkuch.py                        :    6.2s base1_je: 6.4 (-3.3%)
              pyston chaos.py                           :   17.0s base1_je: 17.6 (-3.6%)
              pyston fasta.py                           :    4.6s base1_je: 4.4 (+5.3%)
              pyston pidigits.py                        :    5.7s base1_je: 5.2 (+9.5%)
              pyston richards.py                        :    1.7s base1_je: 1.6 (+6.6%)
              pyston deltablue.py                       :    1.3s base1_je: 1.7 (-20.4%)
              pyston (geomean-10eb)                     :    4.3s base1_je: 4.9 (-13.2%)

undingen · 2015-06-29T22:47:22Z

I tried to reduce the number of GC allocated mem (to decrease the number of collections) by creating a wrapper around theASTInterpreter class to just store a pointer to the ASTInterpreter so we only have to GC allocate sizeof(Box) + (sizeof(ASTInterpreter*)) but it didn't increase the performance. (Even though a large portion of the GC allocated mem are ASTInterpreter instances)

I also tried to align the jump targets to 16byte adresses by emitting multi byte nop sequences but this also didn't help. (Would have surprised me if it would have helped - don't think there is much to gain in this area)

kmod · 2015-06-30T22:02:58Z

src/asm_writing/assembler.cpp

@@ -790,6 +790,8 @@ void Assembler::cmp(Indirect mem, Register reg) {
 }

 void Assembler::lea(Indirect mem, Register reg) {
+    RELEASE_ASSERT(mem.base != RSP, "We have so generate the SIB byte...");


I think we also need a SIB byte for R12.

undingen · 2015-06-30T22:41:30Z

The main issues I see currently:

ugly additional members inside CFGBlock*
memory won't get freed
JitFragment and JitCodeBlock are not the best name. baseline jit is also not too great for such a simple jit tier
integration with the IC/rewriter classes feels a little messy...

kmod · 2015-06-30T22:56:36Z

src/codegen/baseline_jit.h

+
+class JitFragment;
+
+// Manages a fixed size memory block which stores JITed code.


Minor request -- I think it'd be nice to start with a little bit more detail here about the overall baseline-JIT strategy and how it relates to the interpreter. Just a quick high-level overview of what the classes in this file are for, how they get used, what our strategy is for going back and forth between the interpreter/jitted code.

kmod · 2015-06-30T23:55:54Z

I'll read some more of the code in a sec, but overall my feedback is: yeah there's some stuff that is a little bit quick-and-dirty, but I think that's ok for now. I'd say, let's just be good about leaving notes on the things that we think could be better, then work on getting this in and iterating. I'm ok with this area of the code being a little bit rough to start since it feels like we'll be experimenting with it a lot.

kmod · 2015-06-30T23:59:16Z

src/codegen/baseline_jit.cpp

+    a.push(assembler::RBP);
+    a.mov(assembler::RSP, assembler::RBP);
+
+    static_assert(scratch_size % 16 == 0, "stack aligment code depends on this");


I think we need to make sure that the eh_frame we emit matches the stack size that we ask for -- I think the eh_frame_template hardcodes a certain amount of scratch (48 bytes I think).

That's why I choose to use frame pointers because IMHO if I emit the EH info for a function with frame pointers it doesn't contain the amount of bytes the stack has to be adjusted but just that the stack pointer has to be restored by copying the base pointer.
At least that's what I thought whats happening by comparing different EH table output.

oh, yeah you're right :)

undingen · 2015-07-01T17:46:49Z

I addressed all your immediate comments and but some of the non JIT changes into a separate request (#664) and rebased.
I didn't squash it yet in order to make reviewing easier.

kmod · 2015-07-01T22:46:47Z

src/codegen/ast_interpreter.cpp

-                sorted_symbol_table[source_info->getInternedStrings().get(PASSED_GENERATOR_NAME)] = generator;
+    // TODO: we will immediately want the liveness info again in the jit, we should pass
+    // it through.
+    std::unique_ptr<LivenessAnalysis> liveness = computeLivenessInfo(source_info->cfg);


Could this be getLiveness()? I think the call to computeLivenessInfo() in irgen.cpp could also be changed to access the cached version.

undingen · 2015-07-01T23:34:06Z

addressed your comments in the last commit

kmod · 2015-07-01T23:44:11Z

I think the issue with unique_ptr's and forward-declared types is that the compiler will try to generate a default destructor (for the containing class) which won't work. You can get around this by declaring a destructor (ie `~SourceInfo();``); the fun part is you might not have to define it since it probably doesn't get called anywhere :P

kmod · 2015-07-01T23:55:49Z

I think we might need to be careful about reentrancy -- what if we are in the baseline jit, and we hit a function call that ends up triggering the baseline jit on the same function? I haven't finished reading through the code, but assuming there isn't anything that explicitly handles this, I think there could be issues. For instance, the JitFragmentWriter will save the code_offset at the beginning of the jit, but it will end up writing the assembly to whatever the current offset is in the code block. I think this could mess up the offsets in the instructions, and also we will write the original code_offset into the CFGBlock even though some other assembly will live there :(

I think it might be enough to say "if the current codeblock offset at the end is not the same as when we started, abort the fragment"? But I'm not sure if that's necessary (or enough).

undingen · 2015-07-01T23:59:47Z

The JitCodeBlock has a iscurrently_tracing (which should probably be renamed) member which makes sure there is always only one JIT fragment active per JitCodeBlock.
But with multi threading this will need some lock...

kmod · 2015-07-02T00:02:45Z

Ah ok, nice :) Could we still have an assert in fragmentFinished/finishCompilation that asserts the fragmentwriter and the codeblock agree on where the code will end up?

Also I think it's safe to assume we're not getting rid of the GIL any time soon. Probably not until after Windows and Python 3 support :/

undingen · 2015-07-02T00:08:18Z

I will continue tomorrow with eliminating the shared_ptr and adding the check.
Can I stash it then or do you prefer to stash it later because you want to review it in more detail?

kmod · 2015-07-02T00:13:31Z

ok finally read through it all :) looks good to me; feel free to rebase/squash and we should be able to get this merged in tomorrow!

This JIT is tightly coupled to the ASTInterpreter, at every CFGBlock* entry/exit on can switch between interpreting and directly executing the generated code without having to do any translations. Generating the code is pretty fast compared to the LLVM tier but the generated code is not as fast as code generated by the higher LLVM tiers. But because the JITed can use runtime ICs, avoids a lot of interpretation overhead and stores CFGBlock locals sysbols inside register/stack slots, it's much faster than the interpreter.

…tement.

undingen · 2015-07-02T15:10:02Z

I made the last changes and squashed. There should be no functional changes besides that I enabled again the callattr IC because I don't get the protobuf error any more. I don't think the issue is fixed but I also don't think it's hiding inside the JIT (more likely Runtime ICs/rewritter) if we encounter it again I will try to track it down.
I also added a new commit with a small micro optimization (change ASTInterpreter field offsets).
Should be ready for merging :-)

Add a new JIT tier which is tightly coupled to the AST Interpreter

kmod · 2015-07-06T20:41:01Z

Awesome! having a baseline JIT will change a lot of things :)

undingen · 2015-07-06T20:55:03Z

Cool :-)

undingen force-pushed the inc_jit_rewriter2 branch 3 times, most recently from 4e5d195 to 359c6e2 Compare June 29, 2015 20:36

undingen force-pushed the inc_jit_rewriter2 branch 3 times, most recently from 279bca1 to fa593f6 Compare June 30, 2015 22:00

kmod reviewed Jun 30, 2015
View reviewed changes

undingen force-pushed the inc_jit_rewriter2 branch from fa593f6 to c964b34 Compare July 1, 2015 15:10

undingen changed the title ~~WIP: Add a new JIT tier which is tightly coupled to the AST Interpreter~~ Add a new JIT tier which is tightly coupled to the AST Interpreter Jul 1, 2015

undingen force-pushed the inc_jit_rewriter2 branch from c964b34 to 425c08c Compare July 1, 2015 17:43

kmod reviewed Jul 1, 2015
View reviewed changes

undingen force-pushed the inc_jit_rewriter2 branch from 425c08c to f586f49 Compare July 1, 2015 23:33

undingen force-pushed the inc_jit_rewriter2 branch from f586f49 to fc36656 Compare July 2, 2015 13:28

undingen closed this Jul 2, 2015

undingen reopened this Jul 2, 2015

microptimize ASTInterpreter member offsets. Saves about 4byte per sta…

5358470

…tement.

kmod added a commit that referenced this pull request Jul 6, 2015

Merge pull request #651 from undingen/inc_jit_rewriter2

dbe7866

Add a new JIT tier which is tightly coupled to the AST Interpreter

kmod merged commit dbe7866 into pyston:master Jul 6, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a new JIT tier which is tightly coupled to the AST Interpreter #651

Add a new JIT tier which is tightly coupled to the AST Interpreter #651

undingen commented Jun 27, 2015

undingen commented Jun 27, 2015

undingen commented Jun 27, 2015

undingen commented Jun 29, 2015

kmod Jun 30, 2015

undingen commented Jun 30, 2015

kmod Jun 30, 2015

kmod commented Jun 30, 2015

kmod Jun 30, 2015

undingen Jul 1, 2015

kmod Jul 1, 2015

undingen commented Jul 1, 2015

kmod Jul 1, 2015

undingen commented Jul 1, 2015

kmod commented Jul 1, 2015

kmod commented Jul 1, 2015

undingen commented Jul 1, 2015

kmod commented Jul 2, 2015

undingen commented Jul 2, 2015

kmod commented Jul 2, 2015

undingen commented Jul 2, 2015

kmod commented Jul 6, 2015

undingen commented Jul 6, 2015


		class JitFragment;

		// Manages a fixed size memory block which stores JITed code.

Add a new JIT tier which is tightly coupled to the AST Interpreter #651

Add a new JIT tier which is tightly coupled to the AST Interpreter #651

Conversation

undingen commented Jun 27, 2015

undingen commented Jun 27, 2015

undingen commented Jun 27, 2015

undingen commented Jun 29, 2015

kmod Jun 30, 2015

Choose a reason for hiding this comment

undingen commented Jun 30, 2015

kmod Jun 30, 2015

Choose a reason for hiding this comment

kmod commented Jun 30, 2015

kmod Jun 30, 2015

Choose a reason for hiding this comment

undingen Jul 1, 2015

Choose a reason for hiding this comment

kmod Jul 1, 2015

Choose a reason for hiding this comment

undingen commented Jul 1, 2015

kmod Jul 1, 2015

Choose a reason for hiding this comment

undingen commented Jul 1, 2015

kmod commented Jul 1, 2015

kmod commented Jul 1, 2015

undingen commented Jul 1, 2015

kmod commented Jul 2, 2015

undingen commented Jul 2, 2015

kmod commented Jul 2, 2015

undingen commented Jul 2, 2015

kmod commented Jul 6, 2015

undingen commented Jul 6, 2015