Improve checkcast/instanceof performance for interfaces on x86/x64 #14614

0xdaryl · 2022-03-01T11:04:25Z

PR #2361 introduced a number of JIT performance enhancements for checkcasts
involving interface classes. Two of the most significant changes were the
introduction of a one-slot, dynamic cache to cache the last successful instance
class that matched the cast class. It also inlined the itable walk to determine
if the interface class implemented the cast class.

Each time a successful checkcast was performed the cache would be updated with
that result at runtime (it was an LRU cache). This cache was implemented as a
single, 8-byte data snippet located in the code cache. The performance of this
approach may be acceptable for checkcast sites that do not see more than one
instance class, or when there is only a single thread of execution along this
path. However, in the presence of multiple threads and multiple instance classes
the performance penalty of multiple threads writing to the same data address (the
class cache) is quite significant due to hardware cache coherency protocols.
In fact, the performance overhead gets significantly worse the more threads that
are involved.

There is no information provided in #2361 for what motivated the change or the
workload it was expected to benefit.

This PR modifies that implementation as follows:

Consult available profiling information to determine how many instance classes
may be seen by this checkcast/instanceof site. If there are more than one then
do not use a cache. If there is no information available then do not use a
cache as the characteristics are unknown.
If there is a single profiled instance class then pre-populate the cache with
that class. Do not update the cache at runtime.
If a cache is not used, then inline the interface table walk in mainline code
rather than from outlined instructions.

Three environment variables have been introduced to control behaviour of this
evaluator:

TR_updateInterfaceCheckCastCacheSlot : when set the interface cast cache
slot will be updated at runtime (i.e., this is the original behaviour)
TR_forceDisableInterfaceCastCache : never use the cache regardless of
profiling information.
TR_forceEnableInterfaceCastCache : force the use of the cache regardless of
profiling information.
and 3) are mutually exclusive. Behaviour is undefined if both are set.

Using 1) and 3) will restore original behaviour.

Signed-off-by: Daryl Maier maier@ca.ibm.com

0xdaryl · 2022-03-01T11:06:27Z

Jenkins test sanity.functional,extended.functional xlinux,win,osx jdk17

0xdaryl · 2022-03-02T00:46:00Z

Jenkins test sanity.functional,extended.functional win jdk17

0xdaryl · 2022-03-02T02:57:55Z

@BradleyWood @a7ehuo : would you mind reviewing this PR please?

FYI @JamesKingdon

0xdaryl · 2022-03-02T03:10:22Z

This was also tested internally within IBM on 32-bit Windows JDK8.

a7ehuo · 2022-03-02T19:25:59Z

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

+   bool doClassCache = (numSuccessfulClassChecks == 1) ? true : false;
+
+   static bool disableInterfaceCastCache = feGetEnv("TR_forceDisableInterfaceCastCache") != NULL;
+   if (disableInterfaceCastCache) { doClassCache = false; }


Suggested change

if (disableInterfaceCastCache) { doClassCache = false; }

doClassCache = disableInterfaceCastCache ? false : doClassCache;

I reworked the logic.

a7ehuo · 2022-03-02T19:26:42Z

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

+   static bool disableInterfaceCastCache = feGetEnv("TR_forceDisableInterfaceCastCache") != NULL;
+   if (disableInterfaceCastCache) { doClassCache = false; }
+   static bool enableInterfaceCastCache = feGetEnv("TR_forceEnableInterfaceCastCache") != NULL;
+   if (enableInterfaceCastCache) { doClassCache = true; }


Suggested change

if (enableInterfaceCastCache) { doClassCache = true; }

doClassCache = enableInterfaceCastCache ? true : doClassCache;

This expression could be simplified as enableInterfaceCastCache || doClassCache

I reworked the logic.

a7ehuo · 2022-03-02T20:21:14Z

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

+    */
+   TR::Register *scratchReg = (!doClassCache && isCheckCast) ? cg->allocateRegister() : NULL;
+
+   uint8_t numDeps = 2 + (scratchReg != NULL);


Is 2 here for castClassReg and instanceClassReg? If so, castClassReg could be NULL. Should numDeps be as below?

uint8_t numDeps = 1 + (castClassReg != NULL) + (scratchReg != NULL);

Yes, I've changed that.

a7ehuo · 2022-03-02T22:24:07Z

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

         {
-         if (tmp)
+         TR_ASSERT_FATAL(scratchReg, "Scratch register required for iTable lookup");


If scratch register is required for iTable lookup, shouldn't this fatal assert be placed before obtaining iTable at line 4226? And scratchReg doesn't seem to be used after this line.

I don't remember what I was thinking here. On this path, scratchReg will always be allocated for checkcasts and you're right that it is a little late to be checking that here. I'm just going to delete the assert as I don't think it is helpful.

a7ehuo · 2022-03-02T22:28:38Z

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

+         // CheckCast iTable fail lookup out-of-line
+         TR_OutlinedInstructionsGenerator og(iTableLookUpFailLabel, node, cg);
+
+         generateRegInstruction(TR::InstOpCode::PUSHReg, node, instanceClassReg, cg);


Why should instanceClassReg be pushed to the stack here when checkcast fails?

When the checkcast fails we have to throw an exception. The instance class is one of the parameters to the JIT helper that will throw the class cast exception. It is pushed to the stack per the calling convention of that helper.

a7ehuo · 2022-03-02T22:33:58Z

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

+       * Use the scratch register if it is available.  Otherwise, re-use the
+       * instanceClassReg register to perform itable lookup
+       */
+      TR::Register *itableReg = scratchReg ? scratchReg : instanceClassReg;


If reusing instanceClassReg, should it be pushed to the stack to be saved?

No. We reach here only if we are not using the cast cache for instanceofs or checkcasts. scratchReg is used for checkcasts, and instanceClassReg is itself a scratch register that can be re-used for instanceofs. Preserving the registers by pushing onto the stack is only required when there is an out-of-line sequence, and there won't be one for instanceofs.

a7ehuo · 2022-03-02T22:52:24Z

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

      if (!isCheckCast)
         {
+         // Class found in itable
         generateInstruction(TR::InstOpCode::STC, node, cg);
-         }
-      generateLabelInstruction(TR::InstOpCode::JMP4, node, endLabel, cg);

-      // Not found
-      generateVFPRestoreInstruction(vfp, node, cg);
-      generateLabelInstruction(TR::InstOpCode::label, node, iTableLookUpFailLabel, cg);
-      if (isCheckCast)
+         // Fall through to endLabel
+         }
+      else


I wonder if the case (not found && !isCheckCast) is missing here.

if (!isCheckCast) { // Class found in itable generateInstruction(TR::InstOpCode::STC, node, cg); // Fall through to endLabel } else { // isCheckCast && not found // What about (!isCheckCast && not found)? ... TR_OutlinedInstructionsGenerator og(iTableLookUpFailLabel, node, cg); ... }

I wonder if the logic might be like this?

if (!isCheckCast) { // Class found in itable generateInstruction(TR::InstOpCode::STC, node, cg); // Fall through to endLabel } // iTable fail lookup out-of-line TR_OutlinedInstructionsGenerator og(iTableLookUpFailLabel, node, cg); if (isCheckCast) { .... } else { ... } og.endOutlinedInstructionSequence();

After inlineInterfaceLookup() there are only two cases to consider here: whether we're implementing an instanceof or a checkcast. In the case of an instanceof, inlineInterfaceLookup() will have performed the test and all that remains to be done here is to set the carry flag (via STC) depending on the outcome of the test. In the case of a checkcast, we must generate an OOL sequence that will throw the ClassCastException in the event that the lookup failed.

BradleyWood · 2022-03-03T14:47:28Z

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

+    */
+   bool doClassCache = (numSuccessfulClassChecks == 1) ? true : false;
+
+   static bool disableInterfaceCastCache = feGetEnv("TR_forceDisableInterfaceCastCache") != NULL;


Could you add an assertion that disableInterfaceCastCache and enableInterfaceCastCache are not both true.

BradleyWood · 2022-03-03T14:47:33Z

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

   // a push 32bit immediate instruction to pass it on the stack to the jitThrowClassCastException helper
   // as the address gets sign extended. It needs to be stored in a temp register and then push the
   // register to the stack.
-   auto highClass = (comp->target().is64Bit() && ((uintptr_t)clazz) > INT_MAX) ? true : false;
+   auto highClass = (comp->target().is64Bit() && ((uintptr_t)castClass) > INT_MAX) ? true : false;


Use of ternary operator is redundant

Why use auto for highClass but not for doClassCache or elsewhere?

The original author used the auto keyword throughout. I'm not a fan of that style except in limited circumstances. I'll remove the keyword from this statement and a couple other places, as well as replace the ternary operator.

0xdaryl

Comments addressed. Fixes in forced push.

0xdaryl · 2022-05-10T17:51:58Z

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

   // a push 32bit immediate instruction to pass it on the stack to the jitThrowClassCastException helper
   // as the address gets sign extended. It needs to be stored in a temp register and then push the
   // register to the stack.
-   auto highClass = (comp->target().is64Bit() && ((uintptr_t)clazz) > INT_MAX) ? true : false;
+   auto highClass = (comp->target().is64Bit() && ((uintptr_t)castClass) > INT_MAX) ? true : false;


The original author used the auto keyword throughout. I'm not a fan of that style except in limited circumstances. I'll remove the keyword from this statement and a couple other places, as well as replace the ternary operator.

0xdaryl · 2022-05-10T17:57:34Z

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

+    */
+   bool doClassCache = (numSuccessfulClassChecks == 1) ? true : false;
+
+   static bool disableInterfaceCastCache = feGetEnv("TR_forceDisableInterfaceCastCache") != NULL;


0xdaryl · 2022-05-10T17:58:27Z

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

+   bool doClassCache = (numSuccessfulClassChecks == 1) ? true : false;
+
+   static bool disableInterfaceCastCache = feGetEnv("TR_forceDisableInterfaceCastCache") != NULL;
+   if (disableInterfaceCastCache) { doClassCache = false; }


I reworked the logic.

0xdaryl · 2022-05-10T17:58:37Z

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

+   static bool disableInterfaceCastCache = feGetEnv("TR_forceDisableInterfaceCastCache") != NULL;
+   if (disableInterfaceCastCache) { doClassCache = false; }
+   static bool enableInterfaceCastCache = feGetEnv("TR_forceEnableInterfaceCastCache") != NULL;
+   if (enableInterfaceCastCache) { doClassCache = true; }


I reworked the logic.

0xdaryl · 2022-05-10T18:00:11Z

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

+    */
+   TR::Register *scratchReg = (!doClassCache && isCheckCast) ? cg->allocateRegister() : NULL;
+
+   uint8_t numDeps = 2 + (scratchReg != NULL);


Yes, I've changed that.

0xdaryl · 2022-05-22T12:20:50Z

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

+       * Use the scratch register if it is available.  Otherwise, re-use the
+       * instanceClassReg register to perform itable lookup
+       */
+      TR::Register *itableReg = scratchReg ? scratchReg : instanceClassReg;


No. We reach here only if we are not using the cast cache for instanceofs or checkcasts. scratchReg is used for checkcasts, and instanceClassReg is itself a scratch register that can be re-used for instanceofs. Preserving the registers by pushing onto the stack is only required when there is an out-of-line sequence, and there won't be one for instanceofs.

0xdaryl · 2022-05-22T12:21:46Z

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

      if (!isCheckCast)
         {
+         // Class found in itable
         generateInstruction(TR::InstOpCode::STC, node, cg);
-         }
-      generateLabelInstruction(TR::InstOpCode::JMP4, node, endLabel, cg);

-      // Not found
-      generateVFPRestoreInstruction(vfp, node, cg);
-      generateLabelInstruction(TR::InstOpCode::label, node, iTableLookUpFailLabel, cg);
-      if (isCheckCast)
+         // Fall through to endLabel
+         }
+      else


After inlineInterfaceLookup() there are only two cases to consider here: whether we're implementing an instanceof or a checkcast. In the case of an instanceof, inlineInterfaceLookup() will have performed the test and all that remains to be done here is to set the carry flag (via STC) depending on the outcome of the test. In the case of a checkcast, we must generate an OOL sequence that will throw the ClassCastException in the event that the lookup failed.

0xdaryl · 2022-05-22T12:22:15Z

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

         {
-         if (tmp)
+         TR_ASSERT_FATAL(scratchReg, "Scratch register required for iTable lookup");


I don't remember what I was thinking here. On this path, scratchReg will always be allocated for checkcasts and you're right that it is a little late to be checking that here. I'm just going to delete the assert as I don't think it is helpful.

0xdaryl · 2022-05-22T12:22:44Z

runtime/compiler/x/codegen/J9TreeEvaluator.cpp

+         // CheckCast iTable fail lookup out-of-line
+         TR_OutlinedInstructionsGenerator og(iTableLookUpFailLabel, node, cg);
+
+         generateRegInstruction(TR::InstOpCode::PUSHReg, node, instanceClassReg, cg);


When the checkcast fails we have to throw an exception. The instance class is one of the parameters to the JIT helper that will throw the class cast exception. It is pushed to the stack per the calling convention of that helper.

0xdaryl · 2022-05-22T15:45:55Z

Jenkins test sanity.functional,extended.functional xlinux,win,osx jdk17

PR eclipse-openj9#2361 introduced a number of JIT performance enhancements for checkcasts involving interface classes. Two of the most significant changes were the introduction of a one-slot, dynamic cache to cache the last successful instance class that matched the cast class. It also inlined the itable walk to determine if the interface class implemented the cast class. Each time a successful checkcast was performed the cache would be updated with that result at runtime (it was an LRU cache). This cache was implemented as a single, 8-byte data snippet located in the code cache. The performance of this approach may be acceptable for checkcast sites that do not see more than one instance class, or when there is only a single thread of execution along this path. However, in the presence of multiple threads and multiple instance classes the performance penalty of multiple threads writing to the same data address (the class cache) is quite significant due to hardware cache coherency protocols. In fact, the performance overhead gets significantly worse the more threads that are involved. There is no information provided in eclipse-openj9#2361 for what motivated the change or the workload it was expected to benefit. This PR modifies that implementation as follows: * Consult available profiling information to determine how many instance classes may be seen by this checkcast/instanceof site. If there are more than one then do not use a cache. If there is no information available then do not use a cache as the characteristics are unknown. * If there is a single profiled instance class then pre-populate the cache with that class. Do not update the cache at runtime. * If a cache is not used, then inline the interface table walk in mainline code rather than from outlined instructions. Three environment variables have been introduced to control behaviour of this evaluator: 1) `TR_updateInterfaceCheckCastCacheSlot` : when set the interface cast cache slot will be updated at runtime (i.e., this is the original behaviour) 2) `TR_forceDisableInterfaceCastCache` : never use the cache regardless of profiling information. 3) `TR_forceEnableInterfaceCastCache` : force the use of the cache regardless of profiling information. 2) and 3) are mutually exclusive. Behaviour is undefined if both are set. Using 1) and 3) will restore original behaviour. Signed-off-by: Daryl Maier <maier@ca.ibm.com>

0xdaryl · 2022-05-22T17:15:26Z

Jenkins test sanity.functional,extended.functional xlinux,win,osx jdk17

0xdaryl · 2022-05-23T12:24:23Z

Jenkins test extended.functional xlinux jdk17

0xdaryl · 2022-05-23T17:55:48Z

@BradleyWood @a7ehuo : Your comments addressed and CI testing passed. Please review again.

a7ehuo · 2022-05-24T15:43:40Z

LGTM. I'm not able to open this PR in browsers. I got an error "This page is taking too long to load". I have no issues to open other OpenJ9 PRs. I reviewed the update in my iPad GitHub app.

0xdaryl · 2022-05-24T16:28:32Z

I can't access it either with a browser and I opened a GH support ticket: https://support.github.com/ticket/personal/0/1636593

BradleyWood · 2022-05-26T19:09:22Z

LGTM

ymanton

I have to look at this via the GH app on my phone, so I've only done a very cursory review.

0xdaryl added comp:jit arch:x86 labels Mar 1, 2022

a7ehuo reviewed Mar 2, 2022

View reviewed changes

BradleyWood reviewed Mar 3, 2022

View reviewed changes

0xdaryl force-pushed the nocheckcastcache branch from 016ebf5 to 50fe6ca Compare May 22, 2022 15:39

0xdaryl commented May 22, 2022

View reviewed changes

0xdaryl force-pushed the nocheckcastcache branch from 50fe6ca to 701c865 Compare May 22, 2022 17:13

ymanton approved these changes Jun 2, 2022

View reviewed changes

ymanton merged commit 6cc63c8 into eclipse-openj9:master Jun 2, 2022

	if (disableInterfaceCastCache) { doClassCache = false; }
	doClassCache = disableInterfaceCastCache ? false : doClassCache;

	if (enableInterfaceCastCache) { doClassCache = true; }
	doClassCache = enableInterfaceCastCache ? true : doClassCache;

Improve checkcast/instanceof performance for interfaces on x86/x64 #14614

Improve checkcast/instanceof performance for interfaces on x86/x64 #14614

Conversation

0xdaryl commented Mar 1, 2022

0xdaryl commented Mar 1, 2022

0xdaryl commented Mar 2, 2022

0xdaryl commented Mar 2, 2022

0xdaryl commented Mar 2, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0xdaryl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0xdaryl commented May 22, 2022

0xdaryl commented May 22, 2022

0xdaryl commented May 23, 2022

0xdaryl commented May 23, 2022

a7ehuo commented May 24, 2022

0xdaryl commented May 24, 2022

BradleyWood commented May 26, 2022

ymanton left a comment

Choose a reason for hiding this comment