Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve checkcast/instanceof performance for interfaces on x86/x64 #14614

Merged
merged 1 commit into from
Jun 2, 2022

Conversation

0xdaryl
Copy link
Contributor

@0xdaryl 0xdaryl commented Mar 1, 2022

PR #2361 introduced a number of JIT performance enhancements for checkcasts
involving interface classes. Two of the most significant changes were the
introduction of a one-slot, dynamic cache to cache the last successful instance
class that matched the cast class. It also inlined the itable walk to determine
if the interface class implemented the cast class.

Each time a successful checkcast was performed the cache would be updated with
that result at runtime (it was an LRU cache). This cache was implemented as a
single, 8-byte data snippet located in the code cache. The performance of this
approach may be acceptable for checkcast sites that do not see more than one
instance class, or when there is only a single thread of execution along this
path. However, in the presence of multiple threads and multiple instance classes
the performance penalty of multiple threads writing to the same data address (the
class cache) is quite significant due to hardware cache coherency protocols.
In fact, the performance overhead gets significantly worse the more threads that
are involved.

There is no information provided in #2361 for what motivated the change or the
workload it was expected to benefit.

This PR modifies that implementation as follows:

  • Consult available profiling information to determine how many instance classes
    may be seen by this checkcast/instanceof site. If there are more than one then
    do not use a cache. If there is no information available then do not use a
    cache as the characteristics are unknown.

  • If there is a single profiled instance class then pre-populate the cache with
    that class. Do not update the cache at runtime.

  • If a cache is not used, then inline the interface table walk in mainline code
    rather than from outlined instructions.

Three environment variables have been introduced to control behaviour of this
evaluator:

  1. TR_updateInterfaceCheckCastCacheSlot : when set the interface cast cache
    slot will be updated at runtime (i.e., this is the original behaviour)

  2. TR_forceDisableInterfaceCastCache : never use the cache regardless of
    profiling information.

  3. TR_forceEnableInterfaceCastCache : force the use of the cache regardless of
    profiling information.

  4. and 3) are mutually exclusive. Behaviour is undefined if both are set.

Using 1) and 3) will restore original behaviour.

Signed-off-by: Daryl Maier maier@ca.ibm.com

@0xdaryl
Copy link
Contributor Author

0xdaryl commented Mar 1, 2022

Jenkins test sanity.functional,extended.functional xlinux,win,osx jdk17

@0xdaryl
Copy link
Contributor Author

0xdaryl commented Mar 2, 2022

Jenkins test sanity.functional,extended.functional win jdk17

@0xdaryl
Copy link
Contributor Author

0xdaryl commented Mar 2, 2022

@BradleyWood @a7ehuo : would you mind reviewing this PR please?

FYI @JamesKingdon

@0xdaryl
Copy link
Contributor Author

0xdaryl commented Mar 2, 2022

This was also tested internally within IBM on 32-bit Windows JDK8.

bool doClassCache = (numSuccessfulClassChecks == 1) ? true : false;

static bool disableInterfaceCastCache = feGetEnv("TR_forceDisableInterfaceCastCache") != NULL;
if (disableInterfaceCastCache) { doClassCache = false; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (disableInterfaceCastCache) { doClassCache = false; }
doClassCache = disableInterfaceCastCache ? false : doClassCache;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworked the logic.

static bool disableInterfaceCastCache = feGetEnv("TR_forceDisableInterfaceCastCache") != NULL;
if (disableInterfaceCastCache) { doClassCache = false; }
static bool enableInterfaceCastCache = feGetEnv("TR_forceEnableInterfaceCastCache") != NULL;
if (enableInterfaceCastCache) { doClassCache = true; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (enableInterfaceCastCache) { doClassCache = true; }
doClassCache = enableInterfaceCastCache ? true : doClassCache;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This expression could be simplified as enableInterfaceCastCache || doClassCache

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworked the logic.

*/
TR::Register *scratchReg = (!doClassCache && isCheckCast) ? cg->allocateRegister() : NULL;

uint8_t numDeps = 2 + (scratchReg != NULL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 2 here for castClassReg and instanceClassReg? If so, castClassReg could be NULL. Should numDeps be as below?

uint8_t numDeps = 1 + (castClassReg != NULL) +  (scratchReg != NULL);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've changed that.

{
if (tmp)
TR_ASSERT_FATAL(scratchReg, "Scratch register required for iTable lookup");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If scratch register is required for iTable lookup, shouldn't this fatal assert be placed before obtaining iTable at line 4226? And scratchReg doesn't seem to be used after this line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember what I was thinking here. On this path, scratchReg will always be allocated for checkcasts and you're right that it is a little late to be checking that here. I'm just going to delete the assert as I don't think it is helpful.

// CheckCast iTable fail lookup out-of-line
TR_OutlinedInstructionsGenerator og(iTableLookUpFailLabel, node, cg);

generateRegInstruction(TR::InstOpCode::PUSHReg, node, instanceClassReg, cg);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should instanceClassReg be pushed to the stack here when checkcast fails?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the checkcast fails we have to throw an exception. The instance class is one of the parameters to the JIT helper that will throw the class cast exception. It is pushed to the stack per the calling convention of that helper.

* Use the scratch register if it is available. Otherwise, re-use the
* instanceClassReg register to perform itable lookup
*/
TR::Register *itableReg = scratchReg ? scratchReg : instanceClassReg;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If reusing instanceClassReg, should it be pushed to the stack to be saved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. We reach here only if we are not using the cast cache for instanceofs or checkcasts. scratchReg is used for checkcasts, and instanceClassReg is itself a scratch register that can be re-used for instanceofs. Preserving the registers by pushing onto the stack is only required when there is an out-of-line sequence, and there won't be one for instanceofs.

Comment on lines 4242 to +4253
if (!isCheckCast)
{
// Class found in itable
generateInstruction(TR::InstOpCode::STC, node, cg);
}
generateLabelInstruction(TR::InstOpCode::JMP4, node, endLabel, cg);

// Not found
generateVFPRestoreInstruction(vfp, node, cg);
generateLabelInstruction(TR::InstOpCode::label, node, iTableLookUpFailLabel, cg);
if (isCheckCast)
// Fall through to endLabel
}
else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the case (not found && !isCheckCast) is missing here.

      if (!isCheckCast)
         {
         // Class found in itable
         generateInstruction(TR::InstOpCode::STC, node, cg);
         // Fall through to endLabel
         }
      else
         { 
         // isCheckCast && not found 
         // What about (!isCheckCast && not found)?
          ...
         TR_OutlinedInstructionsGenerator og(iTableLookUpFailLabel, node, cg);
         ... 
         }

I wonder if the logic might be like this?

      if (!isCheckCast)
         {
         // Class found in itable
         generateInstruction(TR::InstOpCode::STC, node, cg);

         // Fall through to endLabel
         }

      // iTable fail lookup out-of-line
      TR_OutlinedInstructionsGenerator og(iTableLookUpFailLabel, node, cg);
      if (isCheckCast)
         { .... }
      else
          { ... } 

      og.endOutlinedInstructionSequence();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After inlineInterfaceLookup() there are only two cases to consider here: whether we're implementing an instanceof or a checkcast. In the case of an instanceof, inlineInterfaceLookup() will have performed the test and all that remains to be done here is to set the carry flag (via STC) depending on the outcome of the test. In the case of a checkcast, we must generate an OOL sequence that will throw the ClassCastException in the event that the lookup failed.

*/
bool doClassCache = (numSuccessfulClassChecks == 1) ? true : false;

static bool disableInterfaceCastCache = feGetEnv("TR_forceDisableInterfaceCastCache") != NULL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add an assertion that disableInterfaceCastCache and enableInterfaceCastCache are not both true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

// a push 32bit immediate instruction to pass it on the stack to the jitThrowClassCastException helper
// as the address gets sign extended. It needs to be stored in a temp register and then push the
// register to the stack.
auto highClass = (comp->target().is64Bit() && ((uintptr_t)clazz) > INT_MAX) ? true : false;
auto highClass = (comp->target().is64Bit() && ((uintptr_t)castClass) > INT_MAX) ? true : false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use of ternary operator is redundant

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use auto for highClass but not for doClassCache or elsewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original author used the auto keyword throughout. I'm not a fan of that style except in limited circumstances. I'll remove the keyword from this statement and a couple other places, as well as replace the ternary operator.

Copy link
Contributor Author

@0xdaryl 0xdaryl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments addressed. Fixes in forced push.

// a push 32bit immediate instruction to pass it on the stack to the jitThrowClassCastException helper
// as the address gets sign extended. It needs to be stored in a temp register and then push the
// register to the stack.
auto highClass = (comp->target().is64Bit() && ((uintptr_t)clazz) > INT_MAX) ? true : false;
auto highClass = (comp->target().is64Bit() && ((uintptr_t)castClass) > INT_MAX) ? true : false;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original author used the auto keyword throughout. I'm not a fan of that style except in limited circumstances. I'll remove the keyword from this statement and a couple other places, as well as replace the ternary operator.

*/
bool doClassCache = (numSuccessfulClassChecks == 1) ? true : false;

static bool disableInterfaceCastCache = feGetEnv("TR_forceDisableInterfaceCastCache") != NULL;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

bool doClassCache = (numSuccessfulClassChecks == 1) ? true : false;

static bool disableInterfaceCastCache = feGetEnv("TR_forceDisableInterfaceCastCache") != NULL;
if (disableInterfaceCastCache) { doClassCache = false; }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworked the logic.

static bool disableInterfaceCastCache = feGetEnv("TR_forceDisableInterfaceCastCache") != NULL;
if (disableInterfaceCastCache) { doClassCache = false; }
static bool enableInterfaceCastCache = feGetEnv("TR_forceEnableInterfaceCastCache") != NULL;
if (enableInterfaceCastCache) { doClassCache = true; }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reworked the logic.

*/
TR::Register *scratchReg = (!doClassCache && isCheckCast) ? cg->allocateRegister() : NULL;

uint8_t numDeps = 2 + (scratchReg != NULL);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've changed that.

* Use the scratch register if it is available. Otherwise, re-use the
* instanceClassReg register to perform itable lookup
*/
TR::Register *itableReg = scratchReg ? scratchReg : instanceClassReg;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. We reach here only if we are not using the cast cache for instanceofs or checkcasts. scratchReg is used for checkcasts, and instanceClassReg is itself a scratch register that can be re-used for instanceofs. Preserving the registers by pushing onto the stack is only required when there is an out-of-line sequence, and there won't be one for instanceofs.

Comment on lines 4242 to +4253
if (!isCheckCast)
{
// Class found in itable
generateInstruction(TR::InstOpCode::STC, node, cg);
}
generateLabelInstruction(TR::InstOpCode::JMP4, node, endLabel, cg);

// Not found
generateVFPRestoreInstruction(vfp, node, cg);
generateLabelInstruction(TR::InstOpCode::label, node, iTableLookUpFailLabel, cg);
if (isCheckCast)
// Fall through to endLabel
}
else
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After inlineInterfaceLookup() there are only two cases to consider here: whether we're implementing an instanceof or a checkcast. In the case of an instanceof, inlineInterfaceLookup() will have performed the test and all that remains to be done here is to set the carry flag (via STC) depending on the outcome of the test. In the case of a checkcast, we must generate an OOL sequence that will throw the ClassCastException in the event that the lookup failed.

{
if (tmp)
TR_ASSERT_FATAL(scratchReg, "Scratch register required for iTable lookup");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember what I was thinking here. On this path, scratchReg will always be allocated for checkcasts and you're right that it is a little late to be checking that here. I'm just going to delete the assert as I don't think it is helpful.

// CheckCast iTable fail lookup out-of-line
TR_OutlinedInstructionsGenerator og(iTableLookUpFailLabel, node, cg);

generateRegInstruction(TR::InstOpCode::PUSHReg, node, instanceClassReg, cg);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the checkcast fails we have to throw an exception. The instance class is one of the parameters to the JIT helper that will throw the class cast exception. It is pushed to the stack per the calling convention of that helper.

@0xdaryl
Copy link
Contributor Author

0xdaryl commented May 22, 2022

Jenkins test sanity.functional,extended.functional xlinux,win,osx jdk17

PR eclipse-openj9#2361 introduced a number of JIT performance enhancements for checkcasts
involving interface classes.  Two of the most significant changes were the
introduction of a one-slot, dynamic cache to cache the last successful instance
class that matched the cast class.  It also inlined the itable walk to determine
if the interface class implemented the cast class.

Each time a successful checkcast was performed the cache would be updated with
that result at runtime (it was an LRU cache).  This cache was implemented as a
single, 8-byte data snippet located in the code cache.  The performance of this
approach may be acceptable for checkcast sites that do not see more than one
instance class, or when there is only a single thread of execution along this
path.  However, in the presence of multiple threads and multiple instance classes
the performance penalty of multiple threads writing to the same data address (the
class cache) is quite significant due to hardware cache coherency protocols.
In fact, the performance overhead gets significantly worse the more threads that
are involved.

There is no information provided in eclipse-openj9#2361 for what motivated the change or the
workload it was expected to benefit.

This PR modifies that implementation as follows:

* Consult available profiling information to determine how many instance classes
  may be seen by this checkcast/instanceof site.  If there are more than one then
  do not use a cache.  If there is no information available then do not use a
  cache as the characteristics are unknown.

* If there is a single profiled instance class then pre-populate the cache with
  that class.  Do not update the cache at runtime.

* If a cache is not used, then inline the interface table walk in mainline code
  rather than from outlined instructions.

Three environment variables have been introduced to control behaviour of this
evaluator:

1) `TR_updateInterfaceCheckCastCacheSlot` : when set the interface cast cache
   slot will be updated at runtime (i.e., this is the original behaviour)

2) `TR_forceDisableInterfaceCastCache` : never use the cache regardless of
   profiling information.

3) `TR_forceEnableInterfaceCastCache` : force the use of the cache regardless of
   profiling information.

2) and 3) are mutually exclusive.  Behaviour is undefined if both are set.

Using 1) and 3) will restore original behaviour.

Signed-off-by: Daryl Maier <maier@ca.ibm.com>
@0xdaryl
Copy link
Contributor Author

0xdaryl commented May 22, 2022

Jenkins test sanity.functional,extended.functional xlinux,win,osx jdk17

@0xdaryl
Copy link
Contributor Author

0xdaryl commented May 23, 2022

Jenkins test extended.functional xlinux jdk17

@0xdaryl
Copy link
Contributor Author

0xdaryl commented May 23, 2022

@BradleyWood @a7ehuo : Your comments addressed and CI testing passed. Please review again.

@a7ehuo
Copy link
Contributor

a7ehuo commented May 24, 2022

LGTM. I'm not able to open this PR in browsers. I got an error "This page is taking too long to load". I have no issues to open other OpenJ9 PRs. I reviewed the update in my iPad GitHub app.

@0xdaryl
Copy link
Contributor Author

0xdaryl commented May 24, 2022

I can't access it either with a browser and I opened a GH support ticket: https://support.github.com/ticket/personal/0/1636593

@BradleyWood
Copy link
Member

LGTM

Copy link
Member

@ymanton ymanton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to look at this via the GH app on my phone, so I've only done a very cursory review.

@ymanton ymanton merged commit 6cc63c8 into eclipse-openj9:master Jun 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants