-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve checkcast/instanceof performance for interfaces on x86/x64 #14614
Conversation
Jenkins test sanity.functional,extended.functional xlinux,win,osx jdk17 |
Jenkins test sanity.functional,extended.functional win jdk17 |
@BradleyWood @a7ehuo : would you mind reviewing this PR please? FYI @JamesKingdon |
This was also tested internally within IBM on 32-bit Windows JDK8. |
bool doClassCache = (numSuccessfulClassChecks == 1) ? true : false; | ||
|
||
static bool disableInterfaceCastCache = feGetEnv("TR_forceDisableInterfaceCastCache") != NULL; | ||
if (disableInterfaceCastCache) { doClassCache = false; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (disableInterfaceCastCache) { doClassCache = false; } | |
doClassCache = disableInterfaceCastCache ? false : doClassCache; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reworked the logic.
static bool disableInterfaceCastCache = feGetEnv("TR_forceDisableInterfaceCastCache") != NULL; | ||
if (disableInterfaceCastCache) { doClassCache = false; } | ||
static bool enableInterfaceCastCache = feGetEnv("TR_forceEnableInterfaceCastCache") != NULL; | ||
if (enableInterfaceCastCache) { doClassCache = true; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (enableInterfaceCastCache) { doClassCache = true; } | |
doClassCache = enableInterfaceCastCache ? true : doClassCache; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This expression could be simplified as enableInterfaceCastCache || doClassCache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reworked the logic.
*/ | ||
TR::Register *scratchReg = (!doClassCache && isCheckCast) ? cg->allocateRegister() : NULL; | ||
|
||
uint8_t numDeps = 2 + (scratchReg != NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is 2
here for castClassReg
and instanceClassReg
? If so, castClassReg
could be NULL
. Should numDeps
be as below?
uint8_t numDeps = 1 + (castClassReg != NULL) + (scratchReg != NULL);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've changed that.
{ | ||
if (tmp) | ||
TR_ASSERT_FATAL(scratchReg, "Scratch register required for iTable lookup"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If scratch register is required for iTable lookup, shouldn't this fatal assert be placed before obtaining iTable at line 4226? And scratchReg
doesn't seem to be used after this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't remember what I was thinking here. On this path, scratchReg
will always be allocated for checkcasts and you're right that it is a little late to be checking that here. I'm just going to delete the assert as I don't think it is helpful.
// CheckCast iTable fail lookup out-of-line | ||
TR_OutlinedInstructionsGenerator og(iTableLookUpFailLabel, node, cg); | ||
|
||
generateRegInstruction(TR::InstOpCode::PUSHReg, node, instanceClassReg, cg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should instanceClassReg
be pushed to the stack here when checkcast fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the checkcast fails we have to throw an exception. The instance class is one of the parameters to the JIT helper that will throw the class cast exception. It is pushed to the stack per the calling convention of that helper.
* Use the scratch register if it is available. Otherwise, re-use the | ||
* instanceClassReg register to perform itable lookup | ||
*/ | ||
TR::Register *itableReg = scratchReg ? scratchReg : instanceClassReg; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If reusing instanceClassReg
, should it be pushed to the stack to be saved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. We reach here only if we are not using the cast cache for instanceofs or checkcasts. scratchReg
is used for checkcasts, and instanceClassReg
is itself a scratch register that can be re-used for instanceofs. Preserving the registers by pushing onto the stack is only required when there is an out-of-line sequence, and there won't be one for instanceofs.
if (!isCheckCast) | ||
{ | ||
// Class found in itable | ||
generateInstruction(TR::InstOpCode::STC, node, cg); | ||
} | ||
generateLabelInstruction(TR::InstOpCode::JMP4, node, endLabel, cg); | ||
|
||
// Not found | ||
generateVFPRestoreInstruction(vfp, node, cg); | ||
generateLabelInstruction(TR::InstOpCode::label, node, iTableLookUpFailLabel, cg); | ||
if (isCheckCast) | ||
// Fall through to endLabel | ||
} | ||
else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the case (not found && !isCheckCast
) is missing here.
if (!isCheckCast)
{
// Class found in itable
generateInstruction(TR::InstOpCode::STC, node, cg);
// Fall through to endLabel
}
else
{
// isCheckCast && not found
// What about (!isCheckCast && not found)?
...
TR_OutlinedInstructionsGenerator og(iTableLookUpFailLabel, node, cg);
...
}
I wonder if the logic might be like this?
if (!isCheckCast)
{
// Class found in itable
generateInstruction(TR::InstOpCode::STC, node, cg);
// Fall through to endLabel
}
// iTable fail lookup out-of-line
TR_OutlinedInstructionsGenerator og(iTableLookUpFailLabel, node, cg);
if (isCheckCast)
{ .... }
else
{ ... }
og.endOutlinedInstructionSequence();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After inlineInterfaceLookup()
there are only two cases to consider here: whether we're implementing an instanceof or a checkcast. In the case of an instanceof, inlineInterfaceLookup()
will have performed the test and all that remains to be done here is to set the carry flag (via STC
) depending on the outcome of the test. In the case of a checkcast, we must generate an OOL sequence that will throw the ClassCastException in the event that the lookup failed.
*/ | ||
bool doClassCache = (numSuccessfulClassChecks == 1) ? true : false; | ||
|
||
static bool disableInterfaceCastCache = feGetEnv("TR_forceDisableInterfaceCastCache") != NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add an assertion that disableInterfaceCastCache
and enableInterfaceCastCache
are not both true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
// a push 32bit immediate instruction to pass it on the stack to the jitThrowClassCastException helper | ||
// as the address gets sign extended. It needs to be stored in a temp register and then push the | ||
// register to the stack. | ||
auto highClass = (comp->target().is64Bit() && ((uintptr_t)clazz) > INT_MAX) ? true : false; | ||
auto highClass = (comp->target().is64Bit() && ((uintptr_t)castClass) > INT_MAX) ? true : false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use of ternary operator is redundant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use auto for highClass
but not for doClassCache
or elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original author used the auto
keyword throughout. I'm not a fan of that style except in limited circumstances. I'll remove the keyword from this statement and a couple other places, as well as replace the ternary operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments addressed. Fixes in forced push.
// a push 32bit immediate instruction to pass it on the stack to the jitThrowClassCastException helper | ||
// as the address gets sign extended. It needs to be stored in a temp register and then push the | ||
// register to the stack. | ||
auto highClass = (comp->target().is64Bit() && ((uintptr_t)clazz) > INT_MAX) ? true : false; | ||
auto highClass = (comp->target().is64Bit() && ((uintptr_t)castClass) > INT_MAX) ? true : false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original author used the auto
keyword throughout. I'm not a fan of that style except in limited circumstances. I'll remove the keyword from this statement and a couple other places, as well as replace the ternary operator.
*/ | ||
bool doClassCache = (numSuccessfulClassChecks == 1) ? true : false; | ||
|
||
static bool disableInterfaceCastCache = feGetEnv("TR_forceDisableInterfaceCastCache") != NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
bool doClassCache = (numSuccessfulClassChecks == 1) ? true : false; | ||
|
||
static bool disableInterfaceCastCache = feGetEnv("TR_forceDisableInterfaceCastCache") != NULL; | ||
if (disableInterfaceCastCache) { doClassCache = false; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reworked the logic.
static bool disableInterfaceCastCache = feGetEnv("TR_forceDisableInterfaceCastCache") != NULL; | ||
if (disableInterfaceCastCache) { doClassCache = false; } | ||
static bool enableInterfaceCastCache = feGetEnv("TR_forceEnableInterfaceCastCache") != NULL; | ||
if (enableInterfaceCastCache) { doClassCache = true; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reworked the logic.
*/ | ||
TR::Register *scratchReg = (!doClassCache && isCheckCast) ? cg->allocateRegister() : NULL; | ||
|
||
uint8_t numDeps = 2 + (scratchReg != NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've changed that.
* Use the scratch register if it is available. Otherwise, re-use the | ||
* instanceClassReg register to perform itable lookup | ||
*/ | ||
TR::Register *itableReg = scratchReg ? scratchReg : instanceClassReg; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. We reach here only if we are not using the cast cache for instanceofs or checkcasts. scratchReg
is used for checkcasts, and instanceClassReg
is itself a scratch register that can be re-used for instanceofs. Preserving the registers by pushing onto the stack is only required when there is an out-of-line sequence, and there won't be one for instanceofs.
if (!isCheckCast) | ||
{ | ||
// Class found in itable | ||
generateInstruction(TR::InstOpCode::STC, node, cg); | ||
} | ||
generateLabelInstruction(TR::InstOpCode::JMP4, node, endLabel, cg); | ||
|
||
// Not found | ||
generateVFPRestoreInstruction(vfp, node, cg); | ||
generateLabelInstruction(TR::InstOpCode::label, node, iTableLookUpFailLabel, cg); | ||
if (isCheckCast) | ||
// Fall through to endLabel | ||
} | ||
else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After inlineInterfaceLookup()
there are only two cases to consider here: whether we're implementing an instanceof or a checkcast. In the case of an instanceof, inlineInterfaceLookup()
will have performed the test and all that remains to be done here is to set the carry flag (via STC
) depending on the outcome of the test. In the case of a checkcast, we must generate an OOL sequence that will throw the ClassCastException in the event that the lookup failed.
{ | ||
if (tmp) | ||
TR_ASSERT_FATAL(scratchReg, "Scratch register required for iTable lookup"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't remember what I was thinking here. On this path, scratchReg
will always be allocated for checkcasts and you're right that it is a little late to be checking that here. I'm just going to delete the assert as I don't think it is helpful.
// CheckCast iTable fail lookup out-of-line | ||
TR_OutlinedInstructionsGenerator og(iTableLookUpFailLabel, node, cg); | ||
|
||
generateRegInstruction(TR::InstOpCode::PUSHReg, node, instanceClassReg, cg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the checkcast fails we have to throw an exception. The instance class is one of the parameters to the JIT helper that will throw the class cast exception. It is pushed to the stack per the calling convention of that helper.
Jenkins test sanity.functional,extended.functional xlinux,win,osx jdk17 |
PR eclipse-openj9#2361 introduced a number of JIT performance enhancements for checkcasts involving interface classes. Two of the most significant changes were the introduction of a one-slot, dynamic cache to cache the last successful instance class that matched the cast class. It also inlined the itable walk to determine if the interface class implemented the cast class. Each time a successful checkcast was performed the cache would be updated with that result at runtime (it was an LRU cache). This cache was implemented as a single, 8-byte data snippet located in the code cache. The performance of this approach may be acceptable for checkcast sites that do not see more than one instance class, or when there is only a single thread of execution along this path. However, in the presence of multiple threads and multiple instance classes the performance penalty of multiple threads writing to the same data address (the class cache) is quite significant due to hardware cache coherency protocols. In fact, the performance overhead gets significantly worse the more threads that are involved. There is no information provided in eclipse-openj9#2361 for what motivated the change or the workload it was expected to benefit. This PR modifies that implementation as follows: * Consult available profiling information to determine how many instance classes may be seen by this checkcast/instanceof site. If there are more than one then do not use a cache. If there is no information available then do not use a cache as the characteristics are unknown. * If there is a single profiled instance class then pre-populate the cache with that class. Do not update the cache at runtime. * If a cache is not used, then inline the interface table walk in mainline code rather than from outlined instructions. Three environment variables have been introduced to control behaviour of this evaluator: 1) `TR_updateInterfaceCheckCastCacheSlot` : when set the interface cast cache slot will be updated at runtime (i.e., this is the original behaviour) 2) `TR_forceDisableInterfaceCastCache` : never use the cache regardless of profiling information. 3) `TR_forceEnableInterfaceCastCache` : force the use of the cache regardless of profiling information. 2) and 3) are mutually exclusive. Behaviour is undefined if both are set. Using 1) and 3) will restore original behaviour. Signed-off-by: Daryl Maier <maier@ca.ibm.com>
Jenkins test sanity.functional,extended.functional xlinux,win,osx jdk17 |
Jenkins test extended.functional xlinux jdk17 |
@BradleyWood @a7ehuo : Your comments addressed and CI testing passed. Please review again. |
LGTM. I'm not able to open this PR in browsers. I got an error "This page is taking too long to load". I have no issues to open other OpenJ9 PRs. I reviewed the update in my iPad GitHub app. |
I can't access it either with a browser and I opened a GH support ticket: https://support.github.com/ticket/personal/0/1636593 |
LGTM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to look at this via the GH app on my phone, so I've only done a very cursory review.
PR #2361 introduced a number of JIT performance enhancements for checkcasts
involving interface classes. Two of the most significant changes were the
introduction of a one-slot, dynamic cache to cache the last successful instance
class that matched the cast class. It also inlined the itable walk to determine
if the interface class implemented the cast class.
Each time a successful checkcast was performed the cache would be updated with
that result at runtime (it was an LRU cache). This cache was implemented as a
single, 8-byte data snippet located in the code cache. The performance of this
approach may be acceptable for checkcast sites that do not see more than one
instance class, or when there is only a single thread of execution along this
path. However, in the presence of multiple threads and multiple instance classes
the performance penalty of multiple threads writing to the same data address (the
class cache) is quite significant due to hardware cache coherency protocols.
In fact, the performance overhead gets significantly worse the more threads that
are involved.
There is no information provided in #2361 for what motivated the change or the
workload it was expected to benefit.
This PR modifies that implementation as follows:
Consult available profiling information to determine how many instance classes
may be seen by this checkcast/instanceof site. If there are more than one then
do not use a cache. If there is no information available then do not use a
cache as the characteristics are unknown.
If there is a single profiled instance class then pre-populate the cache with
that class. Do not update the cache at runtime.
If a cache is not used, then inline the interface table walk in mainline code
rather than from outlined instructions.
Three environment variables have been introduced to control behaviour of this
evaluator:
TR_updateInterfaceCheckCastCacheSlot
: when set the interface cast cacheslot will be updated at runtime (i.e., this is the original behaviour)
TR_forceDisableInterfaceCastCache
: never use the cache regardless ofprofiling information.
TR_forceEnableInterfaceCastCache
: force the use of the cache regardless ofprofiling information.
and 3) are mutually exclusive. Behaviour is undefined if both are set.
Using 1) and 3) will restore original behaviour.
Signed-off-by: Daryl Maier maier@ca.ibm.com