Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Portable SCC: Define Default Processor Features #7966

Open
dsouzai opened this issue Dec 4, 2019 · 46 comments
Open

Portable SCC: Define Default Processor Features #7966

dsouzai opened this issue Dec 4, 2019 · 46 comments

Comments

@dsouzai
Copy link
Contributor

dsouzai commented Dec 4, 2019

This issue is to discuss whether or not it makes sense to define a set of processor features the compiler should target when generating AOT code. The three general approaches we can take are:

  1. The compiler has a set of processor features defined for an AOT compile.
  2. The compiler targets whatever features are available on the machine it is running on (while turning off certain features such as TM).
  3. The compiler maintains a mapping of CPU version to CPU features; even though certain platforms don't have CPU features that are tied to any particular CPU version, the compiler can enforce such a mapping to simply user experience (rather than the user having to specify all features they want).

The question of what a compiler like GCC does came up in the discussion. Looking online, my understanding is that by default GCC compiles for the target it itself was compiled to target:

The only difference in behavior between two GCC versions built targeting different sub-architectures is the implicit default argument for the -march parameter, which is derived from the GCC's CHOST when not explicitly provided in the command line. [1]

GCC will only target the CPU it is running on if -march=native is specified [2]


[1] https://wiki.gentoo.org/wiki/GCC_optimization
[2] https://wiki.gentoo.org/wiki/Distcc#-march.3Dnative

@dsouzai
Copy link
Contributor Author

dsouzai commented Dec 4, 2019

@dsouzai
Copy link
Contributor Author

dsouzai commented Dec 4, 2019

My personal preference is 1. Targeting the processor the SCC is generated on is currently what happens but that puts the burden on the user to find an old enough machine. I also do not prefer having a mapping between CPU version and some set of CPU features because it doesn't reflect the reality of which CPU features are available on which CPU version, and is a non-standard mapping that has be maintained and documented.

@DanHeidinga
Copy link
Member

I'm fine starting with a system that specifies all the features to target so we have something working.

From a usability perspective, it won't take long for us (and end users!) to need a way to compactly specify groups of features. If this is a quasi-mapping to process type as proposed in [3], great. Some other logical way to group features together is fine by me as well.

@andrewcraik
Copy link
Contributor

I think grouping based on the hardware platform only makes sense when the platform itself defines such a logical grouping. For example, the aarch64 architecture defines groups of features as a package and optional packages can be included or not in an implementation - grouping our queries to match these architectural groupings makes sense. On a platform like x86 where feature flags are used to determine feature support and feature support is not necessarily tied to a generation across manufacturers then trying to tie these features to a generation / hardware feature group makes less sense. As a result, option 3 is not the right general solution in my mind.

I'm not sure if 1 or 2 is right or if we should just define the 'default' AOT configuration and provide documentation and help on how to enable/disable features. Logical groupings based on what the compiler is accelerating or similar might make sense, but artificial hardware groupings not reflected in the underlying hardware platform (eg mapping features to processor levels when the feature is not truly tied to the processor level) seems counter productive.

I agree the usability story does need some consideration/attention. Given a major use case is a docker image or similar are there any container properties we could use to help figure out what the base config should be?

Another option might be something like the Xtrace and Xdump builders to help build an AOT hardware config?

@DanHeidinga
Copy link
Member

DanHeidinga commented Dec 5, 2019

How does logical groupings differ from the mtune/march settings of GCC?

We have to keep the goal in mind - making it easy for a user to create a portable SCC which includes AOT code that will be broadly applicable. Mapping features to logical groups, whether mtune/march-style or our own creation, gives users a reasonable way to control the baseline systems they want to target.

User's don't understand, and don't want to understand, the right set of flags to enable for the hardware they are deploying on. They will, at the most, know they are targeting "Haswell" processors, or "Skylake", or .... when they buy their instances from cloud providers. They just want their AOT code to work in that world, even if it's not the fastest they could get, as they don't control the hardware.

Another option might be something like the Xtrace and Xdump builders to help build an AOT hardware config?

This sounds a lot like having pre-built configurations :)

@dsouzai
Copy link
Contributor Author

dsouzai commented Dec 7, 2019

After taking a closer look at gcc's march option, I don't think we should follow that approach. GCC does maintain a mapping between processor version and some set of processor features. However, GCC is widely used, and hence the mapping they maintain can be more or less considered standard. I would rather not define a new mapping that's only applicable to us. I also don't want to have to depend on GCC's mapping.

From a usability perspective, it won't take long for us (and end users!) to need a way to compactly specify groups of features.

I'm not convinced that's true. This is something we're only trying to define for AOT code. As you said above:

User's don't understand, and don't want to understand, the right set of flags to enable for the hardware they are deploying on.

They just want their AOT code to work in that world, even if it's not the fastest they could get, as they don't control the hardware.

I don't see why anyone would care whether the AOT generated code is targeting an ivybridge machine even though the JVM is running on say a skylake, so long as they get the benefits.

Having a single definition makes it easier to document and makes it consistent no matter where the SCC is generated (portability being the main goal we're after here). JIT code is still going to target the machine the JVM is running on, so the idea here is the same as always: AOT code gets your app started fast, JIT recompilation gets your app's steady state performance fast.

The set of default features I'm thinking shouldn't be targeting something as old as say core2duo or P4. We can pick some reasonable set of features that should exist on most machines today, and we can easily add downgrading logic to take care of what happens when some features don't.

@harryyu1994
Copy link
Contributor

harryyu1994 commented May 25, 2020

We now have the infrastructure to specify processor types on the fly for each compilation. It's time to decide on the actual set of portable AOT processor defaults for each platform.
@dsouzai @vijaysun-omr @mpirvu @andrewcraik @fjeremic @gita-omr . Could you guys please have some discussion to get this going? Thanks!

Also Marius suggested that we should be able to specify processor via command line options.

@harryyu1994
Copy link
Contributor

Here's the list of processors:

/* List of all processors that are currently supported by OMR's processor detection */

typedef enum OMRProcessorArchitecture {

	OMR_PROCESSOR_UNDEFINED,
	OMR_PROCESSOR_FIRST,

	// 390 Processors
	OMR_PROCESSOR_S390_FIRST = OMR_PROCESSOR_FIRST,
	OMR_PROCESSOR_S390_UNKNOWN = OMR_PROCESSOR_S390_FIRST,
	OMR_PROCESSOR_S390_GP6,
	OMR_PROCESSOR_S390_Z10 = OMR_PROCESSOR_S390_GP6,
	OMR_PROCESSOR_S390_GP7,
	OMR_PROCESSOR_S390_GP8,
	OMR_PROCESSOR_S390_GP9,
	OMR_PROCESSOR_S390_Z196 = OMR_PROCESSOR_S390_GP9,
	OMR_PROCESSOR_S390_GP10,
	OMR_PROCESSOR_S390_ZEC12 = OMR_PROCESSOR_S390_GP10,
	OMR_PROCESSOR_S390_GP11,
	OMR_PROCESSOR_S390_Z13 = OMR_PROCESSOR_S390_GP11,
	OMR_PROCESSOR_S390_GP12,
	OMR_PROCESSOR_S390_Z14 = OMR_PROCESSOR_S390_GP12,
	OMR_PROCESSOR_S390_GP13,
	OMR_PROCESSOR_S390_Z15 = OMR_PROCESSOR_S390_GP13,
	OMR_PROCESSOR_S390_GP14,
	OMR_PROCESSOR_S390_ZNEXT = OMR_PROCESSOR_S390_GP14,
	OMR_PROCESSOR_S390_LAST = OMR_PROCESSOR_S390_GP14,

	// ARM Processors
	OMR_PROCESSOR_ARM_FIRST,
	OMR_PROCESSOR_ARM_UNKNOWN = OMR_PROCESSOR_ARM_FIRST,
	OMR_PROCESSOR_ARM_V6,
	OMR_PROCESSOR_ARM_V7,
	OMR_PROCESSOR_ARM_LAST = OMR_PROCESSOR_ARM_V7,

	// ARM64 / AARCH64 Processors
	OMR_PROCESSOR_ARM64_FISRT,
	OMR_PROCESSOR_ARM64_UNKNOWN = OMR_PROCESSOR_ARM64_FISRT,
	OMR_PROCESSOR_ARM64_V8_A,
	OMR_PROCESSOR_ARM64_LAST = OMR_PROCESSOR_ARM64_V8_A,

	// PPC Processors
	OMR_PROCESSOR_PPC_FIRST,
	OMR_PROCESSOR_PPC_UNKNOWN = OMR_PROCESSOR_PPC_FIRST,
	OMR_PROCESSOR_PPC_RIOS1,
	OMR_PROCESSOR_PPC_PWR403,
	OMR_PROCESSOR_PPC_PWR405,
	OMR_PROCESSOR_PPC_PWR440,
	OMR_PROCESSOR_PPC_PWR601,
	OMR_PROCESSOR_PPC_PWR602,
	OMR_PROCESSOR_PPC_PWR603,
	OMR_PROCESSOR_PPC_82XX,
	OMR_PROCESSOR_PPC_7XX,
	OMR_PROCESSOR_PPC_PWR604,
	// The following processors support SQRT in hardware
	OMR_PROCESSOR_PPC_HW_SQRT_FIRST,
	OMR_PROCESSOR_PPC_RIOS2 = OMR_PROCESSOR_PPC_HW_SQRT_FIRST,
	OMR_PROCESSOR_PPC_PWR2S,
	// The following processors are 64-bit implementations
	OMR_PROCESSOR_PPC_64BIT_FIRST,
	OMR_PROCESSOR_PPC_PWR620 = OMR_PROCESSOR_PPC_64BIT_FIRST,
	OMR_PROCESSOR_PPC_PWR630,
	OMR_PROCESSOR_PPC_NSTAR,
	OMR_PROCESSOR_PPC_PULSAR,
	// The following processors support the PowerPC AS architecture
	// PPC AS includes the new branch hint 'a' and 't' bits
	OMR_PROCESSOR_PPC_AS_FIRST,
	OMR_PROCESSOR_PPC_GP = OMR_PROCESSOR_PPC_AS_FIRST,
	OMR_PROCESSOR_PPC_GR,
	// The following processors support VMX
	OMR_PROCESSOR_PPC_VMX_FIRST,
	OMR_PROCESSOR_PPC_GPUL = OMR_PROCESSOR_PPC_VMX_FIRST,
	OMR_PROCESSOR_PPC_HW_ROUND_FIRST,
	OMR_PROCESSOR_PPC_HW_COPY_SIGN_FIRST = OMR_PROCESSOR_PPC_HW_ROUND_FIRST,
	OMR_PROCESSOR_PPC_P6 = OMR_PROCESSOR_PPC_HW_COPY_SIGN_FIRST,
	OMR_PROCESOSR_PPC_ATLAS,
	OMR_PROCESSOR_PPC_BALANCED,
	OMR_PROCESSOR_PPC_CELLPX,
	// The following processors support VSX
	OMR_PROCESSOR_PPC_VSX_FIRST,
	OMR_PROCESSOR_PPC_P7 = OMR_PROCESSOR_PPC_VSX_FIRST,
	OMR_PROCESSOR_PPC_P8,
	OMR_PROCESSOR_PPC_P9,
	OMR_PROCESSOR_PPC_LAST = OMR_PROCESSOR_PPC_P9,

	// X86 Processors
	OMR_PROCESSOR_X86_FIRST,
	OMR_PROCESSOR_X86_UNKNOWN = OMR_PROCESSOR_X86_FIRST,
	OMR_PROCESSOR_X86_INTEL_FIRST,
	OMR_PROCESSOR_X86_INTELPENTIUM = OMR_PROCESSOR_X86_INTEL_FIRST,
	OMR_PROCESSOR_X86_INTELP6,
	OMR_PROCESSOR_X86_INTELPENTIUM4,
	OMR_PROCESSOR_X86_INTELCORE2,
	OMR_PROCESSOR_X86_INTELTULSA,
	OMR_PROCESSOR_X86_INTELNEHALEM,
	OMR_PROCESSOR_X86_INTELWESTMERE,
	OMR_PROCESSOR_X86_INTELSANDYBRIDGE,
	OMR_PROCESSOR_X86_INTELIVYBRIDGE,
	OMR_PROCESSOR_X86_INTELHASWELL,
	OMR_PROCESSOR_X86_INTELBROADWELL,
	OMR_PROCESSOR_X86_INTELSKYLAKE,
	OMR_PROCESSOR_X86_INTEL_LAST = OMR_PROCESSOR_X86_INTELSKYLAKE,
	OMR_PROCESSOR_X86_AMD_FIRST,
	OMR_PROCESSOR_X86_AMDK5 = OMR_PROCESSOR_X86_AMD_FIRST,
	OMR_PROCESSOR_X86_AMDK6,
	OMR_PROCESSOR_X86_AMDATHLONDURON,
	OMR_PROCESSOR_X86_AMDOPTERON,
	OMR_PROCESSOR_X86_AMDFAMILY15H,
	OMR_PROCESSOR_X86_AMD_LAST = OMR_PROCESSOR_X86_AMDFAMILY15H,
	OMR_PROCESSOR_X86_LAST = OMR_PROCESSOR_X86_AMDFAMILY15H,

	OMR_PROCESOR_RISCV32_UNKNOWN,
	OMR_PROCESOR_RISCV64_UNKNOWN,

	OMR_PROCESSOR_DUMMY = 0x40000000 /* force wide enums */

} OMRProcessorArchitecture;

Refer to omr/include_core/omrport.h for the feature flags.

@mpirvu
Copy link
Contributor

mpirvu commented May 25, 2020

I am thinking that being able to select the features for AOT through command line options is still important. In some instances, the IT people may know that the JVM is not going to run on machines older than X (pick your architecture) and may want to target that architecture as the baseline.
Therefore I am in favor of supporting the logical grouping @DanHeidinga mentioned.
To get this going off the ground we could have a single grouping to start with and gradually add new groupings targeting newer architectures.

@vijaysun-omr
Copy link
Contributor

@harryyu1994 just so I'm clear on what you are expecting when you said It's time to decide on the actual set of portable AOT processor defaults for each platform.... did you mean we should pick the default processor for each platform from the lists that you pasted in your last comment ?

One approach could be to pick some processor that reasonably old such that a large proportion of users can reasonably be expected to have something newer than that and then force the codegen to assume that processor type and see how much of a regression you get from handicapping the codegen in this way before deciding if we should go ahead or not. Is this is the approach you were also thinking of and if so, were you essentially looking for someone familiar with the different codegens to make a processor suggestion for their platform ?

@harryyu1994
Copy link
Contributor

@harryyu1994 just so I'm clear on what you are expecting when you said It's time to decide on the actual set of portable AOT processor defaults for each platform.... did you mean we should pick the default processor for each platform from the lists that you pasted in your last comment ?

Yes, we should pick default processor for each platform from the list I pasted as well as default features (for x86).

One approach could be to pick some processor that reasonably old such that a large proportion of users can reasonably be expected to have something newer than that and then force the codegen to assume that processor type and see how much of a regression you get from handicapping the codegen in this way before deciding if we should go ahead or not. Is this is the approach you were also thinking of and if so, were you essentially looking for someone familiar with the different codegens to make a processor suggestion for their platform ?

Yes, I'm looking for processor suggestions from people.

@mpirvu
Copy link
Contributor

mpirvu commented May 26, 2020

For x86 I am proposing OMR_PROCESSOR_X86_INTELSANDYBRIDGE, to be the baseline for relocatable code. It's a 9-year old architecture that has AVX and AES instructions. If at all possible I would like this baseline to work on both Intel and AMD processors as we start to see more and more AMD EPYC instances in the cloud.

@vijaysun-omr
Copy link
Contributor

Sounds reasonable to me, though I guess the true test will be a performance run to see how much we lose by assuming this older level of architecture on a current machine, e.g. Skylake.

@harryyu1994
Copy link
Contributor

harryyu1994 commented May 26, 2020

For Z and Power, just a single processor type would be sufficient as the feature flags are set based on the processor type.
For x86, we need the set of feature flags as well. (The processor type may not matter that much)

We need to come up with a mapping of processor type to feature flags for x86

Note to self: I need to watch out for the few instances that processor type do matter on x86, also need to look into if it's possible for the baseline to work on both intel and AMD.

@mpirvu
Copy link
Contributor

mpirvu commented May 26, 2020

These are the flags listed for my machine which uses ivybridge CPUs:

Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm epb ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm arat pln pts md_clear spec_ctrl intel_stibp flush_l1d

I am reading that ivybridge added rdrand and F16C instructions on top of sandybridge, so we should exclude those from the list above.
We should care only about the set of the flags that the optimizer is trying to exploit though.

@harryyu1994
Copy link
Contributor

harryyu1994 commented May 26, 2020

We should care only about the set of the flags that the optimizer is trying to exploit though.

In my upcoming changes:

// Only enable the features that compiler currently uses
   uint32_t enabledFeatures [] = {OMR_FEATURE_X86_FPU, OMR_FEATURE_X86_CX8, OMR_FEATURE_X86_CMOV,
                                  OMR_FEATURE_X86_MMX, OMR_FEATURE_X86_SSE, OMR_FEATURE_X86_SSE2,
                                  OMR_FEATURE_X86_SSSE3, OMR_FEATURE_X86_SSE4_1, OMR_FEATURE_X86_POPCNT,
                                  OMR_FEATURE_X86_AESNI, OMR_FEATURE_X86_OSXSAVE, OMR_FEATURE_X86_AVX,
                                  OMR_FEATURE_X86_FMA, OMR_FEATURE_X86_HLE, OMR_FEATURE_X86_RTM};

We maintain this array that contains all the features that the optimizer tries to exploit
We will mask out all the features that we don't care about.

@harryyu1994
Copy link
Contributor

Had some offline discussion with Marius and here's some notes

  1. Check user command line options for the default processor they want
  2. If AOT is already available in the SCC, we want to use the processor in the SCC instead of what user specifies. Output a warning message to inform user that their processor wasn't used
  3. We can potentially produce AOT code with a newer processor than host. (In this case, we can produce code but we can't run on host) The question here is should we disable this or allow this.

@mpirvu
Copy link
Contributor

mpirvu commented May 26, 2020

These are the features present in enabledFeatures [] that a sandybridge architecture does not have:

OMR_FEATURE_X86_OSXSAVE --> OS has enabled XSETBV/XGETBV instructions to access XCR0
OMR_FEATURE_X86_FMA --> FMA extensions using YMM state
OMR_FEATURE_X86_HLE  -> Hardware lock elision
OMR_FEATURE_X86_RTM -> Restricted transactional memory

@vijaysun-omr
Copy link
Contributor

vijaysun-omr commented May 26, 2020

@harryyu1994 for your 3rd "note" did you mean that

a) the default processor level is newer than the host, i.e. it is some really old machine
OR
b) the user can specify an option to produce AOT code for a newer processor than the host

And does "disabling" mean silently not generating AOT code in that scenario OR something like reporting a usage error of some sort ?

@DanHeidinga
Copy link
Member

2. If AOT is already available in the SCC, we want to use the processor in the SCC instead of what user specifies. Output a warning message to inform user that their processor wasn't used

I'm reading this to mean there is 1 processor defined for the SCC. Does it make sense to allow different layers of a multi-layer SCC to define a different, more restrictive (ie: newer), processor level?

We should agree on whether this is a desirable principle rather than worry about the details now.

@harryyu1994
Copy link
Contributor

harryyu1994 commented May 26, 2020

@harryyu1994 for your 3rd "note" did you mean that

a) the default processor level is newer than the host, i.e. it is some really old machine
OR
b) the user can specify an option to produce AOT code for a newer processor than the host

And does "disabling" mean silently not generating AOT code in that scenario OR something like reporting a usage error of some sort ?

@vijaysun-omr I meant b). I was thinking about reporting a usage error to user. Would we ever have a use case where the user only wants to generate AOT compile for a certain processor level? So basically preparing the SCC for others.

I'm reading this to mean there is 1 processor defined for the SCC. Does it make sense to allow different layers of a multi-layer SCC to define a different, more restrictive (ie: newer), processor level?

We should agree on whether this is a desirable principle rather than worry about the details now.

@DanHeidinga

My understanding of the multi-layer cache is that it's for storing SCCs in docker images. Basically each layer of the docker image will want to have its own SCC. So for multi-layers I was thinking something like this:
The base layer will have the least restrictive processor and the outermost layer will have the most restrictive processor. When we add another layer, as long as it's equivalent or more restrictive than the layer below it we are going to allow it. Or maybe everything in docker should use the lowest possible processor settings to maximize portability.

What I meant in the original notes was for a different scenario: (so basically only considering the current outermost layer)
We already have a SCC that the JVM can run with. Then if the user wants to change it to use to a different processor, maybe we want to reject that operation? Or maybe we want to treat it as an overwrite operation where we ditch the original AOT code in the SCC and generate new code. (Not sure if this is feasible, user could just delete the SCC and generate a new one)

@DanHeidinga
Copy link
Member

We already have a SCC that the JVM can run with. Then if the user wants to change it to use to a different processor, maybe we want to reject that operation? Or maybe we want to treat it as an overwrite operation where we ditch the original AOT code in the SCC and generate new code. (Not sure if this is feasible, user could just delete the SCC and generate a new one)

Another approach would be to associate the AOT code with the Processor it requires. This would allow mixing AOT with different Processor requirements in the same cache. Not terribly useful when running on a host system but possibly more useful when a cache is being shipped around in Docker or other ways

@vijaysun-omr
Copy link
Contributor

vijaysun-omr commented May 27, 2020

In relation to comment : #7966 (comment)

I feel it is okay to report a usage error in the case when a user can specify an option to produce AOT code for a newer processor than the host. If this functionality is deemed important in the future, it can be added at that time but I don't see the need to do this work now.

@vijaysun-omr
Copy link
Contributor

vijaysun-omr commented May 27, 2020

I do see merit in @harryyu1994 comment "The base layer will have the least restrictive processor and the outermost layer will have the most restrictive processor. When we add another layer, as long as it's equivalent or more restrictive than the layer below it we are going to allow it. ", i.e. philosophically this may be something to allow.

In practical/implementation terms, I wonder if this is a use case that we support down the line rather than get bogged down with at present.

@DanHeidinga
Copy link
Member

I do see merit in @harryyu1994 comment "The base layer will have the least restrictive processor and the outermost layer will have the most restrictive processor. When we add another layer, as long as it's equivalent or more restrictive than the layer below it we are going to allow it. ", i.e. philosophically this may be something to allow.

In practical/implementation terms, I wonder if this is a use case that we support down the line rather than get bogged down with at present.

How usable is this feature without this?

My mental model is that the docker image may be built up by different parts of the CI at different times on different machines (lots of variability in that process)!

A user may pull an Adopt created docker image with a default cache in it and then in their own CI build a common framework layer with a new cache layer. Finally, each app may reuse that image and add their own classes into the cache.

If all three of those operations happen on different machines, we need to either "pin" the processor level to the one created in the JDK base image (ie: the Adopt layer) or allow each layer to add further restrictions.

A user doesn't want to have a bigger docker image due to AOT code they can't execute.

Oh, and we'll need some API, maybe in the SCC print-stats option to tell the current processor level of the cache so later layers can specify exactly the same one.

harryyu1994 added a commit to harryyu1994/openj9 that referenced this issue Jun 23, 2020
TR::Compiler->target contains the host environment while TR::Compiler->relocatableTarget contains
the target environment that AOT compilations will use. TR::Compiler->relocatableTarget is set according
to the presence of SCC, portable options and Docker containers. Then at the beginning of every compilation
(in the constructor) we set the target to be TR::Compiler->target if JIT compilation and
TR::Compiler->relocatableTarget if AOT compilation.
The following scenarios were considered:

When there is no existing SCC and no portable option specified
(-Xshareclasses:portable, -XX:+PortableShareClasses), we use the host processor for AOT
When there is no existing SCC and has portable option specified, we use the hand-picked
portable processor for AOT
When there is no existing SCC and in container, we use the hand-picked portable processor
for AOT
When there is existing SCC and it passes processor compatibility check, we use the processor
stored in the SCC (ignoring portable options and whether we are in container or not)
JITServer is not affected. Meaning JITServer will still use JITClient's processor information for
all of its compilations regardless of the portable/inContainer/SCC options
depends on: eclipse-omr/omr#4861
Issue: eclipse-openj9#7966

Signed-off-by: Harry Yu <harryyu1994@gmail.com>
harryyu1994 added a commit to harryyu1994/openj9 that referenced this issue Jul 14, 2020
TR::Compiler->target contains the host environment while TR::Compiler->relocatableTarget contains
the target environment that AOT compilations will use. TR::Compiler->relocatableTarget is set according
to the presence of SCC, portable options and Docker containers. Then at the beginning of every compilation
(in the constructor) we set the target to be TR::Compiler->target if JIT compilation and
TR::Compiler->relocatableTarget if AOT compilation.
The following scenarios were considered:

- When there is no existing SCC and no portable option specified we use the host processor for AOT
- When there is no existing SCC and has portable option specified, we use the hand-picked portable processor for AOT
- When there is no existing SCC and in container, we use the hand-picked portable processor for AOT
- When there is existing SCC and it passes processor compatibility check, we use the processor stored in the SCC
(ignoring portable options and whether we are in container or not)
- JITServer is not affected. Meaning JITServer will still use JITClient's processor information for all of its
compilations regardless of the portable/inContainer/SCC options

depends on: eclipse-omr/omr#4861
Issue: eclipse-openj9#7966

Signed-off-by: Harry Yu <harryyu1994@gmail.com>
mpirvu pushed a commit that referenced this issue Jul 16, 2020
TR::Compiler->target contains the host environment while TR::Compiler->relocatableTarget contains
the target environment that AOT compilations will use. TR::Compiler->relocatableTarget is set according
to the presence of SCC, portable options and Docker containers. Then at the beginning of every compilation
(in the constructor) we set the target to be TR::Compiler->target if JIT compilation and
TR::Compiler->relocatableTarget if AOT compilation.
The following scenarios were considered:

When there is no existing SCC and no portable option specified we use the host processor for AOT
When there is no existing SCC and has portable option specified, we use the hand-picked portable processor for AOT
When there is no existing SCC and in container, we use the hand-picked portable processor for AOT
When there is existing SCC and it passes processor compatibility check, we use the processor stored in the SCC
(ignoring portable options and whether we are in container or not)
JITServer is not affected. Meaning JITServer will still use JITClient's processor information for all of its
compilations regardless of the portable/inContainer/SCC options

Issue: #7966

Signed-off-by: Harry Yu <harryyu1994@gmail.com>
harryyu1994 added a commit to harryyu1994/openj9 that referenced this issue Jul 18, 2020
- Added "-XX:+PortableSharedCache" and "-XX:-PortableSharedCache"
- Added container detection. By default we enable portableAOT in containers.
- "-XX:-PortableSharedCache" will disable portableAOT in containers as well.
- The set of cpu features employed by the SCC can be displayed with "-Xshareclasses:printStats"

Issue: eclipse-openj9#7966

Signed-off-by: Harry Yu <harryyu1994@gmail.com>
harryyu1994 added a commit to harryyu1994/openj9 that referenced this issue Jul 24, 2020
- Added "-XX:+PortableSharedCache" and "-XX:-PortableSharedCache"
- Added container detection. By default we enable portableAOT in containers.
- "-XX:-PortableSharedCache" will disable portableAOT in containers as well.
- The set of cpu features employed by the SCC is displayed with "-Xshareclasses:printStats"

Issue: eclipse-openj9#7966

Signed-off-by: Harry Yu <harryyu1994@gmail.com>
harryyu1994 added a commit to harryyu1994/openj9 that referenced this issue Jul 29, 2020
- Added "-XX:+PortableSharedCache" and "-XX:-PortableSharedCache"
- Added container detection. By default we enable portableAOT in containers.
- "-XX:-PortableSharedCache" will disable portableAOT in containers as well.
- The set of cpu features employed by the SCC is displayed with "-Xshareclasses:printStats"

Issue: eclipse-openj9#7966

Signed-off-by: Harry Yu <harryyu1994@gmail.com>
harryyu1994 added a commit to harryyu1994/openj9 that referenced this issue Jul 29, 2020
- Added "-XX:+PortableSharedCache" and "-XX:-PortableSharedCache"
- Added container detection. By default we enable portableAOT in containers.
- "-XX:-PortableSharedCache" will disable portableAOT in containers as well.
- The set of cpu features employed by the SCC can be displayed with "-Xshareclasses:printStats"

Issue: eclipse-openj9#7966

Signed-off-by: Harry Yu <harryyu1994@gmail.com>
@harryyu1994
Copy link
Contributor

@zl-wang @gita-omr @mpirvu @vijaysun-omr I'm looking at enabling Portable AOT on Power. I have a few questions on how this may work on Power as its processor detection and compatibility check logic is a little bit different from x86 and Z. First I'll provide some background information:

Background

How processor detection works now:

  • always go through OMR port library to find the cpu information
  • after initialization all cpu related information is packed into the OMR::CPU::_processorDescription struct
  • OMR::CPU::_processorDescription contains 2 piece of information: 1. the type of the processor 2. a set of processor feature flags

How Portable AOT works on x86:

  • on x86, AOT compatibility is determined solely by processor feature flags
  • as long as the AOT code was compiled with a subset of the processor feature flag that the host processor contains, we determine it's compatible
  • so portable AOT really comes down to picking a set of processor features

How Portable AOT works on Z:

  • on Z, we have existing code that helps us disable features when we downgrade to an older Z processor
  • here's an example on how that works:
    void
    J9::Z::CPU::applyUserOptions()
     {
     if (_processorDescription.processor < OMR_PROCESSOR_S390_Z14)
        {
        omrsysinfo_processor_set_feature(&_processorDescription, OMR_FEATURE_S390_MISCELLANEOUS_INSTRUCTION_EXTENSION_2, FALSE);
        omrsysinfo_processor_set_feature(&_processorDescription, OMR_FEATURE_S390_VECTOR_PACKED_DECIMAL, FALSE);
        omrsysinfo_processor_set_feature(&_processorDescription, OMR_FEATURE_S390_VECTOR_FACILITY_ENHANCEMENT_1, FALSE);
        omrsysinfo_processor_set_feature(&_processorDescription, OMR_FEATURE_S390_GUARDED_STORAGE, FALSE);
        }
     ...
    • If we have downgraded the processor to below Z14, then we will disable OMR_FEATURE_S390_MISCELLANEOUS_INSTRUCTION_EXTENSION_2, OMR_FEATURE_S390_VECTOR_PACKED_DECIMAL, OMR_FEATURE_S390_VECTOR_FACILITY_ENHANCEMENT_1 and OMR_FEATURE_S390_GUARDED_STORAGE which may be set by the host cpu
  • therefore for Portable AOT, all we need to do for Z is to define a portable processor (such as z10) then apply this logic to disable features that z10 doesn't have.

Questions

  • Do you think we should take the x86 approach where we manually define a set of processor features for the portable processor feature set or take the Z approach where we need logic to disable features when we downgrade to an older cpu.
    • in my opinion we should follow the Z approach unlike x86 we have debug options in Power just like Z that allows us to downgrade processors. On Z we downgrade processor then disable feature accordingly but on Power we only downgrade processor. What's the reason behind not disabling features after downgrade processors?
  • What makes things more complicated on Power is that it seems currently we are only looking at the processor type for compatibility check. Should we be looking at the processor feature flags instead?
    • Here's Power's cpu compatibility check logic:
    bool
    J9::Power::CPU::isCompatible(const OMRProcessorDesc& processorDescription)
     {
     OMRProcessorArchitecture targetProcessor = self()->getProcessorDescription().processor;
     OMRProcessorArchitecture processor = processorDescription.processor;
     // Backwards compatibility only applies to p4,p5,p6,p7 and onwards
     // Looks for equality otherwise
     if ((processor == OMR_PROCESSOR_PPC_GP
         || processor == OMR_PROCESSOR_PPC_GR 
         || processor == OMR_PROCESSOR_PPC_P6 
         || (processor >= OMR_PROCESSOR_PPC_P7 && processor <= OMR_PROCESSOR_PPC_LAST))
         && (targetProcessor == OMR_PROCESSOR_PPC_GP 
          || targetProcessor == OMR_PROCESSOR_PPC_GR 
          || targetProcessor == OMR_PROCESSOR_PPC_P6 
          || targetProcessor >= OMR_PROCESSOR_PPC_P7 && targetProcessor <= OMR_PROCESSOR_PPC_LAST))
        {
        return targetProcessor >= processor;
        }
     return targetProcessor == processor;
     }
    • and here's x86's compatibility check logic
    bool
    J9::X86::CPU::isCompatible(const OMRProcessorDesc& processorDescription)
     {
     for (int i = 0; i < OMRPORT_SYSINFO_FEATURES_SIZE; i++)
        {
        // Check to see if the current processor contains all the features that code cache's processor has
        if ((processorDescription.features[i] & self()->getProcessorDescription().features[i]) != processorDescription.features[i])
           return false;
        }
     return true;
     }

@zl-wang
Copy link
Contributor

zl-wang commented Sep 15, 2020

Do you think we should take the x86 approach where we manually define a set of processor features for the portable processor feature set or take the Z approach where we need logic to disable features when we downgrade to an older cpu.

in my opinion we should follow the Z approach unlike x86 we have debug options in Power just like Z that allows us to downgrade processors. On Z we downgrade processor then disable feature accordingly but on Power we only downgrade processor. What's the reason behind not disabling features after downgrade processors?

What makes things more complicated on Power is that it seems currently we are only looking at the processor type for compatibility check. Should we be looking at the processor feature flags instead?

On POWER, it should be similar to Z approach: lower processor-type plus enabled features. For example, Transaction Memory -- it doesn't depend on the hardware strictly, it also depends on whether the OS enables it or not. You cannot determine its availability solely by processor-type.

The general principle of hardware is later generations of CPU are compatible with earlier generations of hardware, ISA-wise, except very few exceptions between far-away generations (for deprecated instructions, e.g.).

@harryyu1994
Copy link
Contributor

On POWER, it should be similar to Z approach: lower processor-type plus enabled features. For example, Transaction Memory -- it doesn't depend on the hardware strictly, it also depends on whether the OS enables it or not. You cannot determine its availability solely by processor-type.

The general principle of hardware is later generations of CPU are compatible with earlier generations of hardware, ISA-wise, except very few exceptions between far-away generations (for deprecated instructions, e.g.).

Okay so processor features is determined by both hardware and os, this makes sense. Another question is that is it true for Power that if the host environment (hardware + os) contains all the processor features that the build environment has then we can run the AOT code(from the build environment) on the host environment?

@harryyu1994
Copy link
Contributor

harryyu1994 commented Sep 15, 2020

The processor feature set contained in processorDescription should be what's actually available and not what could be available based on the processor type. We should take into account of the OS when we initialize the processor feature set. After that, we can just safely compare the processor feature set similar to how we are doing it for x86. I'm hoping this works for Power.

@zl-wang
Copy link
Contributor

zl-wang commented Sep 16, 2020

Okay so processor features is determined by both hardware and os, this makes sense. Another question is that is it true for Power that if the host environment (hardware + os) contains all the processor features that the build environment has then we can run the AOT code(from the build environment) on the host environment?

Yes, that is expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants