-
Notifications
You must be signed in to change notification settings - Fork 722
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Portable SCC: Define Default Processor Features #7966
Comments
My personal preference is 1. Targeting the processor the SCC is generated on is currently what happens but that puts the burden on the user to find an old enough machine. I also do not prefer having a mapping between CPU version and some set of CPU features because it doesn't reflect the reality of which CPU features are available on which CPU version, and is a non-standard mapping that has be maintained and documented. |
I'm fine starting with a system that specifies all the features to target so we have something working. From a usability perspective, it won't take long for us (and end users!) to need a way to compactly specify groups of features. If this is a quasi-mapping to process type as proposed in [3], great. Some other logical way to group features together is fine by me as well. |
I think grouping based on the hardware platform only makes sense when the platform itself defines such a logical grouping. For example, the aarch64 architecture defines groups of features as a package and optional packages can be included or not in an implementation - grouping our queries to match these architectural groupings makes sense. On a platform like x86 where feature flags are used to determine feature support and feature support is not necessarily tied to a generation across manufacturers then trying to tie these features to a generation / hardware feature group makes less sense. As a result, option 3 is not the right general solution in my mind. I'm not sure if 1 or 2 is right or if we should just define the 'default' AOT configuration and provide documentation and help on how to enable/disable features. Logical groupings based on what the compiler is accelerating or similar might make sense, but artificial hardware groupings not reflected in the underlying hardware platform (eg mapping features to processor levels when the feature is not truly tied to the processor level) seems counter productive. I agree the usability story does need some consideration/attention. Given a major use case is a docker image or similar are there any container properties we could use to help figure out what the base config should be? Another option might be something like the Xtrace and Xdump builders to help build an AOT hardware config? |
How does logical groupings differ from the mtune/march settings of GCC? We have to keep the goal in mind - making it easy for a user to create a portable SCC which includes AOT code that will be broadly applicable. Mapping features to logical groups, whether mtune/march-style or our own creation, gives users a reasonable way to control the baseline systems they want to target. User's don't understand, and don't want to understand, the right set of flags to enable for the hardware they are deploying on. They will, at the most, know they are targeting "Haswell" processors, or "Skylake", or .... when they buy their instances from cloud providers. They just want their AOT code to work in that world, even if it's not the fastest they could get, as they don't control the hardware.
This sounds a lot like having pre-built configurations :) |
After taking a closer look at gcc's
I'm not convinced that's true. This is something we're only trying to define for AOT code. As you said above:
I don't see why anyone would care whether the AOT generated code is targeting an ivybridge machine even though the JVM is running on say a skylake, so long as they get the benefits. Having a single definition makes it easier to document and makes it consistent no matter where the SCC is generated (portability being the main goal we're after here). JIT code is still going to target the machine the JVM is running on, so the idea here is the same as always: AOT code gets your app started fast, JIT recompilation gets your app's steady state performance fast. The set of default features I'm thinking shouldn't be targeting something as old as say core2duo or P4. We can pick some reasonable set of features that should exist on most machines today, and we can easily add downgrading logic to take care of what happens when some features don't. |
We now have the infrastructure to specify processor types on the fly for each compilation. It's time to decide on the actual set of portable AOT processor defaults for each platform. Also Marius suggested that we should be able to specify processor via command line options. |
Here's the list of processors: /* List of all processors that are currently supported by OMR's processor detection */
typedef enum OMRProcessorArchitecture {
OMR_PROCESSOR_UNDEFINED,
OMR_PROCESSOR_FIRST,
// 390 Processors
OMR_PROCESSOR_S390_FIRST = OMR_PROCESSOR_FIRST,
OMR_PROCESSOR_S390_UNKNOWN = OMR_PROCESSOR_S390_FIRST,
OMR_PROCESSOR_S390_GP6,
OMR_PROCESSOR_S390_Z10 = OMR_PROCESSOR_S390_GP6,
OMR_PROCESSOR_S390_GP7,
OMR_PROCESSOR_S390_GP8,
OMR_PROCESSOR_S390_GP9,
OMR_PROCESSOR_S390_Z196 = OMR_PROCESSOR_S390_GP9,
OMR_PROCESSOR_S390_GP10,
OMR_PROCESSOR_S390_ZEC12 = OMR_PROCESSOR_S390_GP10,
OMR_PROCESSOR_S390_GP11,
OMR_PROCESSOR_S390_Z13 = OMR_PROCESSOR_S390_GP11,
OMR_PROCESSOR_S390_GP12,
OMR_PROCESSOR_S390_Z14 = OMR_PROCESSOR_S390_GP12,
OMR_PROCESSOR_S390_GP13,
OMR_PROCESSOR_S390_Z15 = OMR_PROCESSOR_S390_GP13,
OMR_PROCESSOR_S390_GP14,
OMR_PROCESSOR_S390_ZNEXT = OMR_PROCESSOR_S390_GP14,
OMR_PROCESSOR_S390_LAST = OMR_PROCESSOR_S390_GP14,
// ARM Processors
OMR_PROCESSOR_ARM_FIRST,
OMR_PROCESSOR_ARM_UNKNOWN = OMR_PROCESSOR_ARM_FIRST,
OMR_PROCESSOR_ARM_V6,
OMR_PROCESSOR_ARM_V7,
OMR_PROCESSOR_ARM_LAST = OMR_PROCESSOR_ARM_V7,
// ARM64 / AARCH64 Processors
OMR_PROCESSOR_ARM64_FISRT,
OMR_PROCESSOR_ARM64_UNKNOWN = OMR_PROCESSOR_ARM64_FISRT,
OMR_PROCESSOR_ARM64_V8_A,
OMR_PROCESSOR_ARM64_LAST = OMR_PROCESSOR_ARM64_V8_A,
// PPC Processors
OMR_PROCESSOR_PPC_FIRST,
OMR_PROCESSOR_PPC_UNKNOWN = OMR_PROCESSOR_PPC_FIRST,
OMR_PROCESSOR_PPC_RIOS1,
OMR_PROCESSOR_PPC_PWR403,
OMR_PROCESSOR_PPC_PWR405,
OMR_PROCESSOR_PPC_PWR440,
OMR_PROCESSOR_PPC_PWR601,
OMR_PROCESSOR_PPC_PWR602,
OMR_PROCESSOR_PPC_PWR603,
OMR_PROCESSOR_PPC_82XX,
OMR_PROCESSOR_PPC_7XX,
OMR_PROCESSOR_PPC_PWR604,
// The following processors support SQRT in hardware
OMR_PROCESSOR_PPC_HW_SQRT_FIRST,
OMR_PROCESSOR_PPC_RIOS2 = OMR_PROCESSOR_PPC_HW_SQRT_FIRST,
OMR_PROCESSOR_PPC_PWR2S,
// The following processors are 64-bit implementations
OMR_PROCESSOR_PPC_64BIT_FIRST,
OMR_PROCESSOR_PPC_PWR620 = OMR_PROCESSOR_PPC_64BIT_FIRST,
OMR_PROCESSOR_PPC_PWR630,
OMR_PROCESSOR_PPC_NSTAR,
OMR_PROCESSOR_PPC_PULSAR,
// The following processors support the PowerPC AS architecture
// PPC AS includes the new branch hint 'a' and 't' bits
OMR_PROCESSOR_PPC_AS_FIRST,
OMR_PROCESSOR_PPC_GP = OMR_PROCESSOR_PPC_AS_FIRST,
OMR_PROCESSOR_PPC_GR,
// The following processors support VMX
OMR_PROCESSOR_PPC_VMX_FIRST,
OMR_PROCESSOR_PPC_GPUL = OMR_PROCESSOR_PPC_VMX_FIRST,
OMR_PROCESSOR_PPC_HW_ROUND_FIRST,
OMR_PROCESSOR_PPC_HW_COPY_SIGN_FIRST = OMR_PROCESSOR_PPC_HW_ROUND_FIRST,
OMR_PROCESSOR_PPC_P6 = OMR_PROCESSOR_PPC_HW_COPY_SIGN_FIRST,
OMR_PROCESOSR_PPC_ATLAS,
OMR_PROCESSOR_PPC_BALANCED,
OMR_PROCESSOR_PPC_CELLPX,
// The following processors support VSX
OMR_PROCESSOR_PPC_VSX_FIRST,
OMR_PROCESSOR_PPC_P7 = OMR_PROCESSOR_PPC_VSX_FIRST,
OMR_PROCESSOR_PPC_P8,
OMR_PROCESSOR_PPC_P9,
OMR_PROCESSOR_PPC_LAST = OMR_PROCESSOR_PPC_P9,
// X86 Processors
OMR_PROCESSOR_X86_FIRST,
OMR_PROCESSOR_X86_UNKNOWN = OMR_PROCESSOR_X86_FIRST,
OMR_PROCESSOR_X86_INTEL_FIRST,
OMR_PROCESSOR_X86_INTELPENTIUM = OMR_PROCESSOR_X86_INTEL_FIRST,
OMR_PROCESSOR_X86_INTELP6,
OMR_PROCESSOR_X86_INTELPENTIUM4,
OMR_PROCESSOR_X86_INTELCORE2,
OMR_PROCESSOR_X86_INTELTULSA,
OMR_PROCESSOR_X86_INTELNEHALEM,
OMR_PROCESSOR_X86_INTELWESTMERE,
OMR_PROCESSOR_X86_INTELSANDYBRIDGE,
OMR_PROCESSOR_X86_INTELIVYBRIDGE,
OMR_PROCESSOR_X86_INTELHASWELL,
OMR_PROCESSOR_X86_INTELBROADWELL,
OMR_PROCESSOR_X86_INTELSKYLAKE,
OMR_PROCESSOR_X86_INTEL_LAST = OMR_PROCESSOR_X86_INTELSKYLAKE,
OMR_PROCESSOR_X86_AMD_FIRST,
OMR_PROCESSOR_X86_AMDK5 = OMR_PROCESSOR_X86_AMD_FIRST,
OMR_PROCESSOR_X86_AMDK6,
OMR_PROCESSOR_X86_AMDATHLONDURON,
OMR_PROCESSOR_X86_AMDOPTERON,
OMR_PROCESSOR_X86_AMDFAMILY15H,
OMR_PROCESSOR_X86_AMD_LAST = OMR_PROCESSOR_X86_AMDFAMILY15H,
OMR_PROCESSOR_X86_LAST = OMR_PROCESSOR_X86_AMDFAMILY15H,
OMR_PROCESOR_RISCV32_UNKNOWN,
OMR_PROCESOR_RISCV64_UNKNOWN,
OMR_PROCESSOR_DUMMY = 0x40000000 /* force wide enums */
} OMRProcessorArchitecture; Refer to omr/include_core/omrport.h for the feature flags. |
I am thinking that being able to select the features for AOT through command line options is still important. In some instances, the IT people may know that the JVM is not going to run on machines older than X (pick your architecture) and may want to target that architecture as the baseline. |
@harryyu1994 just so I'm clear on what you are expecting when you said One approach could be to pick some processor that reasonably old such that a large proportion of users can reasonably be expected to have something newer than that and then force the codegen to assume that processor type and see how much of a regression you get from handicapping the codegen in this way before deciding if we should go ahead or not. Is this is the approach you were also thinking of and if so, were you essentially looking for someone familiar with the different codegens to make a processor suggestion for their platform ? |
Yes, we should pick default processor for each platform from the list I pasted as well as default features (for x86).
Yes, I'm looking for processor suggestions from people. |
For x86 I am proposing OMR_PROCESSOR_X86_INTELSANDYBRIDGE, to be the baseline for relocatable code. It's a 9-year old architecture that has AVX and AES instructions. If at all possible I would like this baseline to work on both Intel and AMD processors as we start to see more and more AMD EPYC instances in the cloud. |
Sounds reasonable to me, though I guess the true test will be a performance run to see how much we lose by assuming this older level of architecture on a current machine, e.g. Skylake. |
For Z and Power, just a single processor type would be sufficient as the feature flags are set based on the processor type. We need to come up with a mapping of processor type to feature flags for x86 Note to self: I need to watch out for the few instances that processor type do matter on x86, also need to look into if it's possible for the baseline to work on both intel and AMD. |
These are the flags listed for my machine which uses ivybridge CPUs: Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm epb ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm arat pln pts md_clear spec_ctrl intel_stibp flush_l1d I am reading that ivybridge added |
In my upcoming changes: // Only enable the features that compiler currently uses
uint32_t enabledFeatures [] = {OMR_FEATURE_X86_FPU, OMR_FEATURE_X86_CX8, OMR_FEATURE_X86_CMOV,
OMR_FEATURE_X86_MMX, OMR_FEATURE_X86_SSE, OMR_FEATURE_X86_SSE2,
OMR_FEATURE_X86_SSSE3, OMR_FEATURE_X86_SSE4_1, OMR_FEATURE_X86_POPCNT,
OMR_FEATURE_X86_AESNI, OMR_FEATURE_X86_OSXSAVE, OMR_FEATURE_X86_AVX,
OMR_FEATURE_X86_FMA, OMR_FEATURE_X86_HLE, OMR_FEATURE_X86_RTM}; We maintain this array that contains all the features that the optimizer tries to exploit |
Had some offline discussion with Marius and here's some notes
|
These are the features present in
|
@harryyu1994 for your 3rd "note" did you mean that a) the default processor level is newer than the host, i.e. it is some really old machine And does "disabling" mean silently not generating AOT code in that scenario OR something like reporting a usage error of some sort ? |
I'm reading this to mean there is 1 processor defined for the SCC. Does it make sense to allow different layers of a multi-layer SCC to define a different, more restrictive (ie: newer), processor level? We should agree on whether this is a desirable principle rather than worry about the details now. |
@vijaysun-omr I meant b). I was thinking about reporting a usage error to user. Would we ever have a use case where the user only wants to generate AOT compile for a certain processor level? So basically preparing the SCC for others.
My understanding of the multi-layer cache is that it's for storing SCCs in docker images. Basically each layer of the docker image will want to have its own SCC. So for multi-layers I was thinking something like this: What I meant in the original notes was for a different scenario: (so basically only considering the current outermost layer) |
Another approach would be to associate the AOT code with the Processor it requires. This would allow mixing AOT with different Processor requirements in the same cache. Not terribly useful when running on a host system but possibly more useful when a cache is being shipped around in Docker or other ways |
In relation to comment : #7966 (comment) I feel it is okay to report a usage error in the case when a user can specify an option to produce AOT code for a newer processor than the host. If this functionality is deemed important in the future, it can be added at that time but I don't see the need to do this work now. |
I do see merit in @harryyu1994 comment "The base layer will have the least restrictive processor and the outermost layer will have the most restrictive processor. When we add another layer, as long as it's equivalent or more restrictive than the layer below it we are going to allow it. ", i.e. philosophically this may be something to allow. In practical/implementation terms, I wonder if this is a use case that we support down the line rather than get bogged down with at present. |
How usable is this feature without this? My mental model is that the docker image may be built up by different parts of the CI at different times on different machines (lots of variability in that process)! A user may pull an Adopt created docker image with a default cache in it and then in their own CI build a common framework layer with a new cache layer. Finally, each app may reuse that image and add their own classes into the cache. If all three of those operations happen on different machines, we need to either "pin" the processor level to the one created in the JDK base image (ie: the Adopt layer) or allow each layer to add further restrictions. A user doesn't want to have a bigger docker image due to AOT code they can't execute. Oh, and we'll need some API, maybe in the SCC |
TR::Compiler->target contains the host environment while TR::Compiler->relocatableTarget contains the target environment that AOT compilations will use. TR::Compiler->relocatableTarget is set according to the presence of SCC, portable options and Docker containers. Then at the beginning of every compilation (in the constructor) we set the target to be TR::Compiler->target if JIT compilation and TR::Compiler->relocatableTarget if AOT compilation. The following scenarios were considered: When there is no existing SCC and no portable option specified (-Xshareclasses:portable, -XX:+PortableShareClasses), we use the host processor for AOT When there is no existing SCC and has portable option specified, we use the hand-picked portable processor for AOT When there is no existing SCC and in container, we use the hand-picked portable processor for AOT When there is existing SCC and it passes processor compatibility check, we use the processor stored in the SCC (ignoring portable options and whether we are in container or not) JITServer is not affected. Meaning JITServer will still use JITClient's processor information for all of its compilations regardless of the portable/inContainer/SCC options depends on: eclipse-omr/omr#4861 Issue: eclipse-openj9#7966 Signed-off-by: Harry Yu <harryyu1994@gmail.com>
TR::Compiler->target contains the host environment while TR::Compiler->relocatableTarget contains the target environment that AOT compilations will use. TR::Compiler->relocatableTarget is set according to the presence of SCC, portable options and Docker containers. Then at the beginning of every compilation (in the constructor) we set the target to be TR::Compiler->target if JIT compilation and TR::Compiler->relocatableTarget if AOT compilation. The following scenarios were considered: - When there is no existing SCC and no portable option specified we use the host processor for AOT - When there is no existing SCC and has portable option specified, we use the hand-picked portable processor for AOT - When there is no existing SCC and in container, we use the hand-picked portable processor for AOT - When there is existing SCC and it passes processor compatibility check, we use the processor stored in the SCC (ignoring portable options and whether we are in container or not) - JITServer is not affected. Meaning JITServer will still use JITClient's processor information for all of its compilations regardless of the portable/inContainer/SCC options depends on: eclipse-omr/omr#4861 Issue: eclipse-openj9#7966 Signed-off-by: Harry Yu <harryyu1994@gmail.com>
TR::Compiler->target contains the host environment while TR::Compiler->relocatableTarget contains the target environment that AOT compilations will use. TR::Compiler->relocatableTarget is set according to the presence of SCC, portable options and Docker containers. Then at the beginning of every compilation (in the constructor) we set the target to be TR::Compiler->target if JIT compilation and TR::Compiler->relocatableTarget if AOT compilation. The following scenarios were considered: When there is no existing SCC and no portable option specified we use the host processor for AOT When there is no existing SCC and has portable option specified, we use the hand-picked portable processor for AOT When there is no existing SCC and in container, we use the hand-picked portable processor for AOT When there is existing SCC and it passes processor compatibility check, we use the processor stored in the SCC (ignoring portable options and whether we are in container or not) JITServer is not affected. Meaning JITServer will still use JITClient's processor information for all of its compilations regardless of the portable/inContainer/SCC options Issue: #7966 Signed-off-by: Harry Yu <harryyu1994@gmail.com>
- Added "-XX:+PortableSharedCache" and "-XX:-PortableSharedCache" - Added container detection. By default we enable portableAOT in containers. - "-XX:-PortableSharedCache" will disable portableAOT in containers as well. - The set of cpu features employed by the SCC can be displayed with "-Xshareclasses:printStats" Issue: eclipse-openj9#7966 Signed-off-by: Harry Yu <harryyu1994@gmail.com>
- Added "-XX:+PortableSharedCache" and "-XX:-PortableSharedCache" - Added container detection. By default we enable portableAOT in containers. - "-XX:-PortableSharedCache" will disable portableAOT in containers as well. - The set of cpu features employed by the SCC is displayed with "-Xshareclasses:printStats" Issue: eclipse-openj9#7966 Signed-off-by: Harry Yu <harryyu1994@gmail.com>
- Added "-XX:+PortableSharedCache" and "-XX:-PortableSharedCache" - Added container detection. By default we enable portableAOT in containers. - "-XX:-PortableSharedCache" will disable portableAOT in containers as well. - The set of cpu features employed by the SCC is displayed with "-Xshareclasses:printStats" Issue: eclipse-openj9#7966 Signed-off-by: Harry Yu <harryyu1994@gmail.com>
- Added "-XX:+PortableSharedCache" and "-XX:-PortableSharedCache" - Added container detection. By default we enable portableAOT in containers. - "-XX:-PortableSharedCache" will disable portableAOT in containers as well. - The set of cpu features employed by the SCC can be displayed with "-Xshareclasses:printStats" Issue: eclipse-openj9#7966 Signed-off-by: Harry Yu <harryyu1994@gmail.com>
@zl-wang @gita-omr @mpirvu @vijaysun-omr I'm looking at enabling Portable AOT on Power. I have a few questions on how this may work on Power as its processor detection and compatibility check logic is a little bit different from x86 and Z. First I'll provide some background information: BackgroundHow processor detection works now:
How Portable AOT works on x86:
How Portable AOT works on Z:
Questions
|
On POWER, it should be similar to Z approach: lower processor-type plus enabled features. For example, Transaction Memory -- it doesn't depend on the hardware strictly, it also depends on whether the OS enables it or not. You cannot determine its availability solely by processor-type. The general principle of hardware is later generations of CPU are compatible with earlier generations of hardware, ISA-wise, except very few exceptions between far-away generations (for deprecated instructions, e.g.). |
Okay so processor features is determined by both hardware and os, this makes sense. Another question is that is it true for Power that if the host environment (hardware + os) contains all the processor features that the build environment has then we can run the AOT code(from the build environment) on the host environment? |
The processor feature set contained in processorDescription should be what's actually available and not what could be available based on the processor type. We should take into account of the OS when we initialize the processor feature set. After that, we can just safely compare the processor feature set similar to how we are doing it for x86. I'm hoping this works for Power. |
Yes, that is expected. |
This issue is to discuss whether or not it makes sense to define a set of processor features the compiler should target when generating AOT code. The three general approaches we can take are:
The question of what a compiler like GCC does came up in the discussion. Looking online, my understanding is that by default GCC compiles for the target it itself was compiled to target:
GCC will only target the CPU it is running on if
-march=native
is specified [2][1] https://wiki.gentoo.org/wiki/GCC_optimization
[2] https://wiki.gentoo.org/wiki/Distcc#-march.3Dnative
The text was updated successfully, but these errors were encountered: