Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the bitPermute Evaluator for Z #2330

Merged
merged 1 commit into from
Apr 16, 2018

Conversation

dchopra001
Copy link
Contributor

Implement the bitPermute Evaluator for z/codegen

Signed-off-by: Dhruv Chopra Dhruv.C.Chopra@ibm.com

@dchopra001
Copy link
Contributor Author

Leaving this as a WIP for now as there are a couple of MemoryReference related things I need to verify and we might be able to come up with a better sequence of instructions to perform this bitPermute operation after @fjeremic takes a look.

@dchopra001
Copy link
Contributor Author

dchopra001 commented Feb 20, 2018

Functionality of the bitPermute opcode is summarized well by this comment:
#2156 (comment)

For comparison on Z, I generated what the IL and subsequent instructions look like on Z when bitPermute is not enabled on the platform (this is with no optimizations running, lastOptIndex=0):

 21    {
 22       final byte [] bitPositions =  {.......};
 23       int result = 0;
 24
 25       for (int i = 0; i < 64; i++)
 26          {
 27          result |= ((value >> bitPositions[i]) & 1) << i;
 28          }
 29       return result;
 30    }```

The corresponding IL and generated instructions for `result |= ((value >> bitPositions[i]) & 1) << i;` looks as below:

``` 7542  n425n    (  0)  istore  <auto slot 2>[#605  Auto] [flags 0x3 0x0 ] 
 7543  n424n    (  0)    ior (in GPR_0381)
 7544  n405n    (  0)      iload  <auto slot 2>[#605  Auto] [flags 0x3 0x0 ]
 7545  n423n    (  0)      ishl (in GPR_0381)
 7546  n421n    (  0)        iand (in GPR_0377)
 7547  n419n    (  0)          ishr (in GPR_0377)
 7548  n406n    (  0)            iload  <parm 0 I>[#603  Parm] 
 7549  n418n    (  0)            b2i (in GPR_0379) (unneededConv )
 7550  n417n    (  0)              bloadi  <array-shadow>[#473  Shadow] [flags 0x80000601 0x0 ] (in GPR_0379) (signExtendedTo32BitAtSource )
 7551  n416n    (  0)                aladd (internalPtr sharedMemory )
 7552  n407n    (  0)                  ==>aload (in &GPR_0371) (X!=0 sharedMemory )
 7553  n415n    (  0)                  ladd
 7554  n414n    (  0)                    i2l (in GPR_0375) (X>=0 unneededConv )
 7555  n408n    (  0)                      ==>iload (in GPR_0375) (signExtendedTo64BitAtSource )
 7556  n413n    (  0)                    lconst 8
 7557  n420n    (  0)          iconst 1
 7558  n422n    (  0)        iload  <auto slot 3>[#606  Auto] [flags 0x3 0x0 ] (in GPR_0382)
 7561  [     0x3ffdb1b3a70]   L       GPR_0377, Parm[<parm 0 I>] ?+0(GPR5) 
 7562  [     0x3ffdb1b3ce0]   LB      GPR_0379, Shadow[<array-shadow>] 8(GPR_0375,&GPR_0371) 
 7563  [     0x3ffdb1b3dd0]   NILL    GPR_0379,0x1f  
 7564  [     0x3ffdb1b3f90]   SRA     GPR_0377,#614 0(GPR_0379)
 7565  [     0x3ffdb1b4090]   NILF    GPR_0377,1 
 7566  [     0x3ffdb1b43b0]   L       GPR_0382, Auto[<auto slot 3>] ?+0(GPR5)
 7567  [     0x3ffdb1b44a0]   NILL    GPR_0382,0x1f
 7568  [     0x3ffdb1b4660]   SLLK    GPR_0381,GPR_0377,#615 0(GPR_0382)
 7569  [     0x3ffdb1b4860]   O       GPR_0381, Auto[<auto slot 2>] ?+0(GPR5) 
 7570  [     0x3ffdb1b4a50]   ST      GPR_0381, Auto[<auto slot 2>] ?+0(GPR5) 

And the instructions for the entire bitPermute operation (i.e the for loop):

7945  [     0x3ffdb1b0710]   Label L0018:
 7946  [     0x3ffdb1b0810]   FENCE   Relative [ 000003FFDB050220 ] BBStart <block_3> (in loop 3)
 7947  [     0x3ffdb1b0bf0]   CHSI     Auto[<auto slot 3>] ?+0(GPR5),0x0020
 7948  [     0x3ffdb1b0cf0]   BRC     BNM(0xa), Label L0017
 7949  [     0x3ffdb1b1000]   FENCE   Relative [ 000003FFDB050224 ] BBEnd </block_3>
 7950  [     0x3ffdb1b1260]   ASSOCREGS
 7951  [     0x3ffdb1b1400]   FENCE   Relative [ 000003FFDB051710 ] BBStart <block_4> (in loop 3)
 7952  [     0x3ffdb1b1860]   CGHSI    MethodMeta[stackOverflowMark] ?+80(GPR13),0xffff
 7953  [     0x3ffdb1b1960]   BRC     BH(0x8), Outlined Label L0080    ; Branch to OOL asyncCheck sequence
 7954  [     0x3ffdb1b2190]   Label L0081:     ; OOL asyncCheck return label
 7955  [     0x3ffdb1b2860]   LGAT    &GPR_0371, Auto[<auto slot 1>] ?+0(GPR5)
 7956  [     0x3ffdb1b2f00]   LGF     GPR_0375, Auto[<auto slot 3>] ?+0(GPR5)
 7957  [     0x3ffdb1b30f0]   CLT     GPR_0375,0xa, Shadow[<contiguous-array-size>] 4(&GPR_0371)
 7958  [     0x3ffdb1b3a70]   L       GPR_0377, Parm[<parm 0 I>] ?+0(GPR5)
 7959  [     0x3ffdb1b3ce0]   LB      GPR_0379, Shadow[<array-shadow>] 8(GPR_0375,&GPR_0371)
 7960  [     0x3ffdb1b3dd0]   NILL    GPR_0379,0x1f
 7961  [     0x3ffdb1b3f90]   SRA     GPR_0377,#614 0(GPR_0379)
 7962  [     0x3ffdb1b4090]   NILF    GPR_0377,1
 7963  [     0x3ffdb1b43b0]   L       GPR_0382, Auto[<auto slot 3>] ?+0(GPR5)
 7964  [     0x3ffdb1b44a0]   NILL    GPR_0382,0x1f
 7965  [     0x3ffdb1b4660]   SLLK    GPR_0381,GPR_0377,#615 0(GPR_0382)
 7966  [     0x3ffdb1b4860]   O       GPR_0381, Auto[<auto slot 2>] ?+0(GPR5)
 7967  [     0x3ffdb1b4a50]   ST      GPR_0381, Auto[<auto slot 2>] ?+0(GPR5)
 7968  [     0x3ffdb1b5620]   ASI      Auto[<auto slot 3>] ?+0(GPR5), 1
 7969  [     0x3ffdb1b5990]   BRC     NOP(0xf), Label L0018

looks like ~18 instructions in the loop

@dchopra001
Copy link
Contributor Author

The implemented evaluator here will generate the following set of instructions to do the entire bitPermute operation if the array length is constant:

IL:

 721  n3n      (  0)  ireturn 
 722  n4n      (  0)    ibitpermute (in GPR_0020) () 
 723  n9n      (  0)      iload
 724  n6n      (  0)      aconst 0x3ffffdbc798 (in GPR_0018) (X!=0 ) 
 725  n7n      (  0)      iconst 32 (X!=0 X>=0 )

Instructions:

728  [        0xc103f0a0]   LLIHF   GPR_0018,1023
 729  [        0xc103f190]   IILF    GPR_0018,-2373736
 730  [        0xc103f400]   XGR     GPR_0020,GPR_0020
 ...
...
737  [        0xc103fb50]   XGR     GPR_0019,GPR_0019
 738  [        0xc103fce0]   LB      GPR_0019,#586 1(GPR_0018)
 739  [        0xc103fdb0]   NILL    GPR_0019,0x1f
 740  [        0xc103ff40]   SRLG    GPR_0019,GPR_0016,#587 0(GPR_0019)
 741  [        0xc1040010]   NILF    GPR_0019,1
 742  [        0xc1040100]   SLLG    GPR_0019,GPR_0019,1
 743  [        0xc10401d0]   OGR     GPR_0020,GPR_0019

A similar sequence to that of lines 737-743 is generated 30 more times (constant length array size=32). This approach avoids branching that the old sequence of instructions perform and has less memory operations compared to before.

In the case where we don't the array length is not a constant, the generated instructions are similar but now require branching. Instructions generated by the evaluator for the above IL will be:

 740  [        0xc0c69910]   LLIHF   GPR_0019,1023
 741  [        0xc0c69a00]   IILF    GPR_0019,-5896280
 742  [        0xc0c69c70]   XGR     GPR_0021,GPR_0021

 743  [        0xc0c69f10]   LTGR    GPR_0022,GPR_0016

 744  [        0xc0c69fe0]   Label L0016:    # (Start of internal control flow)
 745  [        0xc0c6a0b0]   AFI     GPR_0022,-1
 746  [        0xc0c6a1a0]   BRC     MASK5(0x4), Label L0017
 747  [        0xc0c6a270]   XGR     GPR_0020,GPR_0020
 748  [        0xc0c6a400]   LB      GPR_0020,#585 0(GPR_0022,GPR_0019)
 749  [        0xc0c6a4d0]   NILL    GPR_0020,0x1f
 750  [        0xc0c6a660]   SRLG    GPR_0020,GPR_0017,#586 0(GPR_0020)
 751  [        0xc0c6a730]   NILF    GPR_0020,1
 752  [        0xc0c6a8e0]   SLLG    GPR_0020,GPR_0020,#587 0(GPR_0022)
 753  [        0xc0c6a9b0]   OGR     GPR_0021,GPR_0020
 754  [        0xc0c6aa80]   BRC     (null)(0xf), Label L0016
 755  [        0xc0c6ad60]   ASSOCREGS
 756  [        0xc0c6ab50]   Label L0017:    # (End of internal control flow)

~10 instructions repeated in loop.

@dchopra001
Copy link
Contributor Author

@fjeremic Let me know if you can spot an area where we can use more optimal instructions.

@@ -14117,6 +14117,168 @@ bool isSettingOp(uint8_t byteValue, TR::InstOpCode::Mnemonic SI_opcode)
}



Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's clean up this area of code and replace the triple new line with a single new line.

TR::Register *
OMR::Z::TreeEvaluator::bitpermuteEvaluator(TR::Node *node, TR::CodeGenerator *cg)
{

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the unnecessary new line?

OMR::Z::TreeEvaluator::bitpermuteEvaluator(TR::Node *node, TR::CodeGenerator *cg)
{

TR_ASSERT(node->getNumChildren() == 3, "Wrong number of children in bitpermuteEvaluator");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I have no quarrel with additional asserts, shouldn't this be the job of the IL verifier? i.e. we already have a guard for this here:

/* .childProperties = */ THREE_CHILD(TR::Int8, TR::Address, TR::Int32),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. I didn't know about the IL Verifier check. This check also exists in the X/codegen evaluator, so it should be removed from there as well.


bool nodeIs64Bit = node->getSize() == 8;
auto valueReg = cg->evaluate(value); // this will either be a constant or a load
auto addrReg = cg->evaluate(addr); // this will either be a constant or a load
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these comments are true. We only know what the data type will be. IMO they are misleading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.


TR::Register *tmpReg = cg->allocateRegister(TR_GPR);

//Zero result reg
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space needed after // (there are several occurences of this in the PR - let's fix them all and capitalize the first word). Should also move this comment down to the generateRRInstruction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


if (!(node->getDataType() == TR::Int32 || node->getDataType() == TR::Int16 || node->getDataType() == TR::Int8 || node->getDataType() == TR::Int64))
{
TR_ASSERT(0, " Unsupported DataType for bitPermute op");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

generateRRInstruction(cg, TR::InstOpCode::getXORRegOpCode(), node, tmpReg, tmpReg);

//load the shift amount into tmpReg
//TODO: Check to see if we need sourceMR->stopUsingMemRefRegister(cg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically yes, though in practice we have code that handles this automatically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay. Then I'll remove these from here.

auto lengthReg = cg->evaluate(length);
auto indexReg = cg->allocateRegister(TR_GPR);

TR::RegisterDependencyConditions* deps = generateRegisterDependencyConditions((uint8_t)0, (uint8_t)4, cg);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the casts in the arguments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.


// Load array length into index and test to see if it's already zero
// cc=0 if register value is 0, cc=1 if value < 1, cc=2 if value>0
generateRRInstruction(cg, TR::InstOpCode::LTGR, node, indexReg, lengthReg);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to be careful here if running on 31-bit? What is the type of lengthReg on 31-bit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think we can get away with using an LTR here because lengthReg is always a TR::Int32 as per the IL.

Copy link
Contributor Author

@dchopra001 dchopra001 Feb 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction: the correct implementation here would be:
generateRRInstruction(cg, TR::InstOpCode::getLoadTestRegOpCode(), node, indexReg, lengthReg);

where the getLoadTestRegOpCode() query is defined as below:

TR::InstOpCode::Mnemonic
OMR::Z::InstOpCode::getLoadTestRegOpCode() { return TR::Compiler->target.is64Bit() ? TR::InstOpCode::LTGR : TR::InstOpCode::LTR; }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was done incorrectly before. The correct query was added here: f1f0697

generateS390LabelInstruction(cg,TR::InstOpCode::LABEL, node, startLabel);

// now subtract 1 from indexReg
generateRILInstruction(cg, node->getDataType() == TR::Int64 ? TR::InstOpCode::AGFI : TR::InstOpCode::AFI, node, indexReg, -1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use

generateS390ImmOp(cg, TR::InstOpCode::getAddOpCode(), node, indexReg, indexReg, -1);

here instead and let the helper pick the best instruction? All add instructions produce the same condition code results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@fjeremic
Copy link
Contributor

I'll stop the review here and let you do a first round of cleanup before reviewing again.

@fjeremic
Copy link
Contributor

fjeremic commented Feb 21, 2018

A similar sequence to that of lines 737-743 is generated 30 more times (constant length array size=32). This approach avoids branching that the old sequence of instructions perform and has less memory operations compared to before.

We should definitely not do this. The amount of icache we will consume will be way worse than using a loop with a branch on count. This will be well predicted. In the case of constant length do we know what the typical constants that get passed in are? Is it always 8, 16, 32, or 64, or some other value? Perhaps we can have the best of both worlds and unroll the loop manually a few times via Duff's device or unroll fully for say maximum length of 4, otherwise fall back to a single iteration loop.

@fjeremic fjeremic self-assigned this Feb 21, 2018
@fjeremic
Copy link
Contributor

fjeremic commented Feb 21, 2018

Some more reviews on the generated code:

 740  [        0xc0c69910]   LLIHF   GPR_0019,1023
 741  [        0xc0c69a00]   IILF    GPR_0019,-5896280
 742  [        0xc0c69c70]   XGR     GPR_0021,GPR_0021

 743  [        0xc0c69f10]   LTGR    GPR_0022,GPR_0016

 744  [        0xc0c69fe0]   Label L0016:    # (Start of internal control flow)
 745  [        0xc0c6a0b0]   AFI     GPR_0022,-1
 746  [        0xc0c6a1a0]   BRC     MASK5(0x4), Label L0017
 747  [        0xc0c6a270]   XGR     GPR_0020,GPR_0020
// Why is the above XGR needed? Can't we use LGB below instead?
 748  [        0xc0c6a400]   LB      GPR_0020,#585 0(GPR_0022,GPR_0019)
 749  [        0xc0c6a4d0]   NILL    GPR_0020,0x1f
// Do we really need this NILL? Can't whoever is generating this array make sure this is the case beforehand so we don't need to mask here?
 750  [        0xc0c6a660]   SRLG    GPR_0020,GPR_0017,#586 0(GPR_0020)
 751  [        0xc0c6a730]   NILF    GPR_0020,1
// Why do we need to shift left and then shift right here? Can't we just keep the shift mask in a register and keep shifting it right and ANDing? Seems like an extra unnecessary shift.
 752  [        0xc0c6a8e0]   SLLG    GPR_0020,GPR_0020,#587 0(GPR_0022)
 753  [        0xc0c6a9b0]   OGR     GPR_0021,GPR_0020
 754  [        0xc0c6aa80]   BRC     (null)(0xf), Label L0016
 755  [        0xc0c6ad60]   ASSOCREGS
 756  [        0xc0c6ab50]   Label L0017:    # (End of internal control flow)

I feel like there is a use case for RISBG instruction in this algorithm. Have you explored that? More importantly this algorithm can be implemented in a few instructions using VBPERM on z14. We definitely need a z14 path here as I don't think any GPR solution will beat a vector one.

@dchopra001
Copy link
Contributor Author

dchopra001 commented Feb 22, 2018

We should definitely not do this. The amount of icache we will consume will be way worse than using a loop with a branch on count. This will be well predicted.

Sounds like that might be a better option. However this was done in reference to the original X implementation (So we will need to change it there as well). As mentioned in the comment I linked (#2156 (comment)), this was considered to be an optimization for a common case under JProfiling:

This change includes an X86 implementation, which consists of a series of bit tests and sets. The current implementation has an optimization for constant length arrays, a common case under JProfiling.

In the case of constant length do we know what the typical constants that get passed in are? Is it always 8, 16, 32, or 64?

I don't observe any patterns in how the tril tests are written for this. But they seem to be testing the boundaries. And I'm not familiar with how JProfiling works exactly to know what the typical constants would be.

// Why is the above XGR needed? Can't we use LGB below instead?

The XGR was used to zero out tmpReg because tmpReg is OR'ed with resultReg at the end of every loop iteration, so it's always a non-zero value. I think with LGB we will be able to avoid this though. I'll test it out.

// Do we really need this NILL? Can't whoever is generating this array make sure this is the case beforehand so we don't need to mask here?

I addressed this in one of your code review comments. I agree, we can save two instructions by removing this. But I'm not sure who the right person is that should perform this check. There are other areas in the JIT where we do this masking. And the architecture doesn't say anything about considering only the first 5 bits (it considers the first 6, hence why we never mask for 64 bit shifts).

@dchopra001
Copy link
Contributor Author

@fjeremic I've addressed your comments. PTAL.

@fjeremic
Copy link
Contributor

I don't observe any patterns in how the tril tests are written for this.

I don't think tril tests will tell you common use patterns. The point of tril tests is to have maximal coverage and test every single possibility that may arise. It does not tell us which ones are actually common in production code. Since this IL is currently only used for profiling we need to understand the common number of "things" we profile at a time. Surely we don't profile more than a handful of values at a single site, but quantitatively how many is "handful"?

TR::Node *addr = node->getChild(1);
TR::Node *length = node->getChild(2);

bool nodeIs64Bit = node->getSize() == 8;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable is unused. Can we remove?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 537a564

TR::MemoryReference * tempMR = generateS390MemoryReference(tmpReg, 0, cg);
generateRSInstruction(cg, TR::InstOpCode::getShiftRightLogicalSingleOpCode(), node, tmpReg, valueReg, tempMR);

if (!(node->getDataType() == TR::Int64))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This path has an if and an else. Can we swap the blocks to avoid the negative expression in the if clause?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in dcdefce

// This is equivalent to doing a `tmpReg & 0x1`. But on 64-bit we would have to use
// two AND immediate instructions. So instead we use a single RISBG instruction.
TR::InstOpCode::Mnemonic opCode = cg->getS390ProcessorInfo()->supportsArch(TR_S390ProcessorInfo::TR_zEC12) ? TR::InstOpCode::RISBGN : TR::InstOpCode::RISBG;
generateRIEInstruction(cg, opCode, node, tmpReg, tmpReg, 63, 191, 0);
Copy link
Contributor

@fjeremic fjeremic Feb 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we able to use generateShiftAndKeepSelected64Bit here? Also what if we're on a z196? This would cause a SIGILL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow what you mean. Why would this cause a SIGILL on z196? The RISBG Instruction is available on it from what what I can see.

Copy link
Contributor

@fjeremic fjeremic Feb 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I meant to say z9, not z196. Apologies for the mistake. RISBG is only supported on z10 onward:

{0xEC, 0x55, RIE_FORMAT, TR_S390ProcessorInfo::TR_z10}, // RISBG

We should use the above API and not worry about this (it already handles the z9 case).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adressed in ef34b3e

@dchopra001
Copy link
Contributor Author

747  [        0xc0c6a270]   XGR     GPR_0020,GPR_0020
// Why is the above XGR needed? Can't we use LGB below instead?
 748  [        0xc0c6a400]   LB      GPR_0020,#585 0(GPR_0022,GPR_0019)

As mentioned, the XGR was used to zero out the register and then I used an LB to load a byte into an empty register. An LGB accomplishes the same thing, so I've replaced the two instructions with LGB: 400f261

@dchopra001
Copy link
Contributor Author

750  [        0xc0c6a660]   SRLG    GPR_0020,GPR_0017,#586 0(GPR_0020)
 751  [        0xc0c6a730]   NILF    GPR_0020,1
// Why do we need to shift left and then shift right here? Can't we just keep the shift mask in a register and keep shifting it right and ANDing? Seems like an extra unnecessary shift.
 752  [        0xc0c6a8e0]   SLLG    GPR_0020,GPR_0020,#587 0(GPR_0022)

I'm not sure I follow your comment here. Would you be able to provide an example?

To give a bit of background, the first two instructions are trying to do what a bt and setb instruction do on x86 (test a bit at a position specified by a register, then set a bit in a specified register based on the test results). The left shift is done in order to prepare the value to be OR'ed at the correct position in the final result. The amount shifted left is not equal to the amount shifted right.

@dchopra001
Copy link
Contributor Author

 744  [        0xc0c69fe0]   Label L0016:    # (Start of internal control flow)
 745  [        0xc0c6a0b0]   AFI     GPR_0022,-1
 746  [        0xc0c6a1a0]   BRC     MASK5(0x4), Label L0017
 747  [        0xc0c6a270]   XGR     GPR_0020,GPR_0020
// Why is the above XGR needed? Can't we use LGB below instead?
 748  [        0xc0c6a400]   LB      GPR_0020,#585 0(GPR_0022,GPR_0019)
 749  [        0xc0c6a4d0]   NILL    GPR_0020,0x1f
// Do we really need this NILL? Can't whoever is generating this array make sure this is the case beforehand so we don't need to mask here?
 750  [        0xc0c6a660]   SRLG    GPR_0020,GPR_0017,#586 0(GPR_0020)
 751  [        0xc0c6a730]   NILF    GPR_0020,1
// Why do we need to shift left and then shift right here? Can't we just keep the shift mask in a register and keep shifting it right and ANDing? Seems like an extra unnecessary shift.
 752  [        0xc0c6a8e0]   SLLG    GPR_0020,GPR_0020,#587 0(GPR_0022)
 753  [        0xc0c6a9b0]   OGR     GPR_0021,GPR_0020
 754  [        0xc0c6aa80]   BRC     (null)(0xf), Label L0016
 755  [        0xc0c6ad60]   ASSOCREGS
 756  [        0xc0c6ab50]   Label L0017:    # (End of internal control flow)

There may be a possible point of improvement here. If the result of NILF GPR_0020,1 is 0, the SLLG and OGR are effectively NOPs. We would be better off branching back(I can do this by adding another BRC after the NILF) to the top of the internal control flow in those cases (we end up executing two fewer instructions). The penalty however would be that if the result of the NILF GPR_0020,1 is not 0, then we will end up executing 4 instructions as opposed to the 3 right now.

generateRXYInstruction(cg, TR::InstOpCode::LGB, node, tmpReg, sourceMR);

// Create memory reference using tmpReg (shift amount), then shift valueReg by shift amount
TR::MemoryReference * tempMR = generateS390MemoryReference(tmpReg, 0, cg);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get a better name for this memory reference (similarly in the else path)? Perhaps shiftAmountMR? Also would it be possible to fix the pointer location inconsistencies throughout this change? I'm seen all three flavors of type* name, type * name, and type *name. Let's stick to one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in: 62c4af1

// This will generate a RISBG instruction (if supported).
// This is equivalent to doing a `tmpReg & 0x1`. But on 64-bit we would have to use
// two AND immediate instructions. So instead we use a single RISBG instruction.
generateShiftAndKeepSelected64Bit(node, cg, tmpReg, tmpReg, 63, 63, 0, true, false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we carry out the shift and AND operation using a single RISBG?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, just saw RISBG type instructions do not have register flavors. I don't think this would be possible then.

Copy link
Contributor Author

@dchopra001 dchopra001 Feb 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, to address your earlier question as well, I did explore the use of RISBG, but due to shift amounts being specified as immediates, it wasn't possible. I was thinking that the RISBG could be used as a way to preserve a specific bit (where the bit position is inside a register) and then we can shift it right accordingly. However due to no register flavours, this wasn't possible.

But taking a second look, there may still an opportunity for the constant length case. The left shift (done after the RISBG) amount is a constant here:
generateRSInstruction(cg, TR::InstOpCode::getShiftLeftLogicalSingleOpCode(), node, tmpReg, tmpReg, x);
And we can specify the left shift amount (x) as a constant via RISBG. So we may not need this final shift and can do it via RISBG.

I'll test this out.

Copy link
Contributor Author

@dchopra001 dchopra001 Feb 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I tested this out and it looks like it's working. For the constant length case, we now have a guaranteed 4 instructions for every "loop iteration". Before it was anywhere from 5-6 instructions. The following code snippet from the commit summarizes what the RISBG instruction accomplishes:

// This will generate a RISBG instruction (if it's supported, otherwise two shift instructions).
// A RISBG instruction is equivalent to doing a `(tmpReg & 0x1) << x`. But for a 64-bit value we would have to use
// two AND immediate instructions and a shift instruction to do this. So instead we use a single RISBG instruction. 
generateShiftAndKeepSelected64Bit(node, cg, tmpReg, tmpReg, 63 - x, 63 - x, x, true, false);

In the case where a RISBG instruction is not supported by the architecture, we will get two shift instructions instead, which will result in 5 instructions instead of 4.

The commit for this change is here: c6f746f

// Create memory reference using tmpReg (shift amount), then shift valueReg by shift amount
TR::MemoryReference * tempMR = generateS390MemoryReference(tmpReg, 0, cg);
generateRSInstruction(cg, TR::InstOpCode::getShiftRightLogicalSingleOpCode(), node, tmpReg, valueReg, tempMR);
if (node->getDataType() == TR::Int64)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to handle register pairs on 31-bit here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, we are no longer handling the 31-bit case. So the bitPermute IL will be disabled on those architectures. This is addressed here: 1807405

@fjeremic
Copy link
Contributor

fjeremic commented Feb 23, 2018

We would be better off branching back(I can do this by adding another BRC after the NILF) to the top of the internal control flow in those cases (we end up executing two fewer instructions).

Introducing a branch here would have negative effect on performance as these operations are very fast. A shift and an OR is not expensive at all.

I'm not sure I follow your comment here. Would you be able to provide an example?

I assumed there was a flavor of RISBG type instructions work with registers. But this is not the case. It seems they all work on immediate. I rescind my earlier comment.

if (length->getOpCode().isLoadConst())
{
// Manage the constant length case
int arrayLen = length->getIntegerNodeValue<int>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to avoid using primitive types as they have no guarantees on lower bound width. Use int32_t instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in: 62c4af1

// Zero result reg
generateRRInstruction(cg, TR::InstOpCode::getXORRegOpCode(), node, resultReg, resultReg);

if (length->getOpCode().isLoadConst())
Copy link
Contributor

@fjeremic fjeremic Feb 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see this become if (length->getOpCode().isLoadConst() && length->getIntegerNodeValue<int32_t>() <= bitPermuteConstantUnrollThreshold) where this static const is defined to be some sane value so we don't end up unrolling 32 or 64 times in the worst cases. We need to determine by the use case what that threshold should be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fjeremic
Copy link
Contributor

fjeremic commented Feb 23, 2018

Final note before merging I'd like to see a z14 path using VBPERM here as this will have the best performance and path length possible on today's architecture.

return true;
}

return false;
Copy link
Contributor

@fjeremic fjeremic Feb 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just:

return TR::Compiler->target.is64Bit() || self()->use64BitRegsOn32Bit();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done:
e3cd762

@dchopra001
Copy link
Contributor Author

@fjeremic I've addressed the last set of changes requested. (z14 path is in 6652f2e). PTAL.


TR::Register *tmpReg = cg->allocateRegister(TR_GPR);

static const int8_t BIT_PERMUTE_CONSTANT_UNROLL_THRESHOLD = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a local variable and should be named as such (bitPermuteConstantUnrollThreshold). I'd like to see some documentation about this variable and what it means along with why it is currently defined to be 4. If you've done empirical measurements to find out the best value I'd like to see the test posted to either the PR or a new issue and the comment here should link to where future users may reference and reproduce your results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed in : 95356e1
Code Documentation : 1d71fdf

Comments detailing empirical measurements: #2330 (comment)
#2330 (comment)


TR::Register *resultReg = cg->allocateRegister(TR_GPR);


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the double space?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (length->getOpCode().isLoadConst() && length->getIntegerNodeValue<int32_t>() <= BIT_PERMUTE_CONSTANT_UNROLL_THRESHOLD)
{
// Manage the constant length case
int32_t arrayLen = length->getIntegerNodeValue<int>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pull this outside of the above if and use arrayLen instead of length->getIntegerNodeValue<int32_t>()? Also this line seems to be extracting an int but the type of the variable is int32_t. Can we fix this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed here: 95356e1

/// for arrays of size 5 and greater)
else if (cg->getS390ProcessorInfo()->supportsArch(TR_S390ProcessorInfo::TR_z14) &&
length->getOpCode().isLoadConst() &&
length->getIntegerNodeValue<int>() > BIT_PERMUTE_CONSTANT_UNROLL_THRESHOLD &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line shouldn't be necessary. It is always true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, some old code I forgot to remove. Fixed here: 5231065

length->getIntegerNodeValue<int>() <= 16 )
{
char mask[16] = {15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0};
char inverse[16] = {63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63,63, 63};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be marked as const.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, marking these as const will require a const_cast to remove the constness when calling findOrCreateConstant(node, mask, 16).

The correct solution would be to change those APIs to accept const void * but this PR wouldn't be the right place for that. I'll open an issue to address this larger problem.

char mask[16] = {15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0};
char inverse[16] = {63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63,63, 63};

int32_t arrayLen = length->getIntegerNodeValue<int>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can reuse this variable after making the above change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: 95356e1


TR::Register *VectorIndices = cg->allocateRegister(TR_VRF); //V17
TR::Register *VectorSource = cg->allocateRegister(TR_VRF); //V16 and V18
TR::Register *tmpVector = cg->allocateRegister(TR_VRF); //V20 and V21
Copy link
Contributor

@fjeremic fjeremic Apr 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do these comments mean? Also variable names should be camelCase to be consistent with the rest of the codebase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale comments I forgot to remove. Fixed: 76d2d7b


cg->stopUsingRegister(VectorIndices);
cg->stopUsingRegister(VectorSource);
cg->stopUsingRegister(tmpVector);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a space missing here in these 3 lines (alignment with brace).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed: 76d2d7b

for (int j = 1; j < arrayLen; j++)
{
resultMask = ( (resultMask << 1) | 0x1);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not replace the whole loop with resultMask = (1 << arrayLen) - 1; - double check this is correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed here: 35f3057

@dchopra001
Copy link
Contributor Author

dchopra001 commented Apr 3, 2018

I collected some data to see if there's a performance benefit in using the VBPERM instruction on z14 for this use case. Here's the data (each test was run 20 billion times):

screen shot 2018-04-02 at 10 54 01 am

Some context:

  1. The first column is for when we unroll the loop. These instructions are available on z13 and z12. This implementation essentially unrolls the loop to do the bitPermute operation. (i.e. We don't use the VBPERM instruction here)

  2. The second column, z14CurrentIL is doing the bitPermute operation for the way the IL is currently designed. (i.e. We use the VBPERM instruction here, but only after we reverse the order of the indices (i.e. array reversal) and subtract each bit index from 63 to fit the big endianness of the Z architecture)

  1. The third column, z14NewIL, is referring to a suggested new IL that is better suited for big endian architectures. (i.e. We use the VBPERM instruction here, without having to do an array reversal and bit subtraction. The IL already provides us an indices array that is "reversed" and each index has been subtracted from 63 already).

Here's a quick plot of the above data:
screen shot 2018-04-02 at 10 57 09 am

Some observations:
As expected, the vector implementations scale really well and hence are the preferred way to go for larger array sizes.

The z14NewIL implementation has the best performance all around since it's almost on par with the loop unrolling implementation for small array sizes, and scales better for larger sizes.

The loop unrolling implementation performs better than z14CurrentIL for arrays of size <= 4.

There's a clear difference between the loop unrolling implementation (for z13 and lower) and z14currentIL implementation for small array sizes(~20%). The gap is closed if we used the new proposed IL instead.

So the question is, is it implementing a new IL opcode to better fit Big Endian architectures? The short answer is: Not for now. More below:

Currently the only consumer of the bitPermute IL is OpenJ9's JProfiling feature. For this use case we will almost always permute 2 bits or less (i.e. array size will almost always be 2). In the rare case we might permute 3 bits, but beyond that is very unlikely. The main reason being is that the space required to permute more bits grows exponentially with each addition. So this scenario is unlikely to happen in the future as well.

As seen above, there isn't much of a performance difference between the loop unrolling implementation and z14NewIL implementation for arrays of size <=2. In the rare case where the array size is 3, we would see a ~6% performance difference.. which may be negligible if such a scenario is almost never going to happen anyway. Since JProfiling won't see much of a performance benefit from a new IL, we won't implement it right now.

The bitPermute evaluator will use the loop unrolling technique for arrays of size <= 4 and the VBPERM approach used in z14CurrentIL for arrays of size 5 - 16 (VBPERM can only permute 16 bits at a time). If however z14 is not available then we default to the regular loop implementation. (Note that we can technically use the loop unrolling implementation in all cases where z14 is not available. However, as seen above this technique doesn't scale as well and is also going to be iCache heavy. Thus we cap the loop unrolling technique at arrays of size <= 4 for all scenarios).

@dchopra001
Copy link
Contributor Author

dchopra001 commented Apr 3, 2018

For future reference, I am pasting the assembly for each implementation below:

z13/ loop unrolling technique

LGFI R3, 10 // holds the value to permute
LAY R4, %0 // r4 holds the address to our array! Only needed here because we have to manually encode the register number when loading array into VR

LGB R0, 0(0, R4) // load first byte index into register
SRLG R1, R3, 0(R0) // shift to rightmost position to extract the bit
RISBG R1, R1, 63, 191, 0 // this essentially extracts the bit and sets it into its final position
OGR R2, R1 // OR the extracted bit into its final position

LGB R0, 1(0, R4) // load first byte index into register
SRLG R1, R3, 1(R0) // shift to rightmost position to extract the bit
RISBG R1, R1, 62, 190, 1 // this essentially extracts the bit and sets it into its final position
OGR R2, R1 // OR the extracted bit into its final position

z14CurrentIL:

char mask[16] = {15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0};
char inverse[16] = {63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63,63, 63};
..
LGFI R4, 10
LAY R5, %0 //load indices array address. Not needed in actual sequence, but needed here because we have to manually encode the instructions for now
LGFI R3, 0
VLVG VRF16,R4,0(R3),3 // Store the 64-bit value in R4 into VRF16.The index where to put the value inside VRF16 is inside R3(put inside the 0th position)
VLRL VRF17,0(R5),2 // Start loading array in address referenced by R5 into Byte 15-2 = 13. Thus, byte 13 has 2, 14 has 3, 15 has 1

LAY R6, %1 //load mask array's address
VLRL VRF20,0(R6),15 // load mask array into V20

LAY R6, %1 //load index array's address
VLRL VRF21,0(R6),15 // load index array into V21

VPERM V17,V17,V19,V20 // reverse the order of the array

VS V17,V21,V17 // invert the bits

VBPERM VRF18, VRF16, VRF17 ==>BitPermute the value in V16 by the array in V17 and store the result in V18's 48-63.
LGFI R3, 0x3
VLGV R3,VRF18,0(R3),1 ==>Take the 3rd half word in VRF18 and store it in R3. Right align the half word and zero the rest
NILL R3, 0x7 //VBPERM acts on 16 index values. We must mask the final result to extract the relevant bits if we don't want all 16 values. 

z14NewIL:

LGFI R4, 10
LAY R5, %0 //load indices array address. Not needed in actual sequence, but needed here because we have to manually encode the instructions for now
LGFI R3, 0
VLVG VRF16,R4,0(R3),3 // Store the 64-bit value in R4 into VRF16.The index where to put the value inside VRF16 is inside R3(put inside the 0th position)
VLRL VRF17,0(R5),2 // Start loading array in address referenced by R5 into Byte 15-2 = 13. Thus, byte 13 has 2, 14 has 3, 15 has 1
VBPERM VRF18, VRF16, VRF17 ==>BitPermute the value in V16 by the array in V17 and store the result in V18's 48-63.
LGFI R3, 0x3
VLGV R3,VRF18,0(R3),1 ==>Take the 3rd half word in VRF18 and store it in R3. Right align the half word and zero the rest
NILL R3, 0x7 //VBPERM acts on 16 index values. We must mask the final result to extract the relevant bits if we don't want all 16 values. 

I'll post the benchmark I used personally to collect numbers for these assembly sequences shortly.

@dchopra001
Copy link
Contributor Author

For future reference, I wanted to add some context as to why a new IL was suggested.
There are two main issues with the current definition of the BitPermute IL when it is implemented on a Big Endian Architecture like Z.

Background:

On the IBM Z architecture, bit positions inside a register are numbered left to right. Meanwhile, on a Little Endian Architecture like Intel's X86 bit positions in a register are numbered right to left. The two images below show this.

Big Endian (IBM Z):
screen shot 2018-04-05 at 4 49 10 pm

x86(Intel):
screen shot 2018-04-05 at 4 49 37 pm

As can be seen above, for IBM Z's Big Endian architecture the MSB is in bit position 0. Meanwhile for the Little Endian case, the LSB is in bit position 0. Bytes within a register are also numbered in this fashion. The 0th byte on Z register is the leftmost byte, while the 0th byte on an x86 register will be the rightmost byte.

This leads into the first problem: If we wanted to extract the 0th bit position (as per the bit permute IL), the result would be different based on which architecture we are on.

The second problem is that the VBPERM instruction that maps to the bit permute IL well operates under this big endianness assumption as well. It permutes the bits in a value using an array of 16 byte-sized integers (the array is loaded into a register). The result of 0th byte permutation is placed inside the most significant bit of the result value. This is incorrect as we want the result of the 0th byte permutation to be placed inside the LSB of the result.

To solve these two problems, a new IL was proposed ("BigEndianIL"): One where the indices in the array are subtracted from 63 (so the correct positions are selected) and the order of the array is reversed (so the 0th byte in the original array will specify the value to put in the LSB of the result register).

@dchopra001
Copy link
Contributor Author

Benchmark:
benchmark.zip

@dchopra001
Copy link
Contributor Author

The failure in the build doesn't seem related. Here's the failure message:

No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.

@fjeremic
Copy link
Contributor

fjeremic commented Apr 5, 2018

@dchopra001 whenever this is ready for final review please remove the WIP and we'll go through it again.

@dchopra001
Copy link
Contributor Author

@fjeremic It should be ready for final review now. I've already addressed all requested changes from the last review. Removing the WIP tag.

@dchopra001 dchopra001 changed the title WIP: Add the bitPermute Evaluator for Z Add the bitPermute Evaluator for Z Apr 5, 2018
///
/// More info on the measurements is available here: https://github.com/eclipse/omr/pull/2330#issuecomment-378380538
/// The differences between the generated code for the different
/// implementations can be seen here: https://github.com/eclipse/omr/pull/2330#issuecomment-378387516
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doxygen does not do anything for local comments so the triple slash is unnecessary for this entire change. Can we revert to double slash?

Copy link
Contributor Author

@dchopra001 dchopra001 Apr 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
/// Use Z14's VBPERM instruction if possible. (Note: VBPERM supports permutation on arrays
/// of upto size 16, and beats the performance the loop unrolling technique used above
/// for arrays of size 5 and greater)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be z14 and up to. I would avoid referencing hardcoded constants (of previously declared variables) in comments such as "5". It makes refactoring easier should this constant ever change as comments do not need to be updated, i.e. it avoids eventually having stale comments which can be quite deceptive to future readers of this code.

Copy link
Contributor Author

@dchopra001 dchopra001 Apr 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

9410f34

I've instead referred to the constant that holds the threshold value now. That should prevent stale comments from occurring.

arrayLen <= 16 )
{
char mask[16] = {15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0};
char inverse[16] = {63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63,63, 63};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space missing between second last element.

Copy link
Contributor Author

@dchopra001 dchopra001 Apr 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// The result of the VBPERM op is now in resultReg's 48-63 bit locations. Extract the bits you want via a mask
// and store the final result in resultReg. This is necessary because VBPERM operates on 16 bit positions at a time.
// If the array specified by the bitPermute IL was smaller than 16, then invalid bits can be selected.
int resultMask = (1 << arrayLen) - 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resultMask should be of type int32_t to match that of arrayLen.

Copy link
Contributor Author

@dchopra001 dchopra001 Apr 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fjeremic
Copy link
Contributor

fjeremic commented Apr 6, 2018

Looks good overall. Just a couple of minor things to address and we can merge this. Good work!

@dchopra001
Copy link
Contributor Author

dchopra001 commented Apr 6, 2018

@fjeremic Let me know when you're ready to merge. I'll do a squash + rebase before this gets merged.

@fjeremic
Copy link
Contributor

fjeremic commented Apr 6, 2018

@dchopra001 latest changes look good. Please squash the commits and we'll merge.

@genie-omr build all

Enable support for and add an evaluator for the BitPermute IL for the
Z code generator. Currently the IL is only supported for 64-bit.

Signed-off-by: Dhruv Chopra <Dhruv.C.Chopra@ibm.com>
@dchopra001
Copy link
Contributor Author

@fjeremic This should be good to merge now.

@fjeremic
Copy link
Contributor

@genie-omr build all

@fjeremic fjeremic merged commit a1e32cc into eclipse:master Apr 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants