Add the bitPermute Evaluator for Z #2330

dchopra001 · 2018-02-20T23:14:51Z

Implement the bitPermute Evaluator for z/codegen

Signed-off-by: Dhruv Chopra Dhruv.C.Chopra@ibm.com

dchopra001 · 2018-02-20T23:36:09Z

Leaving this as a WIP for now as there are a couple of MemoryReference related things I need to verify and we might be able to come up with a better sequence of instructions to perform this bitPermute operation after @fjeremic takes a look.

dchopra001 · 2018-02-20T23:55:14Z

Functionality of the bitPermute opcode is summarized well by this comment:
#2156 (comment)

For comparison on Z, I generated what the IL and subsequent instructions look like on Z when bitPermute is not enabled on the platform (this is with no optimizations running, lastOptIndex=0):

 21    {
 22       final byte [] bitPositions =  {.......};
 23       int result = 0;
 24
 25       for (int i = 0; i < 64; i++)
 26          {
 27          result |= ((value >> bitPositions[i]) & 1) << i;
 28          }
 29       return result;
 30    }```

The corresponding IL and generated instructions for `result |= ((value >> bitPositions[i]) & 1) << i;` looks as below:

``` 7542  n425n    (  0)  istore  <auto slot 2>[#605  Auto] [flags 0x3 0x0 ] 
 7543  n424n    (  0)    ior (in GPR_0381)
 7544  n405n    (  0)      iload  <auto slot 2>[#605  Auto] [flags 0x3 0x0 ]
 7545  n423n    (  0)      ishl (in GPR_0381)
 7546  n421n    (  0)        iand (in GPR_0377)
 7547  n419n    (  0)          ishr (in GPR_0377)
 7548  n406n    (  0)            iload  <parm 0 I>[#603  Parm] 
 7549  n418n    (  0)            b2i (in GPR_0379) (unneededConv )
 7550  n417n    (  0)              bloadi  <array-shadow>[#473  Shadow] [flags 0x80000601 0x0 ] (in GPR_0379) (signExtendedTo32BitAtSource )
 7551  n416n    (  0)                aladd (internalPtr sharedMemory )
 7552  n407n    (  0)                  ==>aload (in &GPR_0371) (X!=0 sharedMemory )
 7553  n415n    (  0)                  ladd
 7554  n414n    (  0)                    i2l (in GPR_0375) (X>=0 unneededConv )
 7555  n408n    (  0)                      ==>iload (in GPR_0375) (signExtendedTo64BitAtSource )
 7556  n413n    (  0)                    lconst 8
 7557  n420n    (  0)          iconst 1
 7558  n422n    (  0)        iload  <auto slot 3>[#606  Auto] [flags 0x3 0x0 ] (in GPR_0382)

 7561  [     0x3ffdb1b3a70]   L       GPR_0377, Parm[<parm 0 I>] ?+0(GPR5) 
 7562  [     0x3ffdb1b3ce0]   LB      GPR_0379, Shadow[<array-shadow>] 8(GPR_0375,&GPR_0371) 
 7563  [     0x3ffdb1b3dd0]   NILL    GPR_0379,0x1f  
 7564  [     0x3ffdb1b3f90]   SRA     GPR_0377,#614 0(GPR_0379)
 7565  [     0x3ffdb1b4090]   NILF    GPR_0377,1 
 7566  [     0x3ffdb1b43b0]   L       GPR_0382, Auto[<auto slot 3>] ?+0(GPR5)
 7567  [     0x3ffdb1b44a0]   NILL    GPR_0382,0x1f
 7568  [     0x3ffdb1b4660]   SLLK    GPR_0381,GPR_0377,#615 0(GPR_0382)
 7569  [     0x3ffdb1b4860]   O       GPR_0381, Auto[<auto slot 2>] ?+0(GPR5) 
 7570  [     0x3ffdb1b4a50]   ST      GPR_0381, Auto[<auto slot 2>] ?+0(GPR5)

And the instructions for the entire bitPermute operation (i.e the for loop):

7945  [     0x3ffdb1b0710]   Label L0018:
 7946  [     0x3ffdb1b0810]   FENCE   Relative [ 000003FFDB050220 ] BBStart <block_3> (in loop 3)
 7947  [     0x3ffdb1b0bf0]   CHSI     Auto[<auto slot 3>] ?+0(GPR5),0x0020
 7948  [     0x3ffdb1b0cf0]   BRC     BNM(0xa), Label L0017
 7949  [     0x3ffdb1b1000]   FENCE   Relative [ 000003FFDB050224 ] BBEnd </block_3>
 7950  [     0x3ffdb1b1260]   ASSOCREGS
 7951  [     0x3ffdb1b1400]   FENCE   Relative [ 000003FFDB051710 ] BBStart <block_4> (in loop 3)
 7952  [     0x3ffdb1b1860]   CGHSI    MethodMeta[stackOverflowMark] ?+80(GPR13),0xffff
 7953  [     0x3ffdb1b1960]   BRC     BH(0x8), Outlined Label L0080    ; Branch to OOL asyncCheck sequence
 7954  [     0x3ffdb1b2190]   Label L0081:     ; OOL asyncCheck return label
 7955  [     0x3ffdb1b2860]   LGAT    &GPR_0371, Auto[<auto slot 1>] ?+0(GPR5)
 7956  [     0x3ffdb1b2f00]   LGF     GPR_0375, Auto[<auto slot 3>] ?+0(GPR5)
 7957  [     0x3ffdb1b30f0]   CLT     GPR_0375,0xa, Shadow[<contiguous-array-size>] 4(&GPR_0371)
 7958  [     0x3ffdb1b3a70]   L       GPR_0377, Parm[<parm 0 I>] ?+0(GPR5)
 7959  [     0x3ffdb1b3ce0]   LB      GPR_0379, Shadow[<array-shadow>] 8(GPR_0375,&GPR_0371)
 7960  [     0x3ffdb1b3dd0]   NILL    GPR_0379,0x1f
 7961  [     0x3ffdb1b3f90]   SRA     GPR_0377,#614 0(GPR_0379)
 7962  [     0x3ffdb1b4090]   NILF    GPR_0377,1
 7963  [     0x3ffdb1b43b0]   L       GPR_0382, Auto[<auto slot 3>] ?+0(GPR5)
 7964  [     0x3ffdb1b44a0]   NILL    GPR_0382,0x1f
 7965  [     0x3ffdb1b4660]   SLLK    GPR_0381,GPR_0377,#615 0(GPR_0382)
 7966  [     0x3ffdb1b4860]   O       GPR_0381, Auto[<auto slot 2>] ?+0(GPR5)
 7967  [     0x3ffdb1b4a50]   ST      GPR_0381, Auto[<auto slot 2>] ?+0(GPR5)
 7968  [     0x3ffdb1b5620]   ASI      Auto[<auto slot 3>] ?+0(GPR5), 1
 7969  [     0x3ffdb1b5990]   BRC     NOP(0xf), Label L0018

looks like ~18 instructions in the loop

dchopra001 · 2018-02-21T00:18:22Z

The implemented evaluator here will generate the following set of instructions to do the entire bitPermute operation if the array length is constant:

IL:

 721  n3n      (  0)  ireturn 
 722  n4n      (  0)    ibitpermute (in GPR_0020) () 
 723  n9n      (  0)      iload
 724  n6n      (  0)      aconst 0x3ffffdbc798 (in GPR_0018) (X!=0 ) 
 725  n7n      (  0)      iconst 32 (X!=0 X>=0 )

Instructions:

728  [        0xc103f0a0]   LLIHF   GPR_0018,1023
 729  [        0xc103f190]   IILF    GPR_0018,-2373736
 730  [        0xc103f400]   XGR     GPR_0020,GPR_0020
 ...
...
737  [        0xc103fb50]   XGR     GPR_0019,GPR_0019
 738  [        0xc103fce0]   LB      GPR_0019,#586 1(GPR_0018)
 739  [        0xc103fdb0]   NILL    GPR_0019,0x1f
 740  [        0xc103ff40]   SRLG    GPR_0019,GPR_0016,#587 0(GPR_0019)
 741  [        0xc1040010]   NILF    GPR_0019,1
 742  [        0xc1040100]   SLLG    GPR_0019,GPR_0019,1
 743  [        0xc10401d0]   OGR     GPR_0020,GPR_0019

A similar sequence to that of lines 737-743 is generated 30 more times (constant length array size=32). This approach avoids branching that the old sequence of instructions perform and has less memory operations compared to before.

In the case where we don't the array length is not a constant, the generated instructions are similar but now require branching. Instructions generated by the evaluator for the above IL will be:

 740  [        0xc0c69910]   LLIHF   GPR_0019,1023
 741  [        0xc0c69a00]   IILF    GPR_0019,-5896280
 742  [        0xc0c69c70]   XGR     GPR_0021,GPR_0021

 743  [        0xc0c69f10]   LTGR    GPR_0022,GPR_0016

 744  [        0xc0c69fe0]   Label L0016:    # (Start of internal control flow)
 745  [        0xc0c6a0b0]   AFI     GPR_0022,-1
 746  [        0xc0c6a1a0]   BRC     MASK5(0x4), Label L0017
 747  [        0xc0c6a270]   XGR     GPR_0020,GPR_0020
 748  [        0xc0c6a400]   LB      GPR_0020,#585 0(GPR_0022,GPR_0019)
 749  [        0xc0c6a4d0]   NILL    GPR_0020,0x1f
 750  [        0xc0c6a660]   SRLG    GPR_0020,GPR_0017,#586 0(GPR_0020)
 751  [        0xc0c6a730]   NILF    GPR_0020,1
 752  [        0xc0c6a8e0]   SLLG    GPR_0020,GPR_0020,#587 0(GPR_0022)
 753  [        0xc0c6a9b0]   OGR     GPR_0021,GPR_0020
 754  [        0xc0c6aa80]   BRC     (null)(0xf), Label L0016
 755  [        0xc0c6ad60]   ASSOCREGS
 756  [        0xc0c6ab50]   Label L0017:    # (End of internal control flow)

~10 instructions repeated in loop.

dchopra001 · 2018-02-21T00:21:12Z

@fjeremic Let me know if you can spot an area where we can use more optimal instructions.

fjeremic · 2018-02-21T00:03:44Z

compiler/z/codegen/OMRTreeEvaluator.cpp

@@ -14117,6 +14117,168 @@ bool isSettingOp(uint8_t byteValue, TR::InstOpCode::Mnemonic SI_opcode)
   }


+


Let's clean up this area of code and replace the triple new line with a single new line.

fjeremic · 2018-02-21T00:04:16Z

compiler/z/codegen/OMRTreeEvaluator.cpp

+TR::Register *
+OMR::Z::TreeEvaluator::bitpermuteEvaluator(TR::Node *node, TR::CodeGenerator *cg)
+   {
+


Can we remove the unnecessary new line?

fjeremic · 2018-02-21T00:06:31Z

compiler/z/codegen/OMRTreeEvaluator.cpp

+OMR::Z::TreeEvaluator::bitpermuteEvaluator(TR::Node *node, TR::CodeGenerator *cg)
+   {
+
+   TR_ASSERT(node->getNumChildren() == 3, "Wrong number of children in bitpermuteEvaluator");


While I have no quarrel with additional asserts, shouldn't this be the job of the IL verifier? i.e. we already have a guard for this here:

omr/compiler/il/ILOpCodeProperties.hpp

Line 11722 in 97107f4

/* .childProperties = */ THREE_CHILD(TR::Int8, TR::Address, TR::Int32),

Removed. I didn't know about the IL Verifier check. This check also exists in the X/codegen evaluator, so it should be removed from there as well.

fjeremic · 2018-02-21T00:09:06Z

compiler/z/codegen/OMRTreeEvaluator.cpp

+
+   bool nodeIs64Bit = node->getSize() == 8;
+   auto valueReg = cg->evaluate(value); // this will either be a constant or a load
+   auto addrReg = cg->evaluate(addr);   // this will either be a constant or a load


I don't think these comments are true. We only know what the data type will be. IMO they are misleading.

fjeremic · 2018-02-21T00:09:24Z

compiler/z/codegen/OMRTreeEvaluator.cpp

+
+   TR::Register *tmpReg = cg->allocateRegister(TR_GPR);
+
+   //Zero result reg


Space needed after // (there are several occurences of this in the PR - let's fix them all and capitalize the first word). Should also move this comment down to the generateRRInstruction.

fjeremic · 2018-02-21T00:16:14Z

compiler/z/codegen/OMRTreeEvaluator.cpp

+
+      if (!(node->getDataType() == TR::Int32 || node->getDataType() == TR::Int16 || node->getDataType() == TR::Int8 || node->getDataType() == TR::Int64))
+         {
+         TR_ASSERT(0, " Unsupported DataType for bitPermute op");


Similar to above.

fjeremic · 2018-02-21T00:29:13Z

compiler/z/codegen/OMRTreeEvaluator.cpp

+         generateRRInstruction(cg, TR::InstOpCode::getXORRegOpCode(), node, tmpReg, tmpReg);
+
+         //load the shift amount into tmpReg
+         //TODO: Check to see if we need sourceMR->stopUsingMemRefRegister(cg)


Technically yes, though in practice we have code that handles this automatically.

Ah okay. Then I'll remove these from here.

fjeremic · 2018-02-21T00:30:17Z

compiler/z/codegen/OMRTreeEvaluator.cpp

+      auto lengthReg = cg->evaluate(length);
+      auto indexReg = cg->allocateRegister(TR_GPR);
+
+      TR::RegisterDependencyConditions* deps = generateRegisterDependencyConditions((uint8_t)0, (uint8_t)4, cg);


Do we need the casts in the arguments?

fjeremic · 2018-02-21T00:32:32Z

compiler/z/codegen/OMRTreeEvaluator.cpp

+
+      // Load array length into index and test to see if it's already zero
+      // cc=0 if register value is 0, cc=1 if value < 1, cc=2 if value>0
+      generateRRInstruction(cg, TR::InstOpCode::LTGR, node, indexReg, lengthReg); 


Do we need to be careful here if running on 31-bit? What is the type of lengthReg on 31-bit?

Actually I think we can get away with using an LTR here because lengthReg is always a TR::Int32 as per the IL.

Correction: the correct implementation here would be:
generateRRInstruction(cg, TR::InstOpCode::getLoadTestRegOpCode(), node, indexReg, lengthReg);

where the getLoadTestRegOpCode() query is defined as below:

omr/compiler/z/codegen/InstOpCode.cpp

Lines 254 to 255 in 58194bc

TR::InstOpCode::Mnemonic

OMR::Z::InstOpCode::getLoadTestRegOpCode() { return TR::Compiler->target.is64Bit() ? TR::InstOpCode::LTGR : TR::InstOpCode::LTR; }

This was done incorrectly before. The correct query was added here: f1f0697

fjeremic · 2018-02-21T00:38:20Z

compiler/z/codegen/OMRTreeEvaluator.cpp

+      generateS390LabelInstruction(cg,TR::InstOpCode::LABEL, node, startLabel);
+
+      // now subtract 1 from indexReg 
+      generateRILInstruction(cg, node->getDataType() == TR::Int64 ? TR::InstOpCode::AGFI : TR::InstOpCode::AFI, node, indexReg, -1);


Can we use

generateS390ImmOp(cg, TR::InstOpCode::getAddOpCode(), node, indexReg, indexReg, -1);

here instead and let the helper pick the best instruction? All add instructions produce the same condition code results.

fjeremic · 2018-02-21T00:43:44Z

I'll stop the review here and let you do a first round of cleanup before reviewing again.

fjeremic · 2018-02-21T00:47:40Z

A similar sequence to that of lines 737-743 is generated 30 more times (constant length array size=32). This approach avoids branching that the old sequence of instructions perform and has less memory operations compared to before.

We should definitely not do this. The amount of icache we will consume will be way worse than using a loop with a branch on count. This will be well predicted. In the case of constant length do we know what the typical constants that get passed in are? Is it always 8, 16, 32, or 64, or some other value? Perhaps we can have the best of both worlds and unroll the loop manually a few times via Duff's device or unroll fully for say maximum length of 4, otherwise fall back to a single iteration loop.

fjeremic · 2018-02-21T01:04:46Z

Some more reviews on the generated code:

 740  [        0xc0c69910]   LLIHF   GPR_0019,1023
 741  [        0xc0c69a00]   IILF    GPR_0019,-5896280
 742  [        0xc0c69c70]   XGR     GPR_0021,GPR_0021

 743  [        0xc0c69f10]   LTGR    GPR_0022,GPR_0016

 744  [        0xc0c69fe0]   Label L0016:    # (Start of internal control flow)
 745  [        0xc0c6a0b0]   AFI     GPR_0022,-1
 746  [        0xc0c6a1a0]   BRC     MASK5(0x4), Label L0017
 747  [        0xc0c6a270]   XGR     GPR_0020,GPR_0020
// Why is the above XGR needed? Can't we use LGB below instead?
 748  [        0xc0c6a400]   LB      GPR_0020,#585 0(GPR_0022,GPR_0019)
 749  [        0xc0c6a4d0]   NILL    GPR_0020,0x1f
// Do we really need this NILL? Can't whoever is generating this array make sure this is the case beforehand so we don't need to mask here?
 750  [        0xc0c6a660]   SRLG    GPR_0020,GPR_0017,#586 0(GPR_0020)
 751  [        0xc0c6a730]   NILF    GPR_0020,1
// Why do we need to shift left and then shift right here? Can't we just keep the shift mask in a register and keep shifting it right and ANDing? Seems like an extra unnecessary shift.
 752  [        0xc0c6a8e0]   SLLG    GPR_0020,GPR_0020,#587 0(GPR_0022)
 753  [        0xc0c6a9b0]   OGR     GPR_0021,GPR_0020
 754  [        0xc0c6aa80]   BRC     (null)(0xf), Label L0016
 755  [        0xc0c6ad60]   ASSOCREGS
 756  [        0xc0c6ab50]   Label L0017:    # (End of internal control flow)

I feel like there is a use case for RISBG instruction in this algorithm. Have you explored that? More importantly this algorithm can be implemented in a few instructions using VBPERM on z14. We definitely need a z14 path here as I don't think any GPR solution will beat a vector one.

dchopra001 · 2018-02-22T15:04:50Z

We should definitely not do this. The amount of icache we will consume will be way worse than using a loop with a branch on count. This will be well predicted.

Sounds like that might be a better option. However this was done in reference to the original X implementation (So we will need to change it there as well). As mentioned in the comment I linked (#2156 (comment)), this was considered to be an optimization for a common case under JProfiling:

This change includes an X86 implementation, which consists of a series of bit tests and sets. The current implementation has an optimization for constant length arrays, a common case under JProfiling.

In the case of constant length do we know what the typical constants that get passed in are? Is it always 8, 16, 32, or 64?

I don't observe any patterns in how the tril tests are written for this. But they seem to be testing the boundaries. And I'm not familiar with how JProfiling works exactly to know what the typical constants would be.

// Why is the above XGR needed? Can't we use LGB below instead?

The XGR was used to zero out tmpReg because tmpReg is OR'ed with resultReg at the end of every loop iteration, so it's always a non-zero value. I think with LGB we will be able to avoid this though. I'll test it out.

// Do we really need this NILL? Can't whoever is generating this array make sure this is the case beforehand so we don't need to mask here?

I addressed this in one of your code review comments. I agree, we can save two instructions by removing this. But I'm not sure who the right person is that should perform this check. There are other areas in the JIT where we do this masking. And the architecture doesn't say anything about considering only the first 5 bits (it considers the first 6, hence why we never mask for 64 bit shifts).

dchopra001 · 2018-02-22T15:08:54Z

@fjeremic I've addressed your comments. PTAL.

fjeremic · 2018-02-22T15:53:52Z

I don't observe any patterns in how the tril tests are written for this.

I don't think tril tests will tell you common use patterns. The point of tril tests is to have maximal coverage and test every single possibility that may arise. It does not tell us which ones are actually common in production code. Since this IL is currently only used for profiling we need to understand the common number of "things" we profile at a time. Surely we don't profile more than a handful of values at a single site, but quantitatively how many is "handful"?

fjeremic · 2018-02-22T15:58:40Z

compiler/z/codegen/OMRTreeEvaluator.cpp

+   TR::Node *addr = node->getChild(1);
+   TR::Node *length = node->getChild(2);
+
+   bool nodeIs64Bit = node->getSize() == 8;


This variable is unused. Can we remove?

Addressed in 537a564

fjeremic · 2018-02-22T15:59:38Z

compiler/z/codegen/OMRTreeEvaluator.cpp

+      TR::MemoryReference * tempMR = generateS390MemoryReference(tmpReg, 0, cg);
+      generateRSInstruction(cg, TR::InstOpCode::getShiftRightLogicalSingleOpCode(), node, tmpReg, valueReg, tempMR); 
+
+      if (!(node->getDataType() == TR::Int64))


This path has an if and an else. Can we swap the blocks to avoid the negative expression in the if clause?

Addressed in dcdefce

fjeremic · 2018-02-22T16:01:06Z

compiler/z/codegen/OMRTreeEvaluator.cpp

+         // This is equivalent to doing a `tmpReg & 0x1`. But on 64-bit we would have to use
+         // two AND immediate instructions. So instead we use a single RISBG instruction.          
+         TR::InstOpCode::Mnemonic opCode = cg->getS390ProcessorInfo()->supportsArch(TR_S390ProcessorInfo::TR_zEC12) ? TR::InstOpCode::RISBGN : TR::InstOpCode::RISBG;
+         generateRIEInstruction(cg, opCode, node, tmpReg, tmpReg, 63, 191, 0);


Are we able to use generateShiftAndKeepSelected64Bit here? Also what if we're on a z196? This would cause a SIGILL.

I'm not sure I follow what you mean. Why would this cause a SIGILL on z196? The RISBG Instruction is available on it from what what I can see.

Sorry I meant to say z9, not z196. Apologies for the mistake. RISBG is only supported on z10 onward:

omr/compiler/z/codegen/ArchInstOpCodeEncoding.hpp

Line 489 in 58194bc

{0xEC, 0x55, RIE_FORMAT, TR_S390ProcessorInfo::TR_z10}, // RISBG

We should use the above API and not worry about this (it already handles the z9 case).

Adressed in ef34b3e

dchopra001 · 2018-02-23T01:37:50Z

747  [        0xc0c6a270]   XGR     GPR_0020,GPR_0020
// Why is the above XGR needed? Can't we use LGB below instead?
 748  [        0xc0c6a400]   LB      GPR_0020,#585 0(GPR_0022,GPR_0019)

As mentioned, the XGR was used to zero out the register and then I used an LB to load a byte into an empty register. An LGB accomplishes the same thing, so I've replaced the two instructions with LGB: 400f261

dchopra001 · 2018-02-23T01:55:28Z

750  [        0xc0c6a660]   SRLG    GPR_0020,GPR_0017,#586 0(GPR_0020)
 751  [        0xc0c6a730]   NILF    GPR_0020,1
// Why do we need to shift left and then shift right here? Can't we just keep the shift mask in a register and keep shifting it right and ANDing? Seems like an extra unnecessary shift.
 752  [        0xc0c6a8e0]   SLLG    GPR_0020,GPR_0020,#587 0(GPR_0022)

I'm not sure I follow your comment here. Would you be able to provide an example?

To give a bit of background, the first two instructions are trying to do what a bt and setb instruction do on x86 (test a bit at a position specified by a register, then set a bit in a specified register based on the test results). The left shift is done in order to prepare the value to be OR'ed at the correct position in the final result. The amount shifted left is not equal to the amount shifted right.

dchopra001 · 2018-02-23T02:01:42Z

 744  [        0xc0c69fe0]   Label L0016:    # (Start of internal control flow)
 745  [        0xc0c6a0b0]   AFI     GPR_0022,-1
 746  [        0xc0c6a1a0]   BRC     MASK5(0x4), Label L0017
 747  [        0xc0c6a270]   XGR     GPR_0020,GPR_0020
// Why is the above XGR needed? Can't we use LGB below instead?
 748  [        0xc0c6a400]   LB      GPR_0020,#585 0(GPR_0022,GPR_0019)
 749  [        0xc0c6a4d0]   NILL    GPR_0020,0x1f
// Do we really need this NILL? Can't whoever is generating this array make sure this is the case beforehand so we don't need to mask here?
 750  [        0xc0c6a660]   SRLG    GPR_0020,GPR_0017,#586 0(GPR_0020)
 751  [        0xc0c6a730]   NILF    GPR_0020,1
// Why do we need to shift left and then shift right here? Can't we just keep the shift mask in a register and keep shifting it right and ANDing? Seems like an extra unnecessary shift.
 752  [        0xc0c6a8e0]   SLLG    GPR_0020,GPR_0020,#587 0(GPR_0022)
 753  [        0xc0c6a9b0]   OGR     GPR_0021,GPR_0020
 754  [        0xc0c6aa80]   BRC     (null)(0xf), Label L0016
 755  [        0xc0c6ad60]   ASSOCREGS
 756  [        0xc0c6ab50]   Label L0017:    # (End of internal control flow)

There may be a possible point of improvement here. If the result of NILF GPR_0020,1 is 0, the SLLG and OGR are effectively NOPs. We would be better off branching back(I can do this by adding another BRC after the NILF) to the top of the internal control flow in those cases (we end up executing two fewer instructions). The penalty however would be that if the result of the NILF GPR_0020,1 is not 0, then we will end up executing 4 instructions as opposed to the 3 right now.

fjeremic · 2018-02-23T15:24:23Z

compiler/z/codegen/OMRTreeEvaluator.cpp

+         generateRXYInstruction(cg, TR::InstOpCode::LGB, node, tmpReg, sourceMR);
+
+         // Create memory reference using tmpReg (shift amount), then shift valueReg by shift amount
+         TR::MemoryReference * tempMR = generateS390MemoryReference(tmpReg, 0, cg);


Can we get a better name for this memory reference (similarly in the else path)? Perhaps shiftAmountMR? Also would it be possible to fix the pointer location inconsistencies throughout this change? I'm seen all three flavors of type* name, type * name, and type *name. Let's stick to one.

Addressed in: 62c4af1

fjeremic · 2018-02-23T15:28:33Z

compiler/z/codegen/OMRTreeEvaluator.cpp

+            // This will generate a RISBG instruction (if supported).
+            // This is equivalent to doing a `tmpReg & 0x1`. But on 64-bit we would have to use
+            // two AND immediate instructions. So instead we use a single RISBG instruction. 
+            generateShiftAndKeepSelected64Bit(node, cg, tmpReg, tmpReg, 63, 63, 0, true, false);


Why can't we carry out the shift and AND operation using a single RISBG?

Ah, just saw RISBG type instructions do not have register flavors. I don't think this would be possible then.

Yes, to address your earlier question as well, I did explore the use of RISBG, but due to shift amounts being specified as immediates, it wasn't possible. I was thinking that the RISBG could be used as a way to preserve a specific bit (where the bit position is inside a register) and then we can shift it right accordingly. However due to no register flavours, this wasn't possible.

But taking a second look, there may still an opportunity for the constant length case. The left shift (done after the RISBG) amount is a constant here:
generateRSInstruction(cg, TR::InstOpCode::getShiftLeftLogicalSingleOpCode(), node, tmpReg, tmpReg, x);
And we can specify the left shift amount (x) as a constant via RISBG. So we may not need this final shift and can do it via RISBG.

I'll test this out.

Okay I tested this out and it looks like it's working. For the constant length case, we now have a guaranteed 4 instructions for every "loop iteration". Before it was anywhere from 5-6 instructions. The following code snippet from the commit summarizes what the RISBG instruction accomplishes:

// This will generate a RISBG instruction (if it's supported, otherwise two shift instructions). // A RISBG instruction is equivalent to doing a `(tmpReg & 0x1) << x`. But for a 64-bit value we would have to use // two AND immediate instructions and a shift instruction to do this. So instead we use a single RISBG instruction. generateShiftAndKeepSelected64Bit(node, cg, tmpReg, tmpReg, 63 - x, 63 - x, x, true, false);

In the case where a RISBG instruction is not supported by the architecture, we will get two shift instructions instead, which will result in 5 instructions instead of 4.

The commit for this change is here: c6f746f

fjeremic · 2018-02-23T15:31:02Z

compiler/z/codegen/OMRTreeEvaluator.cpp

+         // Create memory reference using tmpReg (shift amount), then shift valueReg by shift amount
+         TR::MemoryReference * tempMR = generateS390MemoryReference(tmpReg, 0, cg);
+         generateRSInstruction(cg, TR::InstOpCode::getShiftRightLogicalSingleOpCode(), node, tmpReg, valueReg, tempMR); 
+         if (node->getDataType() == TR::Int64)


Do we need to handle register pairs on 31-bit here?

As discussed, we are no longer handling the 31-bit case. So the bitPermute IL will be disabled on those architectures. This is addressed here: 1807405

fjeremic · 2018-02-23T16:09:49Z

We would be better off branching back(I can do this by adding another BRC after the NILF) to the top of the internal control flow in those cases (we end up executing two fewer instructions).

Introducing a branch here would have negative effect on performance as these operations are very fast. A shift and an OR is not expensive at all.

I'm not sure I follow your comment here. Would you be able to provide an example?

I assumed there was a flavor of RISBG type instructions work with registers. But this is not the case. It seems they all work on immediate. I rescind my earlier comment.

fjeremic · 2018-02-23T16:13:36Z

compiler/z/codegen/OMRTreeEvaluator.cpp

+   if (length->getOpCode().isLoadConst())
+      {
+      // Manage the constant length case
+      int arrayLen = length->getIntegerNodeValue<int>();


Try to avoid using primitive types as they have no guarantees on lower bound width. Use int32_t instead.

Addressed in: 62c4af1

fjeremic · 2018-02-23T16:16:12Z

compiler/z/codegen/OMRTreeEvaluator.cpp

+   // Zero result reg
+   generateRRInstruction(cg, TR::InstOpCode::getXORRegOpCode(), node, resultReg, resultReg);
+
+   if (length->getOpCode().isLoadConst())


I'd like to see this become if (length->getOpCode().isLoadConst() && length->getIntegerNodeValue<int32_t>() <= bitPermuteConstantUnrollThreshold) where this static const is defined to be some sane value so we don't end up unrolling 32 or 64 times in the worst cases. We need to determine by the use case what that threshold should be.

fjeremic · 2018-02-23T16:19:20Z

Final note before merging I'd like to see a z14 path using VBPERM here as this will have the best performance and path length possible on today's architecture.

fjeremic · 2018-02-27T21:30:04Z

compiler/z/codegen/OMRCodeGenerator.cpp

+      return true;
+      }
+
+   return false;


Why not just:

return TR::Compiler->target.is64Bit() || self()->use64BitRegsOn32Bit();

Done:
e3cd762

dchopra001 · 2018-04-03T14:46:23Z

@fjeremic I've addressed the last set of changes requested. (z14 path is in 6652f2e). PTAL.

fjeremic · 2018-04-03T14:47:43Z

compiler/z/codegen/OMRTreeEvaluator.cpp

+
+   TR::Register *tmpReg = cg->allocateRegister(TR_GPR);
+
+   static const int8_t BIT_PERMUTE_CONSTANT_UNROLL_THRESHOLD = 4;


This is a local variable and should be named as such (bitPermuteConstantUnrollThreshold). I'd like to see some documentation about this variable and what it means along with why it is currently defined to be 4. If you've done empirical measurements to find out the best value I'd like to see the test posted to either the PR or a new issue and the comment here should link to where future users may reference and reproduce your results.

Renamed in : 95356e1
Code Documentation : 1d71fdf

Comments detailing empirical measurements: #2330 (comment)
#2330 (comment)

fjeremic · 2018-04-03T14:49:56Z

compiler/z/codegen/OMRTreeEvaluator.cpp

+
+   TR::Register *resultReg = cg->allocateRegister(TR_GPR);
+
+


Can we remove the double space?

fjeremic · 2018-04-03T14:50:57Z

compiler/z/codegen/OMRTreeEvaluator.cpp

+   if (length->getOpCode().isLoadConst() && length->getIntegerNodeValue<int32_t>() <= BIT_PERMUTE_CONSTANT_UNROLL_THRESHOLD)
+      {
+      // Manage the constant length case
+      int32_t arrayLen = length->getIntegerNodeValue<int>();


Can we pull this outside of the above if and use arrayLen instead of length->getIntegerNodeValue<int32_t>()? Also this line seems to be extracting an int but the type of the variable is int32_t. Can we fix this?

Addressed here: 95356e1

fjeremic · 2018-04-03T14:51:48Z

compiler/z/codegen/OMRTreeEvaluator.cpp

+   /// for arrays of size 5 and greater)
+   else if (cg->getS390ProcessorInfo()->supportsArch(TR_S390ProcessorInfo::TR_z14) &&
+         length->getOpCode().isLoadConst()  &&
+         length->getIntegerNodeValue<int>() > BIT_PERMUTE_CONSTANT_UNROLL_THRESHOLD &&


This line shouldn't be necessary. It is always true.

Ah yes, some old code I forgot to remove. Fixed here: 5231065

fjeremic · 2018-04-03T14:52:08Z

compiler/z/codegen/OMRTreeEvaluator.cpp

+         length->getIntegerNodeValue<int>() <= 16 )
+      {
+      char mask[16] = {15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0};
+      char inverse[16] = {63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63,63, 63};


These should be marked as const.

As discussed, marking these as const will require a const_cast to remove the constness when calling findOrCreateConstant(node, mask, 16).

The correct solution would be to change those APIs to accept const void * but this PR wouldn't be the right place for that. I'll open an issue to address this larger problem.

fjeremic · 2018-04-03T14:52:23Z

compiler/z/codegen/OMRTreeEvaluator.cpp

+      char mask[16] = {15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0};
+      char inverse[16] = {63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63,63, 63};
+
+      int32_t arrayLen = length->getIntegerNodeValue<int>();


Can reuse this variable after making the above change.

Fixed: 95356e1

fjeremic · 2018-04-03T14:52:51Z

compiler/z/codegen/OMRTreeEvaluator.cpp

+
+      TR::Register *VectorIndices = cg->allocateRegister(TR_VRF); //V17
+      TR::Register *VectorSource = cg->allocateRegister(TR_VRF);  //V16 and V18
+      TR::Register *tmpVector = cg->allocateRegister(TR_VRF); //V20 and V21


What do these comments mean? Also variable names should be camelCase to be consistent with the rest of the codebase.

Stale comments I forgot to remove. Fixed: 76d2d7b

fjeremic · 2018-04-03T14:53:53Z

compiler/z/codegen/OMRTreeEvaluator.cpp

+
+     cg->stopUsingRegister(VectorIndices); 
+     cg->stopUsingRegister(VectorSource); 
+     cg->stopUsingRegister(tmpVector);


There seems to be a space missing here in these 3 lines (alignment with brace).

Fixed: 76d2d7b

fjeremic · 2018-04-03T14:56:45Z

compiler/z/codegen/OMRTreeEvaluator.cpp

+      for (int j = 1; j < arrayLen; j++)
+         {
+         resultMask = ( (resultMask << 1) | 0x1);
+         }


Why not replace the whole loop with resultMask = (1 << arrayLen) - 1; - double check this is correct.

Addressed here: 35f3057

dchopra001 · 2018-04-03T20:11:54Z

I collected some data to see if there's a performance benefit in using the VBPERM instruction on z14 for this use case. Here's the data (each test was run 20 billion times):

Some context:

The first column is for when we unroll the loop. These instructions are available on z13 and z12. This implementation essentially unrolls the loop to do the bitPermute operation. (i.e. We don't use the VBPERM instruction here)
The second column, z14CurrentIL is doing the bitPermute operation for the way the IL is currently designed. (i.e. We use the VBPERM instruction here, but only after we reverse the order of the indices (i.e. array reversal) and subtract each bit index from 63 to fit the big endianness of the Z architecture)

The third column, z14NewIL, is referring to a suggested new IL that is better suited for big endian architectures. (i.e. We use the VBPERM instruction here, without having to do an array reversal and bit subtraction. The IL already provides us an indices array that is "reversed" and each index has been subtracted from 63 already).

Here's a quick plot of the above data:

Some observations:
As expected, the vector implementations scale really well and hence are the preferred way to go for larger array sizes.

The z14NewIL implementation has the best performance all around since it's almost on par with the loop unrolling implementation for small array sizes, and scales better for larger sizes.

The loop unrolling implementation performs better than z14CurrentIL for arrays of size <= 4.

There's a clear difference between the loop unrolling implementation (for z13 and lower) and z14currentIL implementation for small array sizes(~20%). The gap is closed if we used the new proposed IL instead.

So the question is, is it implementing a new IL opcode to better fit Big Endian architectures? The short answer is: Not for now. More below:

Currently the only consumer of the bitPermute IL is OpenJ9's JProfiling feature. For this use case we will almost always permute 2 bits or less (i.e. array size will almost always be 2). In the rare case we might permute 3 bits, but beyond that is very unlikely. The main reason being is that the space required to permute more bits grows exponentially with each addition. So this scenario is unlikely to happen in the future as well.

As seen above, there isn't much of a performance difference between the loop unrolling implementation and z14NewIL implementation for arrays of size <=2. In the rare case where the array size is 3, we would see a ~6% performance difference.. which may be negligible if such a scenario is almost never going to happen anyway. Since JProfiling won't see much of a performance benefit from a new IL, we won't implement it right now.

The bitPermute evaluator will use the loop unrolling technique for arrays of size <= 4 and the VBPERM approach used in z14CurrentIL for arrays of size 5 - 16 (VBPERM can only permute 16 bits at a time). If however z14 is not available then we default to the regular loop implementation. (Note that we can technically use the loop unrolling implementation in all cases where z14 is not available. However, as seen above this technique doesn't scale as well and is also going to be iCache heavy. Thus we cap the loop unrolling technique at arrays of size <= 4 for all scenarios).

dchopra001 · 2018-04-03T20:29:57Z

For future reference, I am pasting the assembly for each implementation below:

z13/ loop unrolling technique

LGFI R3, 10 // holds the value to permute
LAY R4, %0 // r4 holds the address to our array! Only needed here because we have to manually encode the register number when loading array into VR

LGB R0, 0(0, R4) // load first byte index into register
SRLG R1, R3, 0(R0) // shift to rightmost position to extract the bit
RISBG R1, R1, 63, 191, 0 // this essentially extracts the bit and sets it into its final position
OGR R2, R1 // OR the extracted bit into its final position

LGB R0, 1(0, R4) // load first byte index into register
SRLG R1, R3, 1(R0) // shift to rightmost position to extract the bit
RISBG R1, R1, 62, 190, 1 // this essentially extracts the bit and sets it into its final position
OGR R2, R1 // OR the extracted bit into its final position

z14CurrentIL:

char mask[16] = {15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0};
char inverse[16] = {63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63,63, 63};
..
LGFI R4, 10
LAY R5, %0 //load indices array address. Not needed in actual sequence, but needed here because we have to manually encode the instructions for now
LGFI R3, 0
VLVG VRF16,R4,0(R3),3 // Store the 64-bit value in R4 into VRF16.The index where to put the value inside VRF16 is inside R3(put inside the 0th position)
VLRL VRF17,0(R5),2 // Start loading array in address referenced by R5 into Byte 15-2 = 13. Thus, byte 13 has 2, 14 has 3, 15 has 1

LAY R6, %1 //load mask array's address
VLRL VRF20,0(R6),15 // load mask array into V20

LAY R6, %1 //load index array's address
VLRL VRF21,0(R6),15 // load index array into V21

VPERM V17,V17,V19,V20 // reverse the order of the array

VS V17,V21,V17 // invert the bits

VBPERM VRF18, VRF16, VRF17 ==>BitPermute the value in V16 by the array in V17 and store the result in V18's 48-63.
LGFI R3, 0x3
VLGV R3,VRF18,0(R3),1 ==>Take the 3rd half word in VRF18 and store it in R3. Right align the half word and zero the rest
NILL R3, 0x7 //VBPERM acts on 16 index values. We must mask the final result to extract the relevant bits if we don't want all 16 values.

z14NewIL:

LGFI R4, 10
LAY R5, %0 //load indices array address. Not needed in actual sequence, but needed here because we have to manually encode the instructions for now
LGFI R3, 0
VLVG VRF16,R4,0(R3),3 // Store the 64-bit value in R4 into VRF16.The index where to put the value inside VRF16 is inside R3(put inside the 0th position)
VLRL VRF17,0(R5),2 // Start loading array in address referenced by R5 into Byte 15-2 = 13. Thus, byte 13 has 2, 14 has 3, 15 has 1
VBPERM VRF18, VRF16, VRF17 ==>BitPermute the value in V16 by the array in V17 and store the result in V18's 48-63.
LGFI R3, 0x3
VLGV R3,VRF18,0(R3),1 ==>Take the 3rd half word in VRF18 and store it in R3. Right align the half word and zero the rest
NILL R3, 0x7 //VBPERM acts on 16 index values. We must mask the final result to extract the relevant bits if we don't want all 16 values.

I'll post the benchmark I used personally to collect numbers for these assembly sequences shortly.

dchopra001 · 2018-04-05T21:11:17Z

For future reference, I wanted to add some context as to why a new IL was suggested.
There are two main issues with the current definition of the BitPermute IL when it is implemented on a Big Endian Architecture like Z.

Background:

On the IBM Z architecture, bit positions inside a register are numbered left to right. Meanwhile, on a Little Endian Architecture like Intel's X86 bit positions in a register are numbered right to left. The two images below show this.

Big Endian (IBM Z):

x86(Intel):

As can be seen above, for IBM Z's Big Endian architecture the MSB is in bit position 0. Meanwhile for the Little Endian case, the LSB is in bit position 0. Bytes within a register are also numbered in this fashion. The 0th byte on Z register is the leftmost byte, while the 0th byte on an x86 register will be the rightmost byte.

This leads into the first problem: If we wanted to extract the 0th bit position (as per the bit permute IL), the result would be different based on which architecture we are on.

The second problem is that the VBPERM instruction that maps to the bit permute IL well operates under this big endianness assumption as well. It permutes the bits in a value using an array of 16 byte-sized integers (the array is loaded into a register). The result of 0th byte permutation is placed inside the most significant bit of the result value. This is incorrect as we want the result of the 0th byte permutation to be placed inside the LSB of the result.

To solve these two problems, a new IL was proposed ("BigEndianIL"): One where the indices in the array are subtracted from 63 (so the correct positions are selected) and the order of the array is reversed (so the 0th byte in the original array will specify the value to put in the LSB of the result register).

dchopra001 · 2018-04-05T21:26:35Z

Benchmark:
benchmark.zip

dchopra001 · 2018-04-05T21:28:16Z

The failure in the build doesn't seem related. Here's the failure message:

No output has been received in the last 10m0s, this potentially indicates a stalled build or something wrong with the build itself.

fjeremic · 2018-04-05T21:42:15Z

@dchopra001 whenever this is ready for final review please remove the WIP and we'll go through it again.

dchopra001 · 2018-04-05T21:44:44Z

@fjeremic It should be ready for final review now. I've already addressed all requested changes from the last review. Removing the WIP tag.

fjeremic · 2018-04-06T02:35:06Z

compiler/z/codegen/OMRTreeEvaluator.cpp

+   ///
+   /// More info on the measurements is available here: https://github.com/eclipse/omr/pull/2330#issuecomment-378380538
+   /// The differences between the generated code for the different 
+   /// implementations can be seen here: https://github.com/eclipse/omr/pull/2330#issuecomment-378387516


Doxygen does not do anything for local comments so the triple slash is unnecessary for this entire change. Can we revert to double slash?

fjeremic · 2018-04-06T02:36:19Z

compiler/z/codegen/OMRTreeEvaluator.cpp

+      }
+   /// Use Z14's VBPERM instruction if possible. (Note: VBPERM supports permutation on arrays
+   /// of upto size 16, and beats the performance the loop unrolling technique used above 
+   /// for arrays of size 5 and greater)


Should be z14 and up to. I would avoid referencing hardcoded constants (of previously declared variables) in comments such as "5". It makes refactoring easier should this constant ever change as comments do not need to be updated, i.e. it avoids eventually having stale comments which can be quite deceptive to future readers of this code.

9410f34

I've instead referred to the constant that holds the threshold value now. That should prevent stale comments from occurring.

fjeremic · 2018-04-06T02:39:10Z

compiler/z/codegen/OMRTreeEvaluator.cpp

+         arrayLen <= 16 )
+      {
+      char mask[16] = {15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0};
+      char inverse[16] = {63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63,63, 63};


Space missing between second last element.

fjeremic · 2018-04-06T02:40:16Z

compiler/z/codegen/OMRTreeEvaluator.cpp

+      // The result of the VBPERM op is now in resultReg's 48-63 bit locations. Extract the bits you want via a mask
+      // and store the final result in resultReg. This is necessary because VBPERM operates on 16 bit positions at a time.
+      // If the array specified by the bitPermute IL was smaller than 16, then invalid bits can be selected. 
+      int resultMask = (1 << arrayLen) - 1;


resultMask should be of type int32_t to match that of arrayLen.

fjeremic · 2018-04-06T02:43:15Z

Looks good overall. Just a couple of minor things to address and we can merge this. Good work!

dchopra001 · 2018-04-06T04:47:52Z

@fjeremic Let me know when you're ready to merge. I'll do a squash + rebase before this gets merged.

fjeremic · 2018-04-06T14:09:44Z

@dchopra001 latest changes look good. Please squash the commits and we'll merge.

@genie-omr build all

Enable support for and add an evaluator for the BitPermute IL for the Z code generator. Currently the IL is only supported for 64-bit. Signed-off-by: Dhruv Chopra <Dhruv.C.Chopra@ibm.com>

dchopra001 · 2018-04-12T17:09:09Z

@fjeremic This should be good to merge now.

fjeremic · 2018-04-12T19:01:44Z

@genie-omr build all

dchopra001 force-pushed the bitPermuteZEvaluator branch from b07cbb0 to 9601a73 Compare February 20, 2018 23:32

dchopra001 mentioned this pull request Feb 20, 2018

WIP: Add the bitPermute Evaluator for Z #2329

Closed

fjeremic reviewed Feb 21, 2018

View reviewed changes

fjeremic added the comp:compiler label Feb 21, 2018

fjeremic self-assigned this Feb 21, 2018

dchopra001 force-pushed the bitPermuteZEvaluator branch from 9601a73 to 0233803 Compare February 22, 2018 15:04

dchopra001 force-pushed the bitPermuteZEvaluator branch from 0233803 to ceca9a7 Compare February 22, 2018 15:08

fjeremic reviewed Feb 22, 2018

View reviewed changes

dchopra001 mentioned this pull request Feb 23, 2018

Remove instruction that masks the shift amount #2320

Open

fjeremic reviewed Feb 23, 2018

View reviewed changes

fjeremic reviewed Feb 27, 2018

View reviewed changes

fjeremic reviewed Apr 3, 2018

View reviewed changes

dchopra001 changed the title ~~WIP: Add the bitPermute Evaluator for Z~~ Add the bitPermute Evaluator for Z Apr 5, 2018

fjeremic reviewed Apr 6, 2018

View reviewed changes

fjeremic approved these changes Apr 6, 2018

View reviewed changes

Add the BitPermute Evaluator for Z

8be2d1a

Enable support for and add an evaluator for the BitPermute IL for the Z code generator. Currently the IL is only supported for 64-bit. Signed-off-by: Dhruv Chopra <Dhruv.C.Chopra@ibm.com>

dchopra001 force-pushed the bitPermuteZEvaluator branch from c7895fe to 8be2d1a Compare April 6, 2018 17:04

fjeremic merged commit a1e32cc into eclipse:master Apr 16, 2018

fjeremic mentioned this pull request Apr 25, 2018

[0.9] Release focus areas eclipse-openj9/openj9#1381

Closed

		@@ -14117,6 +14117,168 @@ bool isSettingOp(uint8_t byteValue, TR::InstOpCode::Mnemonic SI_opcode)
		}


		TR::Register *tmpReg = cg->allocateRegister(TR_GPR);

		//Zero result reg

	TR::InstOpCode::Mnemonic
	OMR::Z::InstOpCode::getLoadTestRegOpCode() { return TR::Compiler->target.is64Bit() ? TR::InstOpCode::LTGR : TR::InstOpCode::LTR; }


		TR::Register *tmpReg = cg->allocateRegister(TR_GPR);

		static const int8_t BIT_PERMUTE_CONSTANT_UNROLL_THRESHOLD = 4;

Add the bitPermute Evaluator for Z #2330

Add the bitPermute Evaluator for Z #2330

Conversation

dchopra001 commented Feb 20, 2018

dchopra001 commented Feb 20, 2018

dchopra001 commented Feb 20, 2018 • edited Loading

dchopra001 commented Feb 21, 2018

dchopra001 commented Feb 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dchopra001 Feb 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjeremic commented Feb 21, 2018

fjeremic commented Feb 21, 2018 • edited Loading

fjeremic commented Feb 21, 2018 • edited Loading

dchopra001 commented Feb 22, 2018 • edited Loading

dchopra001 commented Feb 22, 2018

fjeremic commented Feb 22, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjeremic Feb 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjeremic Feb 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dchopra001 commented Feb 23, 2018

dchopra001 commented Feb 23, 2018

dchopra001 commented Feb 23, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dchopra001 Feb 23, 2018 • edited Loading

Choose a reason for hiding this comment

dchopra001 Feb 27, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjeremic commented Feb 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjeremic Feb 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjeremic commented Feb 23, 2018 • edited Loading

fjeremic Feb 27, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dchopra001 commented Apr 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dchopra001 commented Feb 20, 2018 •

edited

Loading

dchopra001 Feb 22, 2018 •

edited

Loading

fjeremic commented Feb 21, 2018 •

edited

Loading

fjeremic commented Feb 21, 2018 •

edited

Loading

dchopra001 commented Feb 22, 2018 •

edited

Loading

fjeremic Feb 22, 2018 •

edited

Loading

fjeremic Feb 22, 2018 •

edited

Loading

dchopra001 Feb 23, 2018 •

edited

Loading

dchopra001 Feb 27, 2018 •

edited

Loading

fjeremic commented Feb 23, 2018 •

edited

Loading

fjeremic Feb 23, 2018 •

edited

Loading

fjeremic commented Feb 23, 2018 •

edited

Loading

fjeremic Feb 27, 2018 •

edited

Loading

fjeremic Apr 3, 2018 •

edited

Loading

dchopra001 commented Apr 3, 2018 •

edited

Loading

dchopra001 commented Apr 3, 2018 •

edited

Loading