Skip to content

Commit

Permalink
GT_STORE_BLK - do not call memset for blocks containg gc pointers on …
Browse files Browse the repository at this point in the history
…heap (#95530)

Co-authored-by: Qiao Pengcheng <qiaopengcheng@loongson.cn>
  • Loading branch information
EgorBo and shushanhf committed Jan 4, 2024
1 parent 89af7af commit bed5f46
Show file tree
Hide file tree
Showing 21 changed files with 445 additions and 14 deletions.
44 changes: 35 additions & 9 deletions docs/design/coreclr/botr/clr-abi.md
Original file line number Diff line number Diff line change
Expand Up @@ -207,7 +207,7 @@ For non-rude thread abort, the VM walks the stack, running any catch handler tha

For example:

```
```cs
try { // try 1
try { // try 2
System.Threading.Thread.CurrentThread.Abort();
Expand All @@ -223,7 +223,7 @@ L:

In this case, if the address returned in catch 2 corresponding to label L is outside try 1, then the ThreadAbortException re-raised by the VM will not be caught by catch 1, as is expected. The JIT needs to insert a block such that this is the effective code generation:

```
```cs
try { // try 1
try { // try 2
System.Threading.Thread.CurrentThread.Abort();
Expand All @@ -240,7 +240,7 @@ L:

Similarly, the automatic re-raise address for a ThreadAbortException can't be within a finally handler, or the VM will abort the re-raise and swallow the exception. This can happen due to call-to-finally thunks marked as "cloned finally", as described above. For example (this is pseudo-assembly-code, not C#):

```
```cs
try { // try 1
try { // try 2
System.Threading.Thread.CurrentThread.Abort();
Expand All @@ -256,7 +256,7 @@ L:

This would generate something like:

```
```asm
// beginning of 'try 1'
// beginning of 'try 2'
System.Threading.Thread.CurrentThread.Abort();
Expand All @@ -281,7 +281,7 @@ Finally1:

Note that the JIT must already insert a "step" block so the finally will be called. However, this isn't sufficient to support ThreadAbortException processing, because "L1" is marked as "cloned finally". In this case, the JIT must insert another step block that is within "try 1" but outside the cloned finally block, that will allow for correct re-raise semantics. For example:

```
```asm
// beginning of 'try 1'
// beginning of 'try 2'
System.Threading.Thread.CurrentThread.Abort();
Expand Down Expand Up @@ -399,7 +399,7 @@ To implement this requirement, for any function with EH, we create a frame-local

Note that the since a slot on x86 is 4 bytes, the minimum size is 16 bytes. The idea is to have 1 slot for each handler that could be possibly be invoked at the same time. For example, for:

```
```cs
try {
...
} catch {
Expand All @@ -419,7 +419,7 @@ When calling a finally, we set the appropriate level to 0xFC (aka "finally call"

Thus, calling a finally from JIT generated code looks like:

```
```asm
mov dword ptr [L_02+0x4 ebp-10H], 0 // This must happen before the 0xFC is written
mov dword ptr [L_02+0x8 ebp-0CH], 252 // 0xFC
push G_M52300_IG07
Expand All @@ -430,7 +430,7 @@ In this case, `G_M52300_IG07` is not the address after the 'jmp', so a simple 'c

The code this finally returns to looks like this:

```
```asm
mov dword ptr [L_02+0x8 ebp-0CH], 0
jmp SHORT G_M52300_IG05
```
Expand Down Expand Up @@ -479,7 +479,7 @@ Because a main function body will **always** be on the stack when one of its fun

There is one "corner case" in the VM implementation of WantsReportOnlyLeaf model that has implications for the code the JIT is allowed to generate. Consider this function with nested exception handling:

```
```cs
public void runtest() {
try {
try {
Expand Down Expand Up @@ -804,3 +804,29 @@ In addition to the usual registers it also preserves all float registers and `rc
`CORINFO_HELP_DISPATCH_INDIRECT_CALL` takes the call address in `rax` and it reserves the right to use and trash `r10` and `r11`.
The JIT uses the dispatch helper on x64 whenever possible as it is expected that the code size benefits outweighs the less accurate branch prediction.
However, note that the use of `r11` in the dispatcher makes it incompatible with VSD calls where the JIT must fall back to the validator and a manual call.

# Notes on Memset/Memcpy

Generally, `memset` and `memcpy` do not provide any guarantees of atomicity. This implies that they should only be used when the memory being modified by `memset`/`memcpy` is not observable by any other thread (including GC), or when there are no atomicity requirements according to our [Memory Model](../../specs/Memory-model.md). It's especially important when we modify heap containing managed pointers - those must be updated atomically, e.g. using pointer-sized `mov` instruction (managed pointers are always aligned) - see [Atomic Memory Access](../../specs/Memory-model.md#Atomic-memory-accesses). It's worth noting that by "update" it's implied "set to zero", otherwise, we need a write barrier.

Examples:

```cs
struct MyStruct
{
long a;
string b;
}

void Test1(ref MyStruct m)
{
// We're not allowed to use memset here
m = default;
}

MyStruct Test2()
{
// We can use memset here
return default;
}
```
1 change: 1 addition & 0 deletions src/coreclr/jit/codegen.h
Original file line number Diff line number Diff line change
Expand Up @@ -1240,6 +1240,7 @@ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
#ifndef TARGET_X86
void genCodeForInitBlkHelper(GenTreeBlk* initBlkNode);
#endif
void genCodeForInitBlkLoop(GenTreeBlk* initBlkNode);
void genCodeForInitBlkRepStos(GenTreeBlk* initBlkNode);
void genCodeForInitBlkUnroll(GenTreeBlk* initBlkNode);
void genJumpTable(GenTree* tree);
Expand Down
71 changes: 71 additions & 0 deletions src/coreclr/jit/codegenarmarch.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -3243,6 +3243,72 @@ void CodeGen::genCodeForInitBlkHelper(GenTreeBlk* initBlkNode)
genEmitHelperCall(CORINFO_HELP_MEMSET, 0, EA_UNKNOWN);
}

//------------------------------------------------------------------------
// genCodeForInitBlkLoop - Generate code for an InitBlk using an inlined for-loop.
// It's needed for cases when size is too big to unroll and we're not allowed
// to use memset call due to atomicity requirements.
//
// Arguments:
// initBlkNode - the GT_STORE_BLK node
//
void CodeGen::genCodeForInitBlkLoop(GenTreeBlk* initBlkNode)
{
GenTree* const dstNode = initBlkNode->Addr();
genConsumeReg(dstNode);
const regNumber dstReg = dstNode->GetRegNum();

#ifndef TARGET_ARM64
GenTree* const zeroNode = initBlkNode->Data();
genConsumeReg(zeroNode);
const regNumber zeroReg = zeroNode->GetRegNum();
#else
const regNumber zeroReg = REG_ZR;
#endif

if (initBlkNode->IsVolatile())
{
// issue a full memory barrier before a volatile initBlock Operation
instGen_MemoryBarrier();
}

// str xzr, [dstReg]
// mov offsetReg, <block size>
//.LOOP:
// str xzr, [dstReg, offsetReg]
// subs offsetReg, offsetReg, #8
// bne .LOOP

const unsigned size = initBlkNode->GetLayout()->GetSize();
assert((size >= TARGET_POINTER_SIZE) && ((size % TARGET_POINTER_SIZE) == 0));

// The loop is reversed - it makes it smaller.
// Although, we zero the first pointer before the loop (the loop doesn't zero it)
// it works as a nullcheck, otherwise the first iteration would try to access
// "null + potentially large offset" and hit AV.
GetEmitter()->emitIns_R_R(INS_str, EA_PTRSIZE, zeroReg, dstReg);
if (size > TARGET_POINTER_SIZE)
{
// Extend liveness of dstReg in case if it gets killed by the store.
gcInfo.gcMarkRegPtrVal(dstReg, dstNode->TypeGet());

const regNumber offsetReg = initBlkNode->GetSingleTempReg();
instGen_Set_Reg_To_Imm(EA_PTRSIZE, offsetReg, size - TARGET_POINTER_SIZE);

BasicBlock* loop = genCreateTempLabel();
genDefineTempLabel(loop);

GetEmitter()->emitIns_R_R_R(INS_str, EA_PTRSIZE, zeroReg, dstReg, offsetReg);
#ifdef TARGET_ARM64
GetEmitter()->emitIns_R_R_I(INS_subs, EA_PTRSIZE, offsetReg, offsetReg, TARGET_POINTER_SIZE);
#else
GetEmitter()->emitIns_R_R_I(INS_sub, EA_PTRSIZE, offsetReg, offsetReg, TARGET_POINTER_SIZE, INS_FLAGS_SET);
#endif
inst_JMP(EJ_ne, loop);

gcInfo.gcMarkRegSetNpt(genRegMask(dstReg));
}
}

//------------------------------------------------------------------------
// genCall: Produce code for a GT_CALL node
//
Expand Down Expand Up @@ -4513,6 +4579,11 @@ void CodeGen::genCodeForStoreBlk(GenTreeBlk* blkOp)
genCodeForCpObj(blkOp->AsBlk());
break;

case GenTreeBlk::BlkOpKindLoop:
assert(!isCopyBlk);
genCodeForInitBlkLoop(blkOp);
break;

case GenTreeBlk::BlkOpKindHelper:
assert(!blkOp->gtBlkOpGcUnsafe);
if (isCopyBlk)
Expand Down
53 changes: 53 additions & 0 deletions src/coreclr/jit/codegenloongarch64.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -6351,6 +6351,54 @@ void CodeGen::genCodeForInitBlkHelper(GenTreeBlk* initBlkNode)
genEmitHelperCall(CORINFO_HELP_MEMSET, 0, EA_UNKNOWN);
}

//------------------------------------------------------------------------
// genCodeForInitBlkLoop - Generate code for an InitBlk using an inlined for-loop.
// It's needed for cases when size is too big to unroll and we're not allowed
// to use memset call due to atomicity requirements.
//
// Arguments:
// initBlkNode - the GT_STORE_BLK node
//
void CodeGen::genCodeForInitBlkLoop(GenTreeBlk* initBlkNode)
{
GenTree* const dstNode = initBlkNode->Addr();
genConsumeReg(dstNode);
const regNumber dstReg = dstNode->GetRegNum();

if (initBlkNode->IsVolatile())
{
// issue a full memory barrier before a volatile initBlock Operation
instGen_MemoryBarrier();
}

const unsigned size = initBlkNode->GetLayout()->GetSize();
assert((size >= TARGET_POINTER_SIZE) && ((size % TARGET_POINTER_SIZE) == 0));

// The loop is reversed - it makes it smaller.
// Although, we zero the first pointer before the loop (the loop doesn't zero it)
// it works as a nullcheck, otherwise the first iteration would try to access
// "null + potentially large offset" and hit AV.
GetEmitter()->emitIns_R_R_I(INS_st_d, EA_PTRSIZE, REG_R0, dstReg, 0);
if (size > TARGET_POINTER_SIZE)
{
// Extend liveness of dstReg in case if it gets killed by the store.
gcInfo.gcMarkRegPtrVal(dstReg, dstNode->TypeGet());

const regNumber offsetReg = initBlkNode->GetSingleTempReg();
instGen_Set_Reg_To_Imm(EA_PTRSIZE, offsetReg, size - TARGET_POINTER_SIZE);

// loop begin:
// *(dstReg + offsetReg) = 0
GetEmitter()->emitIns_R_R_R(INS_stx_d, EA_PTRSIZE, REG_R0, dstReg, offsetReg);
// offsetReg = offsetReg - 8
GetEmitter()->emitIns_R_R_I(INS_addi_d, EA_PTRSIZE, offsetReg, offsetReg, -8);
// if (offsetReg != 0) goto loop;
GetEmitter()->emitIns_R_I(INS_bnez, EA_8BYTE, offsetReg, -2 << 2);

gcInfo.gcMarkRegSetNpt(genRegMask(dstReg));
}
}

// Generate code for a load from some address + offset
// base: tree node which can be either a local address or arbitrary node
// offset: distance from the base from which to load
Expand Down Expand Up @@ -7262,6 +7310,11 @@ void CodeGen::genCodeForStoreBlk(GenTreeBlk* blkOp)
genCodeForCpObj(blkOp->AsBlk());
break;

case GenTreeBlk::BlkOpKindLoop:
assert(!isCopyBlk);
genCodeForInitBlkLoop(blkOp);
break;

case GenTreeBlk::BlkOpKindHelper:
if (isCopyBlk)
{
Expand Down
60 changes: 60 additions & 0 deletions src/coreclr/jit/codegenriscv64.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -6029,6 +6029,61 @@ void CodeGen::genCodeForCpBlkUnroll(GenTreeBlk* cpBlkNode)
}
}

//------------------------------------------------------------------------
// genCodeForInitBlkLoop - Generate code for an InitBlk using an inlined for-loop.
// It's needed for cases when size is too big to unroll and we're not allowed
// to use memset call due to atomicity requirements.
//
// Arguments:
// initBlkNode - the GT_STORE_BLK node
//
void CodeGen::genCodeForInitBlkLoop(GenTreeBlk* initBlkNode)
{
GenTree* const dstNode = initBlkNode->Addr();
genConsumeReg(dstNode);
const regNumber dstReg = dstNode->GetRegNum();

if (initBlkNode->IsVolatile())
{
// issue a full memory barrier before a volatile initBlock Operation
instGen_MemoryBarrier();
}

const unsigned size = initBlkNode->GetLayout()->GetSize();
assert((size >= TARGET_POINTER_SIZE) && ((size % TARGET_POINTER_SIZE) == 0));

// The loop is reversed - it makes it smaller.
// Although, we zero the first pointer before the loop (the loop doesn't zero it)
// it works as a nullcheck, otherwise the first iteration would try to access
// "null + potentially large offset" and hit AV.
GetEmitter()->emitIns_R_R_I(INS_sd, EA_PTRSIZE, REG_R0, dstReg, 0);
if (size > TARGET_POINTER_SIZE)
{
// Extend liveness of dstReg in case if it gets killed by the store.
gcInfo.gcMarkRegPtrVal(dstReg, dstNode->TypeGet());

const regNumber tempReg = initBlkNode->GetSingleTempReg();
instGen_Set_Reg_To_Imm(EA_PTRSIZE, tempReg, size - TARGET_POINTER_SIZE);

// tempReg = dstReg + tempReg (a new interior pointer, but in a nongc region)
GetEmitter()->emitIns_R_R_R(INS_add, EA_PTRSIZE, tempReg, dstReg, tempReg);

BasicBlock* loop = genCreateTempLabel();
genDefineTempLabel(loop);
GetEmitter()->emitDisableGC(); // TODO: add gcinfo to tempReg and remove nogc

// *tempReg = 0
GetEmitter()->emitIns_R_R_I(INS_sd, EA_PTRSIZE, REG_R0, tempReg, 0);
// tempReg = tempReg - 8
GetEmitter()->emitIns_R_R_I(INS_addi, EA_PTRSIZE, tempReg, tempReg, -8);
// if (tempReg != dstReg) goto loop;
GetEmitter()->emitIns_J(INS_bne, loop, (int)tempReg | ((int)dstReg << 5));
GetEmitter()->emitEnableGC();

gcInfo.gcMarkRegSetNpt(genRegMask(dstReg));
}
}

//------------------------------------------------------------------------
// genCodeForInitBlkHelper - Generate code for an InitBlk node by the means of the VM memcpy helper call
//
Expand Down Expand Up @@ -6918,6 +6973,11 @@ void CodeGen::genCodeForStoreBlk(GenTreeBlk* blkOp)
genCodeForCpObj(blkOp->AsBlk());
break;

case GenTreeBlk::BlkOpKindLoop:
assert(!isCopyBlk);
genCodeForInitBlkLoop(blkOp);
break;

case GenTreeBlk::BlkOpKindHelper:
if (isCopyBlk)
{
Expand Down
Loading

0 comments on commit bed5f46

Please sign in to comment.