New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lower TEST(x, LSH(1, y)) to BT(x, y) #13626

Merged
merged 3 commits into from Sep 14, 2017

Conversation

Projects
None yet
@mikedn
Copy link
Contributor

mikedn commented Aug 28, 2017

No description provided.

@mikedn

This comment has been minimized.

Copy link
Contributor Author

mikedn commented Aug 28, 2017

jit-diff fx summary:

Total bytes of diff: -67 (0.00% of base)
    diff is an improvement.
Total byte diff includes 0 bytes from reconciling methods
        Base had    0 unique methods,        0 unique bytes
        Diff had    0 unique methods,        0 unique bytes
Top file improvements by size (bytes):
         -34 : Microsoft.CodeAnalysis.dasm (0.00% of base)
         -16 : System.Collections.dasm (-0.01% of base)
         -11 : System.Private.CoreLib.dasm (0.00% of base)
          -6 : System.Text.Encoding.CodePages.dasm (-0.01% of base)
4 total files with size differences (4 improved, 0 regressed), 75 unchanged.
Top method improvements by size (bytes):
         -20 : Microsoft.CodeAnalysis.dasm - MetadataSizes:GetTableSize(ubyte,int):int:this
         -19 : System.Private.CoreLib.dasm - String:IsCharBitSet(long,ubyte):bool
         -16 : System.Collections.dasm - BitArray:Get(int):bool:this
         -14 : Microsoft.CodeAnalysis.dasm - MetadataSizes:IsPresent(ubyte):bool:this
7 total methods with size differences (7 improved, 0 regressed), 66286 unchanged.```
@mikedn

This comment has been minimized.

Copy link
Contributor Author

mikedn commented Aug 28, 2017

It's obvious from the diff that this pattern doesn't occur often in the framework. Still, this pattern may occur in code that makes use of bit vectors for optimization purposes and where good performance is expected.

For example, the managed implementation of String.IndexOfAny in dotnet/corert#4395 is ~1.2x faster when using BT.

And in some cases the speedup can grow to 3x:

int y = 0;
int x = 0;
while (x++ < 10000000) {
    if ((x & (1 << y)) == 0) {
        y++;
    }
}
return y;

generates:

G_M1752_IG01:
       50                   push     rax
G_M1752_IG02:
       33C0                 xor      eax, eax
       33D2                 xor      edx, edx
       EB15                 jmp      SHORT G_M1752_IG04
G_M1752_IG03:
       BA01000000           mov      edx, 1
       8BC8                 mov      ecx, eax
       D3E2                 shl      edx, cl
       8B4C2404             mov      ecx, dword ptr [rsp+04H]
       85CA                 test     ecx, edx
       8BD1                 mov      edx, ecx
       7502                 jne      SHORT G_M1752_IG04
       FFC0                 inc      eax
G_M1752_IG04:
       8D4A01               lea      ecx, [rdx+1]
       894C2404             mov      dword ptr [rsp+04H], ecx
       81FA80969800         cmp      edx, 0x989680
       7CDC                 jl       SHORT G_M1752_IG03
G_M1752_IG05:
       4883C408             add      rsp, 8
       C3                   ret

BT version:

G_M1752_IG01:
G_M1752_IG02:
       33C0                 xor      eax, eax
       33D2                 xor      edx, edx
       EB09                 jmp      SHORT G_M1752_IG04
G_M1752_IG03:
       0FA3C1               bt       ecx, eax
       8BD1                 mov      edx, ecx
       7202                 jb       SHORT G_M1752_IG04
       FFC0                 inc      eax
G_M1752_IG04:
       8D4A01               lea      ecx, [rdx+1]
       81FA80969800         cmp      edx, 0x989680
       7CEC                 jl       SHORT G_M1752_IG03
G_M1752_IG05:
       C3                   ret

The fact that BT doesn't need a particular bit index register (ECX) nor a temporary register to compute the bit mask into seems to help a lot.

@mikedn

This comment has been minimized.

Copy link
Contributor Author

mikedn commented Aug 28, 2017

@dotnet-bot test Tizen armel Cross Debug Build
test Windows_NT x64 corefx_baseline

@mikedn mikedn force-pushed the mikedn:test-bt branch from dc89ba4 to b58ebd9 Aug 29, 2017

@mikedn

This comment has been minimized.

Copy link
Contributor Author

mikedn commented Aug 29, 2017

@dotnet-bot test Windows_NT x64 corefx_baseline

@mikedn mikedn force-pushed the mikedn:test-bt branch from b58ebd9 to 903e5fb Aug 29, 2017

@mikedn

This comment has been minimized.

Copy link
Contributor Author

mikedn commented Aug 29, 2017

@russellhadley We talked about generating BT in the past, here's one use of it that appears to be useful for both performance and code size reasons.

@mikedn

This comment has been minimized.

Copy link
Contributor Author

mikedn commented Aug 30, 2017

@dotnet-bot test Tizen armel Cross Debug Build

@danmosemsft

This comment has been minimized.

Copy link
Member

danmosemsft commented Aug 30, 2017

@DrewScoggins @jorive who owns the Perf Tests correctness? Can you help with this:

dotnet-bot@WIN2012-2005B0 "D:\j\workspace\perf_perflab_---1c2b0529\bin\sandbox"
[Tue 08/29/2017][11:02:35] $ py.exe "D:\j\workspace\perf_perflab_---1c2b0529\Microsoft.Benchview.JSONFormat\tools\submission.py" measurement.json  --build "D:\j\workspace\perf_perflab_---1c2b0529\build.json" --machine-data "D:\j\workspace\perf_perflab_---1c2b0529\machinedata.json" --metadata "D:\j\workspace\perf_perflab_---1c2b0529\submission-metadata.json" --group "CoreCLR" --type "private" --config-name "Release" --config Configuration "Release" --config OS "Windows_NT" --config Profile "Off" --config OptLevel "full_opt" --config JitName  "ryujit" --architecture "x86" --machinepool "PerfSnake"
[2017-08-29 11:02][ERROR] <class 'ValueError'>: Specified submission-metadata json file "D:\j\workspace\perf_perflab_---1c2b0529\submission-metadata.json" does not exist.
[2017-08-29 11:02][ERROR] Traceback (most recent call last):
  File "D:\j\workspace\perf_perflab_---1c2b0529\Microsoft.Benchview.JSONFormat\tools\submission.py", line 121, in main
    submission = read_submission_metadata(args['metadata'])
  File "D:\j\workspace\perf_perflab_---1c2b0529\Microsoft.Benchview.JSONFormat\tools\benchview\format\JSONReaders.py", line 70, in read_submission_metadata
    raise ValueError('Specified submission-metadata json file "{0}" does not exist.'.format(filename))
ValueError: Specified submission-metadata json file "D:\j\workspace\perf_perflab_---1c2b0529\submission-metadata.json" does not exist.

@danmosemsft

This comment has been minimized.

Copy link
Member

danmosemsft commented Aug 30, 2017

@dotnet/jit-contrib it would be great to get this in to unblock sharing more of String with CoreRT.

@jorive

This comment has been minimized.

Copy link
Member

jorive commented Aug 30, 2017

@dotnet-bot test Windows_NT x64 full_opt ryujit CoreCLR Perf Tests Correctness
@dotnet-bot test Windows_NT x86 full_opt legacy_backend CoreCLR Perf Tests Correctness
@dotnet-bot test Windows_NT x86 full_opt ryujit CoreCLR Perf Tests Correctness

@jorive

This comment has been minimized.

Copy link
Member

jorive commented Aug 30, 2017

@danmosemsft I'm rerunning the tests. There was a bug on the PR test generator that was fixed this morning.
Sorry for the inconvenience,
-José

@CarolEidt
Copy link
Member

CarolEidt left a comment

It's unfortunate to see all the special-casing required in emitxarch.cpp, but it's not obvious to me how to avoid it. Any idea whether there is any throughput impact?
LGTM, but would like to see additional reviews.

@@ -226,6 +226,12 @@ GTNODE(JCC , GenTreeCC ,0,GTK_LEAF|GTK_NOVALUE) // Check
GTNODE(SETCC , GenTreeCC ,0,GTK_LEAF) // Checks the condition flags and produces 1 if the condition specified
// by GenTreeCC::gtCondition is true and 0 otherwise.

#ifdef _TARGET_XARCH_
GTNODE(BT , GenTreeOp ,0,GTK_BINOP|GTK_NOVALUE)
GTNODE(BTC , GenTreeOp ,0,GTK_BINOP)

This comment has been minimized.

@CarolEidt

CarolEidt Aug 30, 2017

Member

I would prefer that we not add the additional forms as GenTree operators until they are going to be used.

This comment has been minimized.

@CarolEidt

CarolEidt Aug 30, 2017

Member

Also, it would be good to add comments describing them.

This comment has been minimized.

@mikedn

mikedn Aug 31, 2017

Author Contributor

I would prefer that we not add the additional forms as GenTree operators until they are going to be used.

Will do, they're easy to add back if needed. And it may turn out that they're never needed because they have worse perf characteristics on AMD CPUs. I'm not sure it's worth the trouble to make the JIT use them only on Intel CPUs.

// The BT instruction family supports "reg/mem,reg" and "reg/mem,imm" forms.
// However, the "mem,reg" form has a slightly different semantic than the other forms,
// it treats the memory as an array indexed by bit_index / 32/64. This is rarely useful
// as these instructions are tipically produced by transforming shift instructions and

This comment has been minimized.

@CarolEidt

CarolEidt Aug 30, 2017

Member

Typo: tipically => typically

@mikedn mikedn force-pushed the mikedn:test-bt branch from 903e5fb to 073ad28 Aug 31, 2017

@mikedn

This comment has been minimized.

Copy link
Contributor Author

mikedn commented Aug 31, 2017

It's unfortunate to see all the special-casing required in emitxarch.cpp, but it's not obvious to me how to avoid it

Indeed it is. When I started doing this (it's been a while and I don't remember the exact details) I thought it would be easy because like shifts it too has a one byte immediate operand. But then shift requires special casing too and BT needs even more (AFAIR partly because it has a 0F prefix). If I had time I'd refactor the whole emitter, it's long past its design limit (assuming there ever was a design to begin with). It appears that the original idea was to be "data driven" but the instruction table is far too limited to avoid a bunch of special casing in the code.

As for performance, corelib crossgen timing doesn't show any change and I've yet to figure out a way to measure things like retired instruction count.

@JosephTremoulet

This comment has been minimized.

Copy link
Contributor

JosephTremoulet commented Aug 31, 2017

I've yet to figure out a way to measure things like retired instruction count

Here's what I've used (courtesy @pgavlin) to measure it for corelib crossgen on Linux:

./build.sh x64 release skiptests ninja nopgooptimize
valgrind --tool=callgrind --dump-instr=yes --callgrind-out-file=callgrind.out ./bin/Product/Linux.x64.Release/crossgen /Platform_Assemblies_Paths ./bin/Product/Linux.x64.Release/IL ./bin/Product/Linux.x64.Release/IL/System.Private.CoreLib.dll
kcachegrind callgrind.out

And I haven't tried it myself, but @AndyAyersMS has https://github.com/AndyAyersMS/InstructionsRetiredExplorer for measuring the same on Windows.

@mikedn

This comment has been minimized.

Copy link
Contributor Author

mikedn commented Aug 31, 2017

measure it for corelib crossgen on Linux:

I know but I don't have enough disk space to install Linux. And AFAIK callgrind doesn't work in a VM...

https://github.com/AndyAyersMS/InstructionsRetiredExplorer for measuring the same on Windows.

Ah, xperf + pmc, I was just looking into that, thanks for the pointer!

@BruceForstall

This comment has been minimized.

Copy link
Contributor

BruceForstall commented Aug 31, 2017

You can also use Intel 'pin', configured to count instructions: https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool

@AndyAyersMS

This comment has been minimized.

Copy link
Member

AndyAyersMS commented Aug 31, 2017

The instructions retired explorer may require very new xperfs (not 100% sure they've even shipped). Also you can't be running on a box with Hyper-V enabled. And you need to run as admin.

@AndyAyersMS

This comment has been minimized.

Copy link
Member

AndyAyersMS commented Aug 31, 2017

You might be able to enable the PMC ETL event mode via perfview instead. Haven't tried....

@mikedn

This comment has been minimized.

Copy link
Contributor Author

mikedn commented Sep 1, 2017

You can also use Intel 'pin', configured to count instructions

Yep, I know that one too. Yet to get it to build, AFAIR last time I tried it couldn't find the C compiler.

The instructions retired explorer may require very new xperfs (not 100% sure they've even shipped)

Indeed, it looks like the xperf I have doesn't accept -pmcprofile. But there's tracelog that does accept -pmc. But then InstructionsRetiredExplorer can't seem to find any PerfInfo/PMCSample events in the merged etl file. Yet to figure out why but I tried Bruce Dawson's PMC event parsing script from https://github.com/google/UIforETW/tree/master/LabScripts/ETWPMCDemo and got this:

Process          Ins Retired     Branch Ins     Branch Miss
crossgen.exe     18,067,358,073  3,890,472,755  233,555,507
crossgen.exe     18,062,612,600  3,889,872,528  233,447,303
crossgen.exe     18,060,606,313  3,889,366,433  233,538,253
crossgen.exe     18,078,099,445  3,892,123,129  233,823,258
crossgen.exe     18,064,848,866  3,890,080,253  233,779,378
crossgen_bt.exe  18,102,982,290  3,898,831,420  236,963,113
crossgen_bt.exe  18,110,508,956  3,901,657,373  237,138,253
crossgen_bt.exe  18,099,535,973  3,898,380,857  236,903,141
crossgen_bt.exe  18,095,406,195  3,897,302,086  236,967,909
crossgen_bt.exe  18,099,172,364  3,898,696,667  236,855,791

So around 0.2% increase in instructions retired, assuming the numbers are correct. See this Excel file for some more details: https://1drv.ms/x/s!Av4baJYSo5pjgrhsTDG39GBXVvRdOA. For example, the reported time does look reasonable.

0.2% is a bit much and it's quite unfortunate that adding a new instruction has such an effect. I'll see if I can do something about it though I'm not very optimistic when it comes to emitter code...

@mikedn mikedn force-pushed the mikedn:test-bt branch from 073ad28 to 0ed5b7d Sep 2, 2017

@mikedn

This comment has been minimized.

Copy link
Contributor Author

mikedn commented Sep 4, 2017

Probably the best option to speed up the emitter is to add special casing only where it is really needed - in emitOutputRR - because only BT reg, reg is currently needed. BT reg, imm and BT mem, imm can be useful but they're probably not worth the trouble. BT mem, reg is completely pointless.

And if only BT reg, reg is needed then it turns out that it's possible to avoid all emitter special casing if cheating is an option. Special casing is needed because:

  • The existing code path attempts to set bits 0 and 1 of the opcode. For BT these bits have no special meaning but it happens so that its opcode is A3 so we can cheat - just set them unconditionally.
  • BT's registers operand are swapped - BT ecx, edx has a different encoding than ADD ecx, edx. Here we can cheat and have CodeGen call emitIns_R_R with swapped operands. It's ugly but it works.

@karelz karelz added the area-CodeGen label Sep 6, 2017

@mikedn mikedn force-pushed the mikedn:test-bt branch from 0ed5b7d to e079bab Sep 7, 2017

@mikedn

This comment has been minimized.

Copy link
Contributor Author

mikedn commented Sep 7, 2017

I replaced all the special casing from the emitter with comments and asserts. It's perhaps a bit of a hack but the added complexity and the throughput hit is just not worth it considering the low usage of BT.

@mikedn

This comment has been minimized.

Copy link
Contributor Author

mikedn commented Sep 8, 2017

OSX build failed due to some Jenkins issue

00:14:17 hudson.remoting.RemotingSystemException: java.util.concurrent.ExecutionException: Invalid object ID 389 iota=390
00:14:17 	at hudson.remoting.RemoteInvocationHandler.invoke(RemoteInvocationHandler.java:268)
00:14:17 	at com.sun.proxy.$Proxy169.join(Unknown Source)
00:14:17 	at hudson.Launcher$RemoteLauncher$ProcImpl.join(Launcher.java:991)
00:14:17 	at hudson.tasks.CommandInterpreter.join(CommandInterpreter.java:135)
00:14:17 	at hudson.tasks.CommandInterpreter.perform(CommandInterpreter.java:95)

@dotnet-bot test OSX10.12 x64 Checked Build and Test

@mikedn

This comment has been minimized.

Copy link
Contributor Author

mikedn commented Sep 9, 2017

OSX build got stuck...

@dotnet-bot test OSX10.12 x64 Checked Build and Test

@mikedn mikedn referenced this pull request Sep 12, 2017

Merged

Use BT in switch lowering #12552

@danmosemsft

This comment has been minimized.

Copy link
Member

danmosemsft commented Sep 12, 2017

@CarolEidt good to merge it looks like?

@CarolEidt

This comment has been minimized.

Copy link
Member

CarolEidt commented Sep 12, 2017

@dotnet/jit-contrib This LGTM, but I think a second set of eyes on this would be good.

@@ -163,6 +163,7 @@ void genCodeForShiftLong(GenTreePtr tree);

#ifdef _TARGET_XARCH_
void genCodeForShiftRMW(GenTreeStoreInd* storeInd);
void genCodeForBT(GenTreeOp* bt);

This comment has been minimized.

@briansull

briansull Sep 12, 2017

Contributor

You may want to call this BitTest rather than BT, since we will likely want this for other architectures eventually.

This comment has been minimized.

@mikedn

mikedn Sep 13, 2017

Author Contributor

ARM doesn't seem to have anything like BT. ARM64 has TBZ/TBNZ but that's quite different from BT and I'm not even sure how we can represent that in JIT's IR, it's a rather special case of conditional branch, maybe a JTRUE with a contained TEST_EQ/NE operand.

This comment has been minimized.

@briansull

briansull Sep 13, 2017

Contributor

Arm64 has tst:

  tst     w24, #24
  beq     G_M55607_IG12

Implemented with PR #13799

Note:

Test bits (immediate), setting the condition flags and discarding the result: Rn AND imm
This instruction is an alias of the ANDS (immediate) instruction.

This comment has been minimized.

@mikedn

mikedn Sep 13, 2017

Author Contributor

ARM64's tst is similar to x64's test and represented in IR by GT_TEST_EQ and GT_TEST_NE. BT is a different thing.

This comment has been minimized.

@briansull

briansull Sep 13, 2017

Contributor

What is the difference between these two?
Any bit (0-63) can be tested using Arm64 TST instruction.

The main difference seems to be that the Intel instruction sets the CF and the Arm instruction sets the ZF. (And that a memory operand is allowed on Intel)

This comment has been minimized.

@mikedn

mikedn Sep 13, 2017

Author Contributor

Any bit (0-63) can be tested using Arm64 TST instruction.

Yes, but the bit to test is specified by an immediate value. Presumably you can do something like tst w24, w23 but then w23 needs to contain a bit mask, not a bit index like in the case of BT.

@sdmaclea

This comment has been minimized.

Copy link
Member

sdmaclea commented Sep 12, 2017

@dotnet-bot test Windows_NT arm64 Checked

@sdmaclea

This comment has been minimized.

Copy link
Member

sdmaclea commented Sep 12, 2017

@mikedn It is not obvious to me how this is disabled for arm64. I do notsee the #ifdef in LowerCompare() Is it just inside another large block? Never mind it is just in the large TARGET_XARCH block.

@sdmaclea

This comment has been minimized.

Copy link
Member

sdmaclea commented Sep 12, 2017

Do not feel obligated to wait for arm64 tests to complete.

@CarolEidt

This comment has been minimized.

Copy link
Member

CarolEidt commented Sep 12, 2017

Never mind it is just in the large TARGET_XARCH block.

I had the same question, and was thinking that perhaps it is time to extract the XARCH-specific code into lowerxarch.cpp

@sdmaclea

This comment has been minimized.

Copy link
Member

sdmaclea commented Sep 12, 2017

@CarolEidt

extract the XARCH-specific code into lowerxarch.cpp

Many of my recent changes are enabling sections of this LowerCompare XARCH block for arm64.

@CarolEidt

This comment has been minimized.

Copy link
Member

CarolEidt commented Sep 12, 2017

Many of my recent changes are enabling sections of this LowerCompare XARCH block for arm64.

Ah, very good then!

@mikedn mikedn force-pushed the mikedn:test-bt branch from e079bab to 81bc72f Sep 13, 2017

@mikedn

This comment has been minimized.

Copy link
Contributor Author

mikedn commented Sep 13, 2017

Rebased and fixed conflicts. It happens so that now it is obvious that the code is inside an ifdef.

@tannergooding

This comment has been minimized.

Copy link
Member

tannergooding commented Sep 13, 2017

@mikedn are there any plans to add similar changes for bts, btr, and btc?

@mikedn

This comment has been minimized.

Copy link
Contributor Author

mikedn commented Sep 13, 2017

are there any plans to add similar changes for bts, btr, and btc?

I did experiment with those and I have some code that handles things like x |= (1 << y). The are some problems:

  • These instructions have a mem, reg form but it has poor performance and problematic behavior. That means that when x is a memory location we need to generate something like
mov eax, [mem]
bts eax, edx
mov [mem], eax

This makes these instructions slightly less useful compared to BT (which only needs an extra mov in such cases).

  • On AMD CPUs (Ryzen really as probably nobody cares about the rest) these instructions have slightly worse perf characteristics than shift instructions. On Intel CPUs it's the other way around. This makes things complicated - we either need to carefully handle only those patterns that work well on all CPUs or use these instructions only on some CPUs. The former might require running benchmarks on a Ryzen (and I don't have one), the later might complicate testing and I'm not sure if the team will agree to that.

And interestingly, most native compilers seem to avoid these instructions. AFAIR the only native compiler that I've seen using bts & co. is VC++.

@tannergooding

This comment has been minimized.

Copy link
Member

tannergooding commented Sep 13, 2017

The former might require running benchmarks on a Ryzen (and I don't have one)

I'm happy to run benchmarks on a Ryzen, I've got a 1800X and a 1950X (Threadripper)

@tannergooding

This comment has been minimized.

Copy link
Member

tannergooding commented Sep 13, 2017

These instructions have a mem, reg form but it has poor performance and problematic behavior. That means that when x is a memory location we need to generate something like

Fair enough. If alternative code tends to perform better across all platforms, its probably better to keep that.

@mikedn

This comment has been minimized.

Copy link
Contributor Author

mikedn commented Sep 13, 2017

I'm happy to run benchmarks on a Ryzen, I've got a 1800X and a 1950X (Threadripper)

Thanks, then I'll try to put something together when I get some time, next weekend perhaps.

I should add that adding support for SHLX/SHRX/SARX might be more useful. On Intels these have better perf than normal shifts while on AMDs they have identical perf so we can use them on all CPUs that support BMI2. They don't require the shift count to be in ECX (and as seen in an example above this may help with register allocation) and they have a 3 operand form (though I'm not sure if the JIT can fully take advantage of that).

@CarolEidt CarolEidt merged commit f386cd6 into dotnet:master Sep 14, 2017

16 checks passed

CROSS Check Build finished.
Details
CentOS7.1 x64 Debug Build and Test Build finished.
Details
OSX10.12 x64 Checked Build and Test Build finished.
Details
Tizen armel Cross Debug Build Build finished.
Details
Tizen armel Cross Release Build Build finished.
Details
Ubuntu arm Cross Release Build Build finished.
Details
Ubuntu x64 Checked Build and Test Build finished.
Details
Ubuntu x64 Formatting Build finished.
Details
Ubuntu16.04 arm Cross Debug Build Build finished.
Details
Windows_NT x64 Debug Build and Test Build finished.
Details
Windows_NT x64 Formatting Build finished.
Details
Windows_NT x64 Release Priority 1 Build and Test Build finished.
Details
Windows_NT x64 full_opt ryujit CoreCLR Perf Tests Correctness Build finished.
Details
Windows_NT x86 Checked Build and Test Build finished.
Details
Windows_NT x86 full_opt legacy_backend CoreCLR Perf Tests Correctness Build finished.
Details
Windows_NT x86 full_opt ryujit CoreCLR Perf Tests Correctness Build finished.
Details

@mikedn mikedn deleted the mikedn:test-bt branch Dec 16, 2017

@wojtpl2

This comment has been minimized.

Copy link

wojtpl2 commented Jan 28, 2019

@mikedn I used your example in my blog about how to generate the disassembly of .NET functions and how to diff many of them with BenchmarkDotNet.

@wojtpl2

This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment