Partly mitigate bad Clang inlining decision #66

Kojoley · 2019-04-14T01:41:22Z

Because a visitor is wrapped several times during visitation it cases extra temporaries usage and useless store and loads that can only be optimized if the visitation_impl is inlined into the function that creates the wrapper. Clang inliner decides not to inline functions even with small-sized switches, resulting in a poor visitation code. Forceinline mark on those internal functions perceptibly improves the situation, though does not mitigate it completely.

LLVM ticket https://bugs.llvm.org/show_bug.cgi?id=41491

Before:

foo:
	sub	rsp, 88
	lea	rax, [rsp + 64]
	mov	qword ptr [rsp + 72], rax
	lea	rax, [rsp + 72]
	mov	qword ptr [rsp + 80], rax
	mov	eax, dword ptr [rcx]
	lea	r9, [rcx + 4]
	mov	edx, eax
	sar	edx, 31
	xor	edx, eax
	xorps	xmm0, xmm0
	movups	xmmword ptr [rsp + 48], xmm0
	lea	r8, [rsp + 80]
	mov	ecx, eax
	call	tail
	nop
	add	rsp, 88
	ret

tail:
	add	edx, -1
	cmp	edx, 8
	ja	.LBB1_2
	lea	rax, [rip + .LJTI1_0]
	movsxd	rcx, dword ptr [rax + 4*rdx]
	add	rcx, rax
	jmp	rcx

After:

foo:
	sub	rsp, 56
	lea	rax, [rsp + 40]
	mov	qword ptr [rsp + 48], rax
	lea	rdx, [rsp + 48]
	call	tail
	nop
	add	rsp, 56
	ret

tail:
	mov	ecx, dword ptr [rcx]
	mov	eax, ecx
	sar	eax, 31
	xor	eax, ecx
	add	eax, -1
	cmp	eax, 8
	ja	.LBB1_2
	lea	rcx, [rip + .LJTI1_0]
	movsxd	rax, dword ptr [rcx + 4*rax]
	add	rax, rcx
	jmp	rax

Because a visitor is wrapped several times during visitation it cases extra temporaries usage and useless store and loads that can only be optimized if the `visitation_impl` is inlined into the function that creates the wrapper. Clang inliner decides not to inline functions even with small-sized switches, resulting in a poor visitation code. Forceinline mark on those internal functions perceptibly improves the situation, though does not mitigate it completely. LLVM ticket https://bugs.llvm.org/show_bug.cgi?id=41491

coveralls · 2019-04-14T01:59:05Z

Coverage increased (+0.1%) to 95.404% when pulling dd9b0c9 on Kojoley:partly-mitigate-bad-clang-inlining-decision into 74ea828 on boostorg:develop.

apolukhin · 2019-04-23T06:02:30Z

I'm not a big fan of force inline... Is inline not enough for getting the better code?

Kojoley · 2019-04-23T14:30:03Z

Is inline not enough for getting the better code?

Nope, as you can see I have literally replaced inline with BOOST_FORCEINLINE. It seems that LLVM inliner's cost model pessimizes switches too much.

apolukhin · 2019-05-02T10:22:56Z

Clang has heuristics on explicit inline. You have added force inline in some cases where there was no explicit inline. Try replacing all the force inlines with just inline, it should give clang optimizer more info.

I do not like force inline, as it is too often misused. I think that the compiler should decide when to inline and when not. If the compiler takes the wrong decision - then the compiler should be fixed, not the code. Fixing codegen with force inline one one platform could make the codegen on other platform worse.

Kojoley · 2019-05-02T12:04:22Z

Clang has heuristics on explicit inline.

Did not know that, seems to be true. https://godbolt.org/z/q8JUle

You have added force inline in some cases where there was no explicit inline. Try replacing all the force inlines with just inline, it should give clang optimizer more info.

This will not change anything. The thing is that clang seems to decide inlining or not exclusively on function size. Explicit inline slightly increases the inline cost limit, allowing to have one additional case in switch before it will decide not to inline the function. Globally it changes nothing.

I do not like force inline, as it is too often misused.

It is not the case. The functions I have added forceinline to are exclusively internal. That's why I did not add the forceinline to variant<T>::apply_visitor member function, while it would allow optimizing out wrapper in apply_visitor free function.

I think that the compiler should decide when to inline and when not.

In ideal world yes. But you know, "zero cost abstractions" are not always zero cost in reality.

If the compiler takes the wrong decision - then the compiler should be fixed, not the code.

There is nothing wrong decision on compiler side. It did not offer perfect inlining, and I do not think inlining is something that can be done perfectly. GCC seems to be simply much more aggressive, and its just other side of the coin. The code is not compiler friendly, optimizing memory pointers is a very tough problem. If the visitor were stored and passed by value the situation should have been probably be much better.

apolukhin · 2019-05-09T07:02:04Z

Many thanks!

apolukhin merged commit f34376e into boostorg:develop May 9, 2019

Kojoley deleted the partly-mitigate-bad-clang-inlining-decision branch May 9, 2019 12:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partly mitigate bad Clang inlining decision #66

Partly mitigate bad Clang inlining decision #66

Kojoley commented Apr 14, 2019

coveralls commented Apr 14, 2019 •

edited

apolukhin commented Apr 23, 2019

Kojoley commented Apr 23, 2019

apolukhin commented May 2, 2019

Kojoley commented May 2, 2019

apolukhin commented May 9, 2019

Partly mitigate bad Clang inlining decision #66

Partly mitigate bad Clang inlining decision #66

Conversation

Kojoley commented Apr 14, 2019

coveralls commented Apr 14, 2019 • edited

apolukhin commented Apr 23, 2019

Kojoley commented Apr 23, 2019

apolukhin commented May 2, 2019

Kojoley commented May 2, 2019

apolukhin commented May 9, 2019

coveralls commented Apr 14, 2019 •

edited