Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partly mitigate bad Clang inlining decision #66

Conversation

Kojoley
Copy link
Contributor

@Kojoley Kojoley commented Apr 14, 2019

Because a visitor is wrapped several times during visitation it cases extra temporaries usage and useless store and loads that can only be optimized if the visitation_impl is inlined into the function that creates the wrapper. Clang inliner decides not to inline functions even with small-sized switches, resulting in a poor visitation code. Forceinline mark on those internal functions perceptibly improves the situation, though does not mitigate it completely.

LLVM ticket https://bugs.llvm.org/show_bug.cgi?id=41491

Before:

foo:
	sub	rsp, 88
	lea	rax, [rsp + 64]
	mov	qword ptr [rsp + 72], rax
	lea	rax, [rsp + 72]
	mov	qword ptr [rsp + 80], rax
	mov	eax, dword ptr [rcx]
	lea	r9, [rcx + 4]
	mov	edx, eax
	sar	edx, 31
	xor	edx, eax
	xorps	xmm0, xmm0
	movups	xmmword ptr [rsp + 48], xmm0
	lea	r8, [rsp + 80]
	mov	ecx, eax
	call	tail
	nop
	add	rsp, 88
	ret

tail:
	add	edx, -1
	cmp	edx, 8
	ja	.LBB1_2
	lea	rax, [rip + .LJTI1_0]
	movsxd	rcx, dword ptr [rax + 4*rdx]
	add	rcx, rax
	jmp	rcx

After:

foo:
	sub	rsp, 56
	lea	rax, [rsp + 40]
	mov	qword ptr [rsp + 48], rax
	lea	rdx, [rsp + 48]
	call	tail
	nop
	add	rsp, 56
	ret

tail:
	mov	ecx, dword ptr [rcx]
	mov	eax, ecx
	sar	eax, 31
	xor	eax, ecx
	add	eax, -1
	cmp	eax, 8
	ja	.LBB1_2
	lea	rcx, [rip + .LJTI1_0]
	movsxd	rax, dword ptr [rcx + 4*rax]
	add	rax, rcx
	jmp	rax

Because a visitor is wrapped several times during visitation it cases extra
temporaries usage and useless store and loads that can only be optimized if
the `visitation_impl` is inlined into the function that creates the wrapper.
Clang inliner decides not to inline functions even with small-sized switches,
resulting in a poor visitation code. Forceinline mark on those internal functions perceptibly improves the situation, though does not mitigate it
completely.

LLVM ticket https://bugs.llvm.org/show_bug.cgi?id=41491
@coveralls
Copy link

coveralls commented Apr 14, 2019

Coverage Status

Coverage increased (+0.1%) to 95.404% when pulling dd9b0c9 on Kojoley:partly-mitigate-bad-clang-inlining-decision into 74ea828 on boostorg:develop.

@apolukhin
Copy link
Member

I'm not a big fan of force inline... Is inline not enough for getting the better code?

@Kojoley
Copy link
Contributor Author

Kojoley commented Apr 23, 2019

Is inline not enough for getting the better code?

Nope, as you can see I have literally replaced inline with BOOST_FORCEINLINE. It seems that LLVM inliner's cost model pessimizes switches too much.

@apolukhin
Copy link
Member

Clang has heuristics on explicit inline. You have added force inline in some cases where there was no explicit inline. Try replacing all the force inlines with just inline, it should give clang optimizer more info.

I do not like force inline, as it is too often misused. I think that the compiler should decide when to inline and when not. If the compiler takes the wrong decision - then the compiler should be fixed, not the code. Fixing codegen with force inline one one platform could make the codegen on other platform worse.

@Kojoley
Copy link
Contributor Author

Kojoley commented May 2, 2019

Clang has heuristics on explicit inline.

Did not know that, seems to be true. https://godbolt.org/z/q8JUle

You have added force inline in some cases where there was no explicit inline. Try replacing all the force inlines with just inline, it should give clang optimizer more info.

This will not change anything. The thing is that clang seems to decide inlining or not exclusively on function size. Explicit inline slightly increases the inline cost limit, allowing to have one additional case in switch before it will decide not to inline the function. Globally it changes nothing.

I do not like force inline, as it is too often misused.

It is not the case. The functions I have added forceinline to are exclusively internal. That's why I did not add the forceinline to variant<T>::apply_visitor member function, while it would allow optimizing out wrapper in apply_visitor free function.

I think that the compiler should decide when to inline and when not.

In ideal world yes. But you know, "zero cost abstractions" are not always zero cost in reality.

If the compiler takes the wrong decision - then the compiler should be fixed, not the code.

There is nothing wrong decision on compiler side. It did not offer perfect inlining, and I do not think inlining is something that can be done perfectly. GCC seems to be simply much more aggressive, and its just other side of the coin. The code is not compiler friendly, optimizing memory pointers is a very tough problem. If the visitor were stored and passed by value the situation should have been probably be much better.

@apolukhin apolukhin merged commit f34376e into boostorg:develop May 9, 2019
@apolukhin
Copy link
Member

Many thanks!

@Kojoley Kojoley deleted the partly-mitigate-bad-clang-inlining-decision branch May 9, 2019 12:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants