New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shader_jit_x64_compiler: Remove ABI overhead of LG2 and EX2 #3145

Merged
merged 2 commits into from Dec 3, 2017

Conversation

Projects
None yet
4 participants
@MerryMage
Copy link
Member

MerryMage commented Nov 25, 2017

We reimplement log2f and exp2f in assembly.

Algorithms were inspired by those found in Mesa I've used a higher order approximation for exp2 than they did for more accuracy.

Performance numbers (for 1.2 billion sequential calls of the same instruction on my machine):

Instruction Old time New time
ex2 21.5s 8.8s
lg2 13.3s 7.3s

This change is Reviewable

@MerryMage MerryMage force-pushed the MerryMage:lg2-ex2 branch from 0a8b8ae to af6d0ac Nov 25, 2017

@yuriks

This comment has been minimized.

Copy link
Member

yuriks commented Nov 26, 2017

Nice perf wins and nice, tests!


Reviewed 2 of 2 files at r1, 2 of 2 files at r2, 1 of 1 files at r3.
Review status: all files reviewed at latest revision, 9 unresolved discussions.


src/tests/CMakeLists.txt, line 24 at r1 (raw file):

target_link_libraries(tests PRIVATE ${PLATFORM_LIBRARIES} catch-single-include nihstro-headers Threads::Threads)
if (ARCHITECTURE_x86_64)
    target_link_libraries(tests PRIVATE xbyak)

That's weird, I would've expected this to be pulled transitively via video_core, so you shouldn't need to specify it here, since you're not using any xbyak functions directly.

Ah, looks like we're specifying xbyak as a PRIVATE dependency in common and video_core's CMakeLists. I think this is wrong, since they're used in public headers from those libraries, so they should be changed to PUBLIC.


src/tests/video_core/shader/shader_jit_x64_compiler.cpp, line 5 at r1 (raw file):

// Refer to the license.txt file included.

#ifdef ARCHITECTURE_x86_64

Can you conditionally include the entire via on CMakeLists instead (like is done in the main codebase). Or do you have a particular reason to do it this way?


src/tests/video_core/shader/shader_jit_x64_compiler.cpp, line 49 at r1 (raw file):

    });

    const auto run = [&](float input) {

You could encapsulate this (compiling the shader, setting up N inputs and the output, and then the run function receiving N params and returning the output) into a standalone class to avoid duplicating this setup twice, and since it seems to be widely applicable to other shader tests.


src/video_core/shader/shader_jit_x64_compiler.cpp, line 930 at r2 (raw file):

void JitShader::CompilePrelude() {
    CompilePrelude_Log2();
    CompilePrelude_Exp2();

Optional: Have these two functions return the Labels to the compiled function, which are then assigned here, making it a bit more obvious that's where their results are going.


src/video_core/shader/shader_jit_x64_compiler.cpp, line 969 at r2 (raw file):

    align(16);
    L(input_is_nan);

Any reason not to merge this block with one of the other two ret from the other two cases?


src/video_core/shader/shader_jit_x64_compiler.cpp, line 980 at r2 (raw file):

    align(16);
    log2_subroutine = getCurr();

You should be able to also use labels for the routine entry points and then use call with them, letting xbyak possibly emit a shorter relative call instead.


src/video_core/shader/shader_jit_x64_compiler.cpp, line 993 at r2 (raw file):

    and_(eax, 0x7f800000);
    and_(edx, 0x007fffff);
    movss(SCRATCH, xword[rip + c0]); // Preload c0.

Curious: How did you determine placement of this load? (i.e. why here, instead of at the beginning of this block?)


src/video_core/shader/shader_jit_x64_compiler.cpp, line 1050 at r2 (raw file):

    align(16);
    L(ret_label);

Isn't it better to just put ret_label before the final ret in the subroutine, instead of dedicating 16 bytes for it?


src/video_core/shader/shader_jit_x64_compiler.cpp, line 783 at r3 (raw file):

    add(ABI_PARAM2, static_cast<Xbyak::uint32>(offsetof(UnitState, registers.output)));
    CallFarFunction(*this, Emit);
    if (Common::GetCPUCaps().avx) {

After some research, it seems that the latest versions of gcc, clang and MSVC all generate vzeroupper before calls or rets in functions in which AVX was used. Tests: https://godbolt.org/g/K7Eeun (Note that at the time of writing the MSVC in godbolt is out of date and has a bug.)

Given that, do we still need to emit this? And in any case, would it make more sense to gate this behind AVX (the compiler define) instead of the runtime cap bit? Presumably if AVX isn't set then we don't need to worry about AVX code in the rest of the application.


Comments from Reviewable

@MerryMage

This comment has been minimized.

Copy link
Member

MerryMage commented Nov 28, 2017

Review status: all files reviewed at latest revision, 9 unresolved discussions.


src/tests/video_core/shader/shader_jit_x64_compiler.cpp, line 5 at r1 (raw file):

Previously, yuriks (Yuri Kunde Schlesner) wrote…

Can you conditionally include the entire via on CMakeLists instead (like is done in the main codebase). Or do you have a particular reason to do it this way?

No reason, will do it in CMakeLists.


src/video_core/shader/shader_jit_x64_compiler.cpp, line 969 at r2 (raw file):

Previously, yuriks (Yuri Kunde Schlesner) wrote…

Any reason not to merge this block with one of the other two ret from the other two cases?

Will do.


src/video_core/shader/shader_jit_x64_compiler.cpp, line 993 at r2 (raw file):

Previously, yuriks (Yuri Kunde Schlesner) wrote…

Curious: How did you determine placement of this load? (i.e. why here, instead of at the beginning of this block?)

Decision of ordering of instructions was based on execution port contention, done on paper. The difference in practice is minor - a couple of cycles - and is unlikely to be noticeable, but I like doing it anyway.


src/video_core/shader/shader_jit_x64_compiler.cpp, line 1050 at r2 (raw file):

Previously, yuriks (Yuri Kunde Schlesner) wrote…

Isn't it better to just put ret_label before the final ret in the subroutine, instead of dedicating 16 bytes for it?

Smaller jump instruction in the function itself.


src/video_core/shader/shader_jit_x64_compiler.cpp, line 783 at r3 (raw file):

Previously, yuriks (Yuri Kunde Schlesner) wrote…

After some research, it seems that the latest versions of gcc, clang and MSVC all generate vzeroupper before calls or rets in functions in which AVX was used. Tests: https://godbolt.org/g/K7Eeun (Note that at the time of writing the MSVC in godbolt is out of date and has a bug.)

Given that, do we still need to emit this? And in any case, would it make more sense to gate this behind AVX (the compiler define) instead of the runtime cap bit? Presumably if AVX isn't set then we don't need to worry about AVX code in the rest of the application.

Paranoia on my part, essentially. I can drop that commit.


Comments from Reviewable

@MerryMage MerryMage force-pushed the MerryMage:lg2-ex2 branch from af6d0ac to 5d5c326 Nov 28, 2017

@MerryMage

This comment has been minimized.

Copy link
Member

MerryMage commented Nov 28, 2017

Added a ShaderTest class, leaving different future requirements for Run inputs and outputs for any future tests to figure out.

Thanks for the review @yuriks.

@Subv

Subv approved these changes Nov 28, 2017

@yuriks

This comment has been minimized.

Copy link
Member

yuriks commented Nov 30, 2017

:lgtm:


Reviewed 2 of 5 files at r4, 2 of 2 files at r5, 1 of 1 files at r6.
Review status: all files reviewed at latest revision, 1 unresolved discussion.


src/video_core/shader/shader_jit_x64_compiler.cpp, line 1018 at r6 (raw file):

    xorps(SRC1, SRC1); // break dependency chain
    movss(SRC1, SCRATCH2);
    L(input_is_nan);

Just to confirm: Including the shufps in this case (after moving the label here) is intentional, right?


Comments from Reviewable

@yuriks

yuriks approved these changes Nov 30, 2017

@MerryMage

This comment has been minimized.

Copy link
Member

MerryMage commented Nov 30, 2017

Review status: all files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


src/video_core/shader/shader_jit_x64_compiler.cpp, line 1018 at r6 (raw file):

Previously, yuriks (Yuri Kunde Schlesner) wrote…

Just to confirm: Including the shufps in this case (after moving the label here) is intentional, right?

Indeed, mistake the first time.


Comments from Reviewable

MerryMage added some commits Nov 25, 2017

tests: Add tests for x64 shader jit
Tests LG2 and EX2 instructions
shader_jit_x64_compiler: Remove ABI overhead of LG2 and EX2
This involves reimplementing log2f and exp2f.

@MerryMage MerryMage force-pushed the MerryMage:lg2-ex2 branch from 9927c19 to c1aef26 Nov 30, 2017

@MerryMage

This comment has been minimized.

Copy link
Member

MerryMage commented Nov 30, 2017

Rebased.

@jroweboy jroweboy referenced this pull request Dec 1, 2017

Closed

fixed AVX #3171

@MerryMage MerryMage merged commit e23c3cd into citra-emu:master Dec 3, 2017

2 of 3 checks passed

code-review/reviewable 2 files left (yuriks)
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@MerryMage MerryMage deleted the MerryMage:lg2-ex2 branch Dec 3, 2017

addss(SCRATCH2, SCRATCH);

// Duplicate result across vector
xorps(SRC1, SRC1); // break dependency chain

This comment has been minimized.

@hrydgard

hrydgard Dec 4, 2017

Contributor

Curious, does this really help? The mov to it in the next instruction should have the same effect..

This comment has been minimized.

@MerryMage

MerryMage Dec 4, 2017

Member

My logic was that shufps depends on the entire xmm register, while movss only affects the lowest 32 bits. I could be incorrect.

This comment has been minimized.

@hrydgard

hrydgard Dec 4, 2017

Contributor

Oh, doh. I forgot about that.. Yeah, this is probably right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment