Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of CR functions in JIT64 #746

Merged
merged 3 commits into from Aug 19, 2014

Conversation

FioraAeterna
Copy link
Contributor

Use a lookup table instead of calculating it on the fly.

@phire
Copy link
Member

phire commented Aug 6, 2014

Well, it's smaller and neater. But look up tables are known to have surprises on modern CPUs.

I suspect this one is small enough, but I'd like to see benchmarks.

@FioraAeterna
Copy link
Contributor Author

We already use lookup tables in a bunch of other places (like the quantization/dequantization), so it doesn't seem like anything new, unless I'm missing something?

@neobrain
Copy link
Member

neobrain commented Aug 6, 2014

@phire what surprises would you be talking about?

Fwiw, if there's even a tiny chance of this PR regressing, I'd prefer if a hwtest covering the whole instruction input range was written (I'll work on a feature to ease writing such tests soon-ish, so maybe wait until that's published).

@FioraAeterna FioraAeterna changed the title Fastermtcrf Improve performance of CR functions in JIT64 Aug 6, 2014
@phire
Copy link
Member

phire commented Aug 6, 2014

Lookup table is 3-4 cycles if it's an L1 hit and all the other instructions depend on each other. I think it's 7 cycles best case with a worst case in the hundreds of cycles.

The old code has very few dependencies between the blocks, only the ORs depend on each other. I suspect it can do 10 cycles consistently.

Like I said, I'd be interested in benchmarks.

@FioraAeterna
Copy link
Contributor Author

We use lookup tables in a ton of other cases where constants could be generated on the fly (for cheaper than here), so I don't think this could possibly be any worse... do we have any games that make heavy enough use of these instructions to bench?

@degasus
Copy link
Member

degasus commented Aug 6, 2014

@phire Is this lookup tables bigger than the reduced code size? You've missed to code cache miss.

@FioraAeterna
Copy link
Contributor Author

749 to 736 seconds on POV-RAY benchmark, so 1.7% faster overall.

@Tilka
Copy link
Member

Tilka commented Aug 9, 2014

lgtm

I'd be interested in a benchmark comparison of this PR with and without the lookup tables.

@Sonicadvance1
Copy link
Contributor

@FioraAeterna: This comment grants you the permission to merge this pull request whenever you think it is ready. After addressing the remaining comments, click this link to merge.


@dolphin-emu-bot allowmerge

@FioraAeterna
Copy link
Contributor Author

I'm very slightly nervous about this since some of the variants of these instructions seem incredibly rare, but if something does go wrong, I have a blanket fort prepared to defend myself.

dolphin-emu-bot added a commit that referenced this pull request Aug 19, 2014
Improve performance of CR functions in JIT64
@dolphin-emu-bot dolphin-emu-bot merged commit 961c1db into dolphin-emu:master Aug 19, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
7 participants