Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use MSVC intrinsics for better performance #116

Merged
merged 1 commit into from Feb 20, 2015
Merged

Conversation

@CarterLi
Copy link
Contributor

CarterLi commented Feb 20, 2015

Try to fix #115
Not tested yet due to lacking of MSVC compiler

vitaut added a commit that referenced this pull request Feb 20, 2015
Use MSVC intrinsics for better performance
@vitaut vitaut merged commit 939193b into fmtlib:master Feb 20, 2015
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@vitaut

This comment has been minimized.

Copy link
Contributor

vitaut commented Feb 20, 2015

Merged, thanks!

@vitaut

This comment has been minimized.

Copy link
Contributor

vitaut commented Feb 20, 2015

Applied some changes in 27e0bf7 as I think the correct intrinsic to use is _BitScanReverse. Also the intrinsic was not called in release version due to assert.

@CarterLi

This comment has been minimized.

Copy link
Contributor Author

CarterLi commented Feb 21, 2015

You are correct thanks. @vitaut

A question: Why do you disable count_digits(uint32_t n) conditionally when FMT_BUILTIN_CLZLL is not avaliable? I think count_digits(uint32_t n) can coexist with the fallback version of count_digits perfectly since count_digits(uint32_t n) is only an optimization in my opinion.
Note: _BitScanReverse64 is only avaliable on WIN64. Provide count_digits(uint32_t n) should provide much better performance when compiling 32bit version.

In addition, we may implement __builtin_ctzll using __builtin_ctz when __builtin_ctzll is not avaliable (for example), since 64bit operation is much more slower than 32bit operation on x86 and int (ie int32_t) is much more widely used than long long (ie int64_t).

@vitaut

This comment has been minimized.

Copy link
Contributor

vitaut commented Feb 21, 2015

Why do you disable count_digits(uint32_t n) conditionally when FMT_BUILTIN_CLZLL is not avaliable?

Good point, there is no reason to disable optimized count_digits(uint32_t n) when FMT_BUILTIN_CLZLL is not avaliable. I've fixed this in cbc46d9.

As for implementing __builtin_clzll in terms of __builtin_clz on 32-bit platforms, this is reasonable although not critical, because, as you mentioned yourself, int is much more widely used than long long and int is already optimized on 32-bit platforms. Besides, 32-bit systems are becoming increasingly rare. But patches are welcome =).

@CarterLi

This comment has been minimized.

Copy link
Contributor Author

CarterLi commented Feb 21, 2015

well, what I want to say is that we only provide count_digits(uint64_t n) when __builtin_clz(ll) is not available. Which means: if we call count_digits with an int on 32bit platform, we convert it into 64bit long long first, then do 64bit comparations and divisions, which is slow. We'd better provide count_digits(uint32_t) at the same time.
sorry, I posted my comment without finishing it for some reason

@vitaut

This comment has been minimized.

Copy link
Contributor

vitaut commented Feb 21, 2015

Right, but __builtin_clz or equivalent should be available on most platforms. Do you know any popular platform where __builtin_clz is not provided? Not that I'm completely opposed to adding a fallback count_digits(uint32_t) variant (perhaps, by making count_digits a template), just want to make sure that we don't add anything that is almost never used.

@patlecat

This comment has been minimized.

Copy link

patlecat commented Feb 23, 2015

Ah I didn't know you're were using intrinsics already for GCC, RapidJson is also big on SSE usage to speed up their lib. MSVC coverage is very welcome. Cool :D

@patlecat

This comment has been minimized.

Copy link

patlecat commented Feb 23, 2015

Here is a way to check which intrinsics are available for MSVC (in case you don't know already):
https://msdn.microsoft.com/en-us/library/hskdteyh.aspx

@CarterLi

This comment has been minimized.

Copy link
Contributor Author

CarterLi commented Feb 23, 2015

According to the Intel Instruction Set page, Bit Scan Reverse (BSR) is supported on i386+. It's safe enough to use it without checking.
Runtime CPUID check should not be used in this case in my opinion since CPUID instruction itself takes time

EDIT: Found some code which may be useful for this project in RapidJson

@vitaut

This comment has been minimized.

Copy link
Contributor

vitaut commented Feb 23, 2015

I agree with @CarterLi that it's better to avoid runtime checks, but thanks for the suggestion @patlecat. I was thinking of adding SSE support, but it adds very little in terms of performance while making implementation much more complicated.

@patlecat

This comment has been minimized.

Copy link

patlecat commented Feb 24, 2015

@vitaut Then why did @miloyip rely on SSE for RapidJson?

You have a library here, so leave it to the developer on what nifty features to enable for his project, then you don't need CPUID, or offer a CPUID tool seperately to make the decision easier for your users.

@vitaut

This comment has been minimized.

Copy link
Contributor

vitaut commented Feb 24, 2015

@patlecat Maybe I'm missing something, but it looks like RapidJson is not using SSE for integer to string conversion: https://github.com/miloyip/rapidjson/blob/master/include/rapidjson/internal/itoa.h

Anyway, patches are welcome =)

@miloyip

This comment has been minimized.

Copy link

miloyip commented Feb 25, 2015

Yes, @vitaut . From my implementations, SSE2 does not improve the performance much in integer-to-string conversion so I didn't put that in RapidJSON. However, RapidJSON got performance boost by using SSE2/4 for skipping whitespaces during parsing. It also uses bit scan intrinsics in Grisu implementation (float-to-string conversion) as well.

RapidJSON does not use dynamic dispatching according to CPUID. It simply uses static binding. Some discussions about this can be found here.

@vitaut

This comment has been minimized.

Copy link
Contributor

vitaut commented Feb 25, 2015

Thanks, @miloyip, that's what I thought. You've done a great job with RapidJSON BTW.

@vitaut vitaut mentioned this pull request Apr 9, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.