New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

common: Improved CRC calculation for zero buffers #11966

Merged
merged 2 commits into from May 24, 2017

Conversation

Projects
None yet
6 participants
@aclamk
Contributor

aclamk commented Nov 14, 2016

Created special function for calculating CRC for data that contains all zeros.

This function complexity is O(1).
Expected amount of operations is 'count of 1s in length' * 16.

Signed-off-by: Adam Kupczyk akupczyk@mirantis.com

@liewegas liewegas changed the title from Improved CRC calculation. to common: Improved CRC calculation for zero buffers Nov 14, 2016

@ghost

This comment has been minimized.

ghost commented Nov 16, 2016

jenkins test this please (eio now ignored in master)

@ghost

This comment has been minimized.

ghost commented Nov 17, 2016

jenkins test this please (general jenkins failure)

@branch-predictor

According to code, this is not O(1). Secondly, if what you're targetting are zeroed-out bufferlists that are going to be written out to a storage device, you could probably go just with static array of precalculated CRC32s for most common block sizes (512b and 4k in particular) and call it a day - saving cpu time and storage space for crc_turbo_struct - now it takes 4KB (32324), but you could easily fit in 128 bytes or less.
And finally, your code just calculates the expected CRC32 for a buffer that contains specified amount of zeros, it doesn't actually do any calculations of checksum of actual data, so it could be useful only in write-only scenarios, and be source of misleading bugs when someone miscalculates the 0-byte area size.

@aclamk

This comment has been minimized.

Contributor

aclamk commented Nov 24, 2016

@branch-predictor

  1. Time complexity of algorithm is limited by 3232 operations that can be considered atomic.
    And since O(const
    F) = O(F) we have O(1024) == O(1)
  2. I am not targetting bufferlists that are zeroed. I am targetting case that was already implemented in ceph, to speed up CRC calculation for bufferlists if you only append/prepend data.
    The math is, that if you know CRC for a part of data, you can reuse it if it becomes part of larger data.
  3. Indeed, function calculates CRC for ZEROS.
@branch-predictor

This comment has been minimized.

Member

branch-predictor commented Nov 29, 2016

@aclamk

  1. That's not entirely true because you still depend on bits in CRC values being set or not, so the actual complexity will differ between values.
  2. OK, but you still haven't provided users of your code, so at this point it is unused anyway (separate commit binding the cases you targeting with new function is OK).
@aclamk

This comment has been minimized.

Contributor

aclamk commented Mar 28, 2017

@branch-predictor Now connected calculating "0" CRC to bufferlist infrastructure.

@kestrels

This comment has been minimized.

Contributor

kestrels commented Apr 19, 2017

@aclamk
@branch-predictor
@tchaikov
@smatzek

Is there any reason this is not tied into src/common/crc32c* and src/test/common/test_crc32c.cc. In particular you would want to look at the code in src/arch and how it relates to src/common/crc32c.cc.

As you have it right now this is called in one specific place (buffer lists) and is not architecture agnostic (ie it is specific to intel x86, but not called for arm or ppc architectures). That is surprising since it is not written in assembly.

Also, can you provide some numbers that show your version is faster than the existing optimized assembly?

When I created a test program using your code and ran on x86, your version was consistently 2x slower for me than the existing x86 fast zero code (in src/common/crc32c_intel_fast_asm.S).

I was interested because we are working on a version of crc32c_ppc_fast_asm.S (ie serving the same purpose but running on the powerpc architecture). If yours were faster it would make my drop obsolete. When I repeated the same test as the above on ppc your algorithm was quite a bit slower there as well.

In both of the above cases (x86 and ppc) my tests used a buffer len of 1GB. I realize this is not all that scientific a test, since it doesn't really cover various buffer lengths and crc starting values, I just mean it as a reference point for discussion.

@tchaikov tchaikov self-requested a review Apr 20, 2017

@aclamk

This comment has been minimized.

Contributor

aclamk commented Apr 20, 2017

@kestrels

  1. The implementation is in common/crc32c.cc. Location was selected because it is implemented in generic c++, no arch specifics.
    Tests are now located in tests/bufferlist.cc and tests/common/test_crc32c.cc

  2. I modified connection point, now "0" crc is connected to all archs.

  3. Perforrmance, from newly created test case:
    `[ RUN ] Crc32c.zeros_performance

regular method. size=3 time= 2.7e-07 at 10.5964 MB/sec

fallback method. size=3 time=4.31e-07 at 6.6381 MB/sec

regular method. size=6 time= 1.97e-07 at 29.0459 MB/sec

fallback method. size=6 time=1.2e-07 at 47.6837 MB/sec

regular method. size=9 time= 1.88e-07 at 45.6546 MB/sec

fallback method. size=9 time=4.52e-07 at 18.9891 MB/sec

regular method. size=19 time= 2.73e-07 at 66.3729 MB/sec

fallback method. size=19 time=2.42e-07 at 74.8753 MB/sec

regular method. size=49 time= 2.85e-07 at 163.965 MB/sec

fallback method. size=49 time=3.8e-07 at 122.974 MB/sec

regular method. size=127 time= 6.58e-07 at 184.068 MB/sec

fallback method. size=127 time=6.77e-07 at 178.902 MB/sec

regular method. size=202 time= 4.5e-07 at 428.094 MB/sec

fallback method. size=202 time=1.62e-07 at 1189.15 MB/sec

regular method. size=492 time= 6.67e-07 at 703.46 MB/sec

fallback method. size=492 time=4.66e-07 at 1006.88 MB/sec

regular method. size=809 time= 5.24e-07 at 1472.37 MB/sec

fallback method. size=809 time=4.23e-07 at 1823.93 MB/sec

regular method. size=1229 time= 6.01e-07 at 1950.19 MB/sec

fallback method. size=1229 time=5.75e-07 at 2038.38 MB/sec

regular method. size=2234 time= 6.13e-07 at 3475.54 MB/sec

fallback method. size=2234 time=9.41e-07 at 2264.09 MB/sec

regular method. size=6059 time= 9.06e-07 at 6377.83 MB/sec

fallback method. size=6059 time=2.366e-06 at 2442.23 MB/sec

regular method. size=8690 time= 7.96e-07 at 10411.3 MB/sec

fallback method. size=8690 time=3.334e-06 at 2485.73 MB/sec

regular method. size=24315 time= 1.138e-06 at 20376.6 MB/sec

fallback method. size=24315 time=9.172e-06 at 2528.19 MB/sec

regular method. size=43491 time= 8.22e-07 at 50457.7 MB/sec

fallback method. size=43491 time=1.6322e-05 at 2541.13 MB/sec

regular method. size=123206 time= 8.02e-07 at 146507 MB/sec

fallback method. size=123206 time=4.6072e-05 at 2550.32 MB/sec

regular method. size=196732 time= 7.42e-07 at 252855 MB/sec

fallback method. size=196732 time=8.1529e-05 at 2301.25 MB/sec

regular method. size=287426 time= 7.93e-07 at 345663 MB/sec

fallback method. size=287426 time=0.000107467 at 2550.65 MB/sec

regular method. size=526420 time= 5.44e-07 at 922855 MB/sec

fallback method. size=526420 time=0.000213527 at 2351.15 MB/sec

regular method. size=1124344 time= 1.169e-06 at 917244 MB/sec

fallback method. size=1124344 time=0.000454927 at 2356.99 MB/sec

regular method. size=3547931 time= 9.95e-07 at 3.40057e+06 MB/sec

fallback method. size=3547931 time=0.00134473 at 2516.17 MB/sec

regular method. size=5695976 time= 1.378e-06 at 3.94202e+06 MB/sec

fallback method. size=5695976 time=0.00221495 at 2452.47 MB/sec

regular method. size=9489895 time= 1.416e-06 at 6.39143e+06 MB/sec

fallback method. size=9489895 time=0.00377843 at 2395.25 MB/sec

regular method. size=32457613 time= 1.1551e-05 at 2.67977e+06 MB/sec

fallback method. size=32457613 time=0.0126375 at 2449.38 MB/sec

regular method. size=34475894 time= 1.456e-06 at 2.25816e+07 MB/sec

fallback method. size=34475894 time=0.0132855 at 2474.79 MB/sec

regular method. size=122824026 time= 2.391e-06 at 4.89896e+07 MB/sec

fallback method. size=122824026 time=0.0477975 at 2450.63 MB/sec

regular method. size=144505134 time= 2.317e-06 at 5.94781e+07 MB/sec

fallback method. size=144505134 time=0.0566538 at 2432.51 MB/sec

regular method. size=502100579 time= 2.407e-06 at 1.98937e+08 MB/sec

fallback method. size=502100579 time=0.197493 at 2424.6 MB/sec

regular method. size=1071432243 time= 2.199e-06 at 4.64665e+08 MB/sec

fallback method. size=1071432243 time=0.413142 at 2473.23 MB/sec

regular method. size=1542444959 time= 2.917e-06 at 5.04282e+08 MB/sec

fallback method. size=1542444959 time=0.595465 at 2470.32 MB/sec

[ OK ] Crc32c.zeros_performance (1346 ms)`

"regular" means new method
"fallback" means ceph's crc fallback method
As can be seen, execution time is almost not dependent on size.
However, for small sizes, it might be more efficient to use current methods.

uint32_t crc1 = 0;
if ((len & 1) == 1) {
for (int b = 0; b < 32 ; b++) {
if ((crc & (1UL << b)) != 0 ) {

This comment has been minimized.

@branch-predictor

branch-predictor Apr 20, 2017

Member

Since you're not using value of "crc" after this loop, you could just do "if (crc % 1)" and "crc >> 1", then break out early when crc becames 0.

This comment has been minimized.

@aclamk

aclamk Apr 20, 2017

Contributor

@branch-predictor Nice. Will do something like that.

start = ceph_clock_now();
uint32_t crc_a = ceph_crc32c(111, nullptr, size);
end = ceph_clock_now();
std::cout << "regular method. size=" << size << " time= " << (double)(end-start)

This comment has been minimized.

@branch-predictor

branch-predictor Apr 20, 2017

Member

That is not going to give reliable, accurate results. At least put crc function call in a tight loop.

This comment has been minimized.

@aclamk

aclamk Apr 20, 2017

Contributor

@branch-predictor It gives inaccurate results. Still, enough to compare.

This comment has been minimized.

@branch-predictor

branch-predictor Apr 20, 2017

Member

No, they're too inacurrate to be considered meaningful. Again, just execute that function in a loop (one thousand times for example), then measure time of entire run and you'll get reasonable time that could be sensibly compared to other implementations. Or just divide it by loop count, and you'll get reasonable approximate of average time the single call takes.

This comment has been minimized.

@aclamk

aclamk Apr 20, 2017

Contributor

@branch-predictor It depends what do you intend to measure. I just measured difference against other algorithms. If you insist, I can add standalone test, that will just show how fast single calculation is.

This comment has been minimized.

@branch-predictor

branch-predictor Apr 20, 2017

Member

@aclamk Both implementation perform one call in sub-milisecond time on any reasonable CPU, with a lot of things getting in the way and additionally you don't use high-resolution clock. All that will give you huge statistical error making such measurements pointless. Running each function in a separate loop and measuring either entire loop time, or dividing loop time by iteration count will give you more reliable, repeatable results.

This comment has been minimized.

@kestrels

kestrels May 4, 2017

Contributor

@aclamk @branch-predictor I agree with branch-predictor that it would be best to use a high resolution timer in this case. An example where the ceph unit tests already do make use of the high_resolution_timer class can be found in src/test/objectstore_bench.cc. In that file look for the use of #include , using namespace std::chrono, high_resolution_clock::now().

<< " at " << (double)size/(1024*1024)/(double)(end-start) << " MB/sec" << std::endl;
start = ceph_clock_now();
uint32_t crc_b = ceph_crc32c_sctp(111, nullptr, size);

This comment has been minimized.

@branch-predictor

branch-predictor Apr 20, 2017

Member

I'd call that cheating. ;-) Implementation that uses CRC32 feature of modern x86 cpus is over 10x faster than ceph_crc32c_sctp.

This comment has been minimized.

@aclamk

aclamk Apr 20, 2017

Contributor

@branch-predictor Well... the only difference is if we are going to implement fallback for smaller sizes. For larger sizes 10x difference does not matter.

This comment has been minimized.

@smatzek

smatzek Apr 20, 2017

Contributor

I think the difference is that we should be comparing the speed of the new algorithm vs the Intel fastzero implementation for all block sizes. @kestrels test was calling these two against each other:
crc = ceph_crc32c_intel_fast(crc, (unsigned char const *) NULL, nzerobytes);
vs
crc = ceph_crc32c_zeros(crc, nzerobytes);

in a main and run with 'time' with these results:
time ./main -f init -z 1073741824 -d
This is Intel fastzero
crc=0x058E3FA2

real 0m0.046s
user 0m0.044s
sys 0m0.000s
time ./main -f init -z 1073741824 -m
This is the new PR
crc=0x058E3FA2

real 0m0.096s
user 0m0.092s
sys 0m0.000s

On a processor like this:
processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Xeon(R) CPU E3-1271 v3 @ 3.60GHz

This comment has been minimized.

@branch-predictor

branch-predictor Apr 20, 2017

Member

@smatzek I'm afraid that you've measured the speed of launching the "main" binary instead of the code in question.

This comment has been minimized.

@aclamk

aclamk Apr 20, 2017

Contributor

@smatzek First use of crc_zeros fills table 32x32 of precalculated crc's. I suspect that this is what is responsible for initial drop in performance.

This comment has been minimized.

@smatzek

smatzek Apr 20, 2017

Contributor

Yes, quite likely. As @kestrels mentioned in his original post the test was non-scientific and just a "sniff test". @kestrels is not online today. I've re-run his test on that machine 10x times and averaged the results. Intel fastzero assembly is still ~1.5x faster, but again the test case is not the best. I agree we should get a good representative set of input CRCs and buffer lengths and run the tests. My only point is the test should be run vs the Intel fastzero optimization vs the slow fall back it is currently using.

RocksDB is also doing CRC. If it is doing CRC on zero buffers we should see if this impl beats the _mm_crc32_u32 SSE42 instruction Rocks is using and if so, get this impl into RocksDB if it impacts BlueStore performance.

This comment has been minimized.

@branch-predictor

branch-predictor Apr 20, 2017

Member

@smatzek _mm_crc32_u32 is an "intrinsic" (sort of way to embed assembly instructions into code without resorting to using actual assembly flavor, as some compilers use AT&T notation, some use Intel) wrapping the CRC32 instruction that does crc32 calculation in hardware (feature of newer x86 CPU). Looking at the code in RocksDB, I think the RocksDB will be slightly slower than the one used internally by Ceph.

This comment has been minimized.

@kestrels

kestrels May 4, 2017

Contributor

I would recommend you change this to calculate crc_b using ceph_crc32_func instead of ceph_crc32_sctp.

The sctp method is much much slower and is not used on x86, arm, or ppc64le architectures. If you make the change I'm suggesting it would test your algorithm against whatever the architecture probe decides ceph would be using without your algorithm.

In other words, sctp is not actually the fallback method in general, it is only the fallback method on hardware architectures other than x86, arm, or ppc64le (if there are any, I'm not even sure).

This comment has been minimized.

@aclamk

aclamk May 5, 2017

Contributor

@kestrels "I would recommend you change this to calculate crc_b using ceph_crc32_func instead of ceph_crc32_sctp."
Good point, I have not noticed that last refactor allowed to do this.

@branch-predictor

This comment has been minimized.

Member

branch-predictor commented Apr 20, 2017

@kestrels I can't confirm. For 1GB blocks it beats any other implementation -- maybe, similarly to @aclamk, you haven't measured it's speed properly? It still lags (sometimes badly) on blocks around 3k and smaller... (evident data cache trashing).

@branch-predictor

This comment has been minimized.

Member

branch-predictor commented Apr 20, 2017

I still think we could do much better by doing actual research on block sizes and input crcs passed to crc32 function, and hardcoding crc32 values for most common ones, which I assume would be 0 and -1 for input crc and 4k and 64k for block size.

@daxtens

This comment has been minimized.

daxtens commented Apr 25, 2017

Hi,

I did some work on this area a few months ago.

@aclamk - I have some questions about your implementation:

  • If I understand it correctly, your implementation masks off the bottom 4 bits ( & 15 and & ~15) of the length, and stores that in the remainder to be processed with the regular algorithm. Is that right?

  • you then go through the loop starting at line 86. That loop goes bit by bit through the length. The first four iterations will then be no-ops due to the previous masking, if I understand correctly?

  • What is the benefit of masking off the remainder? How does that speed up your algorithm?

  • My understanding of the complexity is as follows:

    • you do 32 loops through your reduction loop, which masks and xors, etc.
    • you then put the remainder (between 0 and 16) through the original implementation - this will do up to 16 loops.

So (cc @branch-predictor) if I remember complexity analysis correctly, the complexity is in terms of the length we want to zero: lets call it n. We analyse it as n goes to infinity - so we don't consider that n is bounded to be a 32 bit number. If we consider it that way, we have that the first part requires as many operations as there are bits in the input: log_2 n. So we would say that it's O(log n). I realise that in the actual implementation there's at most 32 iterations, so it would appear to be O(1); but if you take that logic you would also say that the original implementation does at most 2^32 operations, so it's also O(1). But clearly we would say that the original implementation is O(n)!

So this would be a O(log n) implementation. That's still a win - and it's worth trying to get this cleaned up and implemented, if all the random terms don't end up making it worse.

@smatzek and @kestrels - this is conceptually similar to the work I did with you. In both cases, we go bit by bit through the input. However my code still did the underlying multiplication in the field, rather than reducing to a precomputed table of XORable values. This has up and down sides: the upside is that it requires only 32 values to do 32 bits[0], rather than 32x32 values. The downside is that rather than requiring a single XOR, it requires a number of vector ops.

You'll need to weigh up the cache pollution and all the masking and XORing vs the use of vector, noting that my implementation is not at all close to optimal.

So, how do you accurately benchmark this? The first thing would be to actually look at the lengths that are CRCed. I never got around to doing this before I left. You'd probably also want to do benchmarks of Ceph overall to look for any effects from code/data being evicted from cache.

[0]: My implementation did up to 64, you could reduce that, also consider if the periodicity of the generated values could be used to reduce it further.

@aclamk

This comment has been minimized.

Contributor

aclamk commented Apr 25, 2017

@daxtens
Hi,
Thank you for your attention.

  1. Masking out 4 lowest bits is late optimization
    It stems from the fact that expected amount of "basic" steps performed is:
    E = {amount of bits set in length} * {expected amount of "1" bits in CRC on specific step}
    obviously
    E = {amount of bits set in length} * ~ 16

So, each bit costs 16. Expected amount of processing of 4 lowest bits is 16*2 = 32.
It seems more reasonable to trade it for linear calculation of CRC (max 16 steps).

  1. You are right - there are 4 empty unneeded iterations. Will fix.

  2. Complexity (cc @branch-predictor)
    I agree that my initial statement that is is linear was incorrect.
    If we had n as problem length, then complexity is O(log n)

  3. "[0]: My implementation did up to 64, you could reduce that, also consider if the periodicity of the generated values could be used to reduce it further."
    Could you please show me mathematics of this periodicity?

  4. Optimization
    Of course, the table crc_turbo_table can be precomputed.

@kestrels

This comment has been minimized.

Contributor

kestrels commented Apr 25, 2017

@aclamk
@daxtens
@smatzek
@branch-predictor

My apologies to @aclamk. I was wrong about the test results I posted earlier. They were wrong due to a bug in my test code.

His algorithm seems to be faster across the board than the other options available on x86 or ppc64le.

I had built an isolated test harness to excercise the various crc fastzero options in isolation, rather than having to build them as part of a larger project (ie ceph).

Link to the test program:
https://github.com/kestrels/crc32c-fastzero

My conclusions are posted in the repo itself, here:
https://github.com/kestrels/crc32c-fastzero/blob/master/output-data/conclusions.txt

Please let me know what you think.

@kestrels

I've left some suggestions on changes to make.

start = ceph_clock_now();
uint32_t crc_a = ceph_crc32c(111, nullptr, size);
end = ceph_clock_now();
std::cout << "regular method. size=" << size << " time= " << (double)(end-start)

This comment has been minimized.

@kestrels

kestrels May 4, 2017

Contributor

@aclamk @branch-predictor I agree with branch-predictor that it would be best to use a high resolution timer in this case. An example where the ceph unit tests already do make use of the high_resolution_timer class can be found in src/test/objectstore_bench.cc. In that file look for the use of #include , using namespace std::chrono, high_resolution_clock::now().

@@ -26,7 +40,13 @@ extern ceph_crc32c_func_t ceph_choose_crc32(void);
*/
static inline uint32_t ceph_crc32c(uint32_t crc, unsigned char const *data, unsigned length)
{
if (!data && length > 16)

This comment has been minimized.

@kestrels

kestrels May 4, 2017

Contributor

I think this length of 16 needs to be increased to around 2048 or 4096.

I tested on x86 and ppc64le to see at what point your algorithm is faster than the optimized assembly for fastzero (on x86 this is dropped, on ppc64le I haven't dropped it yet). On my x86 system your algorithm is faster once the length is above 4096. On my ppc64le system your algorithm is faster once the length is above 2048.

This comment has been minimized.

@aclamk

aclamk May 8, 2017

Contributor

@kestrels On my test machine, the best point to fallback to regular was 16. I have difficulties in believing that anything more then 64 will be faster.

This comment has been minimized.

@branch-predictor

branch-predictor May 9, 2017

Member

Your algorithm makes use of large precomputed array (coincidentally, that is 4 kilobytes of data). Your code will cause cache misses across these 4KB, which may be as costly (or more costly) as actually computing crc across up to 4KB of actual data (the cost of cache miss in that case will be balanced by cache prefetcher). So that's why I agree on putting "length >= 2048" at least.

This comment has been minimized.

@aclamk

aclamk May 9, 2017

Contributor

@branch-predictor This seems unproductive. Unless you propose an algorithm to measure best length to switch method, its just guessing.

This comment has been minimized.

@branch-predictor

branch-predictor May 9, 2017

Member

@aclamk I proposed it multiple times, and it boils down to measuring average time needed for each implementation and for each block size and picking the point where one impl outperform other.

This comment has been minimized.

@branch-predictor

branch-predictor May 9, 2017

Member

Ok, I didn't.

<< " at " << (double)size/(1024*1024)/(double)(end-start) << " MB/sec" << std::endl;
start = ceph_clock_now();
uint32_t crc_b = ceph_crc32c_sctp(111, nullptr, size);

This comment has been minimized.

@kestrels

kestrels May 4, 2017

Contributor

I would recommend you change this to calculate crc_b using ceph_crc32_func instead of ceph_crc32_sctp.

The sctp method is much much slower and is not used on x86, arm, or ppc64le architectures. If you make the change I'm suggesting it would test your algorithm against whatever the architecture probe decides ceph would be using without your algorithm.

In other words, sctp is not actually the fallback method in general, it is only the fallback method on hardware architectures other than x86, arm, or ppc64le (if there are any, I'm not even sure).

for (size_t scale=1; scale < 31; scale++)
{
size_t size = (1<<scale) + rand()%(1<<scale);

This comment has been minimized.

@kestrels

kestrels May 4, 2017

Contributor

The way you're using rand here makes sense to me because it excercises the cases where the length is not on a power of 2 boundary. In that case your ceph_crc32_zeros function has a fallback method, which is to call ceph_crc32_func on the remainder and integrate that value into your results.

for (size_t scale=1; scale < 31; scale++)
{
size_t size = (1<<scale) + rand() % (1<<scale);
ceph_crc32c(rand(), nullptr, size);

This comment has been minimized.

@kestrels

kestrels May 4, 2017

Contributor

Here you are using rand twice: once to excercise the case where size is not on a power of 2 boundary, then a second time to randomize the input crc value. I'm fine with this because over the course of 100,000 iterations the input crc values should not significantly impact the time it takes.

The only reason I mention it is because @branch-predictor was talking about unpredictable results. I wasn't sure if he was objecting to the use of rand (and not just the use of low resolution timers).

Note that if you try to call 1,000 iterations (let alone 100,000 iterations) of ceph_crc32c_func instead of ceph_crc32c_zeros, with a constant buffer size of 1GB it takes several minutes to complete but with your algorithm it takes seconds. More evidence of how much faster your algorithm is than the previously existing methods.

This comment has been minimized.

@aclamk

aclamk May 8, 2017

Contributor

@kestrels

  1. I added estimation of error in time measurement.
  2. I kept using clock_gettime, which has resolution 1e-9s on linux x86

This comment has been minimized.

@branch-predictor

branch-predictor May 9, 2017

Member

@aclamk You're using ceph_clock_now which is an alias for clock_gettime(CLOCK_REALTIME, &tp); This is not the best solution, CLOCK_MONOTONIC would be better as it's not prone to time adjustments/deviations. Should also have better accuracy.

common, performance: Created special function for calculating CRC for…
… data that contains all zeros.

Expected amount of operations is 'count of 1s in length' * 16. Generally is O(log(length)).
Added test for performance comparision of "0" crc.
Added test to evaluate single crc calculation time.

Signed-off-by: Adam Kupczyk <akupczyk@mirantis.com>
@@ -266,3 +266,83 @@ TEST(Crc32c, RangeNull) {
ASSERT_EQ(crc, *check);
}
}
double estimate_clock_resolution()

This comment has been minimized.

@branch-predictor

branch-predictor May 9, 2017

Member

Measuring total time of a large number of iterations and then calculating an average (or median) would be more accurate and simpler.

This comment has been minimized.

@aclamk

aclamk May 9, 2017

Contributor

@branch-predictor But hey, now you got a result with an error estimation, how cool is THAT?

@branch-predictor

This comment has been minimized.

Member

branch-predictor commented May 9, 2017

common, performance: Now crc table is pre-calculated.
Signed-off-by: Adam Kupczyk <akupczyk@mirantis.com>

@liewegas liewegas merged commit 8e72e0e into ceph:master May 24, 2017

3 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details
default Build finished.
Details
@liewegas

This comment has been minimized.

Member

liewegas commented May 24, 2017

This is a net improvement over status quo so merging. Feel free to follow up with patches to adjust the thresholds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment