-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
common: Improved CRC calculation for zero buffers #11966
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -266,3 +266,83 @@ TEST(Crc32c, RangeNull) { | |
ASSERT_EQ(crc, *check); | ||
} | ||
} | ||
|
||
double estimate_clock_resolution() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Measuring total time of a large number of iterations and then calculating an average (or median) would be more accurate and simpler. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @branch-predictor But hey, now you got a result with an error estimation, how cool is THAT? |
||
{ | ||
volatile char* p = (volatile char*)malloc(1024); | ||
utime_t start; | ||
utime_t end; | ||
std::set<double> S; | ||
for(int j=10; j<200; j+=1) { | ||
start = ceph_clock_now(); | ||
for (int i=0; i<j; i++) | ||
p[i]=1; | ||
end = ceph_clock_now(); | ||
S.insert((double)(end - start)); | ||
} | ||
auto head = S.begin(); | ||
auto tail = S.end(); | ||
for (size_t i=0; i<S.size()/4; i++) { | ||
++head; | ||
--tail; | ||
} | ||
double v = *(head++); | ||
double range=0; | ||
while (head != tail) { | ||
range = max(range, *head - v); | ||
v = *head; | ||
head++; | ||
} | ||
free((void*)p); | ||
return range; | ||
} | ||
|
||
TEST(Crc32c, zeros_performance_compare) { | ||
double resolution = estimate_clock_resolution(); | ||
utime_t start; | ||
utime_t pre_start; | ||
utime_t end; | ||
double time_adjusted; | ||
using namespace std::chrono; | ||
high_resolution_clock::now(); | ||
for (size_t scale=1; scale < 31; scale++) | ||
{ | ||
size_t size = (1<<scale) + rand()%(1<<scale); | ||
pre_start = ceph_clock_now(); | ||
start = ceph_clock_now(); | ||
uint32_t crc_a = ceph_crc32c(111, nullptr, size); | ||
end = ceph_clock_now(); | ||
time_adjusted = (end - start) - (start - pre_start); | ||
std::cout << "regular method. size=" << size << " time= " << (double)(end-start) | ||
<< " at " << (double)size/(1024*1024)/(time_adjusted) << " MB/sec" | ||
<< " error=" << resolution / time_adjusted * 100 << "%" << std::endl; | ||
|
||
pre_start = ceph_clock_now(); | ||
start = ceph_clock_now(); | ||
uint32_t crc_b = ceph_crc32c_func(111, nullptr, size); | ||
end = ceph_clock_now(); | ||
time_adjusted = (end - start) - (start - pre_start); | ||
std::cout << "fallback method. size=" << size << " time=" << (double)(end-start) | ||
<< " at " << (double)size/(1024*1024)/(time_adjusted) << " MB/sec" | ||
<< " error=" << resolution / time_adjusted * 100 << "%" << std::endl; | ||
EXPECT_EQ(crc_a, crc_b); | ||
} | ||
} | ||
|
||
TEST(Crc32c, zeros_performance) { | ||
constexpr size_t ITER=100000; | ||
utime_t start; | ||
utime_t end; | ||
|
||
start = ceph_clock_now(); | ||
for (size_t i=0; i<ITER; i++) | ||
for (size_t scale=1; scale < 31; scale++) | ||
{ | ||
size_t size = (1<<scale) + rand() % (1<<scale); | ||
ceph_crc32c(rand(), nullptr, size); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here you are using rand twice: once to excercise the case where size is not on a power of 2 boundary, then a second time to randomize the input crc value. I'm fine with this because over the course of 100,000 iterations the input crc values should not significantly impact the time it takes. The only reason I mention it is because @branch-predictor was talking about unpredictable results. I wasn't sure if he was objecting to the use of rand (and not just the use of low resolution timers). Note that if you try to call 1,000 iterations (let alone 100,000 iterations) of ceph_crc32c_func instead of ceph_crc32c_zeros, with a constant buffer size of 1GB it takes several minutes to complete but with your algorithm it takes seconds. More evidence of how much faster your algorithm is than the previously existing methods. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @aclamk You're using ceph_clock_now which is an alias for clock_gettime(CLOCK_REALTIME, &tp); This is not the best solution, CLOCK_MONOTONIC would be better as it's not prone to time adjustments/deviations. Should also have better accuracy. |
||
} | ||
end = ceph_clock_now(); | ||
std::cout << "iterations="<< ITER*31 << " time=" << (double)(end-start) << std::endl; | ||
|
||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this length of 16 needs to be increased to around 2048 or 4096.
I tested on x86 and ppc64le to see at what point your algorithm is faster than the optimized assembly for fastzero (on x86 this is dropped, on ppc64le I haven't dropped it yet). On my x86 system your algorithm is faster once the length is above 4096. On my ppc64le system your algorithm is faster once the length is above 2048.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kestrels On my test machine, the best point to fallback to regular was 16. I have difficulties in believing that anything more then 64 will be faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your algorithm makes use of large precomputed array (coincidentally, that is 4 kilobytes of data). Your code will cause cache misses across these 4KB, which may be as costly (or more costly) as actually computing crc across up to 4KB of actual data (the cost of cache miss in that case will be balanced by cache prefetcher). So that's why I agree on putting "length >= 2048" at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@branch-predictor This seems unproductive. Unless you propose an algorithm to measure best length to switch method, its just guessing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aclamk I proposed it multiple times, and it boils down to measuring average time needed for each implementation and for each block size and picking the point where one impl outperform other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I didn't.