Skip to content

cdriper/CppJavaOneBillion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

CppJavaOneBillion

C++ implementation for "The One Billion Row Challenge".

The code is a fully portable C++20 except the work with memory mapped files (because it's still outside the standart).

The hashing for the city names is not 100% "fair". The measurement data can have only 413 predefined names and I use that fact. Initially I used 2 kinds of hashes. 32 bits without collisions (to avoid direct string comparison) and 16 bits to access a hash table with 64K buckets (with 4 maximum collisions per bucket). Later I found an algorithm for 16 bits hashes without collisions. For ~400 items per 64K buckets you have ~75% chance for at least one collision.

A file with 1 billion records is about ~12 Gb. It's very important to have enough free RAM for caching so the benchmark does not depend on your storage speed.

Optimizations which are not effective. #1. Dynamic portion load per worker thread. Doesn't affect results significantly. #2. Parsing the temperature via a precalculated table (completely remove multiplication). Looks like multiplication is slightly faster than shifting + one extra indirection. #3. "Compressed" hash table to make data more cache friendly. The current version uses only ~6% of the array. The compressed version (with one extra indirection) is slightly slower.

Notes about the performance vs Java. I'm not a Java guy and usually I develop on Windows. So I was not able to run the fastest "native" Java implementations. Moreover I discover that sometimes some "unsafe" versions are slower on my laptop if I compare them against "safe" versions with a lower rank. So the fastest version which I was able to test is gonix (safe). My C++ implementation is faster by at least 1.3x factor.

Also I found C++ version by HappyCerberus. I ported it from POSIX to Windows (yep, the mapped file) and my implementation is at least 1.4x faster.

About

C++ implementation for "The One Billion Row Challenge"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages