This project is dedicated to investigating the Hash Table data structure.
The goal is to find out how many times each word has been shown in a text.
By utilizing a hashtable, we can quickly track and retrieve how many times a scpecific word has appeared in a text.
The objectives of this project are to:
-
implement a simple Hash Table using cache-friendly doubly linked list for handling collisions.
-
test and compare different hash functions under certain conditions and choose the best one.
-
determine hotspots, analyze them and optimize using:
- Inline assembly
- Intrinsics
- Seperate assembly file
For benchmarks i used Shakespeares text that contains $ 30000 $ unique words, a hash table of size $ 5009 $ which is a prime number, it should be a prime in order to minimize the number of collisions and to better distribute the keys across the hash table. The mean value (load factor) is $ 5.8 $.
I'm assuming that if the mean value is low, then it won't be cache-friendly which is one of the reasons, the main question is finding a hash function that won't create any collisions. If the mean value is high, then the amount of collision will be high meaning that finding elements will be slower. This will be researched later on.
Let's firstly mark down all of the hash functions:
- Zero hash
- First letter ASCII hash
- Word length hash
- ASCII sum hash
- ASCII sum divided by length hash
- ROR hash
- ROL hash
- FNV hash
The following histograms were made using python and matplotlib library.
This hash always returns 0, so it's obvious that all of the elements will be stored in 1st cell.
It's not so great, because the time to get the last element depends on the amount of values we insert, which is not what we are looking for.
Max. amount of collisions:
The text I use (Shakespeare) as data to explore contains only lower case words, from this we can conclude that the words are distributed in cells numerated from 97 to 122.
Not so good.
Max. amount of collisions:
The situation here is quite similair. I've quickly made a python script to find out the longest word in the text. The word is honorificabilitudinitatibus and it's 27 symbols long.
Most of the words are less than 20 symbols long, that means that this hash is worse than the previous one.
Max. amount of collisions:
From previous histograms I can conclude that the avarage length is 10. We can use this result to explain the growth of the peak: ASCII values are around 100 and the avarage length is 10, so it's quite understandable why we see a high collision count at 1000.
Max. amount of collisions:
My thoughts are thah the division here is unnecessary. Because the ASCII sum hash has a decent distribution.
Max. amount of collisions:
simple code in C:
uint64_t rotateRightHash (const char* str, int length)
{
uint64_t hash = 0;
for (int i = 0; i < length; i++)
{
hash = ( (hash >> 1) | (hash << 63) ) ^ str[i];
}
return hash;
}
Max. amount of collisions:
Let's compare ROR hash in Godbolt using x86-64 gcc 13.2 compiler with -O0 and -O2 optimizations.
The compiler optimizes the ROL hash function by replacing the rol()
function with a single assembly instruction ror
.
Same as ROR, but the shifts are inversed.
Better distribution than ROR, just because the data isn't symmetrical.
Max. amount of collisions:
The CRC32 Hash is a simple and fast hash function that is widely used for data integrity checks, error detection, and identification of duplicates in databases.
Max. amount of collisions:
Fowler–Noll–Vo non-cryptographic hash function. I chose it because of the simplicity it brings. For more info click here.
simple code in C:
uint64_t FNVHash (const char* str, int length)
{
uint64_t FNVprime = 0x811C9DC5;
uint64_t hash = 0;
for (uint64_t i = 0; i < keyLength; i++)
{
hash *= FNVprime;
hash ^= key[i];
}
return hash;
}
Max. amount of collisions:
Best result so far!
Comparison table:
$ \bold{N} $ | Hash | Table size | Mean value | Dispersion |
---|---|---|---|---|
1 | $ 5009 $ | $ 5.8 $ | $ 171000 $ | |
2 | $ 5009 $ | $ 5.8 $ | $ 20000 $ | |
3 | $ 101 $ | $ 289.7 $ | $ 19900 $ | |
4 | $ 5009 $ | $ 5.8 $ | $ 19400 $ | |
5 | $ 5009 $ | $ 5.8 $ | $ 9800 $ | |
6 | $ 5009 $ | $ 5.8 $ | $ 290 $ | |
7 | $ 5009 $ | $ 5.8 $ | $ 27 $ | |
8 | $ 5009 $ | $ 5.8 $ | $ 13.7 $ | |
9 | $ 5009 $ | $ 5.8 $ | $ 9.81 $ | |
10 | $ 5009 $ | $ 5.8 $ | $ 9.76 $ |
Conclusion: from the statistics above I can say that the FNV hash is most suitable. Also as you can see in row 3 we can see that low sized hash tables have bigger mean values, therefore bigger dispersion. Dispersion shows us how well the data is spread across the hash table that's why it's important to calculate it.
Before we start optimizing I should say that all of the tests were run with -O2 optimization flag, we're not interested in optimizing -O0 code because that's not what people use in reality. 100 samples were used for each test. FNV hash as the hash function. The size of the hash table is 8192, why I've chosen this number you will find out in the modulo operation part.
Baseline:
test N | ticks |
---|---|
1 | $ 9.3 $ |
2 | $ 9.8 $ |
3 | $ 12.2 $ |
4 | $ 9.7 $ |
5 | $ 9.7 $ |
$ average $ | $ 10.2 \pm 1.0 $ |
First of all we need find out which parts of code to optimize. That's why I've used Hotspot
- a GUI for Linux perf
. Let's have a look at the analysis.
Indeed, most of the search time is spent on comparing two words.
strcmp()
is already vectorized by the authors of the C standard library. So in order to perform better we shall use the information we now about the text. The longest word in the text is honorificabilitudinitatibus
which is 27 letters long and it fits into a YMM
register. Now that we now that information let's align each word allocation by 32 bytes, which allows to use one AVX2
comparison.
Here is my optimized strcmp function:
int mystrcmp (const char* s1, const char* s2)
{
__m256i s1_ = _mm256_loadu_si256((const __m256i*) s1);
__m256i s2_ = _mm256_loadu_si256((const __m256i*) s2);
return ~(_mm256_movemask_epi8(_mm256_cmpeq_epi8(s1_, s2_)));
}
Results:
test |
ticks |
---|---|
1 | $ 7.7 $ |
2 | $ 7.6 $ |
3 | $ 7.6 $ |
4 | $ 7.6 $ |
5 | $ 7.7 $ |
$ average $ |
I've noticed that in the function fillHashTable
there is one line (72) that has a longer execution time then others, let me show you:
The code:
...
68: for (int i = 0; i < text->numLines; i++)
{
70: uint64_t listIndex = ht->hashFunction(text->lines[i].string, text->lines[i].length);
72: listIndex %= ht->size;
74: if (findElement(ht, listIndex, text->lines[i].string) == NOT_FOUND)
{
76: PushBack(&ht->lists[listIndex], text->lines[i].string);
}
}
...
The modulo operator %
is expensive because the instruction idiv
in x86 is used and the remainder is stored in RDX
(for 64-bit mode). We can optimize this, but the hash table size has to be 2^n. In this case we can use bitwise and
with a 2^n - 1 bit mask.
x mod 2^n <=> x & (2^n - 1)
Let's rewrite the code using inline assembly:
Inline assembly code
asm volatile
(
"and %1, %0\n\t"
: "+rm" (hash)
: "rm" (htSize - 1)
);
Using the volatile qualifier we disable compiler optimizations because we have already optimized this part and don't need changes.
Check GCC Extended ASM for more.
Results:
test |
ticks $ \cdot 10^9$ |
---|---|
1 | $ 7.4 $ |
2 | $ 7.4 $ |
3 | $ 7.6 $ |
4 | $ 7.3 $ |
5 | $ 7.6 $ |
I've decided to replace the multiplication with addition and shift operations.
This:
hash *= FNVprime;
Has now become this:
hval += (hash << 1) + (hash << 4) + (hash << 5) +
(hash << 7) + (hash << 8) + (hash << 40);
But written in assembly:
Assembly code
FNVHash_asm:
movsx rax, BYTE [rdi]
test rsi, rsi
je .end
add rsi, rdi
.loop:
mov rdx, rax
lea r8, [rax + rax * 8]
mov rcx, rax
add rdi, 1
sal rdx, 7
sal rcx, 24
lea rdx, [rdx + r8 * 2]
add rdx, rax
sal rax, 8
add rdx, rax
movsx rax, BYTE [rdi - 1]
add rcx, rdx
xor rax, rcx
cmp rsi, rdi
jne .loop
.end:
ret
Results:
test |
ticks $ \cdot 10^9 $ |
---|---|
1 | $ 7.8 $ |
2 | $ 7.4 $ |
3 | $ 7.6 $ |
4 | $ 7.4 $ |
5 | $ 7.5 $ |
$ average $ | $ 7.5 \pm 1.2 $ |
My thoughts are that the sum time of the new operations overlap 1 mul operation. I need to research that.
I hit a dead end optimizing the fnv hash function, so i decided to switch to crc32.
This is how i've implemented the CRC32 hash function in C:
uint64_t CRC32Hash (const void* seed, size_t seedLength)
{
uint64_t hash = 0xDEADDEAD;
const char* key = (const char*) seed;
for (size_t i = 0; i < seedLength; i++)
{
hash = _mm_crc32_u8(hash, key[i]);
}
return hash;
}
I've applied all of the current optimizations to crc32 and added 2 more that I will describe right now.
The crc32 instruction is capable of processing 1, 2, 4, or even 8 bytes at a time. Of course we're going to process 8 bytes by using _mm_crc32_u64
intrinsic.
But first we need to align the memory, because when we do an unaligned address access multiple cache lines are read, therefore reducing the perfromance.
The code:
CRC32_64 code
const size_t ALIGNED_BYTES = 32;
uint64_t CRC32Hash_64 (const void* seed, size_t seedLength)
{
uint64_t hash = 0xDEADDEAD;
const char* key = (const char*) seed;
for (size_t i = 0; i < ALIGNED_BYTES; i += 8)
{
hash = _mm_crc32_u64(hash, *(unsigned long long*)(key + i));
}
return hash;
}
This is an unfair optimization, because it is not "hardware", rather "software".
Because we still need a $ 2^n $ table size for the modulo optimization to work, I've just doubled the current size to $ 16384 $.
Let's dive in and see the results:
test |
Baseline, $ 10^9 $ ticks | Strcmp, $ 10^9 $ ticks | Modulo, $ 10^9 $ ticks | Vectorization, $ 10^9 $ ticks | Size, $ 10^9 $ ticks | ||
---|---|---|---|---|---|---|---|
1 | $ 9.6 $ | $ 8.2 $ | $ 7.9 $ | $ 6.4 $ | $ 6.0 $ | ||
2 | $ 9.7 $ | $ 8.2 $ | $ 7.6 $ | $ 6.3 $ | $ 5.7 $ | ||
3 | $ 10.3 $ | $ 7.9 $ | $ 7.5 $ | $ 6.4 $ | $ 5.6 $ | ||
4 | $ 10.0 $ | $ 7.9 $ | $ 7.5 $ | $ 6.4 $ | $ 5.7 $ | ||
5 | $ 10.6 $ | $ 7.9 $ | $ 7.6 $ | $ 6.5 $ | $ 5.8 $ | ||
$ \bold{average} $ | $ 10.0 \pm 0.7$ | $ 8.0 \pm 0.1 $ | $ 7.6 \pm 0.1 $ | $ 6.4 \pm 0.1$ | $ 5.8 \pm 0.1 $ | ||
$ \bold{1.00x}$ |
The total boost is $ 72 % $, fighting against -O2
compiler.
But what are the costs?
-
words should be less than 32 letters long.
-
$2^n$ hash table size. -
x86 dependant code.
-
avx2 required.
Overall, non-portable, hard to support / read, but
CPU: AMD Ryzen 7 5800HS with Radeon Graphics, 3201 Mhz, 8 Core(s), 16 Logical Processor(s)
RAM: 16.0 GB
OS: Ubuntu 22.04 (WSL)
Compiler: g++ 11.4.0