Goals of this project:
- Inspect different hash functions and choose the best, by minimizing the amount of collisions.
- Implement 3 types of assembly optimizations to speed up the searching function
I assume that reader is familiar with the concept of hash tables and processor architecture.
In this project I used chain method to avoid collisions.
Elements with the same hash are stored in linked list. Here is the picture:
With proper size and hash function, we can search elements for O(1).
You can read more about it here:
https://www.geeksforgeeks.org/separate-chaining-collision-handling-technique-in-hashing/
In this part we will inspect 7 hash functions. The main evaluation parameter will be amount of collisions.
The more collisions we have - the slower searching and adding new members will get. That's why this part is really important.
Each time I will load ~14 000 unique words into the Hash Table. Size of the table is set to 1000. This is done on purpose, to get clearer vision of hash functions' capabilities.
With python library matplotlib I will draw graphs to see how much collisions we get.
This is the worst ever hash function that you can imagine. All elements will be stored in one linked list.
Code:
uint32_t OneHash (const char* string, size_t len)
{
return 1;
}
Dispersion: 589413
The search would be O(n) instead of O(1). I implemented it for educational purposes only.
We will return the ASCII number of the first character in the string.
uint32_t FirstLetterHash (const char* string, size_t len)
{
return (uint32_t) string[0];
}
Dispersion: 8954
This hash function might be used somewhere. However we have 14000 words, which means a lot of collisions and empty buckets.
We will return string length.
uint32_t LengthHash (const char* string, size_t len)
{
return len;
}
Dispersion: 28424
We have even more collisions. English language has on average 5 letters in the word, which leads to 2500 collisions at the top.
We will return the ASCII sum of characters in the string.
uint32_t SumHash (const char* string, size_t len)
{
uint32_t hashsum = 0;
while (*string)
{
hashsum += *string;
string++;
}
return hashsum;
}
Dispersion: 135
You can notice the huge improvement. Maximal amount of collisions decreased 400 times.
uint32_t RorFunc (int num, int shift)
{
return (num >> shift) | (num << (32 - shift));
}
uint32_t RorHash (const char* string)
{
unsigned int hash = 0;
size_t index = 0;
while (*str)
{
hash = RotateRight(hash) ^ str[index];
str++;
}
return hash;
}
Dispersion: 120
We didn't get less collisions, but dispersion became slightly better.
The search should be statistically quicker, if we pick a random element.
The same algorithm as previous, but bits rotating to the left.
uint32_t RolFunc (int num, int shift)
{
return (num << shift) | (num >> (32 - shift));
}
uint32_t RorHash (const char* string)
{
unsigned int hash = 0;
size_t index = 0;
while (*str)
{
hash = RotateLeft(hash) ^ str[index];
str++;
}
return hash;
}
Dispersion: 93
Pretty much the same result with less dispersion.
If I had to choose from Rotate Hashes and previous Hash functions, I would definitely picked Rotate Hashes.
However, we have much more efficient and complicated function...
The realization is pretty complicated, we will go through it later in part with optimizations.
With the help of this algorithm, we can have as much as 27 collisions maximally. It is 3 times better than Rotate Hash.
Dispersion: 15
I will use this Hash function for the next part of my project because it has a great potential for optimizations and small dispersion.
Let's look at all functions on the same graphic:
Green - Murmur hash. Black - Length hash. Red - Hashsum. Brown - First ASCII hash. Blue and Cyan - Rotate hashes. Return 1 hash is excluded.
Then let's have a closer look at our favorites:
Green - Murmur hash. Blue - Rotate right hash. Cyan - Rotate left hash. Red - Hashsum.
At this point the supremacy of Murmur hash is obvious.
In this part of the work we will speed up our search function by analysing "bottle necks" of the program.
I will use valgrind to get profiling data and kcachegrind to visualize it. Also I will implement a stress-test, which searches each word in hash table 100 times.
System info: Intel Core i5, 5th gen. Honor MagicBook 16 R5/16/512
Compilator info: g++ (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Optimization flag: -O2.
I've chosen -O2 optimization flag, because I will use AVX2 instructions. You can check my previous project for full explanation.
Before start I removed all unneeded functions:
- I removed all hashes apart from Murmur Hash
- Removed all sorts of dump
Also I used <chrono> library to measure the elapsed time.
Average search time: 3142
$\pm$ 20 ms
Let me visualize profiler data for you:
According to profiler, Murmur Hash affects the performance pretty hard. We can rewrite it on Assembly. I will implement it in the separate file and then link it with our main program.
Murmur hash assembly code
MurMurAsm:
push rbp ; saving base pointer
push rbx
; rdi - string
; rsi - len
mov eax, 5bd1e995h ; hash = seed ^ len
xor rax, rsi
mov rcx, 0 ; coef1 = 0
.loop:
mov rcx, [rdi] ; coef1 = data[0]
mov rbx, [rdi + 1] ; coef1 = data[1] << 8
mov rdx, [rdi + 2] ; coef1 = data[1] << 8
mov r11, [rdi + 3] ; coef1 = data[1] << 8
shl rbx, 8
shl rdx, 16
shl r11, 24
or rcx, rbx
or rcx, rdx
or rcx, r11
add rdi, 4
mov edx, 5bd1e995h
mul edx ; hash *= salt
xor rax, rcx ; hash ^= coef1
sub rsi, 4d ; len -=4
cmp rsi, 4
jge .loop
; switch (len)
cmp rsi, 3 ; case 3
jne skip3
mov rbx, [rdi + 2] ; hash ^= data[2] << 16
shl rbx, 16
xor rax, rbx
skip3:
cmp rsi, 2 ; case 2
jne skip2
mov rbx, [rdi + 1] ; hash ^= data[2] << 8
shl rbx, 8
xor rax, rbx
skip2:
cmp rsi, 1 ; case 1
jne skip1
mov rbx, [rdi] ; hash ^= data[0];
xor rax, rbx
mov edx, 5bd1e995h
mul edx ; hash *= salt
skip1:
mov edx, eax ; hash ^= hash >> 13
shr edx, 13
xor eax, edx
mov edx, 5bd1e995h
mul edx ; hash *= salt
mov edx, eax ; hash ^= hash >> 15
shr edx, 15
xor eax, edx
pop rbx
pop rbp
ret
I've made instruction reordering for better conveyer work. This gave me 5% additional performance boost.
Average search time: 2850
$\pm$ 20 ms
Version | Abs. speedup | Rel. speedup |
---|---|---|
-O2 | 1 | 1 |
Assembly Hash | 1.11 | 1.11 |
Let's once again look on profiler data:
According to profiler, the next target is strcmp. I will rewrite it with inline assembly.
Strcmp inline assembly implementation
asm(".intel_syntax noprefix;"
"loop:"
"push r9;"
"push r10;"
"mov r9b, [rsi];"
"mov r10b, [rdi];"
"cmp r9b, 0;"
"je end;"
"cmp r10b, 0;"
"je end;"
"cmp r9b, r10b;"
"jne end;"
"inc rdi;"
"inc rsi;"
"jmp loop;"
"end:"
"movzx rax, r9b;"
"movzx rbx, r10b;"
"sub rax, rbx;"
"pop r10;"
"pop r9;"
".att_syntax"
: "=a" (result)
: "S" (dst), "D" (src) : "rax", "rbx", "rsi", "rdi", "r9", "r10"
);
Average search time: 5416
$\pm$ 30 ms
Version | Abs. speedup | Rel. speedup |
---|---|---|
-O2 | 1 | 1 |
Assembly Hash | 1.11 | 1.11 |
Assembly strcmp | 0.58 | 0.52 |
The search became much slower, because original strcmp function uses AVX2 instructions. Why don't we use them too?
We work with aligned data, which allows to write more effective code.
Strcmp inline assembly AVX implementation
asm( ".intel_syntax noprefix;"
"mov rsi, %1;"
"mov rdi, %2;"
"movdqu xmm1, [rsi];"
"movdqu xmm2, [rdi];"
"pcmpeqb xmm1, xmm2;"
"pmovmskb eax, xmm1;"
"test eax, eax;"
"jz equal;"
"jnz not_equal;"
"equal:\n"
"xor rax, rax;"
"not_equal:\n"
"mov rax, 1;"
".att_syntax\n"
: "=r" (result)
: "r" (dst), "r" (src)
: "rbx", "rsi", "rdi", "r10", "r11", "xmm1", "xmm2"
);
Average search time: 2967 ms
Version | Abs. speedup | Rel. speedup |
---|---|---|
-O2 | 1 | 1 |
Assembly Hash | 1.11 | 1.11 |
Assembly strcmp | 0.58 | 0.52 |
Better Assembly strcmp | 1.14 | 1.02 |
Now our strcmp function can compete with the one from STL. However inline assebly has a lot of disadvantages: code became less readable and portable.
As long as relative growth is small, the wise choice is to drop this optimization. Strcmp will be replaced further anyway.
We can see, thath Searching function itself affects performance a lot.
From Length hash we already know, that there are no words, longer than 20 symbols. It means we can fit each of them into __m256i format.
Unoptimized Search code
bool SearchMember (HashTable* self, const char content[], size_t len)
{
uint32_t key = murmurasm(content, len) % (uint32_t) self->size;
HashTableNode* cur_node = &(self->array[key]);
int peers = cur_node->peers;
for (int i = 0; i < peers; i++)
{
if (strcmp((char*)cur_node->content, content) == 0)
return SUCCESS_FOUND;
cur_node = cur_node->next;
}
return NOT_FOUND;
}
With the help of AVX2 instructions we'll be able to compare strings much faster. Let's rewrite Searching function with them.
AVX2 optimized Search code
bool SearchMemberAVX (HashTable* self, const char content[], size_t len)
{
uint32_t key = murmurasm(content, len) % (uint32_t) self->size;
HashTableNode* cur_node = self->array[key];
alignas(32) char word_buffer[MAX_WORD_LEN] = "";
strncpy (word_buffer, content, len);
__m256i content_avx = _mm256_load_si256 ((__m256i*) word_buffer);
int peers = cur_node->peers;
for (int i = 0; i < peers; i++)
{
__m256i cur_val_avx = _mm256_load_si256 ((__m256i*) cur_node->content);
__m256i cmp_mask = _mm256_cmpeq_epi8 (content_avx, cur_val_avx);
uint32_t int_cmp_mask = (uint32_t) _mm256_movemask_epi8(cmp_mask);
if (int_cmp_mask == 0xFFFFFFFF)
return SUCCESS_FOUND;
cur_node = cur_node->next;
}
return NOT_FOUND;
}
To get the most out of it, I will transform array of char* input words into _mm256i. With that, we can ged rid of strncpy function, which slows the program.
AVX2 optimized Search code with modified input data
bool SearchMemberAVX (HashTable* self, __m256i* content, size_t len)
{
uint32_t key = murmurasm(content, len) % (uint32_t) self->size;
HashTableNode* cur_node = &(self->array[key]);
int peers = cur_node->peers;
for (int i = 0; i < peers; i++)
{
int cmp_mask = _mm256_testnzc_si256 (*content, *(cur_node->content));
if (cmp_mask == 0)
return SUCCESS_FOUND;
cur_node = cur_node->next;
}
return NOT_FOUND;
}
Average search time: 2056
$\pm$ 20 ms
Version | Abs. speedup | Rel. speedup |
---|---|---|
-O2 | 1 | 1 |
Assembly Hash | 1.11 | 1.11 |
AVX Search | 1.56 | 1.35 |
0x00...1ef0 - Murmur Hash assembly function
The full list of functions in kcachegrind.
According to profiler, there are no more "bottle necks" where we can get noticeable rise in performance. But we can dive deeper in previous optimizations.
In the second optimization we replaced Murmur hash with its assembly version. To make hash calculation even faster, we can use CRC32 instead.
CRC32 has one huge advantage - we can call it directly from assembler.
CRC32 assembly realisation
section .text
global crc32
crc32:
xor rax, rax
crc32 rax, qword [rdi]
crc32 rax, qword [rdi+8]
crc32 rax, qword [rdi+16]
crc32 rax, qword [rdi+24]
ret
Average search time 897 ms
Version | Abs. speedup | Rel. speedup |
---|---|---|
-O2 | 1 | 1 |
Assembly Hash | 1.11 | 1.11 |
AVX Search | 1.56 | 1.35 |
CRC32 Hash | 3.47 | 2.35 |
Huge increase in performance, the last optimization is coming...
This optimization poisioned last in education purposes. It would have been too easy with small load factor at the beginning.
For all this time, we had 1000 cells in our Hast table. Thanks to that, we got all this optimization ideas. It is time to increase amount of cells from 1000 to 15000.
Amount of collisions will drop tremendeously and the performance will rise.
Average search time: 698 ms
Version | Abs. speedup | Rel. speedup |
---|---|---|
-O2 | 1 | 1 |
Assembly Hash | 1.11 | 1.11 |
AVX Search | 1.56 | 1.35 |
CRC32 Hash | 3.47 | 2.35 |
Size increase | 4.5 | 1.28 |
As soon as we get the most out of our program, I will stop here.
It turns out, that not every optimization speeds up the program, sometimes it only slows it down.
At the end we got Search function's performance increase 4.5 times. Let's calculate Ded's coefficient: