This project explores various sorting algorithms employing a BitMask approach.
The Java implementation can be found in this repository.
For implementations in other languages, please refer to the following repositories:
[C# Version] (https://github.com/aldo-gutierrez/bitmasksorterCSharp)
[C++ Version] (https://github.com/aldo-gutierrez/bitmasksorterCpp)
[Javascript Version] (https://github.com/aldo-gutierrez/bitmasksorterJS)
[Python Version] (https://github.com/aldo-gutierrez/bitmasksorterPython)
A BitMask can be utilized to reduce the computational overhead required for sorting.
The following code demonstrates the calculation of the BitMask:
public static int calculateBitMask(int[] array, int start, int endP1) {
int mask = 0x00000000;
int inv_mask = 0x00000000;
for (int i = start; i < endP1; i++) {
int ei = array[i];
mask = mask | ei;
inv_mask = inv_mask | ~ei;
}
return mask & inv_mask;
}
Some properties of this BitMask are:
- This BitMask has only the bit that change
- If the BitMask is 0 it means that all the numbers are equal
First, let's consider an example: suppose we have a list of integers containing numbers from 0 to 127. In this case, the bitmask would be 1111111.
Number | Bits | |
---|---|---|
Min: | 0 | 0000000000000000 |
Max: | 127 | 0000000001111111 |
Average/Median: | 64 | 0000000001000000 |
Bits Used (b): | 7 | 0000000001111111 |
We can readily identify certain optimizations in algorithms.
- QuickSort:
- Now is easy to know which pivot use just use the average, in this case 64 (1000000)
- Each quick sort call is using a bit of information
- Radix sort:
- A typical LSD Radix sorts for integers (32 bits) uses 4 count sort passes, it could just use one
- Count sort/ Pingeon count sort
- We have already de Min/Max values in the BitMask
Second, let's consider a more complex example: suppose we have a list of integers containing numbers 85, 84, 21 and 5. In this case, the bitmask would be 1010001. As only the bits that change are part of the mask, 3 bits in this case.
Number | Bits |
---|---|
85 | 0000000001010101 |
84 | 0000000001010100 |
21 | 0000000000010101 |
5 | 0000000000000101 |
MASK | 0000000001010001 |
Number | Bits | |
---|---|---|
Mask Min: | 4 | 000000000(0)0(0)010(0) |
Mask Max: | 85 | 000000000(1)0(1)010(1) |
Mask Average/Median: | 64 | 000000000(1)0(0)000(0) |
Mask Bits Used (d): | 3 | 000000000(1)0(1)000(1) |
- QuickSort:
- Easy to know which pivot use just use the average
- Each quick sort call is using a bit of information
- Radix sort:
- We will need a modified Radix Sort that uses only the 3 bits in one pass
- Count sort/ Pingeon count sort
- We have already de Min/Max values in the BitMask
- However, we could improve the efficiency by only considering the ones in the mask, 2^3 values, 000(0) to 111(7), 8 values in total
Third, let's consider some problems with the BitMask
Example 3A: If a list contains only two numbers, values: 1111111110000000(65408) and 0000000111111111(511). For this case the mask is 1111111001111111, instead of just 1 bit, so it doesn't help too much. After some step in the sort algorithm it could be a good idea to recalculate the BitMask maybe is 0 and there is no need to further process
Example 3B: If a list contains positive and negative numbers then the mask will probably be -1 which is 0xFFFFFFFF or 0b1111111111111111111111111111111. Even if the numbers in the list are just 1 and -1, so the mask doesn't help. But if we put the positives to one side, negatives to the other we could recalculate the BitMask and get a more useful BitMask.
Example 3C: If the BitMask is 0xFFFFFFFF (0b1111111111111111111111111111111) because the numbers in the list are using the full range of Integer, then the BitMask doesn't to optimize a sort algorithm.
There are many algorithms that use a hybrid approach for example TimSort.
TimSort is a hybrid sorting algorithm that employs merge sort for large arrays of objects and insertion sort for small arrays, adjusting its strategy based on the array size (N).
I think we can use the BitMask to create a hybrid algorithm that takes into account not only N but also r (the range). We could use the BitMask for calculating the range, the range could be reduced if we only use the bits set to one as described in previous paragraphs.
AGSelectorSorterInt is an example of that hybrid algorithm that choose between different algorithms that uses BitMasks
Some variables defined in a radix sorter
r = range, for example for unsigned integers is 2^32
d = length of word or digit in bits, typical value is 8 bits (a byte)
k = number of words/digits, for an integer this is 4 (32/8)
with this notation r = 2^(k*d)
For BitMask sorters
m = length in bits set to one of the BitMask, for integers numbers it goes from 0 to 32
r = range, that is calculated as 2^m
d = length of word or digit in bits, it can be from 1..16 bits, we are using 11 for modern processors and 8 for a dual-core machine or lower
k = number of word/digits
Is similar to QuickSort but for choosing the pivot it uses the BitMask. The first bit set to one of the BitMask is used as pivot In the next recursion the next bit of the bitmask is used. This is different to other QuickSort algorithms that normally use the last element as pivot or choose the average of three pivots.
This is somehow similar to Binary MSD radix sort, but instead of choosing a digit(word) or byte a bit is used More comparisons and study is need to understand the differences in performance and implementation.
This algorithm is slower than TimSort but faster than a Stack Overflow QuickSort samples in general. However, when the number of bits is low is faster than Java TimSort.
If the bitmask is 11_1111_1111_1111_1111 then the first partition uses the mask 10_0000_0000_0000_0000, then the left and right partitions use the mask for the next bit 01_0000_0000_0000_0000, after some iterations and when there is less than a number of bits in the remaining BitMask we switch to a Java Sort or to a Count Sort Algorithm (Pigeonhole Sort or two types of Radix Sorter with bits)
In Example 3A there are only two different numbers in the list but the BitMask (m) is big (1111111001111111), In the first partition we use the first bit(1000000000000000), in the next recursive partition/iteration it will detect that all numbers are going to just one partition, so we recalculate the mask, and in this case it will produce a bitMask with value 0 so the sort is complete.
Optimizations:
- The partition is only done by the bits set to one in the BitMask as described above
- When the remaining bits in the BitMask are small it switches to a Java Sort or to a Count Sort Algorithm (Pigeonhole Sort or two types of Radix Sorter with bits)
- Multithreading support
See an analysis of the performance in the section Speed and Performance
Usage:
//Sorting int numbers
SorterInt sorter = new QuickBitSorterInt();
int[] list = ....
sorter.sort(list);
//Parallel sorting int numbers
SorterInt sorter = new QuickBitSorterMTInt();
int[] list = ....
sorter.sort(list);
//Sorting unsinged int numbers
SorterInt sorter = new QuickBitSorterInt();
sorter.setUnsigned(true);
int[] list = ....
sorter.sort(list);
RadixBitSorter is a sorter that uses the BitMask to sort with just the bits that need to be sorted. The BitMask is used to reduce the number of count sort iterations.
Some implementations of Radix Sort are very unoptimized stackoverflow, baeldung, hackerearth have a lot of this. They use sometimes base 10 and sometimes bucket sort, compared to those RadixBitSorter is way faster.
Some optimized implementations use base 256 using bit operations. A typical optimized Radix Sorter for integers (32 bits) could use 8bits (words/digits) and needs 4 steps of Count Sort.
After a lot of testing I found that in my machine (AMD Ryzen 7 4800H) 11bits is the best word size and I also found that in some older computers dual core 8bits is the best size. That was found many years ago as I later discovered. So RadixBiSorter change the size of the words according to the number of cores that are somehow correlated with the Cache Sizes.
Some other optimizations like swapping the aux buffer with the array buffer are also implemented but with mixed results.
A typical optimized Radix Sorter for integers (32 bits) could use 8bits (words/digits) and needs 4 iterations of Count Sort, with the BitMask we can reduce the iterations to 1 in some cases.
Ska Sort as I understand is an optimized C++ radix sort but is in place. RadixBitSorter is not an in place sorter as it uses and aux buffer.
RadixBitSorter is tipically faster than SkaSort, but when the bitmask is of size 30-32 bits then it can have similar or a little lower performance than SkaSort.
Radix Bit Sorter Optimizations:
- It does not need to sort by all the bits just the ones in the bitmask
- It sorts by 11 bits at a time as maximum and 4 as minimum (Instead of 8 as SkaSort and RadixByteSorterInt do)
Is a RadixBitSorter but that always uses 8bits with byte boundaries, it can or not use the BitMask to optimize the iterations of count sort
RadixByteSorterInt when working an all bytes (not using the bitmask) should have the same performance as SkaSort and tipical optimized radix sorters
RadixBitSorter is faster than RadixByteSorter and SkaSort in most of the cases and machines i tested.
Radix Byte Sorter Optimizations:
- An option to use the bitmask and detect which bytes do not use has been added to the algorithm (calculateBitMaskOptimization) and enable by default.
Usage:
//Sorting unsigned int numbers
IntSorter sorter = RadixBitSorterInt();
sorter.setUnsigned(true);
int[] list = ....
sorter.sort(list);
//Sorting int numbers
IntSorter sorter = RadixBitSorterInt();
int[] list = ....
sorter.sort(list);
It is a multi thread sorter, it combines previous QuickBitSorter and RadixBitSorter
For example if the bitmask is 00000000111111111111111111111110 23 bits, and we have a 32 thread processor then with default configuration:
- The first 5 bits are recursively processed by doing QuickBitSorter in multiple threads
- The next 2 bits are recursively processed by doing RadixBitSort in each thread
- The last 16 bits are done using a Count Sort Algorithm (Pigeonhole Sort or two types of Radix Sorter with bits)
int[] sort in java uses a not stable algorithm, we use same philosophy. Object[] sort in java is a stable algorithm, we use same philosophy.
Most of the algorithms are faster than the Java default and Parallel sorters in most of the cases.
See the performance analysis at the end of this section:
Comparison for sorting 10 Million int elements with range from 0 to 10 Million in an AMD Ryzen 7 4800H processor, Java 1.8.0_341
Algorithm | AVG CPU time [ms] |
---|---|
JavaSorterInt | 729 |
QuickBitSorterInt | 298 |
RadixBitSorterInt | 108 |
RadixByteSorterInt | 110 |
JavaSorterMTInt | 85 |
QuickBitSorterMTInt | 103 |
MixedBitSorterMTInt | 97 |
RadixBitSorterMTInt | 65 |
Comparison for sorting 10 Million int elements with range from 0 to 100000 in an AMD Ryzen 7 4800H processor, Java 1.8.0_341
Algorithm | AVG CPU time [ms] |
---|---|
JavaSorterInt | 541 |
QuickBitSorterInt | 54 |
RadixBitSorterInt | 85 |
RadixByteSorterInt | 111 |
JavaSorterMTInt | 81 |
QuickBitSorterMTInt | 54 |
MixedBitSorterMTInt | 48 |
RadixBitSorterMTInt | 54 |
Comparison for sorting 40 Million int elements with range from 0 to 1000000000 in an AMD Ryzen 7 4800H processor, Java 1.8.0_341
Algorithm | AVG CPU time [ms] |
---|---|
JavaSorterInt | 3174 |
QuickBitSorterInt | 2593 |
RadixBitSorterInt | 474 |
RadixByteSorterInt | 505 |
JavaParallelSorterInt | 365 |
QuickBitSorterMTInt | 589 |
MixedBitSorterMTInt | 449 |
RadixBitSorterMTInt | 289 |
Object Sort using the Interface Intmapper
public class Entity1 {
int id;
String name;
}
Intmapper<Entity1> mapper = new Intmapper<Entity1>() {
@Override
public int intValue(Entity1 o) {
return o.getId();
}
@Override
public int compare(Entity1 entity1, Entity1 t1) {
return Integer.compare(entity1.getId(), t1.getId());
}
};
EntityInt1[] list = ...;
ObjectSorter sorter = new RadixBitSorterObjectInt();
sorter.sort(list, mapper);
###Example 4:
Comparison for sorting 10 Million objects with int key with range from 0 to 10 Million in an AMD Ryzen 7 4800H processor, Java 1.8.0_341
Algorithm | AVG CPU time [ms] |
---|---|
JavaSorterObjectInt | 5584 |
JavaSorterMTObjectInt | 717 |
RadixBitSorterObjectInt | 813 |
###Example 5:
Comparison for sorting 10 Million elements with int key range from 0 to 100000 in an AMD Ryzen 7 4800H processor, Java 1.8.0_341
Algorithm | AVG CPU time [ms] |
---|---|
JavaSorterObjectInt | 4450 |
JavaSorterMTObjectInt | 826 |
RadixBitSorterObjectInt | 644 |
##History of improvements on performance on this project
- Support for sorting floats, doubles, longs and Objects with int fields is done
TODO Needs to be evaluated in detail
(using wikipedia names) https://en.wikipedia.org/wiki/Sorting_algorithm
- n = number of elements
- r = range (2^(k*d)) or (2^(d1+d2...dn) or (2^m)
- m = number of bits on mask
- k = number of digits/words
- d = length of a digit/word in bits (typically 8 bits)
- t = number of hardware threads
Algorithm | CPU worst | CPU average | CPU best | MEM worst | MEM average | MEM best |
---|
The table above needs to be recalculated
See the repository (https://github.com/aldo-gutierrez/bitmasksorterCpp) where a comparison has been done. In summary RadixBitSorterInt is faster than SkaSort when the mask has 30 bits or lower, m<=30, r<=2^30, with 31, 32 bits it could be slower, of similar performance or faster (faster when using a digit/word of 9 bits or larger when detected a modern processor)
- Rename code that uses K as range replace it with R
- Evaluation on complexity
- More Testing, have 100% code coverage
- Compare with [Wolf Sort] (https://github.com/scandum/wolfsort)
- Find the best values for constants for single and multi-thread and for different machines
- Have comparisons and documentation of speed in powers of two instead of some powers of ten as it will be easier to compare with other sorters
- Optimize for almost sorted list or for list that have few sorted parts as JavaSorter
- Create a object sorter algorithm ease to use and that sort for multiple columns
- Create a code generator because most Float, Double, etc sorters share the same code but for performance reasons and java limitations the Generic implementation is slower
- Test Sorters that use one half, one third, or sqrt() of memory and still stable
- Stable sorter with no additional memory
Algorithm | Positive random numbers | Negative random numbers | Unsigned numbers | Sorted | Reverse sorted |
---|---|---|---|---|---|
QuickBitSorterInt | ok | ok | ok | ok | ok |
RadixBitSorterInt | ok | ok | ok | ok | ok |
RadixByteSorterInt | ok | ok | ok | ok | ok |
QuickBitSorterMTInt | ok | ok | ok | ok | ok |
MixedBitSorterMTInt | ok | ok | ok | ok | ok |
RadixBitSorterMTInt | ok | ok | ok | ok | ok |
RadixBitSorterObject | ok | ok | not tested | ok | ok |