I've measured the elapsed time and the mean absolute error for the various methods of computing Prefix Sum of an array of 1M floats.
g++ 5.4.0 | g++ 7.5.0 | g++ 9.3.0 | clang 11.0.3* | msvc 19.26 | Avg SpeedUp | MAE | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Time(sec) | SpeedUp | Time | SpeedUp | Time | SpeedUp | Time | SpeedUp | Time | SpeedUp | |||||
float | simple(baseline) | 1.459 | 0% | 1.255 | 0% | 0.983 | 0% | 0.901 | 0% | 1.247 | 0% | 0% | 2.037 | simple(baseline) |
simple_double | 1.546 | -6% | 1.648 | -24% | 1.334 | -26% | 1.281 | -30% | 1.156 | 8% | -11% | - | simple_double | |
sse | 0.521 | 180% | 0.549 | 128% | 0.341 | 188% | 0.444 | 103% | 0.384 | 225% | 172% | 0.683 | sse | |
kahan | 5.924 | -75% | 4.789 | -74% | 3.987 | -75% | 3.497 | -74% | 4.100 | -70% | -73% | 0.000 | kahan | |
unroll4 | 1.401 | 4% | 1.247 | 1% | 0.972 | 1% | 0.893 | 1% | 1.017 | 23% | 9% | 2.037 | unroll4 | |
unroll4_reorder1 | 1.068 | 37% | 0.984 | 27% | 0.802 | 23% | 0.745 | 21% | 0.818 | 52% | 36% | 0.768 | unroll4_reorder1 | |
unroll4_shift | 0.896 | 63% | 0.514 | 144% | 0.919 | 7% | 0.545 | 65% | 0.917 | 36% | 57% | 0.683 | unroll4_shift | |
unroll8 | 1.431 | 2% | 1.240 | 1% | 1.042 | -6% | 0.886 | 2% | 1.048 | 19% | 7% | 2.037 | unroll8 | |
unroll8_reorder1 | 1.062 | 37% | 0.926 | 36% | 0.763 | 29% | 0.660 | 37% | 0.780 | 60% | 44% | 1.160 | unroll8_reorder1 | |
unroll8_reorder2 | 1.249 | 17% | 0.849 | 48% | 0.645 | 52% | 0.634 | 42% | 0.692 | 80% | 55% | 0.833 | unroll8_reorder2 | |
unroll8_shift | 1.210 | 21% | 0.657 | 91% | 1.248 | -21% | 0.591 | 52% | 1.294 | -4% | 23% | 0.344 | unroll8_shift | |
unroll16 | 1.378 | 6% | 1.242 | 1% | 1.009 | -3% | 0.897 | 0% | 1.036 | 20% | 8% | 2.037 | unroll16 | |
unroll16_reorder1 | 0.880 | 66% | 0.891 | 41% | 0.715 | 37% | 0.701 | 29% | 0.728 | 71% | 52% | 1.198 | unroll16_reorder1 | |
unroll16_reorder2 | 0.657 | 122% | 0.793 | 58% | 0.613 | 60% | 0.533 | 69% | 0.847 | 47% | 65% | 2.277 | unroll16_reorder2 | |
double | simple(baseline) | 1.486 | 0% | 1.291 | 0% | 0.997 | 0% | 0.885 | 0% | 1.526 | 0% | 0% | 2.037 | simple(baseline) |
kahan | 5.563 | -73% | 4.813 | -73% | 4.079 | -76% | 3.466 | -74% | 4.267 | -64% | -70% | - | kahan | |
unroll4 | 1.478 | 1% | 1.248 | 3% | 1.032 | -3% | 0.878 | 1% | 1.018 | 50% | 19% | 2.037 | unroll4 | |
unroll4_reorder1 | 1.079 | 38% | 1.010 | 28% | 0.789 | 26% | 0.741 | 19% | 0.794 | 92% | 51% | 0.768 | unroll4_reorder1 | |
unroll4_shift | 0.927 | 60% | 0.544 | 138% | 0.919 | 8% | 0.468 | 89% | 0.967 | 58% | 70% | 0.683 | unroll4_shift | |
unroll8 | 1.549 | -4% | 1.226 | 5% | 0.958 | 4% | 0.883 | 0% | 1.035 | 47% | 19% | 2.037 | unroll8 | |
unroll8_reorder1 | 0.929 | 60% | 0.944 | 37% | 0.754 | 32% | 0.671 | 32% | 0.765 | 100% | 61% | 1.160 | unroll8_reorder1 | |
unroll8_reorder2 | 0.808 | 84% | 0.831 | 55% | 0.619 | 61% | 0.616 | 44% | 0.673 | 127% | 83% | 0.833 | unroll8_reorder2 | |
unroll8_shift | 1.240 | 20% | 0.648 | 99% | 1.208 | -18% | 0.606 | 46% | 1.269 | 20% | 32% | 0.344 | unroll8_shift | |
unroll16 | 1.431 | 4% | 1.247 | 4% | 1.015 | -2% | 0.904 | -2% | 1.019 | 50% | 19% | 2.037 | unroll16 | |
unroll16_reorder1 | 0.861 | 73% | 0.907 | 42% | 0.690 | 44% | 0.622 | 42% | 0.727 | 110% | 72% | 1.198 | unroll16_reorder1 | |
unroll16_reorder2 | 0.621 | 139% | 0.812 | 59% | 0.619 | 61% | 0.509 | 74% | 0.802 | 90% | 85% | 2.277 | unroll16_reorder2 |
g++ 5.4.0 | g++ 7.5.0 | g++ 9.3.0 | clang 11.0.3* | msvc 19.26 | Avg SpeedUp | MAE | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Time(sec) | SpeedUp | Time | SpeedUp | Time | SpeedUp | Time | SpeedUp | Time | SpeedUp | |||||
float | simple(baseline) | 1.245 | 0% | 1.071 | 0% | 0.995 | 0% | 0.862 | 0% | 1.243 | 0% | 0% | 2.037 | simple(baseline) |
simple_double | 1.614 | -23% | 1.515 | -29% | 1.263 | -21% | 1.173 | -27% | 1.269 | -2% | -17% | - | simple_double | |
avx | 0.514 | 142% | 0.502 | 113% | 0.492 | 102% | 0.481 | 79% | 0.577 | 116% | 108% | 0.344 | avx | |
sse | 0.394 | 216% | 0.358 | 199% | 0.345 | 189% | 0.405 | 113% | 0.504 | 147% | 159% | 0.683 | sse | |
kahan | 4.825 | -74% | 4.246 | -75% | 3.860 | -74% | 3.406 | -75% | 4.069 | -69% | -73% | 0.000 | kahan | |
unroll4 | 1.179 | 6% | 1.213 | -12% | 0.992 | 0% | 0.861 | 0% | 1.249 | 0% | -1% | 2.037 | unroll4 | |
unroll4_reorder1 | 0.988 | 26% | 0.871 | 23% | 0.813 | 22% | 0.727 | 19% | 0.972 | 28% | 24% | 0.768 | unroll4_reorder1 | |
unroll4_shift | 0.679 | 84% | 0.910 | 18% | 0.937 | 6% | 0.570 | 51% | 0.534 | 133% | 76% | 0.683 | unroll4_shift | |
unroll8 | 1.229 | 1% | 1.049 | 2% | 0.949 | 5% | 0.862 | 0% | 1.027 | 21% | 9% | 2.037 | unroll8 | |
unroll8_reorder1 | 1.026 | 21% | 0.753 | 42% | 0.734 | 36% | 0.630 | 37% | 0.792 | 57% | 43% | 1.160 | unroll8_reorder1 | |
unroll8_reorder2 | 0.758 | 64% | 0.673 | 59% | 0.627 | 59% | 0.570 | 51% | 0.715 | 74% | 63% | 0.833 | unroll8_reorder2 | |
unroll8_shift | 0.756 | 65% | 1.356 | -21% | 1.246 | -20% | 0.776 | 11% | 0.631 | 97% | 42% | 0.344 | unroll8_shift | |
unroll16 | 1.171 | 6% | 1.009 | 6% | 1.003 | -1% | 0.876 | -2% | 1.074 | 16% | 7% | 2.037 | unroll16 | |
unroll16_reorder1 | 0.829 | 50% | 0.769 | 39% | 0.680 | 46% | 0.571 | 51% | 0.773 | 61% | 53% | 1.198 | unroll16_reorder1 | |
unroll16_reorder2 | 0.726 | 71% | 0.626 | 71% | 0.569 | 75% | 0.494 | 75% | 0.943 | 32% | 58% | 2.277 | unroll16_reorder2 | |
double | simple(baseline) | 1.214 | 0% | 1.066 | 0% | 0.952 | 0% | 0.875 | 0% | 1.537 | 0% | 0% | 2.037 | simple(baseline) |
kahan | 4.790 | -75% | 4.175 | -74% | 3.821 | -75% | 3.431 | -75% | 3.986 | -61% | -70% | - | kahan | |
unroll4 | 1.152 | 5% | 1.032 | 3% | 0.933 | 2% | 0.866 | 1% | 1.243 | 24% | 10% | 2.037 | unroll4 | |
unroll4_reorder1 | 0.975 | 25% | 0.890 | 20% | 0.810 | 18% | 0.732 | 19% | 0.959 | 60% | 35% | 0.768 | unroll4_reorder1 | |
unroll4_shift | 0.683 | 78% | 0.950 | 12% | 0.919 | 4% | 0.567 | 54% | 0.523 | 194% | 98% | 0.683 | unroll4_shift | |
unroll8 | 1.194 | 2% | 1.063 | 0% | 0.965 | -1% | 0.870 | 1% | 1.034 | 49% | 18% | 2.037 | unroll8 | |
unroll8_reorder1 | 0.876 | 39% | 0.761 | 40% | 0.715 | 33% | 0.639 | 37% | 0.838 | 83% | 54% | 1.160 | unroll8_reorder1 | |
unroll8_reorder2 | 0.791 | 54% | 0.665 | 60% | 0.642 | 48% | 0.576 | 52% | 0.721 | 113% | 76% | 0.833 | unroll8_reorder2 | |
unroll8_shift | 0.750 | 62% | 1.237 | -14% | 1.259 | -24% | 0.763 | 15% | 0.624 | 146% | 61% | 0.344 | unroll8_shift | |
unroll16 | 1.158 | 5% | 1.031 | 3% | 0.960 | -1% | 0.874 | 0% | 1.034 | 49% | 19% | 2.037 | unroll16 | |
unroll16_reorder1 | 0.825 | 47% | 0.750 | 42% | 0.705 | 35% | 0.609 | 44% | 0.747 | 106% | 66% | 1.198 | unroll16_reorder1 | |
unroll16_reorder2 | 0.752 | 61% | 0.636 | 68% | 0.587 | 62% | 0.492 | 78% | 0.958 | 61% | 66% | 2.277 | unroll16_reorder2 |