Parsing series of integers with SIMD
Sample programs for article Parsing series of integers with SIMD
Parsers extract integer numbers from strings. A number can be prepended by a sign character. The numbers are separated by arbitrary sequences of separator chars. All other characters are invalid and the parsers detects them and raise exception.
This repository contains:
- scalar reference implementation;
- two variants of SSE parsers; there are also separate variants designed solely of parsing unsigned numbers;
- scalar hybrid that combines ideas from SIMD parsing with scalar conversion procedures.
Requires: C++11 compiler (tested with GCC 7.3) and Python 2.7.
make to build all programs.
make run-tests to build all unit tests and then run them.
Some tests are time consuming, be patient.
make microbenchmarks.rst to run microbenchmarks.
make report-overall.rst to run performance benchmarks.
make spanmaskhistogram.rst to produce runtime analysis report
for SSE implementation.
There are several programs available in
benchmark--- test performance of given procedure
benchmark-cpuclocks--- measure performance of all procedures; display CPU clocks
benchmark-all--- compare performance of different procedures
comapre-unsigned--- are used to validate if parsers produces the same results as the reference parser
compare-avx512--- the same as above, but tests only AVX512BW implementation
statistics--- gather execution statistics from SSE parsers
Apart from these programs, there are several
that runs various unit tests; they are invoked by
All the programs generates random data which is then parsed, following command line arguments can be used to control data characteristics:
--size=NUMBER input size (in bytes) --loops=NUMBER how many times a test must be repeated [default: 1] --seed=NUMBER seed for random number generator [default: 0] --num=DISTRIBUTION distribution of lengths of numbers --sep=DISTRIBUTION distribution of lengths of gaps between numbers [default: '1'] --separators=string list of separator characters [default: ",; "] --sign=DISTRIBUTION distribution of sign in front of number [default: '1']
DISTRIBUTION is a list of weights separated with commas, which defines
distribution of items.
In case of
--num it's the count of decimal digits in a random number.
--num=1,1,1,1 will produce one-, two-, three- or four-digit
numbers with the same probability;
--num=0,0,0,1,5,1 will produce four-,
five- or six-digit numbers, but five-digits numbers with probability 5/7.
In case of
--sep it is the distribution of numbers of separator characters
between the generated numbers. The default
--sep=1 means there's always
exactly one character;
--sep=0,0,1,1,1 would put form 2 to 4 separator chars.
--sign defines distribution of set: no-character-sign, '+' and '-'.
Thus the default
--sign=1 forces just unsigned numbers;
will force all numbers negative.
- Complete AVX512 implementation to handle scalar fallback.