Skip to content

Commit

Permalink
Arbitrary length Dice coefficients (#63)
Browse files Browse the repository at this point in the history
* Refactor main C++ function to avoid use "constant" memory and avoid new/delete.

* Implement popcount on (almost) arbitrary length arrays.

* First pass at integrating arbitrary length keys. Slows things down a bit.

* Refactor Dice coefficient calculation.

* Temporary fiddling with benchmark code.

* Calculate and report popcount speed from native code implementation.

* Give some values more sensible variable names.

* Remove unused import.

* Add documentation.

* Expand reporting of various measurements.

* Comments.

* Update README.

* Bring test suite up-to-date.

* Refactor main C++ function to avoid use "constant" memory and avoid new/delete.

* Screw everything up by unrolling with C++ templates, apparently.

* Magical argument that makes the compiler generate the correct (performant) code.

* Address Brian's comments.

* Update tests; also test native code version.

* Print popcount throughput; give some variables better names.

* Make some functions static inline.

* Tidy up some expressions.

* Put some braces in the right place; make fn inline.

* Reinstate comment on origin of popcount assembler.

* Make constant a template parameter.

* Comment.

* Complete version working with multiples of 1024 bits.

* Add -march=native compiler option.

* Implementation of arbitrary length CLKs.

* Fix dumb mistakes in updating array pointer and popcounts.

* Tests for arbitrary length popcounts.

* Update some comments.

* Arbitrary length Dice coefficient.

* Rename function.

* Move native dicecoeff calculation into its own function.

* Add tests for native Dice coefficient calculation.

* Move dicecoeff tests to bloommatcher tests; move common bitarray utilities to their own file.

* Simplify slow path / reduce branches in fast path.

* Adapt entitymatcher to arbitrary length CLK interface.

* Remove unused function.

* Update README.

* Address Brian's comments.

* Exit early if filter is zero.

* Specialise popcount arrays calls on array length.

* Fix performance regression.

* Remove storage class specifiers from explicit template specialisations.

* Update README and requirements.txt files.

* Disable unused function.

* Put stars in their proper place.

* Add documentation.
  • Loading branch information
Hamish Ivey-Law committed Mar 14, 2018
1 parent 8853251 commit e10ea25
Show file tree
Hide file tree
Showing 9 changed files with 464 additions and 178 deletions.
2 changes: 0 additions & 2 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -144,8 +144,6 @@ Limitations
- The linkage process has order n^2 time complexity - although algorithms exist to
significantly speed this up. Several possible speedups are described
in http://dbs.uni-leipzig.de/file/P4Join-BTW2015.pdf
- The C++ code makes an assumption of 1024 bit keys (although this would be easy
to change).


License
Expand Down
10 changes: 5 additions & 5 deletions _cffi_build/build_matcher.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,14 @@
"_entitymatcher",
source,
source_extension='.cpp',
extra_compile_args=['-Wall', '-Wextra', '-Werror', '-O3', '-std=c++11', '-mssse3', '-mpopcnt'],
extra_compile_args=['-Wall', '-Wextra', '-Werror', '-O3', '-std=c++11', '-march=native', '-mssse3', '-mpopcnt', '-fvisibility=hidden'
],
)

ffibuilder.cdef("""
int match_one_against_many_dice(const char * one, const char * many, int n, double * score);
int match_one_against_many_dice_1024_k_top(const char *one, const char *many, const uint32_t *counts_many, int n, uint32_t k, double threshold, int *indices, double *scores);
double dice_coeff_1024(const char *e1, const char *e2);
double popcount_1024_array(const char *many, int n, uint32_t *counts_many);
int match_one_against_many_dice_k_top(const char *one, const char *many, const uint32_t *counts_many, int n, int keybytes, uint32_t k, double threshold, int *indices, double *scores);
double dice_coeff(const char *array1, const char *array2, int array_bytes);
double popcount_arrays(uint32_t *counts, const char *arrays, int narrays, int array_bytes);
""")


Expand Down
Loading

0 comments on commit e10ea25

Please sign in to comment.