-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arbitrary length Dice coefficients #63
Conversation
…nto hlaw-refactor-cpp
…ities to their own file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Stephen is going to help review the changes to _cffi_build/dice_one_against_many.cpp
.
I'm seeing a performance regression - are you aware of this?
Develop Benchmark
Threshold: 0.7
Size 1 | Size 2 | Comparisons | Total Time (s) | Throughput
| | (match %) | (comparisons / matching)| (1e6 cmp/s)
-------+--------+------------------+-------------------------+-------------
1000 | 1000 | 1e6 ( 0.01%) | 0.017 (99.6% / 0.4%) | 58.241
2000 | 2000 | 4e6 ( 0.01%) | 0.068 (99.8% / 0.2%) | 59.387
3000 | 3000 | 9e6 ( 0.01%) | 0.132 (99.8% / 0.2%) | 68.548
4000 | 4000 | 16e6 ( 0.01%) | 0.194 (99.9% / 0.1%) | 82.381
5000 | 5000 | 25e6 ( 0.01%) | 0.288 (99.9% / 0.1%) | 87.062
6000 | 6000 | 36e6 ( 0.01%) | 0.400 (99.9% / 0.1%) | 90.039
7000 | 7000 | 49e6 ( 0.01%) | 0.562 (99.8% / 0.2%) | 87.301
8000 | 8000 | 64e6 ( 0.01%) | 0.717 (99.9% / 0.1%) | 89.405
9000 | 9000 | 81e6 ( 0.01%) | 0.979 (99.9% / 0.1%) | 82.806
10000 | 10000 | 100e6 ( 0.01%) | 1.317 (99.9% / 0.1%) | 75.998
20000 | 20000 | 400e6 ( 0.01%) | 5.434 (99.9% / 0.1%) | 73.672
This Branch Benchmark
Threshold: 0.7
Size 1 | Size 2 | Comparisons | Total Time (s) | Throughput
| | (match %) | (comparisons / matching)| (1e6 cmp/s)
-------+--------+------------------+-------------------------+-------------
1000 | 1000 | 1e6 ( 0.01%) | 0.028 (99.8% / 0.2%) | 35.699
2000 | 2000 | 4e6 ( 0.01%) | 0.087 (99.9% / 0.1%) | 46.138
3000 | 3000 | 9e6 ( 0.01%) | 0.175 (99.9% / 0.1%) | 51.467
4000 | 4000 | 16e6 ( 0.01%) | 0.307 (99.9% / 0.1%) | 52.154
5000 | 5000 | 25e6 ( 0.01%) | 0.474 (99.9% / 0.1%) | 52.755
6000 | 6000 | 36e6 ( 0.01%) | 0.681 (99.9% / 0.1%) | 52.918
7000 | 7000 | 49e6 ( 0.01%) | 0.917 (99.9% / 0.1%) | 53.445
8000 | 8000 | 64e6 ( 0.01%) | 1.179 (99.9% / 0.1%) | 54.330
9000 | 9000 | 81e6 ( 0.01%) | 1.486 (99.9% / 0.1%) | 54.549
10000 | 10000 | 100e6 ( 0.01%) | 1.898 (99.9% / 0.1%) | 52.730
20000 | 20000 | 400e6 ( 0.01%) | 7.653 (99.9% / 0.1%) | 52.297
anonlink/entitymatch.py
Outdated
return [] | ||
|
||
# Length must be a multple of 64 bits. | ||
assert(len(filters1[0][0]) % 8 == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you move this comment into a second argument to the assert builtin users will get a more descriptive AssertionError
""" | ||
length_f1 = len(filters1) | ||
length_f2 = len(filters2) | ||
|
||
# We assume the length is 1024 bit = 128 Bytes | ||
match_one_against_many_dice_1024_k_top = lib.match_one_against_many_dice_1024_k_top | ||
if length_f1 == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we check length_f2
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, this might be a style thing. Normally I'd expect the function to be written in such a way that it produces the correct result when either list is empty, usually achieved easily because each statement ends up being a no-op when given an empty list; indeed such was true before this PR. The only reason I check that filters1
is non-empty here is because I need to pick out its first element which in turn is just to obtain the filter length. Nothing in the rest of the code requires len(filters2) != 0
so I would say that checking it is unnecessary.
anonlink/entitymatch.py
Outdated
k, | ||
threshold, | ||
c_indices, | ||
c_scores) | ||
|
||
if matches < 0: | ||
raise Exception('Internel error: Bad key length') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be a ValueError
tests/test_bloommatcher.py
Outdated
@@ -70,6 +73,28 @@ def test_dice_4_c(self): | |||
|
|||
self.assertEqual(result, 0.0) | |||
|
|||
# Generate bit arrays that are combinations of words 0, 1, 2^63, 2^64 - 1 | |||
# of various lengths between 1 and 65 words. | |||
def test_dicecoeff(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: the comments should be the Python docstring.
tests/test_util.py
Outdated
bas = [util.generate_bitarray(1024) for i in range(100)] | ||
# Generate bit arrays that are combinations of words 0, 1, 2^63, 2^64 - 1 | ||
# of various lengths between 1 and 65 words. | ||
def test_popcount_vector(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a good idea to keep the tests part of a TestCase
- I know nose
finds functions that have test in the name but we are now running with py.test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will leave the parameterised tests as top-level functions for now as it is not possible to put functions marked as @pytest.parametrized
into unittest.TestCase
classes (see notes at the end).
tests/test_util.py
Outdated
# of various lengths between 1 and 65 words. | ||
def test_popcount_vector(): | ||
for L in bitarray_utils.key_lengths: | ||
yield check_popcount_vector, bitarray_utils.bitarrays_of_length(L) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is the actual test shouldn't you be calling the check_popcount_vector
function? Or am I missing where test_popcount_vector
is called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's my understanding that this is the way to get parameterised tests: test_popcount_vector
returns a generator that produces pairs consisting of a test function (always check_popcount_vector
here) and the parameters to pass to it (which are produced by bitarrays_of_length(L)
for varying L
). So test_popcount_vector
produces len(key_lengths)
tests.
Or did I misunderstand your question?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, this is now moot as I now use the @pytest.mark.parametrize
decorator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great!
tests/test_bloommatcher.py
Outdated
# of various lengths between 1 and 65 words. | ||
def test_dicecoeff(): | ||
for L in bitarray_utils.key_lengths: | ||
yield check_dicecoeff, bitarray_utils.bitarrays_of_length(L) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mean to call check_dicecoeff
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The C++ all looks nice. About the only comment would be it looks like the specialisation for popcount<3> is not used, so possibly redundant.
Approved to merge |
* Skip conversion to cffi char[] unless required * Libraries shouldn't configure logging * Version bump to 0.6.3 * Improvements to benchmark (#58) * Refactor Dice coefficient calculation. * Temporary fiddling with benchmark code. * Calculate and report popcount speed from native code implementation. * Give some values more sensible variable names. * Remove unused import. * Add documentation. * Expand reporting of various measurements. * Comments. * Update README. * Bring test suite up-to-date. * Address Brian's comments. * Update tests; also test native code version. * Print popcount throughput; give some variables better names. * Update README with throughput data. * Refactor main C++ function to avoid use "constant" memory and avoid new/delete (#55) * Refactor main C++ function to avoid use "constant" memory and avoid new/delete. * Refactor Dice coefficient calculation. * Temporary fiddling with benchmark code. * Calculate and report popcount speed from native code implementation. * Give some values more sensible variable names. * Remove unused import. * Add documentation. * Expand reporting of various measurements. * Comments. * Update README. * Bring test suite up-to-date. * Refactor main C++ function to avoid use "constant" memory and avoid new/delete. * Address Brian's comments. * Update tests; also test native code version. * Print popcount throughput; give some variables better names. * Feature build on Travis CI (#61) Run tests with travis ci * Fix #include file name. * Use pytest (#68) * Update README and requirements.txt files. * Add missing line in README. * Use pytest on Jenkins. * Make Jenkins test commands the same as Travis. * Generate test output and coverage data properly. * Move 'checkout scm' command to start of function; remove redundant cleaning code. Fix #65 * Feature use jenkinslibrary (#70) * Update jenkinsfile to use jenkins library. * Reduce the number of OSX build and which node in Jenkinsfile (see #71) * Arbitrary length Dice coefficients (#63) * Refactor main C++ function to avoid use "constant" memory and avoid new/delete. * Implement popcount on (almost) arbitrary length arrays. * First pass at integrating arbitrary length keys. Slows things down a bit. * Refactor Dice coefficient calculation. * Temporary fiddling with benchmark code. * Calculate and report popcount speed from native code implementation. * Give some values more sensible variable names. * Remove unused import. * Add documentation. * Expand reporting of various measurements. * Comments. * Update README. * Bring test suite up-to-date. * Refactor main C++ function to avoid use "constant" memory and avoid new/delete. * Screw everything up by unrolling with C++ templates, apparently. * Magical argument that makes the compiler generate the correct (performant) code. * Address Brian's comments. * Update tests; also test native code version. * Print popcount throughput; give some variables better names. * Make some functions static inline. * Tidy up some expressions. * Put some braces in the right place; make fn inline. * Reinstate comment on origin of popcount assembler. * Make constant a template parameter. * Comment. * Complete version working with multiples of 1024 bits. * Add -march=native compiler option. * Implementation of arbitrary length CLKs. * Fix dumb mistakes in updating array pointer and popcounts. * Tests for arbitrary length popcounts. * Update some comments. * Arbitrary length Dice coefficient. * Rename function. * Move native dicecoeff calculation into its own function. * Add tests for native Dice coefficient calculation. * Move dicecoeff tests to bloommatcher tests; move common bitarray utilities to their own file. * Simplify slow path / reduce branches in fast path. * Adapt entitymatcher to arbitrary length CLK interface. * Remove unused function. * Update README. * Address Brian's comments. * Exit early if filter is zero. * Specialise popcount arrays calls on array length. * Fix performance regression. * Remove storage class specifiers from explicit template specialisations. * Update README and requirements.txt files. * Disable unused function. * Put stars in their proper place. * Add documentation. * Prepare changelog and bump version for release 0.7.0 * Add clkhash as dependency (required for benchmark) Add travis badge to readme
This PR implements (almost) arbitrary length Dice coefficient calculations and integrates them into anonlink. The main changes are in
dice_one_against_many.cpp
which extends the definitions ofpopcount_array
anddice_coeff
to arrays of length a multiple of 8 bytes. Tests are provided for both of these functions intest_utils.py
andtest_bloommatcher.py
.The PR also incorporates a certain amount of refactoring (especially of
dice_one_against_many.cpp
) which makes it slightly longer than it strictly needs to be.Closes #53.