Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arbitrary length Dice coefficients #63

Merged
merged 56 commits into from
Mar 14, 2018
Merged

Conversation

unzvfu
Copy link

@unzvfu unzvfu commented Feb 20, 2018

This PR implements (almost) arbitrary length Dice coefficient calculations and integrates them into anonlink. The main changes are in dice_one_against_many.cpp which extends the definitions of popcount_array and dice_coeff to arrays of length a multiple of 8 bytes. Tests are provided for both of these functions in test_utils.py and test_bloommatcher.py.

The PR also incorporates a certain amount of refactoring (especially of dice_one_against_many.cpp) which makes it slightly longer than it strictly needs to be.

Closes #53.

Hamish Ivey-Law added 30 commits January 15, 2018 12:03
@unzvfu unzvfu added this to the Sprint 2018-02-12 milestone Feb 20, 2018
@unzvfu unzvfu self-assigned this Feb 20, 2018
@unzvfu unzvfu requested a review from hardbyte February 20, 2018 05:54
Copy link
Collaborator

@hardbyte hardbyte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Stephen is going to help review the changes to _cffi_build/dice_one_against_many.cpp.

I'm seeing a performance regression - are you aware of this?

Develop Benchmark

Threshold: 0.7
Size 1 | Size 2 | Comparisons      | Total Time (s)          | Throughput
       |        |        (match %) | (comparisons / matching)|  (1e6 cmp/s)
-------+--------+------------------+-------------------------+-------------
  1000 |   1000 |    1e6  ( 0.01%) |  0.017  (99.6% /  0.4%) |    58.241
  2000 |   2000 |    4e6  ( 0.01%) |  0.068  (99.8% /  0.2%) |    59.387
  3000 |   3000 |    9e6  ( 0.01%) |  0.132  (99.8% /  0.2%) |    68.548
  4000 |   4000 |   16e6  ( 0.01%) |  0.194  (99.9% /  0.1%) |    82.381
  5000 |   5000 |   25e6  ( 0.01%) |  0.288  (99.9% /  0.1%) |    87.062
  6000 |   6000 |   36e6  ( 0.01%) |  0.400  (99.9% /  0.1%) |    90.039
  7000 |   7000 |   49e6  ( 0.01%) |  0.562  (99.8% /  0.2%) |    87.301
  8000 |   8000 |   64e6  ( 0.01%) |  0.717  (99.9% /  0.1%) |    89.405
  9000 |   9000 |   81e6  ( 0.01%) |  0.979  (99.9% /  0.1%) |    82.806
 10000 |  10000 |  100e6  ( 0.01%) |  1.317  (99.9% /  0.1%) |    75.998
 20000 |  20000 |  400e6  ( 0.01%) |  5.434  (99.9% /  0.1%) |    73.672

This Branch Benchmark

Threshold: 0.7
Size 1 | Size 2 | Comparisons      | Total Time (s)          | Throughput
       |        |        (match %) | (comparisons / matching)|  (1e6 cmp/s)
-------+--------+------------------+-------------------------+-------------
  1000 |   1000 |    1e6  ( 0.01%) |  0.028  (99.8% /  0.2%) |    35.699
  2000 |   2000 |    4e6  ( 0.01%) |  0.087  (99.9% /  0.1%) |    46.138
  3000 |   3000 |    9e6  ( 0.01%) |  0.175  (99.9% /  0.1%) |    51.467
  4000 |   4000 |   16e6  ( 0.01%) |  0.307  (99.9% /  0.1%) |    52.154
  5000 |   5000 |   25e6  ( 0.01%) |  0.474  (99.9% /  0.1%) |    52.755
  6000 |   6000 |   36e6  ( 0.01%) |  0.681  (99.9% /  0.1%) |    52.918
  7000 |   7000 |   49e6  ( 0.01%) |  0.917  (99.9% /  0.1%) |    53.445
  8000 |   8000 |   64e6  ( 0.01%) |  1.179  (99.9% /  0.1%) |    54.330
  9000 |   9000 |   81e6  ( 0.01%) |  1.486  (99.9% /  0.1%) |    54.549
 10000 |  10000 |  100e6  ( 0.01%) |  1.898  (99.9% /  0.1%) |    52.730
 20000 |  20000 |  400e6  ( 0.01%) |  7.653  (99.9% /  0.1%) |    52.297

return []

# Length must be a multple of 64 bits.
assert(len(filters1[0][0]) % 8 == 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you move this comment into a second argument to the assert builtin users will get a more descriptive AssertionError

"""
length_f1 = len(filters1)
length_f2 = len(filters2)

# We assume the length is 1024 bit = 128 Bytes
match_one_against_many_dice_1024_k_top = lib.match_one_against_many_dice_1024_k_top
if length_f1 == 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we check length_f2 as well?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this might be a style thing. Normally I'd expect the function to be written in such a way that it produces the correct result when either list is empty, usually achieved easily because each statement ends up being a no-op when given an empty list; indeed such was true before this PR. The only reason I check that filters1 is non-empty here is because I need to pick out its first element which in turn is just to obtain the filter length. Nothing in the rest of the code requires len(filters2) != 0 so I would say that checking it is unnecessary.

k,
threshold,
c_indices,
c_scores)

if matches < 0:
raise Exception('Internel error: Bad key length')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be a ValueError

@@ -70,6 +73,28 @@ def test_dice_4_c(self):

self.assertEqual(result, 0.0)

# Generate bit arrays that are combinations of words 0, 1, 2^63, 2^64 - 1
# of various lengths between 1 and 65 words.
def test_dicecoeff():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: the comments should be the Python docstring.

bas = [util.generate_bitarray(1024) for i in range(100)]
# Generate bit arrays that are combinations of words 0, 1, 2^63, 2^64 - 1
# of various lengths between 1 and 65 words.
def test_popcount_vector():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a good idea to keep the tests part of a TestCase - I know nose finds functions that have test in the name but we are now running with py.test

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will leave the parameterised tests as top-level functions for now as it is not possible to put functions marked as @pytest.parametrized into unittest.TestCase classes (see notes at the end).

# of various lengths between 1 and 65 words.
def test_popcount_vector():
for L in bitarray_utils.key_lengths:
yield check_popcount_vector, bitarray_utils.bitarrays_of_length(L)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is the actual test shouldn't you be calling the check_popcount_vector function? Or am I missing where test_popcount_vector is called?

Copy link
Author

@unzvfu unzvfu Feb 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's my understanding that this is the way to get parameterised tests: test_popcount_vector returns a generator that produces pairs consisting of a test function (always check_popcount_vector here) and the parameters to pass to it (which are produced by bitarrays_of_length(L) for varying L). So test_popcount_vector produces len(key_lengths) tests.

Or did I misunderstand your question?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this is now moot as I now use the @pytest.mark.parametrize decorator.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great!

# of various lengths between 1 and 65 words.
def test_dicecoeff():
for L in bitarray_utils.key_lengths:
yield check_dicecoeff, bitarray_utils.bitarrays_of_length(L)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mean to call check_dicecoeff?

Copy link
Contributor

@sjhardy sjhardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The C++ all looks nice. About the only comment would be it looks like the specialisation for popcount<3> is not used, so possibly redundant.

@sjhardy
Copy link
Contributor

sjhardy commented Mar 14, 2018

Approved to merge

@unzvfu unzvfu merged commit e10ea25 into develop Mar 14, 2018
@unzvfu unzvfu deleted the hlaw-arbitrary-length-dice-coeff branch March 14, 2018 03:03
unzvfu pushed a commit that referenced this pull request Mar 15, 2018
* Skip conversion to cffi char[] unless required

* Libraries shouldn't configure logging

* Version bump to 0.6.3

* Improvements to benchmark (#58)

* Refactor Dice coefficient calculation.

* Temporary fiddling with benchmark code.

* Calculate and report popcount speed from native code implementation.

* Give some values more sensible variable names.

* Remove unused import.

* Add documentation.

* Expand reporting of various measurements.

* Comments.

* Update README.

* Bring test suite up-to-date.

* Address Brian's comments.

* Update tests; also test native code version.

* Print popcount throughput; give some variables better names.

* Update README with throughput data.

* Refactor main C++ function to avoid use "constant" memory and avoid new/delete (#55)

* Refactor main C++ function to avoid use "constant" memory and avoid new/delete.

* Refactor Dice coefficient calculation.

* Temporary fiddling with benchmark code.

* Calculate and report popcount speed from native code implementation.

* Give some values more sensible variable names.

* Remove unused import.

* Add documentation.

* Expand reporting of various measurements.

* Comments.

* Update README.

* Bring test suite up-to-date.

* Refactor main C++ function to avoid use "constant" memory and avoid new/delete.

* Address Brian's comments.

* Update tests; also test native code version.

* Print popcount throughput; give some variables better names.

* Feature build on Travis CI (#61)

Run tests with travis ci

* Fix #include file name.

* Use pytest (#68)

* Update README and requirements.txt files.

* Add missing line in README.

* Use pytest on Jenkins.

* Make Jenkins test commands the same as Travis.

* Generate test output and coverage data properly.

* Move 'checkout scm' command to start of function; remove redundant cleaning code.

Fix #65

* Feature use jenkinslibrary (#70)

* Update jenkinsfile to use jenkins library.

* Reduce the number of OSX build and which node in Jenkinsfile (see #71)

* Arbitrary length Dice coefficients (#63)

* Refactor main C++ function to avoid use "constant" memory and avoid new/delete.

* Implement popcount on (almost) arbitrary length arrays.

* First pass at integrating arbitrary length keys. Slows things down a bit.

* Refactor Dice coefficient calculation.

* Temporary fiddling with benchmark code.

* Calculate and report popcount speed from native code implementation.

* Give some values more sensible variable names.

* Remove unused import.

* Add documentation.

* Expand reporting of various measurements.

* Comments.

* Update README.

* Bring test suite up-to-date.

* Refactor main C++ function to avoid use "constant" memory and avoid new/delete.

* Screw everything up by unrolling with C++ templates, apparently.

* Magical argument that makes the compiler generate the correct (performant) code.

* Address Brian's comments.

* Update tests; also test native code version.

* Print popcount throughput; give some variables better names.

* Make some functions static inline.

* Tidy up some expressions.

* Put some braces in the right place; make fn inline.

* Reinstate comment on origin of popcount assembler.

* Make constant a template parameter.

* Comment.

* Complete version working with multiples of 1024 bits.

* Add -march=native compiler option.

* Implementation of arbitrary length CLKs.

* Fix dumb mistakes in updating array pointer and popcounts.

* Tests for arbitrary length popcounts.

* Update some comments.

* Arbitrary length Dice coefficient.

* Rename function.

* Move native dicecoeff calculation into its own function.

* Add tests for native Dice coefficient calculation.

* Move dicecoeff tests to bloommatcher tests; move common bitarray utilities to their own file.

* Simplify slow path / reduce branches in fast path.

* Adapt entitymatcher to arbitrary length CLK interface.

* Remove unused function.

* Update README.

* Address Brian's comments.

* Exit early if filter is zero.

* Specialise popcount arrays calls on array length.

* Fix performance regression.

* Remove storage class specifiers from explicit template specialisations.

* Update README and requirements.txt files.

* Disable unused function.

* Put stars in their proper place.

* Add documentation.

* Prepare changelog and bump version for release 0.7.0

* Add clkhash as dependency (required for benchmark)
Add travis badge to readme
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants