Arbitrary length Dice coefficients #63

unzvfu · 2018-02-20T05:54:37Z

This PR implements (almost) arbitrary length Dice coefficient calculations and integrates them into anonlink. The main changes are in dice_one_against_many.cpp which extends the definitions of popcount_array and dice_coeff to arrays of length a multiple of 8 bytes. Tests are provided for both of these functions in test_utils.py and test_bloommatcher.py.

The PR also incorporates a certain amount of refactoring (especially of dice_one_against_many.cpp) which makes it slightly longer than it strictly needs to be.

Closes #53.

…ew/delete.

…bit.

…ew/delete.

…nto hlaw-refactor-cpp

…mant) code.

…ities to their own file.

hardbyte

Looks great. Stephen is going to help review the changes to _cffi_build/dice_one_against_many.cpp.

I'm seeing a performance regression - are you aware of this?

Develop Benchmark

Threshold: 0.7
Size 1 | Size 2 | Comparisons      | Total Time (s)          | Throughput
       |        |        (match %) | (comparisons / matching)|  (1e6 cmp/s)
-------+--------+------------------+-------------------------+-------------
  1000 |   1000 |    1e6  ( 0.01%) |  0.017  (99.6% /  0.4%) |    58.241
  2000 |   2000 |    4e6  ( 0.01%) |  0.068  (99.8% /  0.2%) |    59.387
  3000 |   3000 |    9e6  ( 0.01%) |  0.132  (99.8% /  0.2%) |    68.548
  4000 |   4000 |   16e6  ( 0.01%) |  0.194  (99.9% /  0.1%) |    82.381
  5000 |   5000 |   25e6  ( 0.01%) |  0.288  (99.9% /  0.1%) |    87.062
  6000 |   6000 |   36e6  ( 0.01%) |  0.400  (99.9% /  0.1%) |    90.039
  7000 |   7000 |   49e6  ( 0.01%) |  0.562  (99.8% /  0.2%) |    87.301
  8000 |   8000 |   64e6  ( 0.01%) |  0.717  (99.9% /  0.1%) |    89.405
  9000 |   9000 |   81e6  ( 0.01%) |  0.979  (99.9% /  0.1%) |    82.806
 10000 |  10000 |  100e6  ( 0.01%) |  1.317  (99.9% /  0.1%) |    75.998
 20000 |  20000 |  400e6  ( 0.01%) |  5.434  (99.9% /  0.1%) |    73.672

This Branch Benchmark

Threshold: 0.7
Size 1 | Size 2 | Comparisons      | Total Time (s)          | Throughput
       |        |        (match %) | (comparisons / matching)|  (1e6 cmp/s)
-------+--------+------------------+-------------------------+-------------
  1000 |   1000 |    1e6  ( 0.01%) |  0.028  (99.8% /  0.2%) |    35.699
  2000 |   2000 |    4e6  ( 0.01%) |  0.087  (99.9% /  0.1%) |    46.138
  3000 |   3000 |    9e6  ( 0.01%) |  0.175  (99.9% /  0.1%) |    51.467
  4000 |   4000 |   16e6  ( 0.01%) |  0.307  (99.9% /  0.1%) |    52.154
  5000 |   5000 |   25e6  ( 0.01%) |  0.474  (99.9% /  0.1%) |    52.755
  6000 |   6000 |   36e6  ( 0.01%) |  0.681  (99.9% /  0.1%) |    52.918
  7000 |   7000 |   49e6  ( 0.01%) |  0.917  (99.9% /  0.1%) |    53.445
  8000 |   8000 |   64e6  ( 0.01%) |  1.179  (99.9% /  0.1%) |    54.330
  9000 |   9000 |   81e6  ( 0.01%) |  1.486  (99.9% /  0.1%) |    54.549
 10000 |  10000 |  100e6  ( 0.01%) |  1.898  (99.9% /  0.1%) |    52.730
 20000 |  20000 |  400e6  ( 0.01%) |  7.653  (99.9% /  0.1%) |    52.297

hardbyte · 2018-02-20T22:15:44Z

anonlink/entitymatch.py

+        return []
+
+    # Length must be a multple of 64 bits.
+    assert(len(filters1[0][0]) % 8 == 0)


If you move this comment into a second argument to the assert builtin users will get a more descriptive AssertionError

hardbyte · 2018-02-20T22:16:26Z

anonlink/entitymatch.py

    """
    length_f1 = len(filters1)
    length_f2 = len(filters2)

-    # We assume the length is 1024 bit = 128 Bytes
-    match_one_against_many_dice_1024_k_top = lib.match_one_against_many_dice_1024_k_top
+    if length_f1 == 0:


Shouldn't we check length_f2 as well?

Hm, this might be a style thing. Normally I'd expect the function to be written in such a way that it produces the correct result when either list is empty, usually achieved easily because each statement ends up being a no-op when given an empty list; indeed such was true before this PR. The only reason I check that filters1 is non-empty here is because I need to pick out its first element which in turn is just to obtain the filter length. Nothing in the rest of the code requires len(filters2) != 0 so I would say that checking it is unnecessary.

hardbyte · 2018-02-20T22:18:18Z

anonlink/entitymatch.py

            k,
            threshold,
            c_indices,
            c_scores)

+        if matches < 0:
+            raise Exception('Internel error: Bad key length')


Should be a ValueError

hardbyte · 2018-02-20T22:22:59Z

tests/test_bloommatcher.py

@@ -70,6 +73,28 @@ def test_dice_4_c(self):

        self.assertEqual(result, 0.0)

+# Generate bit arrays that are combinations of words 0, 1, 2^63, 2^64 - 1
+# of various lengths between 1 and 65 words.
+def test_dicecoeff():


Nitpick: the comments should be the Python docstring.

hardbyte · 2018-02-20T22:26:10Z

tests/test_util.py

-        bas = [util.generate_bitarray(1024) for i in range(100)]
+# Generate bit arrays that are combinations of words 0, 1, 2^63, 2^64 - 1
+# of various lengths between 1 and 65 words.
+def test_popcount_vector():


Probably a good idea to keep the tests part of a TestCase - I know nose finds functions that have test in the name but we are now running with py.test

Will leave the parameterised tests as top-level functions for now as it is not possible to put functions marked as @pytest.parametrized into unittest.TestCase classes (see notes at the end).

hardbyte · 2018-02-20T22:27:59Z

tests/test_util.py

+# of various lengths between 1 and 65 words.
+def test_popcount_vector():
+    for L in bitarray_utils.key_lengths:
+        yield check_popcount_vector, bitarray_utils.bitarrays_of_length(L)


If this is the actual test shouldn't you be calling the check_popcount_vector function? Or am I missing where test_popcount_vector is called?

It's my understanding that this is the way to get parameterised tests: test_popcount_vector returns a generator that produces pairs consisting of a test function (always check_popcount_vector here) and the parameters to pass to it (which are produced by bitarrays_of_length(L) for varying L). So test_popcount_vector produces len(key_lengths) tests.

Or did I misunderstand your question?

Okay, this is now moot as I now use the @pytest.mark.parametrize decorator.

hardbyte · 2018-02-20T22:29:06Z

tests/test_bloommatcher.py

+# of various lengths between 1 and 65 words.
+def test_dicecoeff():
+    for L in bitarray_utils.key_lengths:
+        yield check_dicecoeff, bitarray_utils.bitarrays_of_length(L)


Mean to call check_dicecoeff?

sjhardy

The C++ all looks nice. About the only comment would be it looks like the specialisation for popcount<3> is not used, so possibly redundant.

sjhardy · 2018-03-14T02:49:58Z

Approved to merge

* Skip conversion to cffi char[] unless required * Libraries shouldn't configure logging * Version bump to 0.6.3 * Improvements to benchmark (#58) * Refactor Dice coefficient calculation. * Temporary fiddling with benchmark code. * Calculate and report popcount speed from native code implementation. * Give some values more sensible variable names. * Remove unused import. * Add documentation. * Expand reporting of various measurements. * Comments. * Update README. * Bring test suite up-to-date. * Address Brian's comments. * Update tests; also test native code version. * Print popcount throughput; give some variables better names. * Update README with throughput data. * Refactor main C++ function to avoid use "constant" memory and avoid new/delete (#55) * Refactor main C++ function to avoid use "constant" memory and avoid new/delete. * Refactor Dice coefficient calculation. * Temporary fiddling with benchmark code. * Calculate and report popcount speed from native code implementation. * Give some values more sensible variable names. * Remove unused import. * Add documentation. * Expand reporting of various measurements. * Comments. * Update README. * Bring test suite up-to-date. * Refactor main C++ function to avoid use "constant" memory and avoid new/delete. * Address Brian's comments. * Update tests; also test native code version. * Print popcount throughput; give some variables better names. * Feature build on Travis CI (#61) Run tests with travis ci * Fix #include file name. * Use pytest (#68) * Update README and requirements.txt files. * Add missing line in README. * Use pytest on Jenkins. * Make Jenkins test commands the same as Travis. * Generate test output and coverage data properly. * Move 'checkout scm' command to start of function; remove redundant cleaning code. Fix #65 * Feature use jenkinslibrary (#70) * Update jenkinsfile to use jenkins library. * Reduce the number of OSX build and which node in Jenkinsfile (see #71) * Arbitrary length Dice coefficients (#63) * Refactor main C++ function to avoid use "constant" memory and avoid new/delete. * Implement popcount on (almost) arbitrary length arrays. * First pass at integrating arbitrary length keys. Slows things down a bit. * Refactor Dice coefficient calculation. * Temporary fiddling with benchmark code. * Calculate and report popcount speed from native code implementation. * Give some values more sensible variable names. * Remove unused import. * Add documentation. * Expand reporting of various measurements. * Comments. * Update README. * Bring test suite up-to-date. * Refactor main C++ function to avoid use "constant" memory and avoid new/delete. * Screw everything up by unrolling with C++ templates, apparently. * Magical argument that makes the compiler generate the correct (performant) code. * Address Brian's comments. * Update tests; also test native code version. * Print popcount throughput; give some variables better names. * Make some functions static inline. * Tidy up some expressions. * Put some braces in the right place; make fn inline. * Reinstate comment on origin of popcount assembler. * Make constant a template parameter. * Comment. * Complete version working with multiples of 1024 bits. * Add -march=native compiler option. * Implementation of arbitrary length CLKs. * Fix dumb mistakes in updating array pointer and popcounts. * Tests for arbitrary length popcounts. * Update some comments. * Arbitrary length Dice coefficient. * Rename function. * Move native dicecoeff calculation into its own function. * Add tests for native Dice coefficient calculation. * Move dicecoeff tests to bloommatcher tests; move common bitarray utilities to their own file. * Simplify slow path / reduce branches in fast path. * Adapt entitymatcher to arbitrary length CLK interface. * Remove unused function. * Update README. * Address Brian's comments. * Exit early if filter is zero. * Specialise popcount arrays calls on array length. * Fix performance regression. * Remove storage class specifiers from explicit template specialisations. * Update README and requirements.txt files. * Disable unused function. * Put stars in their proper place. * Add documentation. * Prepare changelog and bump version for release 0.7.0 * Add clkhash as dependency (required for benchmark) Add travis badge to readme

Hamish Ivey-Law added 30 commits January 15, 2018 12:03

Refactor main C++ function to avoid use "constant" memory and avoid n…

f5444a3

…ew/delete.

Implement popcount on (almost) arbitrary length arrays.

5d5337b

First pass at integrating arbitrary length keys. Slows things down a …

3864028

…bit.

Refactor Dice coefficient calculation.

5d5338f

Temporary fiddling with benchmark code.

88e3625

Calculate and report popcount speed from native code implementation.

a705de8

Give some values more sensible variable names.

cff1cb6

Remove unused import.

603b6d4

Add documentation.

de33a67

Expand reporting of various measurements.

a458ed0

Comments.

7d2e66c

Update README.

9666eae

Bring test suite up-to-date.

6fe3663

Refactor main C++ function to avoid use "constant" memory and avoid n…

66d9b6e

…ew/delete.

Merge branch 'hlaw-refactor-cpp' of github.com:n1analytics/anonlink i…

05246f2

…nto hlaw-refactor-cpp

Merge branch 'hlaw-refactor-cpp' into hlaw-arbitrary-length-dice-coeff

8181752

Screw everything up by unrolling with C++ templates, apparently.

3a55dc4

Magical argument that makes the compiler generate the correct (perfor…

b94c555

…mant) code.

Address Brian's comments.

166f6e9

Update tests; also test native code version.

9cbc243

Print popcount throughput; give some variables better names.

cf26901

Merge branch 'hlaw-fix-issue-56' into hlaw-refactor-cpp

c986021

Merge branch 'hlaw-refactor-cpp' into hlaw-arbitrary-length-dice-coeff

e978834

Make some functions static inline.

d02f23a

Tidy up some expressions.

888e989

Put some braces in the right place; make fn inline.

c6780f0

Reinstate comment on origin of popcount assembler.

3f1104f

Make constant a template parameter.

edc7c2b

Comment.

892c599

Complete version working with multiples of 1024 bits.

f500231

Hamish Ivey-Law added 9 commits February 19, 2018 17:51

Arbitrary length Dice coefficient.

e8c77bc

Rename function.

1febd65

Move native dicecoeff calculation into its own function.

21390c4

Add tests for native Dice coefficient calculation.

75cef8e

Move dicecoeff tests to bloommatcher tests; move common bitarray util…

c338c32

…ities to their own file.

Simplify slow path / reduce branches in fast path.

4d74b1c

Adapt entitymatcher to arbitrary length CLK interface.

2d6b5f7

Remove unused function.

ab45ea8

Update README.

9ccaa8d

unzvfu added this to the Sprint 2018-02-12 milestone Feb 20, 2018

unzvfu self-assigned this Feb 20, 2018

unzvfu requested a review from hardbyte February 20, 2018 05:54

hardbyte reviewed Feb 20, 2018

View reviewed changes

sjhardy approved these changes Feb 21, 2018

View reviewed changes

Hamish Ivey-Law added 11 commits February 22, 2018 10:07

Address Brian's comments.

e515b34

Exit early if filter is zero.

446033f

Specialise popcount arrays calls on array length.

dea0a0d

Fix performance regression.

d3671a2

Remove storage class specifiers from explicit template specialisations.

93abfae

Update README and requirements.txt files.

e9706ff

Merge branch 'feature-use-pytest' into hlaw-arbitrary-length-dice-coeff

8e673fb

Merge branch 'develop' into hlaw-arbitrary-length-dice-coeff

bea2ad5

Disable unused function.

ed09687

Put stars in their proper place.

63cc6e0

Add documentation.

ef82759

unzvfu merged commit e10ea25 into develop Mar 14, 2018

unzvfu deleted the hlaw-arbitrary-length-dice-coeff branch March 14, 2018 03:03

unzvfu mentioned this pull request Mar 14, 2018

C implementation of arbitrary length Dice coefficient #53

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arbitrary length Dice coefficients #63

Arbitrary length Dice coefficients #63

unzvfu commented Feb 20, 2018

hardbyte left a comment

hardbyte Feb 20, 2018

hardbyte Feb 20, 2018

unzvfu Feb 21, 2018

hardbyte Feb 20, 2018

hardbyte Feb 20, 2018

hardbyte Feb 20, 2018

unzvfu Feb 21, 2018

hardbyte Feb 20, 2018

unzvfu Feb 21, 2018 •

edited

Loading

unzvfu Feb 21, 2018

hardbyte Feb 21, 2018

hardbyte Feb 20, 2018

sjhardy left a comment

sjhardy commented Mar 14, 2018

Arbitrary length Dice coefficients #63

Arbitrary length Dice coefficients #63

Conversation

unzvfu commented Feb 20, 2018

hardbyte left a comment

Choose a reason for hiding this comment

Develop Benchmark

This Branch Benchmark

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

unzvfu Feb 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjhardy left a comment

Choose a reason for hiding this comment

sjhardy commented Mar 14, 2018

unzvfu Feb 21, 2018 •

edited

Loading