Improvements to benchmark #58

unzvfu · 2018-02-05T03:15:40Z

This PR attempts to address issue #56. A description of the new "benchmark report" can be found in the file README.rst.

The changes to the C++ code are (i) measure and return the time taken to do the comparisons, (ii) allow direct access to the popcount function and (iii) some associated refactoring.

The changes to the Python code are (i) the popcount_vector function can now call the C++ implementation and (ii) modifying the measurements in the benchmark.

hardbyte

Just a few comments

hardbyte · 2018-02-07T00:39:58Z

README.rst

+    Native code (no copy):      |     0.91  |   13443.87
+    Native code (w/ copy):      |   381.83  |      31.97   (99.8% copying)
+
+    Threshold: 0.5


Should we print out the k value used for the table`s benchmarks as well?

It's currently set to "maximum possible", but you're right, that should be mentioned explicitly somewhere.

hardbyte · 2018-02-07T00:40:38Z

README.rst

+      4000 |   4000 |      16e6  (50.54%)   |  4.635s (88.3% / 11.7%) |     3.910
+
+    Threshold: 0.7
+    Size 1 | Size 2 | Comparisons (match %) | Total Time (simat/solv) | Throughput (1e6 cmp/s)


simat/solv isn't clear on first reading to me - I have read the explanation further down but maybe consider different words?

How about comparisons vs matching?

Haha, indeed. I picked the names because they fit in the space rather than because they were readable.

hardbyte · 2018-02-07T00:48:25Z

anonlink/util.py

-    """
-    Note, due to the overhead of converting bitarrays into
-    bytes, it is more expensive to call our C implementation
+def popcount_vector(bitarrays, use_native=False):


Just to be consistent either switch this parameter to use_python or switch the calculate_filter_similarity to use_native.

https://github.com/n1analytics/anonlink/blob/master/anonlink/entitymatch.py#L128

hardbyte · 2018-02-07T00:48:55Z

anonlink/util.py

-    Note, due to the overhead of converting bitarrays into
-    bytes, it is more expensive to call our C implementation
+def popcount_vector(bitarrays, use_native=False):
+    """Return an array containing the popcounts of the elements of


Bit pedantic but array should be list

hardbyte · 2018-02-07T00:50:54Z

anonlink/util.py

+    """Return an array containing the popcounts of the elements of
+    bitarrays. If use_native is True, use the native code
+    implementation and return the time spent (in milliseconds) in the
+    native code as a second return value.


Strongly suggest keeping the return types consistent by also recording the time taken in python.

I'd probably convert into seconds too.

hardbyte · 2018-02-07T00:51:31Z

anonlink/util.py

-    #                 bytes([b for f in clks for b in f.tobytes()]))
-    # lib.popcount_1024_array(many, n, c_popcounts)
-    #
-    # return [c_popcounts[i] for i in range(n)]


erm sorry about that

* Skip conversion to cffi char[] unless required * Libraries shouldn't configure logging * Version bump to 0.6.3 * Improvements to benchmark (#58) * Refactor Dice coefficient calculation. * Temporary fiddling with benchmark code. * Calculate and report popcount speed from native code implementation. * Give some values more sensible variable names. * Remove unused import. * Add documentation. * Expand reporting of various measurements. * Comments. * Update README. * Bring test suite up-to-date. * Address Brian's comments. * Update tests; also test native code version. * Print popcount throughput; give some variables better names. * Update README with throughput data. * Refactor main C++ function to avoid use "constant" memory and avoid new/delete (#55) * Refactor main C++ function to avoid use "constant" memory and avoid new/delete. * Refactor Dice coefficient calculation. * Temporary fiddling with benchmark code. * Calculate and report popcount speed from native code implementation. * Give some values more sensible variable names. * Remove unused import. * Add documentation. * Expand reporting of various measurements. * Comments. * Update README. * Bring test suite up-to-date. * Refactor main C++ function to avoid use "constant" memory and avoid new/delete. * Address Brian's comments. * Update tests; also test native code version. * Print popcount throughput; give some variables better names. * Feature build on Travis CI (#61) Run tests with travis ci * Fix #include file name. * Use pytest (#68) * Update README and requirements.txt files. * Add missing line in README. * Use pytest on Jenkins. * Make Jenkins test commands the same as Travis. * Generate test output and coverage data properly. * Move 'checkout scm' command to start of function; remove redundant cleaning code. Fix #65 * Feature use jenkinslibrary (#70) * Update jenkinsfile to use jenkins library. * Reduce the number of OSX build and which node in Jenkinsfile (see #71) * Arbitrary length Dice coefficients (#63) * Refactor main C++ function to avoid use "constant" memory and avoid new/delete. * Implement popcount on (almost) arbitrary length arrays. * First pass at integrating arbitrary length keys. Slows things down a bit. * Refactor Dice coefficient calculation. * Temporary fiddling with benchmark code. * Calculate and report popcount speed from native code implementation. * Give some values more sensible variable names. * Remove unused import. * Add documentation. * Expand reporting of various measurements. * Comments. * Update README. * Bring test suite up-to-date. * Refactor main C++ function to avoid use "constant" memory and avoid new/delete. * Screw everything up by unrolling with C++ templates, apparently. * Magical argument that makes the compiler generate the correct (performant) code. * Address Brian's comments. * Update tests; also test native code version. * Print popcount throughput; give some variables better names. * Make some functions static inline. * Tidy up some expressions. * Put some braces in the right place; make fn inline. * Reinstate comment on origin of popcount assembler. * Make constant a template parameter. * Comment. * Complete version working with multiples of 1024 bits. * Add -march=native compiler option. * Implementation of arbitrary length CLKs. * Fix dumb mistakes in updating array pointer and popcounts. * Tests for arbitrary length popcounts. * Update some comments. * Arbitrary length Dice coefficient. * Rename function. * Move native dicecoeff calculation into its own function. * Add tests for native Dice coefficient calculation. * Move dicecoeff tests to bloommatcher tests; move common bitarray utilities to their own file. * Simplify slow path / reduce branches in fast path. * Adapt entitymatcher to arbitrary length CLK interface. * Remove unused function. * Update README. * Address Brian's comments. * Exit early if filter is zero. * Specialise popcount arrays calls on array length. * Fix performance regression. * Remove storage class specifiers from explicit template specialisations. * Update README and requirements.txt files. * Disable unused function. * Put stars in their proper place. * Add documentation. * Prepare changelog and bump version for release 0.7.0 * Add clkhash as dependency (required for benchmark) Add travis badge to readme

Hamish Ivey-Law added 9 commits February 1, 2018 11:03

Refactor Dice coefficient calculation.

5d5338f

Temporary fiddling with benchmark code.

88e3625

Calculate and report popcount speed from native code implementation.

a705de8

Give some values more sensible variable names.

cff1cb6

Remove unused import.

603b6d4

Add documentation.

de33a67

Expand reporting of various measurements.

a458ed0

Comments.

7d2e66c

Update README.

9666eae

unzvfu self-assigned this Feb 5, 2018

unzvfu requested a review from hardbyte February 5, 2018 03:15

Bring test suite up-to-date.

6fe3663

hardbyte reviewed Feb 7, 2018

View reviewed changes

Hamish Ivey-Law added 4 commits February 7, 2018 13:45

Address Brian's comments.

166f6e9

Update tests; also test native code version.

9cbc243

Print popcount throughput; give some variables better names.

cf26901

Update README with throughput data.

28ad7ec

unzvfu merged commit eefd19b into develop Feb 9, 2018

unzvfu deleted the hlaw-fix-issue-56 branch February 9, 2018 05:29

unzvfu restored the hlaw-fix-issue-56 branch February 9, 2018 05:29

unzvfu mentioned this pull request Feb 9, 2018

Misleading or confusing aspects of anonlink.benchmark #56

Closed

unzvfu deleted the hlaw-fix-issue-56 branch February 9, 2018 05:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements to benchmark #58

Improvements to benchmark #58

unzvfu commented Feb 5, 2018

hardbyte left a comment

hardbyte Feb 7, 2018

unzvfu Feb 7, 2018

hardbyte Feb 7, 2018

unzvfu Feb 7, 2018

hardbyte Feb 7, 2018

hardbyte Feb 7, 2018

hardbyte Feb 7, 2018

hardbyte Feb 7, 2018

Improvements to benchmark #58

Improvements to benchmark #58

Conversation

unzvfu commented Feb 5, 2018

hardbyte left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment