Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to benchmark #58

Merged
merged 14 commits into from
Feb 9, 2018
Merged

Improvements to benchmark #58

merged 14 commits into from
Feb 9, 2018

Conversation

unzvfu
Copy link

@unzvfu unzvfu commented Feb 5, 2018

This PR attempts to address issue #56. A description of the new "benchmark report" can be found in the file README.rst.

The changes to the C++ code are (i) measure and return the time taken to do the comparisons, (ii) allow direct access to the popcount function and (iii) some associated refactoring.

The changes to the Python code are (i) the popcount_vector function can now call the C++ implementation and (ii) modifying the measurements in the benchmark.

@unzvfu unzvfu self-assigned this Feb 5, 2018
@unzvfu unzvfu requested a review from hardbyte February 5, 2018 03:15
Copy link
Collaborator

@hardbyte hardbyte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments

README.rst Outdated
Native code (no copy): | 0.91 | 13443.87
Native code (w/ copy): | 381.83 | 31.97 (99.8% copying)

Threshold: 0.5
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we print out the k value used for the table`s benchmarks as well?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's currently set to "maximum possible", but you're right, that should be mentioned explicitly somewhere.

README.rst Outdated
4000 | 4000 | 16e6 (50.54%) | 4.635s (88.3% / 11.7%) | 3.910

Threshold: 0.7
Size 1 | Size 2 | Comparisons (match %) | Total Time (simat/solv) | Throughput (1e6 cmp/s)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simat/solv isn't clear on first reading to me - I have read the explanation further down but maybe consider different words?

How about comparisons vs matching?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, indeed. I picked the names because they fit in the space rather than because they were readable.

anonlink/util.py Outdated
"""
Note, due to the overhead of converting bitarrays into
bytes, it is more expensive to call our C implementation
def popcount_vector(bitarrays, use_native=False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be consistent either switch this parameter to use_python or switch the calculate_filter_similarity to use_native.

https://github.com/n1analytics/anonlink/blob/master/anonlink/entitymatch.py#L128

anonlink/util.py Outdated
Note, due to the overhead of converting bitarrays into
bytes, it is more expensive to call our C implementation
def popcount_vector(bitarrays, use_native=False):
"""Return an array containing the popcounts of the elements of
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bit pedantic but array should be list

anonlink/util.py Outdated
"""Return an array containing the popcounts of the elements of
bitarrays. If use_native is True, use the native code
implementation and return the time spent (in milliseconds) in the
native code as a second return value.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strongly suggest keeping the return types consistent by also recording the time taken in python.

I'd probably convert into seconds too.

# bytes([b for f in clks for b in f.tobytes()]))
# lib.popcount_1024_array(many, n, c_popcounts)
#
# return [c_popcounts[i] for i in range(n)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

erm sorry about that

@unzvfu unzvfu merged commit eefd19b into develop Feb 9, 2018
@unzvfu unzvfu deleted the hlaw-fix-issue-56 branch February 9, 2018 05:29
@unzvfu unzvfu restored the hlaw-fix-issue-56 branch February 9, 2018 05:29
@unzvfu unzvfu deleted the hlaw-fix-issue-56 branch February 9, 2018 05:35
unzvfu pushed a commit that referenced this pull request Mar 15, 2018
* Skip conversion to cffi char[] unless required

* Libraries shouldn't configure logging

* Version bump to 0.6.3

* Improvements to benchmark (#58)

* Refactor Dice coefficient calculation.

* Temporary fiddling with benchmark code.

* Calculate and report popcount speed from native code implementation.

* Give some values more sensible variable names.

* Remove unused import.

* Add documentation.

* Expand reporting of various measurements.

* Comments.

* Update README.

* Bring test suite up-to-date.

* Address Brian's comments.

* Update tests; also test native code version.

* Print popcount throughput; give some variables better names.

* Update README with throughput data.

* Refactor main C++ function to avoid use "constant" memory and avoid new/delete (#55)

* Refactor main C++ function to avoid use "constant" memory and avoid new/delete.

* Refactor Dice coefficient calculation.

* Temporary fiddling with benchmark code.

* Calculate and report popcount speed from native code implementation.

* Give some values more sensible variable names.

* Remove unused import.

* Add documentation.

* Expand reporting of various measurements.

* Comments.

* Update README.

* Bring test suite up-to-date.

* Refactor main C++ function to avoid use "constant" memory and avoid new/delete.

* Address Brian's comments.

* Update tests; also test native code version.

* Print popcount throughput; give some variables better names.

* Feature build on Travis CI (#61)

Run tests with travis ci

* Fix #include file name.

* Use pytest (#68)

* Update README and requirements.txt files.

* Add missing line in README.

* Use pytest on Jenkins.

* Make Jenkins test commands the same as Travis.

* Generate test output and coverage data properly.

* Move 'checkout scm' command to start of function; remove redundant cleaning code.

Fix #65

* Feature use jenkinslibrary (#70)

* Update jenkinsfile to use jenkins library.

* Reduce the number of OSX build and which node in Jenkinsfile (see #71)

* Arbitrary length Dice coefficients (#63)

* Refactor main C++ function to avoid use "constant" memory and avoid new/delete.

* Implement popcount on (almost) arbitrary length arrays.

* First pass at integrating arbitrary length keys. Slows things down a bit.

* Refactor Dice coefficient calculation.

* Temporary fiddling with benchmark code.

* Calculate and report popcount speed from native code implementation.

* Give some values more sensible variable names.

* Remove unused import.

* Add documentation.

* Expand reporting of various measurements.

* Comments.

* Update README.

* Bring test suite up-to-date.

* Refactor main C++ function to avoid use "constant" memory and avoid new/delete.

* Screw everything up by unrolling with C++ templates, apparently.

* Magical argument that makes the compiler generate the correct (performant) code.

* Address Brian's comments.

* Update tests; also test native code version.

* Print popcount throughput; give some variables better names.

* Make some functions static inline.

* Tidy up some expressions.

* Put some braces in the right place; make fn inline.

* Reinstate comment on origin of popcount assembler.

* Make constant a template parameter.

* Comment.

* Complete version working with multiples of 1024 bits.

* Add -march=native compiler option.

* Implementation of arbitrary length CLKs.

* Fix dumb mistakes in updating array pointer and popcounts.

* Tests for arbitrary length popcounts.

* Update some comments.

* Arbitrary length Dice coefficient.

* Rename function.

* Move native dicecoeff calculation into its own function.

* Add tests for native Dice coefficient calculation.

* Move dicecoeff tests to bloommatcher tests; move common bitarray utilities to their own file.

* Simplify slow path / reduce branches in fast path.

* Adapt entitymatcher to arbitrary length CLK interface.

* Remove unused function.

* Update README.

* Address Brian's comments.

* Exit early if filter is zero.

* Specialise popcount arrays calls on array length.

* Fix performance regression.

* Remove storage class specifiers from explicit template specialisations.

* Update README and requirements.txt files.

* Disable unused function.

* Put stars in their proper place.

* Add documentation.

* Prepare changelog and bump version for release 0.7.0

* Add clkhash as dependency (required for benchmark)
Add travis badge to readme
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants