Feature arbitrary size encodings #366

hardbyte · 2020-10-21T01:32:51Z

Builds on #363 as suggested by @nbgl in #163

Implements in C a path to popcount and dice one to many when the encoding size is not word aligned. This means you can finally work on encodings of just 1 byte or 13 bytes (if you are so inclined).

I haven't done any benchmarking although I've expanded and added several tests.

Encodings of any number of bytes are now supported. Closes #163

codecov · 2020-10-21T01:51:07Z

Codecov Report

Merging #366 (47ca356) into master (03ac3d0) will not change coverage.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #366   +/-   ##
=======================================
  Coverage   94.57%   94.57%           
=======================================
  Files          16       16           
  Lines         792      792           
=======================================
  Hits          749      749           
  Misses         43       43

Bumps [hypothesis](https://github.com/HypothesisWorks/hypothesis) from 5.41.0 to 5.41.1. - [Release notes](https://github.com/HypothesisWorks/hypothesis/releases) - [Commits](HypothesisWorks/hypothesis@hypothesis-python-5.41.0...hypothesis-python-5.41.1) Signed-off-by: dependabot-preview[bot] <support@dependabot.com>

Encodings of any number of bytes are now supported. Closes #163

hardbyte · 2020-11-09T19:14:01Z

@wilko77 did you find a way to make the coverage work for external PRs? As far as I can tell this is passing all the tests on each system and is ready for review.

wilko77 · 2020-11-09T23:07:51Z

I started looking into the pipeline (#365). I made the coverage optional, that works. However, we then have another step where we upload the artifacts to a feed. And this still fails. It complains that the Anonlink build service hasn't got the permissions to publish there.
That's where I got stuck. As far as I can see, the Build Service has contributor permissions, thus should be fine to twine upload to the feed.
Maybe we can get the release pipeline to do the publishing to the feed.
Or can we do without that feed?

hardbyte · 2020-11-09T23:48:45Z

My guess is that "Publish package to test feed" stage could be only enabled for the main branches - https://docs.microsoft.com/en-us/azure/devops/pipelines/process/conditions?view=azure-devops&tabs=yaml

Maybe we can get the release pipeline to do the publishing to the feed.
Or can we do without that feed?

Yeah I think if you could get the release pipeline to pick up all the artifacts directly you wouldn't need the test feed. I made it so we could test linked feature branches across the multiple projects. Primarily so anonlink-entity-service could test a not-yet released version of the anonlink library by pulling from the test feed.

wilko77

Nice.
A few minor things. See comments.
Thank you.

If you merge master, then you'll get the updated CI definition, and with it the satisfaction of seeing the CI pipeline succeed.

anonlink/similarities/_dice_x86.py

anonlink/similarities/dice.cpp

wilko77 · 2020-11-11T12:59:39Z

anonlink/similarities/_dice.pyx

+    # To permit arbitrary size input_data, we may need to pad with zeros to align to a word boundary?
+    # Eg say our input is 32 bits and our array size is 8, our output_size is already correct at 4
+    # but we may want to make the input 64 bits?


I don't fully understand this. I though you implemented arbitrary sized inputs? What is this issue here exactly?

This was me recording my thinking early on. As you know, before this PR the CPP code assumed inputs would be a multiple of WORD_SIZE (8 bytes) in length, at least for supporting popcount we could have padded the input with zeros up to the nearest word size boundary and avoided modifying the CPP code.

Ultimately I didn't take that approach anywhere so I'll remove the comment.

…gs' into feature-arbitrary-size-encodings # Conflicts: # anonlink/similarities/_dice.pyx # requirements.txt

Permit arbitrary size encodings

3668958

Encodings of any number of bytes are now supported. Closes #163

wilko77 and others added 3 commits November 2, 2020 17:15

Merge branch 'master' into feature-arbitrary-size-encodings

d9cd306

Permit arbitrary size encodings

07ebb92

Encodings of any number of bytes are now supported. Closes #163

wilko77 approved these changes Nov 11, 2020

View reviewed changes

hardbyte added 4 commits November 13, 2020 09:23

Edit docstrings and comments after code review

9cf5b05

Merge branch 'master' into feature-arbitrary-size-encodings

8b63429

Merge remote-tracking branch 'hardbyte/feature-arbitrary-size-encodin…

045a653

…gs' into feature-arbitrary-size-encodings # Conflicts: # anonlink/similarities/_dice.pyx # requirements.txt

Add test that non-byte aligned datasets raise

47ca356

hardbyte merged commit 6d4aba7 into data61:master Nov 15, 2020

hardbyte deleted the feature-arbitrary-size-encodings branch November 15, 2020 03:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature arbitrary size encodings #366

Feature arbitrary size encodings #366

hardbyte commented Oct 21, 2020

codecov bot commented Oct 21, 2020 •

edited

Loading

hardbyte commented Nov 9, 2020

wilko77 commented Nov 9, 2020

hardbyte commented Nov 9, 2020 •

edited

Loading

wilko77 left a comment

wilko77 Nov 11, 2020

hardbyte Nov 12, 2020

Feature arbitrary size encodings #366

Feature arbitrary size encodings #366

Conversation

hardbyte commented Oct 21, 2020

codecov bot commented Oct 21, 2020 • edited Loading

Codecov Report

hardbyte commented Nov 9, 2020

wilko77 commented Nov 9, 2020

hardbyte commented Nov 9, 2020 • edited Loading

wilko77 left a comment

Choose a reason for hiding this comment

wilko77 Nov 11, 2020

Choose a reason for hiding this comment

hardbyte Nov 12, 2020

Choose a reason for hiding this comment

codecov bot commented Oct 21, 2020 •

edited

Loading

hardbyte commented Nov 9, 2020 •

edited

Loading