Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash provided block identifier before storing in DB #633

Merged
merged 4 commits into from
Apr 18, 2021

Conversation

hardbyte
Copy link
Collaborator

Related to issue #632.

Block names were limited to 64 chars, for P-Sig we may very well want to use block ids that are based on bloom filters... a 64 char limit would impose a max bloom filter size of 512 bits (assuming a perfect encoding unlike json).

This small change increases the column size to 512 chars, although happy to increase further or remove the character limitation.

@hardbyte hardbyte requested a review from wilko77 March 22, 2021 01:57
Copy link
Collaborator

@wilko77 wilko77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

512 bytes for a block?
Say we have 15 Million entities with about 1.5 Million blocks. If we use 512 bytes for every block, than we have to insert 6GB worth of block IDs into the database.
However, the server only cares about matching up same block IDs. It does not care about the particular format. The minimum amount of bits to fully describe 1.5 Million different blocks is 21 bits. That's only about 30 MB of data.

Why don't we do the following:
We let the user upload whatever they want. Then, when the server writes the block information into the database, it will map the user provided block IDs to some fixed-sized hash values and uses that as the internal block IDs. 128 bits should be plenty to all but eliminate the chance of a collision.
On the plus side we would have less load on the database, as there is less data and plain integers instead of strings. What do you think?

@hardbyte
Copy link
Collaborator Author

hardbyte commented Mar 22, 2021

Oh I like that proposal a lot. So we will keep the DB constraint at 64 bytes, but hash the provided block id during import.

As for implementation, hashed via hmac with a run specific key? Key generated by the service on run creation and stored in the DB run table? hmac_instance.hexdigest() to convert to string for storage in the existing 64 char block_id column?

>>> from hashlib import blake2b
>>> h = blake2b(key=b'pseudorandom key stored in run table', digest_size=16)
>>> h.update(b'provided block id')
>>> h.hexdigest()
'3d363ff7401e02026f4a4687d4863ced'

@hardbyte hardbyte requested a review from wilko77 March 22, 2021 20:49
Copy link
Collaborator

@wilko77 wilko77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think a keyed hash function is necessary.
We see the server as a semi-trusted third party. Thus, if the block IDs, as generated by the data providers, is leaking info about the PII, then the providers shouldn't upload it like that in the first place. The server must not be allowed to see that.
All the privacy protection has to happen on the client side. (like with the CLKs)
That means, if we have "leaky" block IDs, then the clients have to do something about it. Either encrypt, or use a keyed hash function. And only upload 'save' block IDs.

For the server part, the proposed hashing had the only purpose to create fixed sized indices, thus allowing clients to upload arbitrary block IDs. You could even just use the python inbuilt hash function.

Copy link
Collaborator

@wilko77 wilko77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good now

@@ -277,6 +277,7 @@ def compute_filter_similarity(package, project_id, run_id, threshold, encoding_s
task_span = compute_filter_similarity.span

def new_child_span(name, parent_scope=None):
log.debug(name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's not a good debug message.

@hardbyte hardbyte changed the title Increase block identifier size Hash provided block identifier before storing in DB Mar 28, 2021
@hardbyte
Copy link
Collaborator Author

Sorry, I forgot about this PR.

Just rebased after adding another test and combined two loops into one (importing blocks was hashing the names then counting the sizes in two separate loops). If/once the CI passes I'll merge

@hardbyte hardbyte merged commit 0e93989 into develop Apr 18, 2021
@hardbyte hardbyte deleted the increase-block-id-size branch April 18, 2021 00:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants