Feature extend binary format #505

wilko77 · 2020-02-17T13:36:18Z

This PR changes the binary format for storing clks in the object store.

Previously, we implicitly inferred the entity ID from the index of the CLK in the list. That was OK, as we had the CLKs delivered in order. With the blocking however, we don't have this order any more, and thus, we have to keep track which CLK belongs to which entity ID in order to produce an interpret-able result...

I chose to encode the entity ID as an unsigned int. I know, there are more people on the earth than that, but hey, storage costs money...

hardbyte

Added my 2c

hardbyte · 2020-02-17T20:02:05Z

backend/entityservice/database/selections.py

+        else:
+            return result[columns[0]]


Not 100% enjoying returning a different type for convenience when there is only one column. I get it & I've done it I just don't like it.

backend/entityservice/serialization.py

hardbyte · 2020-02-17T20:07:05Z

backend/entityservice/serialization.py


    :return:
        An iterable of bytes.
    """
    bit_packing_struct = binary_format(encoding_size)

    for hash_bytes in filters:
-        yield bit_packing_struct.pack(hash_bytes)
+        yield bit_packing_struct.pack(*hash_bytes)


hardbyte · 2020-02-17T20:07:55Z

backend/entityservice/serialization.py

    - "<encoding size>s" Store the n (e.g. 128) raw bytes of the bitarray

    https://docs.python.org/3/library/struct.html
    """
-    bit_packing_fmt = f"!{encoding_size}s"
+    bit_packing_fmt = f"!I{encoding_size}s"


We are just going to pretend the old format doesn't exist aren't we? Should we take the opportunity to add a version byte to the format?

It's an internal format, so yes, I pretend that it never existed. I'm old, I forget quickly.
A version byte for each filter would mean a lot of version bytes in the object store. This should rather be solved via a header. But again, since this is only used internally in the service, we will never have to differentiate between the versions anyway and thus, the versioning is redundant.

I was meaning a header for the file rather than for each encoding. But sure not worth it right now.

hardbyte · 2020-02-17T20:09:39Z

backend/entityservice/serialization.py

+
+    :param data_iterable: an iterable of binary packed filters.
+    :param max_bytes: if present, only read up to 'max_bytes' bytes.
+    :param encoding_size: the encoding size of one filter, excluding the entity ID info


Since we bounce between bytes and bits I'd try be explicit about the size.

backend/entityservice/tasks/comparing.py

hardbyte · 2020-02-17T20:12:03Z

backend/entityservice/tasks/comparing.py

+    #TODO: use the entity ids!
+    entity_ids_dp1, chunk_dp1 = zip(*chunk_dp1)
    t1 = time.time()
    log.debug("Fetching and deserializing chunk of filters for dataprovider 2")
    chunk_dp2, chunk_dp2_size = get_chunk_from_object_store(chunk_info_dp2, encoding_size)
+    # TODO: use the entity ids!
+    entity_ids_dp2, chunk_dp2 = zip(*chunk_dp2)


(what you said)

yes, but this is another task. I want this PR to be single purpose.
Integrating the use of the entity IDs will most likely also require changes to the helper functions in the anonlink library. I didn't want to open that can of worm just yet.

hardbyte · 2020-02-17T20:13:59Z

backend/entityservice/tasks/encoding_uploading.py

@@ -43,12 +43,12 @@ def handle_raw_upload(project_id, dp_id, receipt_token, parent_span=None):
    def filter_generator():
        log.debug("Deserializing json filters")


This log message looks like a copy/paste error of mine no doubt. Could you update.

hardbyte · 2020-02-17T20:18:09Z

backend/entityservice/views/project.py

@@ -151,20 +151,26 @@ def project_binaryclks_post(project_id):
                # connexion has already read the data before our handler is called!
                # https://github.com/zalando/connexion/issues/592
                # stream = get_stream()
-                stream = BytesIO(request.data)
-                expected_bytes = binary_format(size).size * count
+                stream = iterable_to_stream(BytesIO(request.data))


I'm confused. Isn't BytesIO already a stream? https://docs.python.org/3/library/io.html#binary-i-o

yes it is. I got confused by the read function, it says that it reads up to 'size' bytes. Thus I thought it might be saver to wrap it in a BufferedReader, but I guess that's unnecessary.

hardbyte · 2020-02-17T20:27:31Z

backend/entityservice/views/project.py

+                def entity_id_injector(filter_stream):
+                    for entity_id in range(count):
+                        yield binary_formatter.pack(entity_id, filter_stream.read(size))
+                data_with_ids = b''.join(entity_id_injector(stream))


I want to check my understanding here. This takes the binary stream and converts it into a byte producing generator.
This generator uses the user supplied count and reads size bytes (where size is also user supplied in the header) from the stream, packing the bytes using the binary format adding in an entity id as it goes.

How do we handle the user supplying their own entity ids? I assume that is future work? Will they be optional/required?

If this code path is here for the long haul, can we do this without making a full copy of the data in memory?

consider how can this fail with bad data from the user.

I just wrote a quick fix to make this endpoint work with the new format.

Until connexion fixes the stream thing we cannot do much here.
There is also the problem, that the binary format as of now does not consider blocking information.

I imagine, that in the far future, one day, someone will revisit the binary upload. (S)he will define a new binary format, implement that in the anonlink-client, and finally, will make this endpoint stunningly beautiful.

For now, this is not a priority, though.

…ect use of strip()

hardbyte

Thanks for making those changes. good to merge

wilko added 5 commits February 18, 2020 00:18

extend binary format to also include entity id.

e77c624

get_filter_metadata also returns encoding size

fa8d267

dirty fixes to make new format work

101f404

adjust to API changes

cb9e6cc

binary format for upload is now different to the internal binary format.

6dc218c

wilko77 requested a review from hardbyte February 17, 2020 13:36

hardbyte reviewed Feb 17, 2020

View reviewed changes

wilko added 4 commits February 18, 2020 11:02

clarify doc strings

56af9c8

remove debug message

f2257c1

unify return type for get_bloomingdata_columns(), fix bug with incorr…

312038f

…ect use of strip()

remove unnecessary stream wrapper

ffb93b5

wilko77 requested a review from hardbyte February 18, 2020 04:57

hardbyte approved these changes Feb 19, 2020

View reviewed changes

Merge branch 'develop' into feature_extend_binary_format

4ea1e60

wilko77 merged commit 88c3705 into develop Feb 19, 2020

wilko77 deleted the feature_extend_binary_format branch February 19, 2020 03:26

hardbyte mentioned this pull request Feb 19, 2020

Blocking #62

Closed

This was referenced Apr 30, 2020

Release v1.13.0 beta.2 #553

Merged

Merge develop into master for v1.13.0-beta2 #554

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature extend binary format #505

Feature extend binary format #505

wilko77 commented Feb 17, 2020

hardbyte left a comment

hardbyte Feb 17, 2020

hardbyte Feb 17, 2020

hardbyte Feb 17, 2020

wilko77 Feb 17, 2020

hardbyte Feb 18, 2020

hardbyte Feb 17, 2020

wilko77 Feb 17, 2020

hardbyte Feb 17, 2020

wilko77 Feb 17, 2020

hardbyte Feb 17, 2020

hardbyte Feb 17, 2020

wilko77 Feb 18, 2020

hardbyte Feb 17, 2020 •

edited

Loading

wilko77 Feb 18, 2020

hardbyte left a comment

		@@ -43,12 +43,12 @@ def handle_raw_upload(project_id, dp_id, receipt_token, parent_span=None):
		def filter_generator():
		log.debug("Deserializing json filters")

Feature extend binary format #505

Feature extend binary format #505

Conversation

wilko77 commented Feb 17, 2020

hardbyte left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hardbyte Feb 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hardbyte left a comment

Choose a reason for hiding this comment

hardbyte Feb 17, 2020 •

edited

Loading