Commit initial design of pull feature. #537

hardbyte · 2020-04-06T05:31:53Z

Adds a design doc for a feature to "pull data" from an external object store.

If we are all happy with this design, I'll spend the rest of the week prototyping it and update the design doc accordingly. Unsure if we want to commit this kind of design document, or wait until the feature is implemented and just document the "actual" system.

Background

In early 2020 blocking has been added to increase the capacity of the Anonlink system, however the system cannot currently handle users uploading large encoded datasets in a reliable way. Investing in scaling Anonlink to a system that can deal with tens of millions of records vastly increases the chance of adoption and the general usefulness of the system.

When a client tries to upload tens of millions of records today they must have a fast and reliable internet connection and complete in one REST call the entire data upload operation. These uploads can easily be multiple GiB of data.

On the server side we have had to increase default timeouts to >60s to handle the upload of single digit millions of encoded records. The idea of the pull feature is to allow the client to point to a location where the data can be pulled by the server. Modifying the client and API to upload in chunks - making many smaller upload requests - was also considered. This alternative was rejected as we estimate it would take more effort to get to the same reliability. For example the chunking, compression, retrying, and data integrity would all have to be custom handled.

With the Anonlink system we aim to provide everything required to take users from clean records through to an accurate concordance table. We rejected the idea of requiring a data provider to setup their own object store or otherwise hosting their encoded data online; instead we will modify the client to be able to upload to an object store.

AJChapman

It's great that you're thinking this through from the start, and that you're looking for feedback. I particularly like the sequence diagram generated from a text source!

Another more 'agile' starting point for the document could be user stories, e.g. "As a user with a multi-gigabyte dataset on my machine and a slow, intermittent internet connection, I want my upload to happen as quickly and smoothly as possible.", "As a user with a dataset already available in an S3 bucket, I want Anonlink to fetch it directly so I don't have to download and then upload it.", etc. You could write a list of these, then extract requirements and proceed from there.

docs/designs/anonlink-upload-data-plan.md

AJChapman · 2020-04-07T04:48:05Z

docs/image_source/data-upload.puml

+
+== Upload Encodings ==
+
+Alice -> "Object Store": Upload Binary Encoding Data


A general comment about this step: What is it about uploading to the third-party object store that makes it easier for Alice than uploading to Anonlink? Does it not suffer the same issues? If not, could Anonlink's API be changed to work the same way the object store works? If not, why not? It would at least be good to document the thought process that lead to this solution.

Great point, I'll reference that in the Abstract/Purpose section of the design doc and add an Alternatives section.

The gist is we could modify the Anonlink api to deal with multipart uploads and add in the retrying mechanisms and integrity checks - and update our clients. There may be a lot of corner cases though and if we rely on an already solid object store client and server we should get to a reliable solution faster.

AJChapman · 2020-04-07T04:55:56Z

docs/designs/anonlink-upload-data-plan.md

+
+
+## Requirements
+


This section reads more like a design than a set of requirements. Requirements should say less about how, and more about what. E.g:

A user shall be able to upload data to Anonlink,

The system shall accept uploads of up to 1TiB of data,

In the event of an upload being interrupted, a user shall be able to resume the upload at a later stage without having to re-upload data already sent.

AJChapman · 2020-04-07T04:59:39Z

docs/designs/anonlink-upload-data-plan.md

+At deployment time the user will have the option of exposing the MinIO object store used by the Anonlink Entity Service,
+or - as already supported - providing external S3/MinIO access credentials. The Anonlink Entity Service's API will be
+modified such that a data provider can request temporary object store credentials. The client uses these temporary
+credentials to upload their encodings to the object store. The client then informs the Anonlink server via the `clks` endpoint


Why are the credentials temporary? Would it be simpler to give each user permanent permission to upload to their own sandbox? Then the process becomes:

The user uploads to their sandbox,

The user hits the /upload endpoint (or /upload/from_sandbox) to tell the system that it is ready.

That could work too.

It could involve some trickery to create Minio users dynamically, the docs only show creating users/groups via the mc command line tool and don't document the underlying API. I was thinking of creating a default upload user using mc at service startup, and then using the assume role api to create credentials for each data uploader who requests them.

The AssumeRole API only creates temporary credentials (valid between 15min to 12 hours).

From a user point of view I don't see it making a big difference, the majority of users will be using our client tool which can happily make the three calls in sequence. Also this is only for the case where we are exposing our object store, if the user already has object store credentials e.g. for S3 they can upload at their leisure without having to think about temporary credentials. It would be nice for our client tool to deal with this nicely/transparently.

The default case using temporary credentials for our object store:

$ anonlink upload mydata.csv <AUTH params etc>

Where the user wants our client to upload to their own bucket, optionally providing the AWS credential profile to use:

$ anonlink upload mydata.csv [--profile easd] --upload-to=s3://my-own-bucket/mydata.csv <AUTH params etc>

If the user wants to use already uploaded data, they will have to provide the anonlink entity service with appropriate credentials. I definitely don't want to assume we can share a users AWS credentials with the service, and I'm not sure if we should be asking the users' AWS account for (temporary) credentials, or just require it to be all explicitly provided.

This means something like:

$ anonlink upload s3://my-own-bucket/mydata.csv

is probably not going to work unless the user has provided us explicit credentials to share with the service, or explicitly told us that our client should create temporary and restricted credentials to share with the service. Telling us could be additional command line arguments, an ~/.anonlink config file or environment variables.

cc @wilko77 do you have opinions here?

AJChapman · 2020-04-07T05:00:10Z

docs/image_source/data-upload.puml

+...some time later...
+
+Anonlink --> "Object Store" : Pull/copy encoding data
+


When are the temporary credentials revoked?

Probably you got this from my earlier answer but they time out.

docs/designs/anonlink-upload-data-plan.md

wilko77 · 2020-04-16T06:22:10Z

docs/designs/anonlink-upload-data-plan.md

+```sh
+$ anonlink upload mydata.csv <AUTH params etc>
+```
+
+Where the user wants our client to upload to their own bucket, optionally providing the AWS credential profile to use:
+
+```sh
+$ anonlink upload mydata.csv [--profile easd] --upload-to=s3://my-own-bucket/mydata.csv <AUTH params etc>
+```
+
+If the user wants to use already uploaded data, they will have to either explicitly provide the anonlink entity service
+with credentials that are appropriate to share with the service, or explicitly request temporary credentials. 
+
+```sh
+$ anonlink upload s3://my-own-bucket/mydata.csv [--profile easd] --request-read-only-credentials \
+           <AUTH params etc>
+```
+
+or 
+```sh
+$ anonlink upload s3://my-own-bucket/mydata.csv [--profile easd] --share-aws-credentials-with-server \
+           <AUTH params etc>


This are a lot of different ways to upload data, all under the same command. That might lead to a long list of conditionals in the code, also the docs for the command will be rather long to cover all those cases.
How about we split the cases where we link to already uploaded data and put them into a different command?

That's a good idea, I've added anonlink serverpull for that second use case. I'm thinking it might be a stretch to implement that in the first revision though.

that's fine. I think we should aim to implement the most common use case and it's probably fine to leave the others in the planning stage.

…g upload.

hardbyte requested a review from wilko77 April 6, 2020 05:31

AJChapman reviewed Apr 7, 2020

View reviewed changes

hardbyte force-pushed the design-data-pull-feature branch from 35e38e9 to 819fc6c Compare April 7, 2020 22:59

hardbyte added 2 commits April 9, 2020 09:54

Commit initial design of pull feature.

16cfa2e

Update pull data design doc based on feedback and early prototyping

92bf210

hardbyte force-pushed the design-data-pull-feature branch from 819fc6c to 92bf210 Compare April 8, 2020 21:54

Update data pull design document

889ca77

wilko77 reviewed Apr 16, 2020

View reviewed changes

Add a serverpull command to the anonlink client instead of overloadin…

d8adad7

…g upload.

hardbyte requested a review from wilko77 April 17, 2020 05:28

wilko77 approved these changes Apr 18, 2020

View reviewed changes

Merge branch 'develop' into design-data-pull-feature

08deb26

hardbyte merged commit f17f336 into develop Apr 19, 2020

hardbyte deleted the design-data-pull-feature branch April 19, 2020 03:03

This was referenced Apr 30, 2020

Release v1.13.0 beta.2 #553

Merged

Merge develop into master for v1.13.0-beta2 #554

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit initial design of pull feature. #537

Commit initial design of pull feature. #537

hardbyte commented Apr 6, 2020

AJChapman left a comment

AJChapman Apr 7, 2020

hardbyte Apr 7, 2020

AJChapman Apr 7, 2020

AJChapman Apr 7, 2020

hardbyte Apr 7, 2020

AJChapman Apr 7, 2020

hardbyte Apr 7, 2020

wilko77 Apr 16, 2020

hardbyte Apr 17, 2020

wilko77 Apr 18, 2020


		== Upload Encodings ==

		Alice -> "Object Store": Upload Binary Encoding Data

		...some time later...

		Anonlink --> "Object Store" : Pull/copy encoding data

Commit initial design of pull feature. #537

Commit initial design of pull feature. #537

Conversation

hardbyte commented Apr 6, 2020

Background

AJChapman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment