Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Commit initial design of pull feature. #537

Merged
merged 5 commits into from
Apr 19, 2020
Merged

Conversation

hardbyte
Copy link
Collaborator

@hardbyte hardbyte commented Apr 6, 2020

Adds a design doc for a feature to "pull data" from an external object store.

If we are all happy with this design, I'll spend the rest of the week prototyping it and update the design doc accordingly. Unsure if we want to commit this kind of design document, or wait until the feature is implemented and just document the "actual" system.

Background

In early 2020 blocking has been added to increase the capacity of the Anonlink system, however the system cannot currently handle users uploading large encoded datasets in a reliable way. Investing in scaling Anonlink to a system that can deal with tens of millions of records vastly increases the chance of adoption and the general usefulness of the system.

When a client tries to upload tens of millions of records today they must have a fast and reliable internet connection and complete in one REST call the entire data upload operation. These uploads can easily be multiple GiB of data.

On the server side we have had to increase default timeouts to >60s to handle the upload of single digit millions of encoded records. The idea of the pull feature is to allow the client to point to a location where the data can be pulled by the server. Modifying the client and API to upload in chunks - making many smaller upload requests - was also considered. This alternative was rejected as we estimate it would take more effort to get to the same reliability. For example the chunking, compression, retrying, and data integrity would all have to be custom handled.

With the Anonlink system we aim to provide everything required to take users from clean records through to an accurate concordance table. We rejected the idea of requiring a data provider to setup their own object store or otherwise hosting their encoded data online; instead we will modify the client to be able to upload to an object store.

@hardbyte hardbyte requested a review from wilko77 April 6, 2020 05:31
Copy link

@AJChapman AJChapman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great that you're thinking this through from the start, and that you're looking for feedback. I particularly like the sequence diagram generated from a text source!

Another more 'agile' starting point for the document could be user stories, e.g. "As a user with a multi-gigabyte dataset on my machine and a slow, intermittent internet connection, I want my upload to happen as quickly and smoothly as possible.", "As a user with a dataset already available in an S3 bucket, I want Anonlink to fetch it directly so I don't have to download and then upload it.", etc. You could write a list of these, then extract requirements and proceed from there.

docs/designs/anonlink-upload-data-plan.md Show resolved Hide resolved

== Upload Encodings ==

Alice -> "Object Store": Upload Binary Encoding Data

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general comment about this step: What is it about uploading to the third-party object store that makes it easier for Alice than uploading to Anonlink? Does it not suffer the same issues? If not, could Anonlink's API be changed to work the same way the object store works? If not, why not? It would at least be good to document the thought process that lead to this solution.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point, I'll reference that in the Abstract/Purpose section of the design doc and add an Alternatives section.

The gist is we could modify the Anonlink api to deal with multipart uploads and add in the retrying mechanisms and integrity checks - and update our clients. There may be a lot of corner cases though and if we rely on an already solid object store client and server we should get to a reliable solution faster.



## Requirements

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section reads more like a design than a set of requirements. Requirements should say less about how, and more about what. E.g:

  • A user shall be able to upload data to Anonlink,
  • The system shall accept uploads of up to 1TiB of data,
  • In the event of an upload being interrupted, a user shall be able to resume the upload at a later stage without having to re-upload data already sent.

At deployment time the user will have the option of exposing the MinIO object store used by the Anonlink Entity Service,
or - as already supported - providing external S3/MinIO access credentials. The Anonlink Entity Service's API will be
modified such that a data provider can request temporary object store credentials. The client uses these temporary
credentials to upload their encodings to the object store. The client then informs the Anonlink server via the `clks` endpoint

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are the credentials temporary? Would it be simpler to give each user permanent permission to upload to their own sandbox? Then the process becomes:

  1. The user uploads to their sandbox,
  2. The user hits the /upload endpoint (or /upload/from_sandbox) to tell the system that it is ready.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could work too.

It could involve some trickery to create Minio users dynamically, the docs only show creating users/groups via the mc command line tool and don't document the underlying API. I was thinking of creating a default upload user using mc at service startup, and then using the assume role api to create credentials for each data uploader who requests them.

The AssumeRole API only creates temporary credentials (valid between 15min to 12 hours).

From a user point of view I don't see it making a big difference, the majority of users will be using our client tool which can happily make the three calls in sequence. Also this is only for the case where we are exposing our object store, if the user already has object store credentials e.g. for S3 they can upload at their leisure without having to think about temporary credentials. It would be nice for our client tool to deal with this nicely/transparently.

The default case using temporary credentials for our object store:

$ anonlink upload mydata.csv <AUTH params etc>

Where the user wants our client to upload to their own bucket, optionally providing the AWS credential profile to use:

$ anonlink upload mydata.csv [--profile easd] --upload-to=s3://my-own-bucket/mydata.csv <AUTH params etc>

If the user wants to use already uploaded data, they will have to provide the anonlink entity service with appropriate credentials. I definitely don't want to assume we can share a users AWS credentials with the service, and I'm not sure if we should be asking the users' AWS account for (temporary) credentials, or just require it to be all explicitly provided.

This means something like:

$ anonlink upload s3://my-own-bucket/mydata.csv

is probably not going to work unless the user has provided us explicit credentials to share with the service, or explicitly told us that our client should create temporary and restricted credentials to share with the service. Telling us could be additional command line arguments, an ~/.anonlink config file or environment variables.

cc @wilko77 do you have opinions here?

...some time later...

Anonlink --> "Object Store" : Pull/copy encoding data

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When are the temporary credentials revoked?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably you got this from my earlier answer but they time out.

docs/designs/anonlink-upload-data-plan.md Outdated Show resolved Hide resolved
docs/designs/anonlink-upload-data-plan.md Outdated Show resolved Hide resolved
docs/designs/anonlink-upload-data-plan.md Outdated Show resolved Hide resolved
Comment on lines 312 to 333
```sh
$ anonlink upload mydata.csv <AUTH params etc>
```

Where the user wants our client to upload to their own bucket, optionally providing the AWS credential profile to use:

```sh
$ anonlink upload mydata.csv [--profile easd] --upload-to=s3://my-own-bucket/mydata.csv <AUTH params etc>
```

If the user wants to use already uploaded data, they will have to either explicitly provide the anonlink entity service
with credentials that are appropriate to share with the service, or explicitly request temporary credentials.

```sh
$ anonlink upload s3://my-own-bucket/mydata.csv [--profile easd] --request-read-only-credentials \
<AUTH params etc>
```

or
```sh
$ anonlink upload s3://my-own-bucket/mydata.csv [--profile easd] --share-aws-credentials-with-server \
<AUTH params etc>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This are a lot of different ways to upload data, all under the same command. That might lead to a long list of conditionals in the code, also the docs for the command will be rather long to cover all those cases.
How about we split the cases where we link to already uploaded data and put them into a different command?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea, I've added anonlink serverpull for that second use case. I'm thinking it might be a stretch to implement that in the first revision though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's fine. I think we should aim to implement the most common use case and it's probably fine to leave the others in the planning stage.

@hardbyte hardbyte requested a review from wilko77 April 17, 2020 05:28
@hardbyte hardbyte merged commit f17f336 into develop Apr 19, 2020
@hardbyte hardbyte deleted the design-data-pull-feature branch April 19, 2020 03:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants