-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Commit initial design of pull feature. #537
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's great that you're thinking this through from the start, and that you're looking for feedback. I particularly like the sequence diagram generated from a text source!
Another more 'agile' starting point for the document could be user stories, e.g. "As a user with a multi-gigabyte dataset on my machine and a slow, intermittent internet connection, I want my upload to happen as quickly and smoothly as possible.", "As a user with a dataset already available in an S3 bucket, I want Anonlink to fetch it directly so I don't have to download and then upload it.", etc. You could write a list of these, then extract requirements and proceed from there.
|
||
== Upload Encodings == | ||
|
||
Alice -> "Object Store": Upload Binary Encoding Data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general comment about this step: What is it about uploading to the third-party object store that makes it easier for Alice than uploading to Anonlink? Does it not suffer the same issues? If not, could Anonlink's API be changed to work the same way the object store works? If not, why not? It would at least be good to document the thought process that lead to this solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point, I'll reference that in the Abstract/Purpose section of the design doc and add an Alternatives section.
The gist is we could modify the Anonlink api to deal with multipart uploads and add in the retrying mechanisms and integrity checks - and update our clients. There may be a lot of corner cases though and if we rely on an already solid object store client and server we should get to a reliable solution faster.
|
||
|
||
## Requirements | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section reads more like a design than a set of requirements. Requirements should say less about how, and more about what. E.g:
- A user shall be able to upload data to Anonlink,
- The system shall accept uploads of up to 1TiB of data,
- In the event of an upload being interrupted, a user shall be able to resume the upload at a later stage without having to re-upload data already sent.
At deployment time the user will have the option of exposing the MinIO object store used by the Anonlink Entity Service, | ||
or - as already supported - providing external S3/MinIO access credentials. The Anonlink Entity Service's API will be | ||
modified such that a data provider can request temporary object store credentials. The client uses these temporary | ||
credentials to upload their encodings to the object store. The client then informs the Anonlink server via the `clks` endpoint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are the credentials temporary? Would it be simpler to give each user permanent permission to upload to their own sandbox? Then the process becomes:
- The user uploads to their sandbox,
- The user hits the /upload endpoint (or /upload/from_sandbox) to tell the system that it is ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That could work too.
It could involve some trickery to create Minio users dynamically, the docs only show creating users/groups via the mc
command line tool and don't document the underlying API. I was thinking of creating a default upload user using mc
at service startup, and then using the assume role api to create credentials for each data uploader who requests them.
The AssumeRole API only creates temporary credentials (valid between 15min to 12 hours).
From a user point of view I don't see it making a big difference, the majority of users will be using our client tool which can happily make the three calls in sequence. Also this is only for the case where we are exposing our object store, if the user already has object store credentials e.g. for S3 they can upload at their leisure without having to think about temporary credentials. It would be nice for our client tool to deal with this nicely/transparently.
The default case using temporary credentials for our object store:
$ anonlink upload mydata.csv <AUTH params etc>
Where the user wants our client to upload to their own bucket, optionally providing the AWS credential profile to use:
$ anonlink upload mydata.csv [--profile easd] --upload-to=s3://my-own-bucket/mydata.csv <AUTH params etc>
If the user wants to use already uploaded data, they will have to provide the anonlink entity service with appropriate credentials. I definitely don't want to assume we can share a users AWS credentials with the service, and I'm not sure if we should be asking the users' AWS account for (temporary) credentials, or just require it to be all explicitly provided.
This means something like:
$ anonlink upload s3://my-own-bucket/mydata.csv
is probably not going to work unless the user has provided us explicit credentials to share with the service, or explicitly told us that our client should create temporary and restricted credentials to share with the service. Telling us could be additional command line arguments, an ~/.anonlink
config file or environment variables.
cc @wilko77 do you have opinions here?
...some time later... | ||
|
||
Anonlink --> "Object Store" : Pull/copy encoding data | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When are the temporary credentials revoked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably you got this from my earlier answer but they time out.
35e38e9
to
819fc6c
Compare
819fc6c
to
92bf210
Compare
```sh | ||
$ anonlink upload mydata.csv <AUTH params etc> | ||
``` | ||
|
||
Where the user wants our client to upload to their own bucket, optionally providing the AWS credential profile to use: | ||
|
||
```sh | ||
$ anonlink upload mydata.csv [--profile easd] --upload-to=s3://my-own-bucket/mydata.csv <AUTH params etc> | ||
``` | ||
|
||
If the user wants to use already uploaded data, they will have to either explicitly provide the anonlink entity service | ||
with credentials that are appropriate to share with the service, or explicitly request temporary credentials. | ||
|
||
```sh | ||
$ anonlink upload s3://my-own-bucket/mydata.csv [--profile easd] --request-read-only-credentials \ | ||
<AUTH params etc> | ||
``` | ||
|
||
or | ||
```sh | ||
$ anonlink upload s3://my-own-bucket/mydata.csv [--profile easd] --share-aws-credentials-with-server \ | ||
<AUTH params etc> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This are a lot of different ways to upload data, all under the same command. That might lead to a long list of conditionals in the code, also the docs for the command will be rather long to cover all those cases.
How about we split the cases where we link to already uploaded data and put them into a different command?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea, I've added anonlink serverpull
for that second use case. I'm thinking it might be a stretch to implement that in the first revision though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's fine. I think we should aim to implement the most common use case and it's probably fine to leave the others in the planning stage.
Adds a design doc for a feature to "pull data" from an external object store.
If we are all happy with this design, I'll spend the rest of the week prototyping it and update the design doc accordingly. Unsure if we want to commit this kind of design document, or wait until the feature is implemented and just document the "actual" system.
Background
In early 2020 blocking has been added to increase the capacity of the Anonlink system, however the system cannot currently handle users uploading large encoded datasets in a reliable way. Investing in scaling Anonlink to a system that can deal with tens of millions of records vastly increases the chance of adoption and the general usefulness of the system.
When a client tries to upload tens of millions of records today they must have a fast and reliable internet connection and complete in one REST call the entire data upload operation. These uploads can easily be multiple GiB of data.
On the server side we have had to increase default timeouts to >60s to handle the upload of single digit millions of encoded records. The idea of the pull feature is to allow the client to point to a location where the data can be pulled by the server. Modifying the client and API to upload in chunks - making many smaller upload requests - was also considered. This alternative was rejected as we estimate it would take more effort to get to the same reliability. For example the chunking, compression, retrying, and data integrity would all have to be custom handled.
With the Anonlink system we aim to provide everything required to take users from clean records through to an accurate concordance table. We rejected the idea of requiring a data provider to setup their own object store or otherwise hosting their encoded data online; instead we will modify the client to be able to upload to an object store.