Skip to content

Conversation

HazAT
Copy link
Member

@HazAT HazAT commented Jan 29, 2018

This adds an endpoint for chunk uploads.
https://paper.dropbox.com/doc/Exposed-Blob-Upload-Service-A7Qd5HBtLvNsVI5SvHjUV

This is only the first part of a series of PRs, next up will the assemble endpoint.
tl;dr we accept chunks, create FileBlobs which later will be assembled in a different endpoint.

These requests need a valid auth token with permission of project:releases.

GET -> /api/0/chunk-upload/
Returns metadata how the client (mostly sentry-cli) should structure the upload.

Response:

{
    "url": "https://sentry.io",
    "chunksPerRequest": 16,
    "hashAlgorithm": "sha1",
    "chunkSize": 1048576,
    "concurrency": 4
}

POST -> /api/0/chunk-upload/

curl -X POST \
  http://localhost:8000/api/0/chunk-upload/ \
  -H 'content-type: multipart/form-data; boundary=----WebKitFormBoundary7MA4YWxkTrZu0gW' \
  -F file=@/Users/haza/Desktop/upload-test/08aec68f1f2240b718982cc0924b5684e8149174 \
  -F file=@/Users/haza/Desktop/upload-test/9073a9c28759b5e5a6d0d8afb5e1c5ca13dbdc45

So this POST takes multiple files up to MAX_CHUNKS_PER_REQUEST and creates FileBlob in the database.
The filename must be the checksum of the file because we validate against it.
If there is a mismatch the request fails (this should never happen ¯\_(ツ)_/¯).
This request also validates chunk size and count.

The tests should cover all scenarios.

@HazAT HazAT self-assigned this Jan 29, 2018
@ghost
Copy link

ghost commented Jan 29, 2018

1 Warning
⚠️ You should update CHANGES due to the size of this PR

Generated by 🚫 danger

from sentry.api.bases.project import ProjectReleasePermission


UPLOAD_ENDPOINT = 'https://sentry.io'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a hardcoded value? This isn't going to work for on-premise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, is there a way to get the current URL or something?
We were thinking about not serving the URL of the installation instead serve upload.sentry.io which handles stuff a bit differently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the system.url-prefix option. I think we should introduce a new option here so it can be overridden in config. But default back to system.url-prefix.

def get(self, request):
return Response(
{
'url': UPLOAD_ENDPOINT,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for returning this value in the first place? I understand the value in this endpoint being able to fetch this information about basically the server's support, but this in theory just points back to itself? So it's just the exact same endpoint that you hit for GET. In what situation would it be different?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said before the idea is to maybe handle the upload through a different endpoint e.g.: upload.sentry.io which has some kind of keep alive or whatever.
I am not really sure if this endpoint basically is just another instance of Sentry with different webserver settings.

{
'url': UPLOAD_ENDPOINT,
'chunkSize': DEFAULT_BLOB_SIZE,
'chunksPerRequest': MAX_CHUNKS_PER_REQUEST,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to even accept multiple chunks per request? If we limited it to just 1 per request, you'd be able to achieve more throughput through parallelization. Anything in chunksPerRequest is going to be limited serially.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea behind this was that if we send more (data) per request we save the overhead of new http connections.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we get sentry-cli to keep the connection alive, it should probably not even matter too much. But we will probably not implement parallel uploads in CLI...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue that it'd be outweighed by the parallelism. For example, this is what gsutil does for it's own chunked uploads to Google Storage. I can't think of anything that does this hybrid thing where you can send chunks, but in batches. I don't have a strong opinion, but I don't think it's that valuable.

See stuff like: https://cloud.google.com/storage/docs/json_api/v1/how-tos/resumable-upload

I also like the idea of using the Content-Range header, but might not be as relevant for us here since they are fixed sizes, etc.

But anyways, I'd really just vote for handling 1 chunk per upload to reduce some complexity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattrobenolt 1 chunk per upload will be awful for europeans. Means 1000 requests with a minimum latency of 60-150ms. We will be mostly waiting for network here.


UPLOAD_ENDPOINT = 'https://sentry.io'
MAX_CHUNKS_PER_REQUEST = 16
MAX_CONCURRENCY = 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just a hint to the person that calls this? Because there's nothing that's actually enforcing this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it should be a hint, @mitsuhiko mentioned something about if we have a different endpoint which handles connections differently we can change it afterward.

return Response({'error': 'Checksum missmatch'},
status=status.HTTP_400_BAD_REQUEST)

return Response(status=status.HTTP_200_OK)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Shouldn't this be 201 Created?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've discussed this and we went with 200.
But I don't have a strong opinion we can go with 201.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh and also, if the endpoint already has chunks, they will not get "created" because they are already there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of a single chunk per request (see previous comment), we could even go with PUT and 200 OK.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I will leave it like this, since a PUT would mean a new function and 201 will not always be correct?! 🤷‍♂️

from sentry.api.base import Endpoint
from sentry.api.bases.project import ProjectReleasePermission

UPLOAD_ENDPOINT_SETTING = options.get('system.upload-url-prefix')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't do this at module load time. Query these inside of ChunkUploadEndpoint.get.

Copy link
Contributor

@mattrobenolt mattrobenolt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙆‍♂️

@HazAT HazAT merged commit be01269 into master Feb 1, 2018
@HazAT HazAT deleted the feature/chunk-upload branch February 1, 2018 07:41
@github-actions github-actions bot locked and limited conversation to collaborators Dec 22, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants