Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for aws_chunked with s3v4. #996

Closed
wants to merge 1 commit into from

Conversation

@timuralp
Copy link
Contributor

@timuralp timuralp commented Aug 3, 2016

When uploading to S3 from a stream, it would be useful to opt into S3
aws_chunked uploads with v4 signatures. The mechanism is documented
here:
http://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-streaming.html

The feature is implemented through a new auth class --
S3SigV4ChunkedAuth. The class wraps the body of the request with a
ChunkedUploadWrapper instance. It takes care of producing fixed size
upload chunks and computing the required signatures for each one.

To opt into this scheme, the caller is required to set the aws_chunked
option for client configuration and it will only be used with the
PutObject operations.

Fixes #995

@timuralp
Copy link
Contributor Author

@timuralp timuralp commented Aug 3, 2016

I wasn't sure if this is the right approach to this problem and the PR doesn't have any unit tests at the moment. If there is agreement that this make sense or guidance on how to rework it, I'm happy to work further on the PR and add the tests.

Loading

@codecov-io
Copy link

@codecov-io codecov-io commented Aug 3, 2016

Codecov Report

Merging #996 into develop will decrease coverage by 1.35%.
The diff coverage is 89.14%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #996      +/-   ##
===========================================
- Coverage    98.03%   96.68%   -1.36%     
===========================================
  Files           45       44       -1     
  Lines         7345     7264      -81     
===========================================
- Hits          7201     7023     -178     
- Misses         144      241      +97
Impacted Files Coverage Δ
botocore/signers.py 97.25% <80%> (-1.19%) ⬇️
botocore/auth.py 96.29% <89.51%> (-2.01%) ⬇️
botocore/configloader.py 75.94% <0%> (-24.06%) ⬇️
botocore/compat.py 68.14% <0%> (-23.71%) ⬇️
botocore/translate.py 92.68% <0%> (-7.32%) ⬇️
botocore/handlers.py 94.13% <0%> (-2.19%) ⬇️
botocore/docs/sharedexample.py 97.16% <0%> (-1.42%) ⬇️
botocore/docs/bcdoc/style.py 94.05% <0%> (-1.17%) ⬇️
botocore/paginate.py 96.59% <0%> (-0.61%) ⬇️
... and 26 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 81f8c4c...d27e1e6. Read the comment docs.

Loading

@jamesls
Copy link
Member

@jamesls jamesls commented Aug 3, 2016

Linking issues, boto3 tracking issue: boto/boto3#751

Pending discussion over on the issue.

Loading

@timuralp timuralp force-pushed the feature/aws-chunked branch from 8cf59e6 to b1e1297 Aug 11, 2016
When uploading to S3 from a stream, it would be useful to opt into S3
aws_chunked uploads with v4 signatures. The mechanism is documented
here:
http://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-streaming.html

The feature is implemented through a new auth class --
S3SigV4ChunkedAuth. The class wraps the body of the request with a
ChunkedUploadWrapper instance. It takes care of producing fixed size
upload chunks and computing the required signatures for each one.

To opt into this scheme, the caller is required to set the aws_chunked
option for client configuration and it will only be used with the
PutObject operations.

Fixes boto#995
@timuralp timuralp force-pushed the feature/aws-chunked branch from b1e1297 to d27e1e6 Jul 22, 2017
@kyleknap kyleknap removed the large label Feb 27, 2020
michalc added a commit to uktrade/public-data-api that referenced this issue Aug 3, 2020
Not using boto3 deliberately. This does mean we have low-level code,
but...

- To actually output JSON, we do have to do a little bit of faffing with
  bytes anyway, since S3 Select does not output valid JSON: it outputs
  JSON objects contatanated together with a delimeter.

- S3 Select [at least, minios's implementation used for testing].
  Appears to output unicode escape sequences for certain characters that
  don't need that, i.e. '\u0026' instead of just '&'. Not using boto3
  means we can address issues like this in as a performant way as possible
  [even if we don't do much optimisation now, we are free to later].

- boto3 does not always support all of AWS, specifically with S3. For
  example boto/botocore#996 has been open for 4
  years (to the day!) at this point, so we should not be thwarted or have
  to workaround any limitation of boto3: I suspect its architecture is not
  optimized for low-level/streaming access, which is exactly the sort of
  thing that is suspected to be useful in this project.

- Am pro keeping the option to use asyncio open, at least for now in
  this early stage of the project. boto3 doesn't appear to support it
  out of the box. In its current form, to move to asyncio shouldn't be a
  massive project since the dependencies are a web server and a web client
  [both of which would likely be aiohttp].
michalc added a commit to uktrade/public-data-api that referenced this issue Aug 3, 2020
Not using boto3 deliberately. This does mean we have low-level code,
but...

- To actually output JSON, we do have to do a little bit of faffing with
  bytes anyway, since S3 Select does not output valid JSON: it outputs
  JSON objects contatanated together with a delimeter.

- S3 Select [at least, minios's implementation used for testing],
  appears to output unicode escape sequences for certain characters that
  don't need it, i.e. '\u0026' instead of just '&'. Not using boto3
  means we can address issues like this in as a performant way as possible
  [even if we don't do much optimisation now, we are free to later].

- boto3 does not always support all of AWS, specifically with S3. For
  example boto/botocore#996 has been open for 4
  years (to the day!) at this point, so we should not be thwarted or have
  to workaround any limitation of boto3: I suspect its architecture is not
  optimized for low-level/streaming access, which is exactly the sort of
  thing that is suspected to be useful in this project.

- Am pro keeping the option to use asyncio open, at least for now in
  this early stage of the project, and boto3 doesn't appear to support it
  out of the box. In its current form, to move to asyncio shouldn't be a
  massive project since the dependencies are a web server and a web client
  [both of which would likely be aiohttp].
michalc added a commit to uktrade/public-data-api that referenced this issue Aug 3, 2020
Not using boto3 deliberately. This does mean we have low-level code,
but...

- To actually output JSON, we do have to do a little bit of faffing with
  bytes anyway, since S3 Select does not output valid JSON: it outputs
  JSON objects contatanated together with a delimeter.

- S3 Select [at least, minios's implementation used for testing],
  appears to output unicode escape sequences for certain characters that
  don't need it, i.e. '\u0026' instead of just '&'. Not using boto3
  means we can address issues like this in as a performant way as possible
  [even if we don't do much optimisation now, we are free to later].

- boto3 does not always support all of AWS, specifically with S3. For
  example boto/botocore#996 has been open for 4
  years (to the day!) at this point, so we should not be thwarted or have
  to workaround any limitation of boto3: I suspect its architecture is not
  optimized for low-level/streaming access, which is exactly the sort of
  thing that is suspected to be useful in this project.

- Am pro keeping the option to use asyncio open, at least for now in
  this early stage of the project, and boto3 doesn't appear to support it
  out of the box. In its current form, to move to asyncio shouldn't be a
  massive project since the dependencies are a web server and a web client
  [both of which would likely be aiohttp].
michalc added a commit to uktrade/public-data-api that referenced this issue Aug 3, 2020
Not using boto3 deliberately. This does mean we have low-level code,
but...

- To actually output JSON, we do have to do a little bit of faffing with
  bytes anyway, since S3 Select does not output valid JSON: it outputs
  JSON objects concatenated together with a delimiter.

- S3 Select [at least, minios's implementation used for testing],
  appears to output unicode escape sequences for certain characters that
  don't need it, i.e. '\u0026' instead of just '&'. Not using boto3
  means we can address issues like this in as a performant way as possible
  [even if we don't do much optimisation now, we are free to later].

- boto3 does not always support all of AWS, specifically with S3. For
  example boto/botocore#996 has been open for 4
  years (to the day!) at this point, so we should not be thwarted or have
  to workaround any limitation of boto3: I suspect its architecture is not
  optimized for low-level/streaming access, which is exactly the sort of
  thing that is suspected to be useful in this project.

- Am pro keeping the option to use asyncio open, at least for now in
  this early stage of the project, and boto3 doesn't appear to support it
  out of the box. In its current form, to move to asyncio shouldn't be a
  massive project since the dependencies are a web server and a web client
  [both of which would likely be aiohttp].
michalc added a commit to uktrade/public-data-api that referenced this issue Aug 3, 2020
Not using boto3 deliberately. This does mean we have low-level code,
but...

- To actually output JSON, we do have to do a little bit of faffing with
  bytes anyway, since S3 Select does not output valid JSON: it outputs
  JSON objects concatenated together with a delimiter.

- S3 Select [at least, minios's implementation used for testing],
  appears to output unicode escape sequences for certain characters that
  don't need it, i.e. '\u0026' instead of just '&'. Not using boto3
  means we can address issues like this in as a performant way as possible
  [even if we don't do much optimisation now, we are free to later]. One
  of the potential critiscms of depending on S3 was vendor lock-in, so at
  least should support the main implemenations, at least for now.

- boto3 does not always support all of AWS, specifically with S3. For
  example boto/botocore#996 has been open for 4
  years (to the day!) at this point, so we should not be thwarted or have
  to workaround any limitation of boto3: I suspect its architecture is not
  optimized for low-level/streaming access, which is exactly the sort of
  thing that is suspected to be useful in this project.

- Am pro keeping the option to use asyncio open, at least for now in
  this early stage of the project, and boto3 doesn't appear to support it
  out of the box. In its current form, to move to asyncio shouldn't be a
  massive project since the dependencies are a web server and a web client
  [both of which would likely be aiohttp].
michalc added a commit to uktrade/public-data-api that referenced this issue Aug 3, 2020
Not using boto3 deliberately. This does mean we have low-level code,
but...

- To actually output JSON, we do have to do a little bit of faffing with
  bytes anyway, since S3 Select does not output valid JSON: it outputs
  JSON objects concatenated together with a delimiter.

- S3 Select [at least, minios's implementation used for testing],
  appears to output unicode escape sequences for certain characters that
  don't need it, i.e. '\u0026' instead of just '&'. Not using boto3
  means we can address issues like this in as a performant way as possible
  [even if we don't do much optimisation now, we are free to later]. One
  of the potential criticism of depending on S3 was vendor lock-in, so at
  least should support the main implementations, at least for now.

- boto3 does not always support all of AWS, specifically with S3. For
  example boto/botocore#996 has been open for 4
  years (to the day!) at this point, so we should not be thwarted or have
  to workaround any limitation of boto3: I suspect its architecture is not
  optimized for low-level/streaming access, which is exactly the sort of
  thing that is suspected to be useful in this project.

- Am pro keeping the option to use asyncio open, at least for now in
  this early stage of the project, and boto3 doesn't appear to support it
  out of the box. In its current form, to move to asyncio shouldn't be a
  massive project since the dependencies are a web server and a web client
  [both of which would likely be aiohttp].
michalc added a commit to uktrade/public-data-api that referenced this issue Aug 3, 2020
Exposing via GET rather than a POST body which S3 does is deliberate.

- We expect that queries won't ever be too long. It is beyond the scope
  of the project to expose anything other than a rough "filtering" API,
  and maybe some very basis analysis at most. Being limited by the length
  of URLs I think is fine.

- We could have URL-based caching, potentially using CloudFront down the
  line if we wanted to.

- Sharing/debugging should be fairly straightforward. You can paste a
  URL in a browser and see the filtered data. Literally any HTTP client
  should work: no need for any special library [other than a JSON
  decoder]

Not using boto3 deliberately. This does mean we have low-level code,
but...

- To actually output JSON, we do have to do a little bit of faffing with
  bytes anyway, since S3 Select does not output valid JSON: it outputs
  JSON objects concatenated together with a delimiter.

- S3 Select [at least, minios's implementation used for testing],
  appears to output unicode escape sequences for certain characters that
  don't need it, i.e. '\u0026' instead of just '&'. Not using boto3
  means we can address issues like this in as a performant way as possible
  [even if we don't do much optimisation now, we are free to later]. One
  of the potential criticism of depending on S3 was vendor lock-in, so at
  least should support the main implementations, at least for now.

- boto3 does not always support all of AWS, specifically with S3. For
  example boto/botocore#996 has been open for 4
  years (to the day!) at this point, so we should not be thwarted or have
  to workaround any limitation of boto3: I suspect its architecture is not
  optimized for low-level/streaming access, which is exactly the sort of
  thing that is suspected to be useful in this project.

- Am pro keeping the option to use asyncio open, at least for now in
  this early stage of the project, and boto3 doesn't appear to support it
  out of the box. In its current form, to move to asyncio shouldn't be a
  massive project since the dependencies are a web server and a web client
  [both of which would likely be aiohttp].
michalc added a commit to uktrade/public-data-api that referenced this issue Aug 3, 2020
Exposing via GET rather than a POST body which S3 does is deliberate.

- We expect that queries won't ever be too long: it is beyond the scope
  of the project to expose anything other than a basic filtering API,
  and maybe some very basic aggregate functions at most. Being limited by
  the length of URLs I think is fine.

- We could have URL-based caching, potentially using CloudFront down the
  line if we wanted to.

- Sharing/debugging should be fairly straightforward. You can paste a
  URL in a browser and see the filtered data. Literally any HTTP client
  should work: no need for any special library [other than a JSON
  decoder]

Not using boto3 deliberately. This does mean we have low-level code,
but...

- To actually output JSON, we do have to do a little bit of faffing with
  bytes anyway, since S3 Select does not output valid JSON: it outputs
  JSON objects concatenated together with a delimiter.

- S3 Select [at least, minios's implementation used for testing],
  appears to output unicode escape sequences for certain characters that
  don't need it, i.e. '\u0026' instead of just '&'. Not using boto3
  means we can address issues like this in as a performant way as possible
  [even if we don't do much optimisation now, we are free to later]. One
  of the potential criticism of depending on S3 was vendor lock-in, so at
  least should support the main implementations, at least for now.

- boto3 does not always support all of AWS, specifically with S3. For
  example boto/botocore#996 has been open for 4
  years (to the day!) at this point, so we should not be thwarted or have
  to workaround any limitation of boto3: I suspect its architecture is not
  optimized for low-level/streaming access, which is exactly the sort of
  thing that is suspected to be useful in this project.

- Am pro keeping the option to use asyncio open, at least for now in
  this early stage of the project, and boto3 doesn't appear to support it
  out of the box. In its current form, to move to asyncio shouldn't be a
  massive project since the dependencies are a web server and a web client
  [both of which would likely be aiohttp].
michalc added a commit to uktrade/public-data-api that referenced this issue Aug 3, 2020
Exposing via GET rather than a POST body which S3 does is deliberate.

- We expect that queries won't ever be too long: it is beyond the scope
  of the project to expose anything other than a basic filtering API,
  and maybe some very basic aggregate functions at most. Being limited by
  the length of URLs I think is fine.

- We could have URL-based caching, potentially using CloudFront down the
  line if we wanted to.

- Sharing/debugging should be fairly straightforward. You can paste a
  URL in a browser and see the filtered data. Literally any HTTP client
  should work: no need for any special library [other than a JSON
  decoder]

Not using boto3 deliberately. This does mean we have low-level code,
but...

- To actually output JSON, we do have to do a little bit of faffing with
  bytes anyway, since S3 Select does not output valid JSON: it outputs
  JSON objects concatenated together with a delimiter.

- S3 Select [at least, minios's implementation used for testing],
  appears to output unicode escape sequences for certain characters that
  don't need it, i.e. '\u0026' instead of just '&'. Not using boto3
  means we can address issues like this in as a performant way as possible
  [even if we don't do much optimisation now, we are free to later]. One
  of the potential criticism of depending on S3 was vendor lock-in, so at
  least should support the main implementations, at least for now.

- boto3 does not always support all of AWS, specifically with S3. For
  example boto/botocore#996 has been open for 4
  years (to the day!) at this point, so we should not be thwarted or have
  to workaround any limitation of boto3: I suspect its architecture is not
  optimized for low-level/streaming access, which is exactly the sort of
  thing that is suspected to be useful in this project.

- Am pro keeping the option to use asyncio open, at least for now in
  this early stage of the project, and boto3 doesn't appear to support it
  out of the box. In its current form, to move to asyncio shouldn't be a
  massive project since the dependencies are a web server and a web client
  [both of which would likely be aiohttp].
michalc added a commit to uktrade/public-data-api that referenced this issue Aug 3, 2020
Exposing via GET rather than a POST body which S3 does is deliberate.

- We expect that queries won't ever be too long: it is beyond the scope
  of the project to expose anything other than a basic filtering API,
  and maybe some very basic aggregate functions at most. Being limited by
  the length of URLs I think is fine.

- We could have URL-based caching, potentially using CloudFront down the
  line if we wanted to.

- Sharing/debugging should be fairly straightforward. You can paste a
  URL in a browser and see the filtered data. Literally any HTTP client
  should work: no need for any special library [other than a JSON
  decoder]

Not using boto3 deliberately. This does mean we have low-level code,
but...

- To actually output JSON, we do have to do a little bit of faffing with
  bytes anyway, since S3 Select does not output valid JSON: it outputs
  JSON objects concatenated together with a delimiter.

- S3 Select [at least, minios's implementation used for testing],
  appears to output unicode escape sequences for certain characters that
  don't need it, i.e. '\u0026' instead of just '&'. Not using boto3
  means we can address issues like this in as a performant way as possible
  [even if we don't do much optimisation now, we are free to later]. One
  of the potential criticism of depending on S3 was vendor lock-in, so at
  least should support the main implementations, at least for now.

- boto3 does not always support all of AWS, specifically with S3. For
  example boto/botocore#996 has been open for 4
  years (to the day!) at this point, so we should not be thwarted or have
  to workaround any limitation of boto3: I suspect its architecture is not
  optimized for low-level/streaming access, which is exactly the sort of
  thing that is suspected to be useful in this project.

- Am pro keeping the option to use asyncio open, at least for now in
  this early stage of the project, and boto3 doesn't appear to support it
  out of the box. In its current form, to move to asyncio shouldn't be a
  massive project since the dependencies are a web server and a web client
  [both of which would likely be aiohttp].
michalc added a commit to uktrade/public-data-api that referenced this issue Aug 3, 2020
Exposing via GET rather than a POST body which S3 does is deliberate.

- We expect that queries won't ever be too long: it is beyond the scope
  of the project to expose anything other than a basic filtering API,
  and maybe some very basic aggregate functions at most. Being limited by
  the length of URLs I think is fine.

- We could have URL-based caching, potentially using CloudFront down the
  line if we wanted to.

- Sharing/debugging should be fairly straightforward. You can paste a
  URL in a browser and see the filtered data. Literally any HTTP client
  should work: no need for any special library [other than a JSON
  decoder]

Not using boto3 deliberately. This does mean we have low-level code,
but...

- To actually output JSON, we do have to do a little bit of faffing with
  bytes anyway, since S3 Select does not output valid JSON: it outputs
  JSON objects concatenated together with a delimiter.

- S3 Select [at least, minios's implementation used for testing],
  appears to output unicode escape sequences for certain characters that
  don't need it, i.e. '\u0026' instead of just '&'. Not using boto3
  means we can address issues like this in as a performant way as possible
  [even if we don't do much optimisation now, we are free to later]. One
  of the potential criticism of depending on S3 was vendor lock-in, so at
  least should support the main implementations, at least for now.

- boto3 does not always support all of AWS, specifically with S3. For
  example boto/botocore#996 has been open for 4
  years (to the day!) at this point, so we should not be thwarted or have
  to workaround any limitation of boto3: I suspect its architecture is not
  optimized for low-level/streaming access, which is exactly the sort of
  thing that is suspected to be useful in this project.

- Am pro keeping the option to use asyncio open, at least for now in
  this early stage of the project, and boto3 doesn't appear to support it
  out of the box. In its current form, to move to asyncio shouldn't be a
  massive project since the dependencies are a web server and a web client
  [both of which would likely be aiohttp].
michalc added a commit to uktrade/public-data-api that referenced this issue Aug 3, 2020
Exposing via GET rather than a POST body which S3 does is deliberate.

- We expect that queries won't ever be too long: it is beyond the scope
  of the project to expose anything other than a basic filtering API,
  and maybe some very basic aggregate functions at most. Being limited by
  the length of URLs I think is fine.

- We could have URL-based caching, potentially using CloudFront down the
  line if we wanted to.

- Sharing/debugging should be fairly straightforward. You can paste a
  URL in a browser and see the filtered data. Literally any HTTP client
  should work: no need for any special library [other than a JSON
  decoder]

Not using boto3 deliberately. This does mean we have low-level code,
but...

- To actually output JSON, we do have to do a little bit of faffing with
  bytes anyway, since S3 Select does not output valid JSON: it outputs
  JSON objects concatenated together with a delimiter.

- S3 Select [at least, minios's implementation used for testing],
  appears to output unicode escape sequences for certain characters that
  don't need it, i.e. '\u0026' instead of just '&'. Not using boto3
  means we can address issues like this in as a performant way as possible
  [even if we don't do much optimisation now, we are free to later]. One
  of the potential criticism of depending on S3 was vendor lock-in, so at
  least should support the main implementations, at least for now.

- boto3 does not always support all of AWS, specifically with S3. For
  example boto/botocore#996 has been open for 4
  years (to the day!) at this point, so we should not be thwarted or have
  to workaround any limitation of boto3: I suspect its architecture is not
  optimized for low-level/streaming access, which is exactly the sort of
  thing that is suspected to be useful in this project.

- Am pro keeping the option to use asyncio open, at least for now in
  this early stage of the project, and boto3 doesn't appear to support it
  out of the box. In its current form, to move to asyncio shouldn't be a
  massive project since the dependencies are a web server and a web client
  [both of which would likely be aiohttp].
michalc added a commit to uktrade/public-data-api that referenced this issue Aug 3, 2020
Exposing via GET rather than a POST body which S3 does is deliberate.

- We expect that queries won't ever be too long: it is beyond the scope
  of the project to expose anything other than a basic filtering API,
  and maybe some very basic aggregate functions at most. Being limited by
  the length of URLs I think is fine.

- We could have URL-based caching, potentially using CloudFront down the
  line if we wanted to.

- Sharing/debugging should be fairly straightforward. You can paste a
  URL in a browser and see the filtered data. Literally any HTTP client
  should work: no need for any special library [other than a JSON
  decoder]

Not using boto3 deliberately. This does mean we have low-level code,
but...

- To actually output JSON, we do have to do a little bit of faffing with
  bytes anyway, since S3 Select does not output valid JSON: it outputs
  JSON objects concatenated together with a delimiter.

- S3 Select [at least, minios's implementation used for testing],
  appears to output unicode escape sequences for certain characters that
  don't need it, i.e. '\u0026' instead of just '&'. Not using boto3
  means we can address issues like this in as a performant way as possible
  [even if we don't do much optimisation now, we are free to later]. One
  of the potential criticism of depending on S3 was vendor lock-in, so at
  least should support the main implementations, at least for now.

- boto3 does not always support all of AWS, specifically with S3. For
  example boto/botocore#996 has been open for 4
  years (to the day!) at this point, so we should not be thwarted or have
  to workaround any limitation of boto3: I suspect its architecture is not
  optimized for low-level/streaming access, which is exactly the sort of
  thing that is suspected to be useful in this project.

- Am pro keeping the option to use asyncio open, at least for now in
  this early stage of the project, and boto3 doesn't appear to support it
  out of the box. In its current form, to move to asyncio shouldn't be a
  massive project since the dependencies are a web server and a web client
  [both of which would likely be aiohttp].
@github-actions
Copy link

@github-actions github-actions bot commented Feb 27, 2021

Greetings! It looks like this issue hasn’t been active in longer than one year. We encourage you to check if this is still an issue in the latest release. Because it has been longer than one year since the last update on this, and in the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please feel free to provide a comment to prevent automatic closure, or if the issue is already closed, please feel free to reopen it.

Loading

@timuralp
Copy link
Contributor Author

@timuralp timuralp commented Mar 3, 2021

I don't know if there is anything I could've done to draw more attention to this or generate discussion. In light of lack of desire to discuss this feature, it probably makes sense to close it.

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

5 participants