-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Utilize io.Reader not io.ReadSeeker where possible #915
Comments
Thanks for contacting us @DavidJFelix, I'd like to learn more about the your use case. In the web service's code does it cache the contents it will be uploading to S3 as a file or in memory? In addition are you referring to physical memory or virtual memory? Since the SDK uses AWS v4 signature version a digest hash of the body needs to be computed. In order to do this the SDK must read the contents of the One solution for streaming content is to use the uploader := s3manager.NewUploader(sess)
err := uploader.Upload(params /*s3manager.UploadInput */) An additional solution could be to use the SDK's Presigner with the S3 PutObject operation. Though, all retry functionality would be on the onerous of the code that would be sending the URL. If just a req, _ := svc.PutObjectRequest(params)
urlStr, err := req.Presign(15 *time.Minutes /* URL expiry time*/) With that said, locally I tried to reproduce this with a 300MB file from disk being uploaded to to S3, but the memory usage of the process stayed at about 6MB. The process's virtual memory did go up 300MB, but that's just because the OS creates this for the mapped file, the memory is not actually used. I also tried an experiment where I switched to using |
So my use case right now is that I have a web service which manages file uploads and stores them on a user's behalf on s3. The I'll look into the |
In addition I didn't mention before, you can control concurrency of the |
func handler(w http.ResponseWriter, r *http.Request) {
basicPost(r.Body)
}
func basicPost(r io.Reader) {
sess, err := session.NewSession(&aws.Config{Region: aws.String("us-east-1")})
if err != nil {
fmt.Println(err)
}
svc := s3manager.NewUploader(sess)
input := &s3manager.UploadInput{
Bucket: aws.String("bucket"),
Key: aws.String("whatever"),
Body: r,
}
_, err = svc.Upload(input)
if err != nil {
fmt.Println(err)
}
} I know that declaring the session per call is possibly not efficient for concurrent requests, but I wanted to ensure everything that could be cleaned up would be. All I'm doing right now is watching Activity Monitor in OSX and checking the memory (not virtual memory) field. goamz uses approx. 8MB. |
Use case (download side): streaming the ciphertext of 4k video (mp4 files). The ciphertext is in S3 in files that are about 4GB in size. Presume that you get the video in chunks. Video must download chunks within a timing deadline to the EC2 instance that decrypts the stream for the user. So the chunks coming from S3 must be small because you can't read the first byte out of the chunk until all bytes of the chunk have come down. That means that you end up making a huge number of requests to S3 for chunks. The alternative is that you just pass in an io.Reader that only reads forward. Now the call to S3 should return immediately with the io.Reader still open. That means that when we get range requests from the browser, we just keep reading more out of the io.Reader. We have a similar issue with uploads. If uploading allowed the multipart mime part's io.Reader to be used as a parameter to S3 manager, then if the browser uploaded a quarter of the file, then a quarter of the ciphertext is already uploaded into S3. When the whole file is uploaded, the whole ciphertext file exists now in S3. What we are doing now as a workaround is to let the user just upload the file into the EC2, and promise to get the ciphertext of the file into S3 as soon as possible. When I was investigating this, I think I supplied my own io.ReadSeeker to log what it was doing... but what I discovered was that it seeked to the end to get a file size before it began, and it looked like it seeked one byte ahead while doing something. With video, the stream is often unbounded. You won't know the content length or be able to compute a hash of it until you have an EOF; so you end up just having to upload it first. Live-Streaming is the act of downloading a file (video, logs) that is still uploading. If everything was using forward-only IO, it becomes straight-forward; other than the issue of not knowing the content length or hash until the stream actually ends. |
I'm hitting this issue too. I need to stream multipart requests coming from clients to my server and up to S3 without too much buffering. The SDK, as it currently stands, does not seem to be thought out for this use case and requires buffering an entire part before being able to push it up to S3. |
I ended up switching to https://github.com/rlmcpherson/s3gof3r which does this correctly. |
This is mainly because AWS official SDK does not support streaming multipart requests without buffering too much. More info can be found at: aws/aws-sdk-go#915
@c4milo I'm also testing the use of s3gof3r, since goamz seems to be unmaintained for the past 2 years. Goamz worked, utilizing 30MB of RAM at peak and 8MB idle. s3gof3r seems to be solid interface and once I implement it I'll report back with anecdotal memory profile. |
@DavidJFelix with the func handler(w http.ResponseWriter, r *http.Request) {
basicPost(r.Body)
}
func basicPost(r io.Reader) {
sess, err := session.NewSession(&aws.Config{Region: aws.String("us-east-1")})
if err != nil {
fmt.Println(err)
}
svc := s3manager.NewUploader(sess, func(u *s3manager.Uploader) {
u.Concurrency = 1
})
input := &s3manager.UploadInput{
Bucket: aws.String("bucket"),
Key: aws.String("whatever"),
Body: r,
}
_, err = svc.Upload(input)
if err != nil {
fmt.Println(err)
}
} Setting |
@jasdel I don't have a problem with it doing things concurrently, I have a problem with it not releasing 100MB that is gained at some point during transmission. |
Reviewing the code it looks like part of the memory issue with S3 manager is that the buffer used for reader is recreated for each part. I think if this was updated to use a pool of readers it would reduce the overall memory alloc/frees. |
@jasdel are you able to reproduce what I'm seeing with the above code retaining about 100M of RAM even after the request is complete? |
@DavidJFelix I'm working to reproduce this. With a benchmark I can see using an |
Just to clarify, it doesn't seem to be leaking memory per request, it seems to hit 100MB and then stay there, even on subsequent requests (which should create a new manager). The issue is that it never returns to the starting RAM of ~3MB. |
Thanks for the update @DavidJFelix. The memory used by the process may not immediately drop as Go runtime will hold onto memory allocated by the process. The s3manger when not using an If possible its best to share the managers and session across uploads where possible. Especially if the SDK were to add byte array pools to optimize out some of the allocs with In your use-case is the upload manager being used to upload multiple 100MB files? Or is a single file is uploaded? With the following bench, the uploader does allocate about 25MB per upload. About 5MB per part. This highlights why the SDK should optimize for a byte buffer pool when using go test -run NONE -bench=.* -benchmem package main
import (
"bytes"
"fmt"
"io/ioutil"
"net/http"
"sync"
"testing"
"github.com/aws/aws-sdk-go/aws"
"github.com/aws/aws-sdk-go/aws/request"
"github.com/aws/aws-sdk-go/awstesting/unit"
"github.com/aws/aws-sdk-go/service/s3"
"github.com/aws/aws-sdk-go/service/s3/s3manager"
)
func BenchmarkS3Manager_Upload(b *testing.B) {
buf := make([]byte, 1024*1024*20)
mockSvc := mockedSvc()
uploader := s3manager.NewUploaderWithClient(mockSvc, func(u *s3manager.Uploader) {
// u.PartSize = 64 * 1024 * 1024 // 64MB per part
})
b.ResetTimer()
for i := 0; i < b.N; i++ {
uploader.Upload(&s3manager.UploadInput{
Bucket: aws.String("jasdel-bucket02"),
Key: aws.String("somekey"),
Body: bytes.NewBuffer(buf),
})
}
}
func mockedSvc() *s3.S3 {
var m sync.Mutex
partNum := 0
svc := s3.New(unit.Session)
svc.Handlers.Unmarshal.Clear()
svc.Handlers.UnmarshalMeta.Clear()
svc.Handlers.UnmarshalError.Clear()
svc.Handlers.Send.Clear()
svc.Handlers.Send.PushBack(func(r *request.Request) {
m.Lock()
defer m.Unlock()
r.HTTPResponse = &http.Response{
StatusCode: 200,
Body: ioutil.NopCloser(bytes.NewReader([]byte{})),
}
switch data := r.Data.(type) {
case *s3.CreateMultipartUploadOutput:
data.UploadId = aws.String("UPLOAD-ID")
case *s3.UploadPartOutput:
partNum++
data.ETag = aws.String(fmt.Sprintf("ETAG%d", partNum))
case *s3.CompleteMultipartUploadOutput:
data.Location = aws.String("https://location")
data.VersionId = aws.String("VERSION-ID")
case *s3.PutObjectOutput:
data.VersionId = aws.String("VERSION-ID")
}
})
return svc
} |
I also have a use case where I want to stream content from one online source where I have an MD5 of the full content, and ensure that the content enters S3 with the same hash (end to end validation of the content copy). PutObject would solve this, but the use of ReadSeeker makes it difficult to use an http.Response.Body. The Uploader helps with the ReadSeeker issue, but the downside is that I lose the end-to-end hashes on anything with multipart uploads. I can solve that by putting a io.TeeReader into the uploader, and write to an MD5, to verify that the download MD5 was correct by the end, but I could still do with the UploadPart calls to automatically validate the MD5s of the chunks. Given an overall MD5 for the download, chunk validation on upload, and modifying the content to have the right ETag after the upload, I can be pretty certain that the copy happened correctly. |
I was thinking of
Now, if the structure looked like this, streaming + MD5 could be supported with
So essentially, as long as S3 PutObject on-the-wire request requires the MD5 before the actual content, we can't support streaming, because we have to buffer the content somewhere. Streaming upload would probably be achieved with multipart insofar that after the content we can add additional stuff, but of course that's not pretty and it would increase bandwidth consumption because the content would have to be base64 encoded. So, that's really a shame. The only solution for supporting streaming would require opting out of the MD5 mechanism (as a workaround s3gof3r stores additional file for just the MD5, goamz probably does not use MD5 at all). That's not cool either. Oh well.. :/ |
Hm, does S3 support trailing headers for chunked uploads? http://stackoverflow.com/a/11313254 Okay, it appears it does not: https://forums.aws.amazon.com/message.jspa?messageID=561616 |
Hi all, I think the the best way to address this issue without a breaking change to the SDK's S3 PutObject and UploadPart APIs would be to use the s3manager package's Uploader utility. This utility will take a In addition if you're looking to stream content from a client through your server to S3 I recommend your server application using presigned URLs that are sent to the client. This reduces the load on your service and allows the client application to upload the content without the content being streamed through your service. I'll put together an example that highlights this idea. |
See above why that wouldn't solve my use-case. |
Thanks for the clarification @Redundancy. One way to work around this would be to use presigned URLs with S3's PutObject API. I actually just finished a PR which outlines how to use presigned URLs with S3. The example was written in the context of a client & server setup, but I think its still useful for using presigned urls in general. The presigned url's are then executed with a standard http.Client. The example was in the context of a client and server setup, but i think the idea is similar to your use case. Do you think this would help address your usecase? |
It's a bit of a longer workaround, but it's disappointing not to fix the issue of the loss of generality with the API. io.Reader is far more general than io.ReadSeeker, and the api could attempt to upgrade a Reader to a ReadSeeker internally to see if it can take a more efficient path (this sort of thing is sometimes done in the standard library). See the zero copy io in io.Copy (copyBuffer) where it tests for ReaderFrom and WriterTo. |
I don't think we could change the API directly as it would require a breaking change. In the general case the SDK needs to use an io.ReadSeeker for APIs such as PutObject were the reader's content needs a to be included in the Recently the AWS SDKs have started to allow S3 PutObject to use a magic value of UNSIGNED-PAYLOAD, for the request's body sha signature component when the request is made over HTTPS. This feature can be enable in the SDK as well. With the above we could consider creating a streaming only version of PutObject that uses an io.Reader instead of read seeker. With the magic unsigned payload body Sha component. This customization may not be able to retry any request error. When streaming content from a client it's generally much better to provide the client with a presigned URL. An application could require it's client provide it with a sha256 hash of the body the client wants to include. With that hash the presigned URL can be generated to only allow a body with that hash to be uploaded to S3. Additional headers such as Content-MD5 can also be added and included in the signature for data verification. |
So happy to see query and its reply.Learned alot :) nice explanation by both |
Summary
I noticed the s3 calls are using
io.ReadSeeker
forBody
behind the scenes, which seems to be contributing to a huge waste of memory for general use cases. The issue is thatReadSeeker
needs to persist bytes so that it can seek backwards, while theio.Reader
interface can garbage collect bytes once it has read them, allowing it to Stream data from network or disk.Observed
Using the s3 api for a web service which uses
PutObject
to place a file into s3, the file must be fully loaded into memory usingio.ReadSeeker
. For large objects (300MB) this meant a RAM use of about > 700MB.Expected
The
io.Reader
interface should stream input bytes and allow them to be garbage collected once Read (and presumably written to s3).Additional Information
Replacing aws-sdk-go with mitchellh/goamz dropped my RAM use per 300MB request from 700MB to ~8MB. I have a minimal test case to demonstrate this, if desired.
The text was updated successfully, but these errors were encountered: