New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Utilize io.Reader not io.ReadSeeker where possible #915

Closed
DavidJFelix opened this Issue Oct 29, 2016 · 23 comments

Comments

Projects
None yet
7 participants
@DavidJFelix

DavidJFelix commented Oct 29, 2016

Summary

I noticed the s3 calls are using io.ReadSeeker for Body behind the scenes, which seems to be contributing to a huge waste of memory for general use cases. The issue is that ReadSeeker needs to persist bytes so that it can seek backwards, while the io.Reader interface can garbage collect bytes once it has read them, allowing it to Stream data from network or disk.

Observed

Using the s3 api for a web service which uses PutObject to place a file into s3, the file must be fully loaded into memory using io.ReadSeeker. For large objects (300MB) this meant a RAM use of about > 700MB.

Expected

The io.Reader interface should stream input bytes and allow them to be garbage collected once Read (and presumably written to s3).

Additional Information

Replacing aws-sdk-go with mitchellh/goamz dropped my RAM use per 300MB request from 700MB to ~8MB. I have a minimal test case to demonstrate this, if desired.

@jasdel

This comment has been minimized.

Member

jasdel commented Oct 29, 2016

Thanks for contacting us @DavidJFelix, I'd like to learn more about the your use case. In the web service's code does it cache the contents it will be uploading to S3 as a file or in memory? In addition are you referring to physical memory or virtual memory?

Since the SDK uses AWS v4 signature version a digest hash of the body needs to be computed. In order to do this the SDK must read the contents of the io.ReadSeekers and seek back to the origin of the reader so that the http.Client can send the body in the request. In addition using an io.ReadSeeker allows the SDK to retry failed uploads if a network connection error occurs. An io.Reader would make retries impossible without reading the full contents into memory.

One solution for streaming content is to use the s3manager.Uploader. This utility allows uses S3's multi-part uploads to upload the content in chunks(aka parts) in parallel and it can be used with an io.Reader. The uploader defaults to 5MB parts but this can be adjusted with configurations.

uploader := s3manager.NewUploader(sess)
err := uploader.Upload(params /*s3manager.UploadInput */)

An additional solution could be to use the SDK's Presigner with the S3 PutObject operation. Though, all retry functionality would be on the onerous of the code that would be sending the URL. If just a io.Reader were used for the request's body, retries would be impossible without loading the full contents into memory like mentioned earlier.

req, _ := svc.PutObjectRequest(params)
urlStr, err := req.Presign(15 *time.Minutes /* URL expiry time*/)

With that said, locally I tried to reproduce this with a 300MB file from disk being uploaded to to S3, but the memory usage of the process stayed at about 6MB. The process's virtual memory did go up 300MB, but that's just because the OS creates this for the mapped file, the memory is not actually used.

I also tried an experiment where I switched to using ioutil.ReadAll to read the file's contents into memory and then wrap the bytes with a bytes.Reader. In this case I did see the memory usage go up significantly. Memory usage of about the size of the file is expected, but I did see a condition were something in the stack was causing the buffer to be duplicated.

@DavidJFelix

This comment has been minimized.

DavidJFelix commented Oct 29, 2016

So my use case right now is that I have a web service which manages file uploads and stores them on a user's behalf on s3. The http.Request object's Body is an io.Reader and I simply validate that the request is authenticated and then save it to s3 on behalf of the user. I was using ioutil.ReadAll and bytes.Reader to turn the request into a PutObjectInput. I think the 2x is caused by some saving I'm doing in the intermediate. The problem is that I have is these files could and likely will be large, so concurrent requests will spike RAM to an unpredictable and unacceptable level.

I'll look into the io.Reader interfaces that you're mentioning and see if they make sense.

@jasdel

This comment has been minimized.

Member

jasdel commented Oct 29, 2016

In addition I didn't mention before, you can control concurrency of the s3manager.Uploader's uploads. By default it is 5 parts at once per file, but you can change this to 1 or 2 in this context. Even if the uploader is uploading in serial it will handle retries of the parts.

@DavidJFelix

This comment has been minimized.

DavidJFelix commented Oct 30, 2016

Uploader seems to be working closer to how I'd expect with regards to the buffer. It's still a much higher use of memory than goamz. Using uploader, the memory profile appears to go to about 130MB and flat line at 130MB. I'm not sure how it looks with concurrent requests. Here's the code I'm using:

func handler(w http.ResponseWriter, r *http.Request) {
    basicPost(r.Body)
}

func basicPost(r io.Reader) {

    sess, err := session.NewSession(&aws.Config{Region: aws.String("us-east-1")})
    if err != nil {
        fmt.Println(err)
    }
    svc := s3manager.NewUploader(sess)
    input := &s3manager.UploadInput{
        Bucket: aws.String("bucket"),
        Key: aws.String("whatever"),
        Body: r,
    }

     _, err = svc.Upload(input)
     if err != nil {
          fmt.Println(err)
    }
}

I know that declaring the session per call is possibly not efficient for concurrent requests, but I wanted to ensure everything that could be cleaned up would be. All I'm doing right now is watching Activity Monitor in OSX and checking the memory (not virtual memory) field. goamz uses approx. 8MB.

@rfielding

This comment has been minimized.

rfielding commented Oct 30, 2016

Use case (download side): streaming the ciphertext of 4k video (mp4 files).

The ciphertext is in S3 in files that are about 4GB in size. Presume that you get the video in chunks. Video must download chunks within a timing deadline to the EC2 instance that decrypts the stream for the user. So the chunks coming from S3 must be small because you can't read the first byte out of the chunk until all bytes of the chunk have come down. That means that you end up making a huge number of requests to S3 for chunks.

The alternative is that you just pass in an io.Reader that only reads forward. Now the call to S3 should return immediately with the io.Reader still open. That means that when we get range requests from the browser, we just keep reading more out of the io.Reader.

We have a similar issue with uploads. If uploading allowed the multipart mime part's io.Reader to be used as a parameter to S3 manager, then if the browser uploaded a quarter of the file, then a quarter of the ciphertext is already uploaded into S3. When the whole file is uploaded, the whole ciphertext file exists now in S3.

What we are doing now as a workaround is to let the user just upload the file into the EC2, and promise to get the ciphertext of the file into S3 as soon as possible. When I was investigating this, I think I supplied my own io.ReadSeeker to log what it was doing... but what I discovered was that it seeked to the end to get a file size before it began, and it looked like it seeked one byte ahead while doing something.

With video, the stream is often unbounded. You won't know the content length or be able to compute a hash of it until you have an EOF; so you end up just having to upload it first.

Live-Streaming is the act of downloading a file (video, logs) that is still uploading. If everything was using forward-only IO, it becomes straight-forward; other than the issue of not knowing the content length or hash until the stream actually ends.

@c4milo

This comment has been minimized.

c4milo commented Nov 1, 2016

I'm hitting this issue too. I need to stream multipart requests coming from clients to my server and up to S3 without too much buffering. The SDK, as it currently stands, does not seem to be thought out for this use case and requires buffering an entire part before being able to push it up to S3.

@c4milo

This comment has been minimized.

c4milo commented Nov 1, 2016

I ended up switching to https://github.com/rlmcpherson/s3gof3r which does this correctly.

c4milo added a commit to lift-plugins/registry that referenced this issue Nov 1, 2016

Switches away from AWS Go SDK for streaming files to S3
This is mainly because AWS official SDK does not support streaming
multipart requests without buffering too much.

More info can be found at: aws/aws-sdk-go#915
@DavidJFelix

This comment has been minimized.

DavidJFelix commented Nov 1, 2016

@c4milo I'm also testing the use of s3gof3r, since goamz seems to be unmaintained for the past 2 years. Goamz worked, utilizing 30MB of RAM at peak and 8MB idle. s3gof3r seems to be solid interface and once I implement it I'll report back with anecdotal memory profile.

@jasdel

This comment has been minimized.

Member

jasdel commented Nov 1, 2016

@DavidJFelix with the s3manager.Uploader you can control how many parallel uploads are done at once. Configuring this will set how much memory is used during the upload. The following is an update the the code above that configures a single part upload at a time.

func handler(w http.ResponseWriter, r *http.Request) {
    basicPost(r.Body)
}

func basicPost(r io.Reader) {

    sess, err := session.NewSession(&aws.Config{Region: aws.String("us-east-1")})
    if err != nil {
        fmt.Println(err)
    }
    svc := s3manager.NewUploader(sess, func(u *s3manager.Uploader) {
        u.Concurrency = 1
    })
    input := &s3manager.UploadInput{
        Bucket: aws.String("bucket"),
        Key: aws.String("whatever"),
        Body: r,
    }

     _, err = svc.Upload(input)
     if err != nil {
          fmt.Println(err)
    }
}

Setting Concurrency to 1 will do streaming upload per a part at a time.

@DavidJFelix

This comment has been minimized.

DavidJFelix commented Nov 1, 2016

@jasdel I don't have a problem with it doing things concurrently, I have a problem with it not releasing 100MB that is gained at some point during transmission.

@jasdel

This comment has been minimized.

Member

jasdel commented Nov 1, 2016

Reviewing the code it looks like part of the memory issue with S3 manager is that the buffer used for reader is recreated for each part. I think if this was updated to use a pool of readers it would reduce the overall memory alloc/frees.

@DavidJFelix

This comment has been minimized.

DavidJFelix commented Nov 2, 2016

@jasdel are you able to reproduce what I'm seeing with the above code retaining about 100M of RAM even after the request is complete?

@jasdel

This comment has been minimized.

Member

jasdel commented Nov 2, 2016

@DavidJFelix I'm working to reproduce this. With a benchmark I can see using an io.Reader like bytes.Buffer the uploader will allocate a bit over the 100MB per upload total over the life of the upload. This is split into 5MB chunks (part size). I expected these parts to be GC once no longer used, but I'll do a focused test to see if i can reproduce leaking the memory.

@DavidJFelix

This comment has been minimized.

DavidJFelix commented Nov 2, 2016

Just to clarify, it doesn't seem to be leaking memory per request, it seems to hit 100MB and then stay there, even on subsequent requests (which should create a new manager). The issue is that it never returns to the starting RAM of ~3MB.

@jasdel

This comment has been minimized.

Member

jasdel commented Nov 11, 2016

Thanks for the update @DavidJFelix. The memory used by the process may not immediately drop as Go runtime will hold onto memory allocated by the process. The s3manger when not using an io.ReadSeeker will allocate additional memory that could be optimized away though.

If possible its best to share the managers and session across uploads where possible. Especially if the SDK were to add byte array pools to optimize out some of the allocs with io.Readers. The uploader and sessions are safe to use concurrently.

In your use-case is the upload manager being used to upload multiple 100MB files? Or is a single file is uploaded?

With the following bench, the uploader does allocate about 25MB per upload. About 5MB per part. This highlights why the SDK should optimize for a byte buffer pool when using io.Reader.

go test -run NONE -bench=.* -benchmem
package main

import (
    "bytes"
    "fmt"
    "io/ioutil"
    "net/http"
    "sync"
    "testing"

    "github.com/aws/aws-sdk-go/aws"
    "github.com/aws/aws-sdk-go/aws/request"
    "github.com/aws/aws-sdk-go/awstesting/unit"
    "github.com/aws/aws-sdk-go/service/s3"
    "github.com/aws/aws-sdk-go/service/s3/s3manager"
)

func BenchmarkS3Manager_Upload(b *testing.B) {
    buf := make([]byte, 1024*1024*20)

    mockSvc := mockedSvc()
    uploader := s3manager.NewUploaderWithClient(mockSvc, func(u *s3manager.Uploader) {
        //          u.PartSize = 64 * 1024 * 1024 // 64MB per part
    })

    b.ResetTimer()

    for i := 0; i < b.N; i++ {
        uploader.Upload(&s3manager.UploadInput{
            Bucket: aws.String("jasdel-bucket02"),
            Key:    aws.String("somekey"),
            Body:   bytes.NewBuffer(buf),
        })
    }
}

func mockedSvc() *s3.S3 {
    var m sync.Mutex
    partNum := 0

    svc := s3.New(unit.Session)
    svc.Handlers.Unmarshal.Clear()
    svc.Handlers.UnmarshalMeta.Clear()
    svc.Handlers.UnmarshalError.Clear()
    svc.Handlers.Send.Clear()
    svc.Handlers.Send.PushBack(func(r *request.Request) {
        m.Lock()
        defer m.Unlock()

        r.HTTPResponse = &http.Response{
            StatusCode: 200,
            Body:       ioutil.NopCloser(bytes.NewReader([]byte{})),
        }

        switch data := r.Data.(type) {
        case *s3.CreateMultipartUploadOutput:
            data.UploadId = aws.String("UPLOAD-ID")
        case *s3.UploadPartOutput:
            partNum++
            data.ETag = aws.String(fmt.Sprintf("ETAG%d", partNum))
        case *s3.CompleteMultipartUploadOutput:
            data.Location = aws.String("https://location")
            data.VersionId = aws.String("VERSION-ID")
        case *s3.PutObjectOutput:
            data.VersionId = aws.String("VERSION-ID")
        }
    })

    return svc
}
@Redundancy

This comment has been minimized.

Redundancy commented Jan 24, 2017

I also have a use case where I want to stream content from one online source where I have an MD5 of the full content, and ensure that the content enters S3 with the same hash (end to end validation of the content copy).

PutObject would solve this, but the use of ReadSeeker makes it difficult to use an http.Response.Body. The Uploader helps with the ReadSeeker issue, but the downside is that I lose the end-to-end hashes on anything with multipart uploads.

I can solve that by putting a io.TeeReader into the uploader, and write to an MD5, to verify that the download MD5 was correct by the end, but I could still do with the UploadPart calls to automatically validate the MD5s of the chunks.

Given an overall MD5 for the download, chunk validation on upload, and modifying the content to have the right ETag after the upload, I can be pretty certain that the copy happened correctly.

@joonas-fi

This comment has been minimized.

joonas-fi commented Feb 25, 2017

I was thinking of TeeReader as well for calculating the MD5. Unfortunately that wouldn't help because at least the S3 REST docs (http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectPUT.html) explain that the structure of the PUT request is:

PUT /.... HTTP/1.1
Auth header (includes signed MD5 header)
MD5 header

Body as binary

Now, if the structure looked like this, streaming + MD5 could be supported with TeeReader:

PUT /.... HTTP/1.1
Content-Type: multipart/form-data

... multipart boundary ...
body here, base64
... multipart boundary ...
signed auth value
... multipart boundary ...
md5
... multipart boundary ...

So essentially, as long as S3 PutObject on-the-wire request requires the MD5 before the actual content, we can't support streaming, because we have to buffer the content somewhere. Streaming upload would probably be achieved with multipart insofar that after the content we can add additional stuff, but of course that's not pretty and it would increase bandwidth consumption because the content would have to be base64 encoded.

So, that's really a shame. The only solution for supporting streaming would require opting out of the MD5 mechanism (as a workaround s3gof3r stores additional file for just the MD5, goamz probably does not use MD5 at all). That's not cool either. Oh well.. :/

@joonas-fi

This comment has been minimized.

joonas-fi commented Feb 25, 2017

Hm, does S3 support trailing headers for chunked uploads? http://stackoverflow.com/a/11313254

Okay, it appears it does not: https://forums.aws.amazon.com/message.jspa?messageID=561616
This chunked upload + trailing headers idea seems to have been discussed before here as well: #81

@kiiadi kiiadi added this to the 2.0 milestone Feb 27, 2017

@jasdel

This comment has been minimized.

Member

jasdel commented May 9, 2017

Hi all, I think the the best way to address this issue without a breaking change to the SDK's S3 PutObject and UploadPart APIs would be to use the s3manager package's Uploader utility. This utility will take a io.Reader instead of an io.ReadSeeker and internally optimize for the io.ReadSeeker case.

In addition if you're looking to stream content from a client through your server to S3 I recommend your server application using presigned URLs that are sent to the client. This reduces the load on your service and allows the client application to upload the content without the content being streamed through your service. I'll put together an example that highlights this idea.

@jasdel jasdel closed this May 9, 2017

@Redundancy

This comment has been minimized.

Redundancy commented May 12, 2017

See above why that wouldn't solve my use-case.

@jasdel

This comment has been minimized.

Member

jasdel commented May 12, 2017

Thanks for the clarification @Redundancy. One way to work around this would be to use presigned URLs with S3's PutObject API. I actually just finished a PR which outlines how to use presigned URLs with S3. The example was written in the context of a client & server setup, but I think its still useful for using presigned urls in general. The presigned url's are then executed with a standard http.Client. The example was in the context of a client and server setup, but i think the idea is similar to your use case. Do you think this would help address your usecase?

#1260

@Redundancy

This comment has been minimized.

Redundancy commented May 13, 2017

It's a bit of a longer workaround, but it's disappointing not to fix the issue of the loss of generality with the API.

io.Reader is far more general than io.ReadSeeker, and the api could attempt to upgrade a Reader to a ReadSeeker internally to see if it can take a more efficient path (this sort of thing is sometimes done in the standard library).

See the zero copy io in io.Copy (copyBuffer) where it tests for ReaderFrom and WriterTo.

@jasdel

This comment has been minimized.

Member

jasdel commented May 13, 2017

I don't think we could change the API directly as it would require a breaking change. In the general case the SDK needs to use an io.ReadSeeker for APIs such as PutObject were the reader's content needs a to be included in the
AWS v4 signature. The SDK would be required to buffer the reader's full contents into memory inorder to calculate the body Sha component of the signature. In addition retrying request automatically by the SDK would be imposible without buffering the reader's content in memory as well.

Recently the AWS SDKs have started to allow S3 PutObject to use a magic value of UNSIGNED-PAYLOAD, for the request's body sha signature component when the request is made over HTTPS. This feature can be enable in the SDK as well.

With the above we could consider creating a streaming only version of PutObject that uses an io.Reader instead of read seeker. With the magic unsigned payload body Sha component. This customization may not be able to retry any request error.

When streaming content from a client it's generally much better to provide the client with a presigned URL. An application could require it's client provide it with a sha256 hash of the body the client wants to include. With that hash the presigned URL can be generated to only allow a body with that hash to be uploaded to S3. Additional headers such as Content-MD5 can also be added and included in the signature for data verification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment