SeekableByteChannel that doesn't use a temp file #103

sbeimin · 2018-02-28T11:19:53Z

The current implementation of S3SeekableByteChannel makes use of a temp file. This has a couple of drawbacks. One of the primary reasons to use a SeekableByteChannel is that you do not want to 'download' an entire (2GB) file if you are only interested in a specific range of bytes within the file.

Why do we have a separate S3SeekableByteChannel next to S3FileChannel? S3FileChannel extends java.nio.channels.FileChannel which in turn implements java.nio.channels.SeekableByteChannel.

Another problem is the use of Files.createTempFile(). This generally creates a file in /tmp (on linux e.g.) which can be seen as a security risk if the file contains sensitive data.

The com.amazonaws.services.s3.model.GetObjectRequest supports a private long[] range; which enables the retrieval of a limited range of bytes. Perhaps this could be used to create a better SeekableByteChannel implementation.

However I can't find any option to write a limited range of bytes to an S3 Object... Perhaps we'll have to wait and see that the 2.0 version of the AWS SDK offers us.

magicDGS · 2018-03-22T16:27:15Z

I think that an implementation such as the one on epam/htsjdk-s3-plugin S3SeekableStream is the way to go. It is similar as what I'm doing in my implementation of an HTTP/S filesystem....

ryan-williams · 2018-07-17T23:54:33Z

I learned about this issue the hard way on a 6GB file just now!

IMO throwing UnsupportedOperationException would be preferable to downloading the whole file (esp. to /tmp, per OP).

tomwhite · 2018-07-18T08:29:24Z

I did some work to fix this a while back (that I forgot to post at the time). Sorry you suffered from this too @ryan-williams! Here's the branch: https://github.com/tomwhite/Amazon-S3-FileSystem-NIO2/tree/read-seeks

It lacks unit tests, but I did test it manually.

ryan-williams · 2018-07-24T19:07:45Z

That's great @tomwhite, I got most of the way through my own implementation but yours is cleaner.

I patched a few more fixes (to #106 and #108) into it and released org.lasersonlab:s3fs:2.2.3 to Maven Central from this tag on the lasersonlab fork.

Lots of tests fail, as you mentioned, but it seems like a mix of mocks that I don't really care to unbreak, and ITs that aren't configured in a way I know how to run

droazen mentioned this issue Mar 20, 2018

Test GATK4 with the existing S3 NIO plugin and get basic S3 read support working broadinstitute/gatk#3708

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SeekableByteChannel that doesn't use a temp file #103

SeekableByteChannel that doesn't use a temp file #103

sbeimin commented Feb 28, 2018 •

edited

magicDGS commented Mar 22, 2018

ryan-williams commented Jul 17, 2018

tomwhite commented Jul 18, 2018

ryan-williams commented Jul 24, 2018 •

edited

SeekableByteChannel that doesn't use a temp file #103

SeekableByteChannel that doesn't use a temp file #103

Comments

sbeimin commented Feb 28, 2018 • edited

magicDGS commented Mar 22, 2018

ryan-williams commented Jul 17, 2018

tomwhite commented Jul 18, 2018

ryan-williams commented Jul 24, 2018 • edited

sbeimin commented Feb 28, 2018 •

edited

ryan-williams commented Jul 24, 2018 •

edited