Enable direct s3a:// file accesses #6698

takeshi-yoshimura · 2020-07-10T11:32:00Z

Currently, we need to copy files on S3 to local storage before using
them. This patch enables gatk local and spark modes to access s3a://
files directly to reduce copy overhead and local disk usages.

s3a file accesses require additional configuration of core-site.xml
located in CLASSPATH as well as other hadoop applications. Spark
already has hadoop dependencies but local modes need to add hadoop
jars in the classpath.

Example core-site.xml:

<configuration>
<property>
<name>fs.s3a.access.key</name>
<value>{Your AWS_ACCESS_KEY_ID}</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>{Your AWS_SECRET_ACCESS_KEY}</value>
</property>
</configuration>

Currently, we need to copy files on S3 to local storage before using them. This patch enables gatk local and spark modes to access s3a:// files directly to reduce copy overhead and local disk usages. s3a file accesses require additional configuration of core-site.xml located in CLASSPATH as well as other hadoop applications. Spark already has hadoop dependencies but local modes need to add hadoop jars in the classpath. Example core-site.xml: <configuration> <property> <name>fs.s3a.access.key</name> <value>{Your AWS_ACCESS_KEY_ID}</value> </property> <property> <name>fs.s3a.secret.key</name> <value>{Your AWS_SECRET_ACCESS_KEY}</value> </property> </configuration>

lbergelson · 2020-07-15T16:25:10Z

@takeshi-yoshimura Native access to s3a::// seems useful. In order to include this though we need some tests to show that it works/ demonstrate how to use it. I'm a bit concerned that it's version 0.0.1 (although it looks like there is a 0.0.2 out, should that be the one incorporated instead?) and there doesn't seem to be any activity on the library's github in the last two years-ish. I'm wondering how stable/supported it is.

Maybe @fnothaft can comment on it. What's the status of this library? Do you recommend incorporating it or is there a different solution you've moved on to?

takeshi-yoshimura · 2020-07-20T12:41:33Z

thank you for your reply. I will try adding tests and documentation how to use s3a.

lbergelson · 2020-07-21T18:34:41Z

@takeshi-yoshimura One more thing to note. You should be able to use this library with an existing version of gatk by including this in your classpath since filesystem providers should be dynamically loaded. ( something along the lines of`--java-options "-cp ' should work.)

takeshi-yoshimura · 2020-07-29T07:23:29Z

@lbergelson thank you for the comment and sorry for my bit late response. I excluded the dependency to the jsr203-s3a and tested that both local- and spark-gatk can access s3a files by dynamically loading it. I also added a new directory scripts/s3a for documentation and simple tests for s3a demonstration.

add documentation and test scripts to enable s3a

2e91a34

lbergelson self-requested a review August 6, 2020 15:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable direct s3a:// file accesses #6698

Enable direct s3a:// file accesses #6698

takeshi-yoshimura commented Jul 10, 2020

lbergelson commented Jul 15, 2020

takeshi-yoshimura commented Jul 20, 2020

lbergelson commented Jul 21, 2020

takeshi-yoshimura commented Jul 29, 2020

Enable direct s3a:// file accesses #6698

Are you sure you want to change the base?

Enable direct s3a:// file accesses #6698

Conversation

takeshi-yoshimura commented Jul 10, 2020

lbergelson commented Jul 15, 2020

takeshi-yoshimura commented Jul 20, 2020

lbergelson commented Jul 21, 2020

takeshi-yoshimura commented Jul 29, 2020