Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable direct s3a:// file accesses #6698

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

takeshi-yoshimura
Copy link

Currently, we need to copy files on S3 to local storage before using
them. This patch enables gatk local and spark modes to access s3a://
files directly to reduce copy overhead and local disk usages.

s3a file accesses require additional configuration of core-site.xml
located in CLASSPATH as well as other hadoop applications. Spark
already has hadoop dependencies but local modes need to add hadoop
jars in the classpath.

Example core-site.xml:

<configuration>
<property>
<name>fs.s3a.access.key</name>
<value>{Your AWS_ACCESS_KEY_ID}</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>{Your AWS_SECRET_ACCESS_KEY}</value>
</property>
</configuration>

Currently, we need to copy files on S3 to local storage before using
them. This patch enables gatk local and spark modes to access s3a://
files directly to reduce copy overhead and local disk usages.

s3a file accesses require additional configuration of core-site.xml
located in CLASSPATH as well as other hadoop applications. Spark
already has hadoop dependencies but local modes need to add hadoop
jars in the classpath.

Example core-site.xml:

<configuration>
<property>
<name>fs.s3a.access.key</name>
<value>{Your AWS_ACCESS_KEY_ID}</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>{Your AWS_SECRET_ACCESS_KEY}</value>
</property>
</configuration>
@lbergelson
Copy link
Member

@takeshi-yoshimura Native access to s3a::// seems useful. In order to include this though we need some tests to show that it works/ demonstrate how to use it. I'm a bit concerned that it's version 0.0.1 (although it looks like there is a 0.0.2 out, should that be the one incorporated instead?) and there doesn't seem to be any activity on the library's github in the last two years-ish. I'm wondering how stable/supported it is.

Maybe @fnothaft can comment on it. What's the status of this library? Do you recommend incorporating it or is there a different solution you've moved on to?

@takeshi-yoshimura
Copy link
Author

thank you for your reply. I will try adding tests and documentation how to use s3a.

@lbergelson
Copy link
Member

@takeshi-yoshimura One more thing to note. You should be able to use this library with an existing version of gatk by including this in your classpath since filesystem providers should be dynamically loaded. ( something along the lines of`--java-options "-cp ' should work.)

@takeshi-yoshimura
Copy link
Author

@lbergelson thank you for the comment and sorry for my bit late response. I excluded the dependency to the jsr203-s3a and tested that both local- and spark-gatk can access s3a files by dynamically loading it. I also added a new directory scripts/s3a for documentation and simple tests for s3a demonstration.

@lbergelson lbergelson self-requested a review August 6, 2020 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants