Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

java.nio.file.ProviderNotFoundException (Provider "s3" not found) #1732

Closed
rstrahan opened this Issue Sep 17, 2017 · 14 comments

Comments

Projects
None yet
4 participants
@rstrahan
Copy link

commented Sep 17, 2017

I'm trying to transformAlignments from BAM file in S3, e.g.:

adam-submit transformAlignments s3://1000genomes/phase3/data/HG00154/alignment/HG00154.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam s3://<mybucket>/1000genomes/adam/bam=HG00154/

It fails with:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 92 in stage 0.0 failed 60 times, most recent failure: Lost task 92.59 in stage 0.0 (TID 1266, ip-10-184-8-118.ec2.internal, executor 1): java.nio.file.ProviderNotFoundException: Provider "s3" not found at java.nio.file.FileSystems.newFileSystem(FileSystems.java:341) at org.seqdoop.hadoop_bam.util.NIOFileUtil.asPath(NIOFileUtil.java:40) ...

If I stage the input BAM file onHDFS, the problem is resolved (the S3 output path works fine - only S3 input path causes problems).

hadoop fs -cp s3://1000genomes/phase3/data/HG00154/alignment/HG00154.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam /adam/HG00154.bam  

adam-submit transformAlignments /adam/HG00154.bam s3://<mybucket>/1000genomes/adam/bam=HG00154/

Do you have any pointers or fixes to get transformAlignment to support S3 input BAM files?

@fnothaft

This comment has been minimized.

Copy link
Member

commented Sep 17, 2017

Hi @rstrahan! Thanks for dropping in with the issue. Accessing BAM data from S3 is thornier than I'd like and we've got an action item to write this up for the 0.23.0 release (see #1643). I'll probably take this on tonight, since you're asking here. To get you unstuck, here's a set of pointers. From bigdatagenomics/mango#311, you're on EMR, so you should be on a pretty up-to-date version of Hadoop, and your IAM roles should be configured properly. What you'll need to do is:

  • Use the s3a scheme, instead of the s3 scheme. This is the latest version of the S3 file access code path in Hadoop.
  • Include the JARs from https://github.com/fnothaft/jsr203-s3a --> http://search.maven.org/#artifactdetails%7Cnet.fnothaft%7Cjsr203-s3a%7C0.0.1%7Cjar . This includes a file system provider for the s3a scheme. I'm not familiar with how EMR attaches JARs, but if you were using spark-submit/adam-submit, you could do --packages net.fnothaft:jsr203-s3a
  • You may also need com.amazonaws:aws-java-sdk-pom:1.10.34 and org.apache.hadoop:hadoop-aws:2.7.4. I don't think you'll need these on EMR, though.

I'll clean this up further tonight and bundle this in our docs. Let me know if you run into any further issues!

@dmmiller612

This comment has been minimized.

Copy link

commented Sep 18, 2017

Hey @fnothaft ,

Thanks for taking time to look at this issue. I actually followed your instructions above, except used maven for the jar dependencies. I took these steps:

  • added the s3a dependency to spark sparkSession.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
  • Added the maven dependency listed above (which automatically added it to the jar with dependencies
  • I already had the sdk and aws-hadoop in the jar

However, it looks like I had the same issues with s3a on the emr (Can't verify locally at this time).

java.nio.file.ProviderNotFoundException: Provider "s3a" not found
	at java.nio.file.FileSystems.newFileSystem(FileSystems.java:341)
	at org.seqdoop.hadoop_bam.util.NIOFileUtil.asPath(NIOFileUtil.java:40)
	at org.seqdoop.hadoop_bam.BAMRecordReader.initialize(BAMRecordReader.java:140)
	at org.seqdoop.hadoop_bam.BAMInputFormat.createRecordReader(BAMInputFormat.java:121)
	at org.seqdoop.hadoop_bam.AnySAMInputFormat.createRecordReader(AnySAMInputFormat.java:190)
	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:180)
	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:179)
	at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134)
	at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
@heuermh

This comment has been minimized.

Copy link
Member

commented Sep 18, 2017

@rstrahan @dmmiller612 I've been using the s3a protocol on various versions of EMR without any issues, and without any additional dependencies or configuration, although I'm using adam-shell via ssh and ADAM interactively in Zeppelin, not adam-submit.

@fnothaft

This comment has been minimized.

Copy link
Member

commented Sep 18, 2017

@heuermh you've been doing s3a with Parquet, no? This issue is specific to loading BAM, CRAM, or VCF, where the nio provider is needed for Hadoop-BAM to access the BAI, CRAI, or TBI.

@fnothaft

This comment has been minimized.

Copy link
Member

commented Sep 18, 2017

@dmmiller612 let me look into this a bit more. The TL;DR is that there's something to do with classpath/classloader config. I've run into this on other distros similar to EMR.

@heuermh

This comment has been minimized.

Copy link
Member

commented Sep 18, 2017

@heuermh you've been doing s3a with Parquet, no? This issue is specific to loading BAM, CRAM, or VCF, where the nio provider is needed for Hadoop-BAM to access the BAI, CRAI, or TBI.

I haven't been attempting to access the index directly (e.g. using ADAMContext.loadIndexedBam) so perhaps I've missed the issue. I have working session this afternoon with Be The Match folks and will look into it.

@fnothaft

This comment has been minimized.

Copy link
Member

commented Sep 18, 2017

All Hadoop-BAM BAM reads create a nio provider to try and find the index, even if you aren't using the index to filter intervals.

@dmmiller612

This comment has been minimized.

Copy link

commented Sep 18, 2017

@fnothaft I'll continue to look as well. I simplified my search and am just using adam.loadAlignments("s3a://bucket-name/something.bam"), and it can still reproduce the issue.

@dmmiller612

This comment has been minimized.

Copy link

commented Sep 18, 2017

Also worth mentioning that I am using spark 2.2. I don't think that would matter, I just wanted to give a bit of context in case it does.

@heuermh

This comment has been minimized.

Copy link
Member

commented Jan 11, 2018

@dmmiller612 Is this still an issue? We've since deployed the s3 doc (although it doesn't look any different than what @fnothaft described above) and released ADAM version 0.23.0.

@dmmiller612

This comment has been minimized.

Copy link

commented Jan 11, 2018

@heuermh I can check later, but I still experience the same error in hadoop-bam. That doc above looks like it is specifically for adam files and not bam files. The problem seems to be that hadoop-bam uses nio to look for the bai file, and it isn't getting registered, even if I manually add an s3 nio package.

@heuermh

This comment has been minimized.

Copy link
Member

commented Jan 11, 2018

Thanks, I'll be on EMR later today and will do some further investigation.

@fnothaft

This comment has been minimized.

Copy link
Member

commented Jan 11, 2018

Hi @dmmiller612 ! How are you adding the JARs? I'm not familiar with EMR, but depending on how they attach dependencies, the s3 NIO JAR may not be visible to the correct classloader. As far as I can tell, the NIO system searches a specific classloader for the NIO Fs implementations. What's worked most reliably for me is to have the NIO library on the executor classpaths when the Spark executors boot.

@fnothaft

This comment has been minimized.

Copy link
Member

commented Mar 7, 2018

The approach for doing this was documented in 90572b5 or earlier.

@fnothaft fnothaft closed this Mar 7, 2018

@heuermh heuermh added this to the 0.24.0 milestone Mar 7, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.