Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues loading BAM files in Google FS #1816

Closed
Georgehe4 opened this issue Dec 4, 2017 · 10 comments
Closed

Issues loading BAM files in Google FS #1816

Georgehe4 opened this issue Dec 4, 2017 · 10 comments
Milestone

Comments

@Georgehe4
Copy link
Contributor

@Georgehe4 Georgehe4 commented Dec 4, 2017

There are issues in ADAM when trying to load files from google file system:

ac = ADAMContext(sc)
alignmentFile = "gs://genomics-public-data/platinum-genomes/bam/NA12890_S1.bam"
reads = ac.loadAlignments(alignmentFile)
reads.toDF().take(1)
Py4JJavaErrorTraceback (most recent call last)
<ipython-input-45-77d92f596218> in <module>()
----> 1 reads.toDF().take(1)

/usr/lib/spark/python/pyspark/sql/dataframe.pyc in take(self, num)
    474         [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
    475         """
--> 476         return self.limit(num).collect()
    477 
    478     @since(1.3)

/usr/lib/spark/python/pyspark/sql/dataframe.pyc in collect(self)
    436         """
    437         with SCCallSiteSync(self._sc) as css:
--> 438             port = self._jdf.collectToPython()
    439         return list(_load_from_socket(port, BatchedSerializer(PickleSerializer())))
    440 

/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135         for temp_arg in temp_args:

/usr/lib/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    317                 raise Py4JJavaError(
    318                     "An error occurred while calling {0}{1}{2}.\n".
--> 319                     format(target_id, ".", name), value)
    320             else:
    321                 raise Py4JError(

Py4JJavaError: An error occurred while calling o521.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 23.0 failed 4 times, most recent failure: Lost task 0.3 in stage 23.0 (TID 39, mango-2-w-0.c.mango-bdgenomics.internal, executor 1): java.nio.file.ProviderNotFoundException: Provider "gs" not found
	at java.nio.file.FileSystems.newFileSystem(FileSystems.java:341)
	at org.seqdoop.hadoop_bam.util.NIOFileUtil.asPath(NIOFileUtil.java:40)
	at org.seqdoop.hadoop_bam.BAMRecordReader.initialize(BAMRecordReader.java:140)
	at org.seqdoop.hadoop_bam.BAMInputFormat.createRecordReader(BAMInputFormat.java:121)
	at org.seqdoop.hadoop_bam.AnySAMInputFormat.createRecordReader(AnySAMInputFormat.java:190)
	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:180)
	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:179)
	at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134)
	at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
	at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
	at scala.Option.foreach(Option.scala:245)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:336)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
	at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply$mcI$sp(Dataset.scala:2803)
	at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2800)
	at org.apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2800)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
	at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2823)
	at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:2800)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.file.ProviderNotFoundException: Provider "gs" not found
	at java.nio.file.FileSystems.newFileSystem(FileSystems.java:341)
	at org.seqdoop.hadoop_bam.util.NIOFileUtil.asPath(NIOFileUtil.java:40)
	at org.seqdoop.hadoop_bam.BAMRecordReader.initialize(BAMRecordReader.java:140)
	at org.seqdoop.hadoop_bam.BAMInputFormat.createRecordReader(BAMInputFormat.java:121)
	at org.seqdoop.hadoop_bam.AnySAMInputFormat.createRecordReader(AnySAMInputFormat.java:190)
	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:180)
	at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:179)
	at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:134)
	at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)

For comparison, loading .vcf from google file system does not have issues.

@Georgehe4 Georgehe4 changed the title Issues loading BAM files in Google DFS Issues loading BAM files in Google FS Dec 4, 2017
@akmorrow13
Copy link
Contributor

@akmorrow13 akmorrow13 commented Dec 4, 2017

Similar issue to #1732 in s3

@fnothaft
Copy link
Member

@fnothaft fnothaft commented Dec 4, 2017

Hi @Georgehe4! You need the Java NIO FileSystemProvider for the gs:// scheme on your classpaths. Google provides a NIO FileSystemProvider at https://github.com/GoogleCloudPlatform/google-cloud-java/tree/master/google-cloud-contrib/google-cloud-nio. If you need any guidance, @ryan-williams has experience using this and might be able to point you in the right direction.

@fnothaft
Copy link
Member

@fnothaft fnothaft commented Dec 4, 2017

@ryan-williams
Copy link
Member

@ryan-williams ryan-williams commented Dec 5, 2017

Haven't looked closely at this but step 1 is to include the NIO provider, as Frank said above. Here's a piece of documentation about that; you most likely want the "shaded" JAR, to avoid dep-version-conflicts with Spark or other things on your classpath. The latest version seems to be 0.30.0-alpha).

At that point, you may find that the NIO provider is still not being found; Scala does something to its classloaders that breaks custom-NIO-provider-detection (cf. scala/bug#10247). I've dealt with that by using my own Path class that wraps Java's, and which munges the NIO providers in various ways; relevant docs.

Hope that helps!

@Georgehe4
Copy link
Contributor Author

@Georgehe4 Georgehe4 commented Dec 5, 2017

The shaded jar seems to get the program closer to integrating with gs but there are some auth issues that need to be resolved first

https://pastebin.com/51YmLVHP

@Georgehe4
Copy link
Contributor Author

@Georgehe4 Georgehe4 commented Dec 22, 2017

Using the 0.22.0-alpha version of the shaded jar seems to work.

There are similar issues trying to pull from gs using gce vm's being tracked here: googleapis/google-cloud-java#2453

@fnothaft
Copy link
Member

@fnothaft fnothaft commented Jan 21, 2018

Hi @Georgehe4! I know you've been working on this for bigdatagenomics/mango#340. When you're done downstream, would you mind pushing some of that info back upstream?

@Georgehe4
Copy link
Contributor Author

@Georgehe4 Georgehe4 commented Jan 21, 2018

Yep for sure 👍

@Georgehe4
Copy link
Contributor Author

@Georgehe4 Georgehe4 commented Feb 15, 2018

Got some time unblocked this week, I'll be working on a PR.

@fnothaft
Copy link
Member

@fnothaft fnothaft commented Mar 7, 2018

Resolved by #1918.

@fnothaft fnothaft closed this Mar 7, 2018
@fnothaft fnothaft added this to the 0.24.0 milestone Mar 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants