BAMHeader not set when running on a cluster #676

Closed
arahuja opened this Issue May 26, 2015 · 13 comments

Comments

Projects
None yet
4 participants
@arahuja
Contributor

arahuja commented May 26, 2015

When running transform with .bam file as the output, I see the following error:

Exception in thread "main" java.lang.AssertionError: assertion failed: Cannot return header if not attached.
        at scala.Predef$.assert(Predef.scala:179)
        at org.bdgenomics.adam.rdd.read.ADAMBAMOutputFormat$.getHeader(ADAMBAMOutputFormat.scala:60)
        at org.bdgenomics.adam.rdd.read.ADAMBAMOutputFormat.<init>(ADAMBAMOutputFormat.scala:68)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at java.lang.Class.newInstance(Class.java:379)
        at org.apache.spark.rdd.InstrumentedOutputFormat.<init>(InstrumentedOutputFormat.scala:33)
        at org.bdgenomics.adam.rdd.read.InstrumentedADAMBAMOutputFormat.<init>(ADAMBAMOutputFormat.scala:71)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at java.lang.Class.newInstance(Class.java:379)
        at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:958)

I tried to fix this with a similar strategy used by @fnothaft for the VCF header (see https://github.com/bigdatagenomics/adam/compare/master...hammerlab:bam-save-header?expand=1), but did not seem to take. Any ideas on this?

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Jun 1, 2015

Member

IIRC @fnothaft and I could not convince ourselves that the VCF header impl works like a year ago, though he claimed it was working for him (on a cluster) and I couldn't get it to work.

Member

ryan-williams commented Jun 1, 2015

IIRC @fnothaft and I could not convince ourselves that the VCF header impl works like a year ago, though he claimed it was working for him (on a cluster) and I couldn't get it to work.

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Jun 1, 2015

Member

I just ran into this issue as well, btw.

Member

ryan-williams commented Jun 1, 2015

I just ran into this issue as well, btw.

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Jun 1, 2015

Member

#353 has some discussion of the VCF version of this issue

Member

ryan-williams commented Jun 1, 2015

#353 has some discussion of the VCF version of this issue

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Jun 1, 2015

Member

I'll try my hand at porting a similar fix today and see if I have any more luck

Member

ryan-williams commented Jun 1, 2015

I'll try my hand at porting a similar fix today and see if I have any more luck

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Jun 1, 2015

Member

Just tried with master...ryan-williams:sam-attempt, still see the issue. Will dig through logs and see if I can verify whether the executors attempted to set the header.

Member

ryan-williams commented Jun 1, 2015

Just tried with master...ryan-williams:sam-attempt, still see the issue. Will dig through logs and see if I can verify whether the executors attempted to set the header.

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Jun 1, 2015

Member

heh, seems like the errors there were from double- and triple-setting the header on each executor (once per partition), e.g.:

java.lang.AssertionError: assertion failed: Cannot attach a new SAM header without first clearing the header.
        at scala.Predef$.assert(Predef.scala:179)
        at org.bdgenomics.adam.rdd.read.ADAMSAMOutputFormat$.addHeader(ADAMSAMOutputFormat.scala:41)
        at org.bdgenomics.adam.rdd.read.AlignmentRecordRDDFunctions$$anonfun$adamSAMSave$1$$anonfun$2.apply(AlignmentRecordRDDFunctions.scala:144)
        at org.bdgenomics.adam.rdd.read.AlignmentRecordRDDFunctions$$anonfun$adamSAMSave$1$$anonfun$2.apply(AlignmentRecordRDDFunctions.scala:134)

Forcibly clearing and then re-setting the header on each partition seems to do the trick, e.g.

    val mp = rdd.mapPartitionsWithIndex((idx, iter) => {
      log.warn(s"Setting ${if (asSam) "SAM" else "BAM"} header for partition $idx")
      val header = bcastHeader.value
      synchronized {
        // perform map partition call to ensure that the VCF header is set on all
        // nodes in the cluster; see:
        // https://github.com/bigdatagenomics/adam/issues/353

        asSam match {
          case true  =>
            ADAMSAMOutputFormat.clearHeader()
            ADAMSAMOutputFormat.addHeader(header)
            log.warn(s"Set SAM header for partition $idx")
          case false =>
            ADAMBAMOutputFormat.clearHeader()
            ADAMBAMOutputFormat.addHeader(header)
            log.warn(s"Set BAM header for partition $idx")
        }
      }
      Iterator[Int]()
    }).count()

I'll open up a PR promptly.

Member

ryan-williams commented Jun 1, 2015

heh, seems like the errors there were from double- and triple-setting the header on each executor (once per partition), e.g.:

java.lang.AssertionError: assertion failed: Cannot attach a new SAM header without first clearing the header.
        at scala.Predef$.assert(Predef.scala:179)
        at org.bdgenomics.adam.rdd.read.ADAMSAMOutputFormat$.addHeader(ADAMSAMOutputFormat.scala:41)
        at org.bdgenomics.adam.rdd.read.AlignmentRecordRDDFunctions$$anonfun$adamSAMSave$1$$anonfun$2.apply(AlignmentRecordRDDFunctions.scala:144)
        at org.bdgenomics.adam.rdd.read.AlignmentRecordRDDFunctions$$anonfun$adamSAMSave$1$$anonfun$2.apply(AlignmentRecordRDDFunctions.scala:134)

Forcibly clearing and then re-setting the header on each partition seems to do the trick, e.g.

    val mp = rdd.mapPartitionsWithIndex((idx, iter) => {
      log.warn(s"Setting ${if (asSam) "SAM" else "BAM"} header for partition $idx")
      val header = bcastHeader.value
      synchronized {
        // perform map partition call to ensure that the VCF header is set on all
        // nodes in the cluster; see:
        // https://github.com/bigdatagenomics/adam/issues/353

        asSam match {
          case true  =>
            ADAMSAMOutputFormat.clearHeader()
            ADAMSAMOutputFormat.addHeader(header)
            log.warn(s"Set SAM header for partition $idx")
          case false =>
            ADAMBAMOutputFormat.clearHeader()
            ADAMBAMOutputFormat.addHeader(header)
            log.warn(s"Set BAM header for partition $idx")
        }
      }
      Iterator[Int]()
    }).count()

I'll open up a PR promptly.

ryan-williams added a commit to ryan-williams/adam that referenced this issue Jun 1, 2015

ryan-williams added a commit to ryan-williams/adam that referenced this issue Jun 1, 2015

fix BAM/SAM header setting when writing on cluster
  fixes #676

- formatting / unnecessary `.toString` nits

- one more header fix

- whitespace nits

@fnothaft fnothaft closed this in #691 Jun 2, 2015

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Jun 4, 2015

Member

Interestingly, I just saw a failure when writing a SAM file, so there might be some nondeterministic way that this remains unfixed.

Here are the YARN container logs of all my executors.

This gist has the logs of the two executors where relevant things happened: IDs 161 and 69. The only log lines about headers come from these two executors, and they were running on the same host (out of 94 in the application); probably a coincidence?

I ran rdd.coalesce(1).adamSAMSave(filename, asSam = true, asRegularFile = true) on an ADAM with #689 patched in.

Timeline:

Something for us to keep an eye on / think about possible explanations for!

Member

ryan-williams commented Jun 4, 2015

Interestingly, I just saw a failure when writing a SAM file, so there might be some nondeterministic way that this remains unfixed.

Here are the YARN container logs of all my executors.

This gist has the logs of the two executors where relevant things happened: IDs 161 and 69. The only log lines about headers come from these two executors, and they were running on the same host (out of 94 in the application); probably a coincidence?

I ran rdd.coalesce(1).adamSAMSave(filename, asSam = true, asRegularFile = true) on an ADAM with #689 patched in.

Timeline:

Something for us to keep an eye on / think about possible explanations for!

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jun 4, 2015

Member

I'm not 100% sure (obviously) but my money is on Hadoop-BAM being borked.

Member

fnothaft commented Jun 4, 2015

I'm not 100% sure (obviously) but my money is on Hadoop-BAM being borked.

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Aug 23, 2015

Member

I thought I was seeing this again but decided the behavior is different enough that I'm going to file a new issue.

Member

ryan-williams commented Aug 23, 2015

I thought I was seeing this again but decided the behavior is different enough that I'm going to file a new issue.

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams May 16, 2016

Member

Bad news: I ran into this again today.
Good news: I've figured out why it's (still!) happening!
Bad news: I don't know how to fix it.

In the no-op mapPartitions+count that we run to set the header on each executor, there is no guarantee that at least one task will be executed on each active executor. Furthermore, there is no guarantee that tasks will run on the same executors during the subsequent writing of the BAM.

Generally, tasks get placed to optimize locality, which is likely to be stable in some cases, but today I ran into trouble because the no-op-count finishes in < 1s, while Spark waits on some (>1s?) timeout before scheduling non-NODE_LOCAL tasks, which was happening in the latter (saveAsNewAPIHadoopFile) stage, meaning certain executors received tasks in the latter but not the former, resulting in quick task and then job failures.

Following are more details about my app, partitions, and executors that exposed this.

I was attempting to load a .bam that is stored on HDFS in 956 blocks spanning 91 nodes. My cluster has 94 nodes; two were offline when this BAM was stored on HDFS, and one's HDFS process is dead but it functions fine as a Spark node.

At the time of the no-op-count stage, I had 319 live (3-core) executors, 4 of which were on 2 of the nodes with no BAM blocks. The no-op-count ran its 956 tasks in < 1s, with NODE_LOCAL locality on all tasks.

Then, the saveAsNewAPIHadoopFile stage ran, scheduling 928 NODE_LOCAL tasks, close to the maximum possible of 945 (3 on each of the 315 executors with BAM blocks, minus 17 executors that only got 2 tasks). 3-4s later, it scheduled the remaining 28 (956 - 928) tasks with RACK_LOCAL locality on [some of the executors that only got 2 NODE_LOCAL tasks previously] as well as [the 4 executors on nodes that have no BAM blocks].

The latter 4 executors failed immediately due to not having the BAM header set, and then retried the failed tasks in such rapid succession that they hit my spark.task.maxFailures of 10 in under 1s and failed the job.

I haven't fully digested the implications of this or thought much about ways to fix it, but I'm going to reopen this as I think it's real and still live!

Member

ryan-williams commented May 16, 2016

Bad news: I ran into this again today.
Good news: I've figured out why it's (still!) happening!
Bad news: I don't know how to fix it.

In the no-op mapPartitions+count that we run to set the header on each executor, there is no guarantee that at least one task will be executed on each active executor. Furthermore, there is no guarantee that tasks will run on the same executors during the subsequent writing of the BAM.

Generally, tasks get placed to optimize locality, which is likely to be stable in some cases, but today I ran into trouble because the no-op-count finishes in < 1s, while Spark waits on some (>1s?) timeout before scheduling non-NODE_LOCAL tasks, which was happening in the latter (saveAsNewAPIHadoopFile) stage, meaning certain executors received tasks in the latter but not the former, resulting in quick task and then job failures.

Following are more details about my app, partitions, and executors that exposed this.

I was attempting to load a .bam that is stored on HDFS in 956 blocks spanning 91 nodes. My cluster has 94 nodes; two were offline when this BAM was stored on HDFS, and one's HDFS process is dead but it functions fine as a Spark node.

At the time of the no-op-count stage, I had 319 live (3-core) executors, 4 of which were on 2 of the nodes with no BAM blocks. The no-op-count ran its 956 tasks in < 1s, with NODE_LOCAL locality on all tasks.

Then, the saveAsNewAPIHadoopFile stage ran, scheduling 928 NODE_LOCAL tasks, close to the maximum possible of 945 (3 on each of the 315 executors with BAM blocks, minus 17 executors that only got 2 tasks). 3-4s later, it scheduled the remaining 28 (956 - 928) tasks with RACK_LOCAL locality on [some of the executors that only got 2 NODE_LOCAL tasks previously] as well as [the 4 executors on nodes that have no BAM blocks].

The latter 4 executors failed immediately due to not having the BAM header set, and then retried the failed tasks in such rapid succession that they hit my spark.task.maxFailures of 10 in under 1s and failed the job.

I haven't fully digested the implications of this or thought much about ways to fix it, but I'm going to reopen this as I think it's real and still live!

@ryan-williams ryan-williams reopened this May 16, 2016

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft May 16, 2016

Member

I think we have a 100% concrete, guaranteed fix in the single file path. Specifically, see 7d4b409. It's an ugly fix: we write the header to HDFS as a file, and then read it when creating a record reader. It should be straightforward to port over to the "normal" (write a BAM as many partitions) code path, I just haven't done it.

Member

fnothaft commented May 16, 2016

I think we have a 100% concrete, guaranteed fix in the single file path. Specifically, see 7d4b409. It's an ugly fix: we write the header to HDFS as a file, and then read it when creating a record reader. It should be straightforward to port over to the "normal" (write a BAM as many partitions) code path, I just haven't done it.

@heuermh heuermh added this to the 0.20.0 milestone Jun 5, 2016

@heuermh heuermh referenced this issue Jun 7, 2016

Closed

Release ADAM version 0.20.0 #1048

47 of 61 tasks complete
@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jul 19, 2016

Member

Resolved by #964.

Member

fnothaft commented Jul 19, 2016

Resolved by #964.

@fnothaft fnothaft closed this Jul 19, 2016

@fnothaft fnothaft self-assigned this Aug 8, 2016

@fnothaft fnothaft reopened this Aug 8, 2016

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Aug 8, 2016

Member

This isn't yet resolved for sharded files, I'll have a PR resolving this for sharded files this AM.

Member

fnothaft commented Aug 8, 2016

This isn't yet resolved for sharded files, I'll have a PR resolving this for sharded files this AM.

fnothaft added a commit to fnothaft/adam that referenced this issue Aug 8, 2016

[ADAM-676] Clean up header issues for sharded files.
Resolves #676. In #964, we resolved the "header not set" issues for single file
SAM/BAM output. This change propegates this fix to sharded SAM/BAM output, and
VCF.

fnothaft added a commit to fnothaft/adam that referenced this issue Aug 30, 2016

[ADAM-676] Clean up header issues for sharded files.
Resolves #676. In #964, we resolved the "header not set" issues for single file
SAM/BAM output. This change propegates this fix to sharded SAM/BAM output, and
VCF.

fnothaft added a commit to fnothaft/adam that referenced this issue Sep 6, 2016

[ADAM-676] Clean up header issues for sharded files.
Resolves #676. In #964, we resolved the "header not set" issues for single file
SAM/BAM output. This change propegates this fix to sharded SAM/BAM output, and
VCF.

@heuermh heuermh closed this in 48caa3f Sep 7, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment