Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BaseRecalibratorSpark crash: too many open files #6578

Open
riederd opened this issue Apr 29, 2020 · 8 comments
Open

BaseRecalibratorSpark crash: too many open files #6578

riederd opened this issue Apr 29, 2020 · 8 comments

Comments

@riederd
Copy link

riederd commented Apr 29, 2020

Hi,

I'm trying to run BaseRecalibratorSpark (gatk-4.1.7.0) but I'm hitting a problem where the process carshe with a "too many open files" error.

Here is my command:

ulimit -n 4096
/usr/local/bioinf/gatk/gatk-4.1.7.0/gatk BaseRecalibratorSpark
--java-options '-Xmx64G'
--tmp-dir /local/scratch/rieder
-I Normal_fixed.bam
-R GRCh38.d1.vd1.fa
-L S07604514_Regions_merged_padded.interval_list
-O P45507_normal_bqsr.table
--known-sites Homo_sapiens_assembly38.dbsnp138.vcf
--known-sites Homo_sapiens_assembly38.known_indels.vcf
--known-sites Mills_and_1000G_gold_standard.indels.hg38.vcf
--spark-master local[8]
--conf 'spark.executor.cores=8'
--conf 'spark.local.dir=/local/scratch/rieder'

Here is the (hopefully) relevant log extract:

20/04/29 01:51:51 INFO TaskSetManager: Finished task 578.0 in stage 0.0 (TID 578) in 2720 ms on localhost (executor driver) (576/1585)
20/04/29 01:51:51 INFO NewHadoopRDD: Input split: file:/data/projects/2019/NeoAG/VCF-phasing/work/b8/d8dc550b7ba5f57d935d04e27b756a/Normal_fixed.bam:19562233856+33554432
01:51:51.374 INFO FeatureManager - Using codec VCFCodec to read file file:///local/scratch/rieder/spark-bb59423b-0368-4de5-85e0-e6641fb25380/userFiles-a91d5958-33f5-4685-bf9d-c8fc0924f7c6/Homo_sapiens_assembly38.dbsnp138.vcf
01:51:51.431 INFO FeatureManager - Using codec VCFCodec to read file file:///local/scratch/rieder/spark-bb59423b-0368-4de5-85e0-e6641fb25380/userFiles-a91d5958-33f5-4685-bf9d-c8fc0924f7c6/Homo_sapiens_assembly38.known_indels.vcf
01:51:51.451 INFO FeatureManager - Using codec VCFCodec to read file file:///local/scratch/rieder/spark-bb59423b-0368-4de5-85e0-e6641fb25380/userFiles-a91d5958-33f5-4685-bf9d-c8fc0924f7c6/Mills_and_1000G_gold_standard.indels.hg38.vcf
01:51:51.457 INFO FeatureManager - Using codec VCFCodec to read file file:///local/scratch/rieder/spark-bb59423b-0368-4de5-85e0-e6641fb25380/userFiles-a91d5958-33f5-4685-bf9d-c8fc0924f7c6/Homo_sapiens_assembly38.dbsnp138.vcf
01:51:51.507 INFO BaseRecalibrationEngine - The covariates being used here:
01:51:51.507 INFO BaseRecalibrationEngine - ReadGroupCovariate
01:51:51.507 INFO BaseRecalibrationEngine - QualityScoreCovariate
01:51:51.507 INFO BaseRecalibrationEngine - ContextCovariate
01:51:51.507 INFO BaseRecalibrationEngine - CycleCovariate
01:51:51.517 INFO FeatureManager - Using codec VCFCodec to read file file:///local/scratch/rieder/spark-bb59423b-0368-4de5-85e0-e6641fb25380/userFiles-a91d5958-33f5-4685-bf9d-c8fc0924f7c6/Homo_sapiens_assembly38.known_indels.vcf
20/04/29 01:51:51 ERROR Executor: Exception in task 581.0 in stage 0.0 (TID 581)
org.broadinstitute.hellbender.exceptions.GATKException: Error initializing feature reader for path /local/scratch/rieder/spark-bb59423b-0368-4de5-85e0-e6641fb25380/userFiles-a91d5958-33f5-4685-bf9d-c8fc0924f7c6/Homo_sapiens_assembly38.known_indels.vcf
at org.broadinstitute.hellbender.engine.FeatureDataSource.getTribbleFeatureReader(FeatureDataSource.java:383)
at org.broadinstitute.hellbender.engine.FeatureDataSource.getFeatureReader(FeatureDataSource.java:335)
at org.broadinstitute.hellbender.engine.FeatureDataSource.(FeatureDataSource.java:282)
at org.broadinstitute.hellbender.engine.FeatureDataSource.(FeatureDataSource.java:238)
at org.broadinstitute.hellbender.engine.FeatureDataSource.(FeatureDataSource.java:222)
at org.broadinstitute.hellbender.utils.spark.JoinReadsWithVariants.openFeatureSource(JoinReadsWithVariants.java:63)
at org.broadinstitute.hellbender.utils.spark.JoinReadsWithVariants.lambda$null$0(JoinReadsWithVariants.java:44)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at org.broadinstitute.hellbender.utils.spark.JoinReadsWithVariants.lambda$join$60e5b476$1(JoinReadsWithVariants.java:44)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:186)
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:186)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: htsjdk.tribble.TribbleException$MalformedFeatureFile: Unable to parse header with error: /local/scratch/rieder/spark-bb59423b-0368-4de5-85e0-e6641fb25380/userFiles-a91d5958-33f5-4685-bf9d-c8fc0924f7c6/Homo_sapiens_assembly38.known_indels.vcf: Too many open files, for input source: /local/scratch/rieder/spark-bb59423b-0368-4de5-85e0-e6641fb25380/userFiles-a91d5958-33f5-4685-bf9d-c8fc0924f7c6/Homo_sapiens_assembly38.known_indels.vcf
at htsjdk.tribble.TribbleIndexedFeatureReader.readHeader(TribbleIndexedFeatureReader.java:263)
at htsjdk.tribble.TribbleIndexedFeatureReader.(TribbleIndexedFeatureReader.java:102)
at htsjdk.tribble.TribbleIndexedFeatureReader.(TribbleIndexedFeatureReader.java:127)
at htsjdk.tribble.AbstractFeatureReader.getFeatureReader(AbstractFeatureReader.java:121)
at org.broadinstitute.hellbender.engine.FeatureDataSource.getTribbleFeatureReader(FeatureDataSource.java:380)
... 39 more

How many files does it need to open? As user I can not ulimit to more than 4096 files.

Best
Dietmar

@lbergelson
Copy link
Member

Huh. It sounds like there must be a file handle leak we didn't know about. I wonder why you're seeing it now for the first time...
@cmnbroad Could changes to the Path/GATKPathSpecifier code possibly have any side effects of losing file handles somewhere?

@riederd
Copy link
Author

riederd commented Apr 29, 2020

...I didn't run the spark version for quite some time, and it might also depend on the size of the BAM file being processed.

@lbergelson
Copy link
Member

I meant more "why anyone is seeing it now for the first time" not you specifically :)

Are you running a particularly giant bam file?

@riederd
Copy link
Author

riederd commented Apr 29, 2020

Not really it is "just" 50697M

@lbergelson
Copy link
Member

That's big but not crazy.

@riederd
Copy link
Author

riederd commented Apr 29, 2020

its from WGS

@cmnbroad
Copy link
Collaborator

cmnbroad commented May 4, 2020

Not sure if/how GATKPathSpecifier could cause this. @riederd Have you ever run this same command in a previous GATK version (ie., 4.1.6.0). It would be super helpful to know if 4.1.6.0 doesn't have this problem. Also, the only GATKPathSpecifier changes in 4.1.7.0 were for reference files.

I've long suspected that we have a file handle leak somewhere, since I encounter it when running tests locally, but have never been able to track it down.

@riederd
Copy link
Author

riederd commented May 4, 2020

4.1.4.1 didn't have the problem, at least I do not remember to have seen it.
4.1.6.0 I didn't use much because of other issues with Mutect2

HTH

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants