Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HaplotypeCallerSpark crashes:8 parallel GATk jobs (8 samples) using 5 Spark cpus on each in a node with 40 cpus #5717

Open
bhanugandham opened this issue Feb 25, 2019 · 3 comments

Comments

@bhanugandham
Copy link

commented Feb 25, 2019

User Question: I'm trying to speed up the process of calling variants using SPARK. I have access to a slurm HPC cluster, so I guess it's not that straightforward to run GATK in a proper distributed master-slave architecture (if there is any tutorial on how to setup slurm jobs to use GATK Spark tools on multiple nodes I would appreciate it a lot).
Therefore, I run GATK in local mode with some SPARK threads, eventually speeding up the process by parallelising the number of samples processed simultaneously with GNU parallel. But then, I'm having troubles because some samples crash due to SPARK errors. Perhaps you could send my logs to the developers ? I'm trying to run 8 parallel GATk jobs (8 samples) using 5 Spark cpus on each in a node with 40 cpus.

Best,
Pedro

This Issue was generated from your [forums]
[forums]: https://gatkforums.broadinstitute.org/gatk/discussion/comment/56193#Comment_56193

@bhanugandham

This comment has been minimized.

Copy link
Author

commented Feb 25, 2019

gatk4_errorLines.log
parallel.log
hs_err_pid82632.log

Please find attached error logs presented by the user.

@droazen

This comment has been minimized.

Copy link
Collaborator

commented Feb 25, 2019

@tomwhite Can you please have a look?

@tomwhite

This comment has been minimized.

Copy link
Collaborator

commented Feb 26, 2019

A few comments:

  • Spark job failures are usually memory related. What was the command run in this case?
  • Running HaplotypeCallerSpark is memory and compute intensive and should really be done on a cluster such as a Google Cloud Dataproc cluster. There are some scripts to help with this process here: https://github.com/broadinstitute/gatk/tree/master/scripts/spark_eval. Even if you don't use the scripts, they have settings for tuning memory, workers etc, that might be helpful.
  • MarkDuplicatesSpark can be run effectively on a single multi-core machine, so might be a good one to start with to get into Spark.
  • Spark manages parallelism, so GNU parallel is not needed (and doesn't really work well with Spark).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.