Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uncycler 0.4.6.0 failing at PCS Bridges, Trinity runs fine #185

Closed
jennaj opened this issue Jan 9, 2019 · 10 comments
Closed

Uncycler 0.4.6.0 failing at PCS Bridges, Trinity runs fine #185

jennaj opened this issue Jan 9, 2019 · 10 comments
Assignees
Labels
bug functionality usegalaxy.org tool/dependency/function fix usegalaxy.org test/retest-pass passed retest

Comments

@jennaj
Copy link
Member

jennaj commented Jan 9, 2019

Workaround for end-users: Use the Galaxy EU https://usegalaxy.eu server until the Galaxy Main https://usegalaxy.org server is fixed and this ticket closed out.


Important part of error seems to be after prep steps, when the assembly is actually starting

libgomp: Thread creation failed: Resource temporarily unavailable


Tool ID: toolshed.g2.bx.psu.edu/repos/iuc/unicycler/unicycler/0.4.6.0

Test histories usegalaxy.ORG

Test history usegalaxy.EU (for comparison) -- Update: Tool works at EU

GUI "Bug" message

screen shot 2019-01-09 at 11 40 42 am

GUI "Info" message

screen shot 2019-01-09 at 11 41 20 am

test1 full error report

Job Execution and Failure Information
Command Line
None
stderr
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
Error: SPAdes crashed! Please view spades.log for more information.

stdout

Starting Unicycler (2019-01-09 11:28:19)
    Welcome to Unicycler, an assembly pipeline for bacterial genomes. Since you
provided both short and long reads, Unicycler will perform a hybrid assembly.
It will first use SPAdes to make a short-read assembly graph, and then it will
use various methods to scaffold that graph with the long reads.
    For more information, please see https://github.com/rrwick/Unicycler

Command: /pylon5/mc48nsp/xcgalaxy/conda/envs/__unicycler@0.4.6/bin/unicycler -t 11 -o ./ --verbosity 3 --pilon_path /pylon5/mc48nsp/xcgalaxy/conda/envs/__unicycler@0.4.6/share/pilon-1.22-1/pilon-1.22.jar -1 fq1.fastq -2 fq2.fastq -l lr.fastq --mode normal --min_fasta_length 100 --linear_seqs 0 --min_kmer_frac 0.2 --max_kmer_frac 0.95 --kmer_count 10 --depth_filter 0.25 --start_gene_id 90.0 --start_gene_cov 95.0 --min_polish_size 1000 --min_component_size 1000 --min_dead_end_size 1000 --scores 

Unicycler version: v0.4.6
Using 11 threads

The output directory already exists and files may be reused or overwritten:
  /pylon5/mc48nsp/xcgalaxy/main/staging/21960093/working

Bridging mode: normal
  using default normal bridge quality cutoff: 10.00

Dependencies:
  Program         Version             Status     Path                                                        
  spades.py       3.12.0              good       /pylon5/mc48nsp/xcgalaxy/conda/envs/__unicycler@0.4.6/bin/sp
                                                 ades.py                                                     
  racon           1.3.1               good       /pylon5/mc48nsp/xcgalaxy/conda/envs/__unicycler@0.4.6/bin/ra
                                                 con                                                         
  makeblastdb     2.5.0+              good       /pylon5/mc48nsp/xcgalaxy/conda/envs/__unicycler@0.4.6/bin/ma
                                                 keblastdb                                                   
  tblastn         2.5.0+              good       /pylon5/mc48nsp/xcgalaxy/conda/envs/__unicycler@0.4.6/bin/tb
                                                 lastn                                                       
  bowtie2-build   2.3.4.3             good       /pylon5/mc48nsp/xcgalaxy/conda/envs/__unicycler@0.4.6/bin/bo
                                                 wtie2-build                                                 
  bowtie2         2.3.4.3             good       /pylon5/mc48nsp/xcgalaxy/conda/envs/__unicycler@0.4.6/bin/bo
                                                 wtie2                                                       
  samtools        1.9                 good       /pylon5/mc48nsp/xcgalaxy/conda/envs/__unicycler@0.4.6/bin/sa
                                                 mtools                                                      
  java            1.8.0_152-release   good       /pylon5/mc48nsp/xcgalaxy/conda/envs/__unicycler@0.4.6/bin/ja
                                                 va                                                          
  pilon           1.22                good       /pylon5/mc48nsp/xcgalaxy/conda/envs/__unicycler@0.4.6/share/
                                                 pilon-1.22-1/pilon-1.22.jar                                 
  bcftools                            not used                                                               


SPAdes read error correction (2019-01-09 11:29:25)
    Unicycler uses the SPAdes read error correction module to reduce the number
of errors in the short read before SPAdes assembly. This can make the assembly
faster and simplify the assembly graph structure.

Command: /pylon5/mc48nsp/xcgalaxy/conda/envs/__unicycler@0.4.6/bin/spades.py -1 /pylon5/mc48nsp/xcgalaxy/main/staging/21960093/working/fq1.fastq -2 /pylon5/mc48nsp/xcgalaxy/main/staging/21960093/working/fq2.fastq -o /pylon5/mc48nsp/xcgalaxy/main/staging/21960093/working/spades_assembly/read_correction --threads 11 --only-error-correction

System information:
  SPAdes version: 3.12.0
  Python version: 3.6.6
  OS: Linux-3.10.0-693.21.1.el7.x86_64-x86_64-with-centos-7.4.1708-Core
Output dir: /pylon5/mc48nsp/xcgalaxy/main/staging/21960093/working/spades_assembly/read_correction
Mode: ONLY read error correction (without assembling)
Debug mode is turned OFF
Dataset parameters:
  Multi-cell mode (you should set '--sc' flag if input data was obtained with MDA (single-cell) technology or --meta flag if processing metagenomic dataset)
  Reads:
    Library number: 1, library type: paired-end
      orientation: fr
      left reads: ['/pylon5/mc48nsp/xcgalaxy/main/staging/21960093/working/fq1.fastq']
      right reads: ['/pylon5/mc48nsp/xcgalaxy/main/staging/21960093/working/fq2.fastq']
      interlaced reads: not specified
      single reads: not specified
      merged reads: not specified
Read error correction parameters:
  Iterations: 1
  PHRED offset will be auto-detected
  Corrected reads will be compressed
Other parameters:
  Dir for temp files: /pylon5/mc48nsp/xcgalaxy/main/staging/21960093/working/spades_assembly/read_correction/tmp
  Threads: 11
  Memory limit (in Gb): 250
======= SPAdes pipeline started. Log can be found here: /pylon5/mc48nsp/xcgalaxy/main/staging/21960093/working/spades_assembly/read_correction/spades.log
===== Read error correction started.
== Running read error correction tool: /pylon5/mc48nsp/xcgalaxy/conda/envs/__unicycler@0.4.6/share/spades-3.12.0-1/bin/spades-hammer /pylon5/mc48nsp/xcgalaxy/main/staging/21960093/working/spades_assembly/read_correction/corrected/configs/config.info
  0:00:00.113     4M / 4M    INFO    General                 (main.cpp                  :  75)   Starting BayesHammer, built from N/A, git revision N/A
  0:00:00.396     4M / 4M    INFO    General                 (main.cpp                  :  76)   Loading config from /pylon5/mc48nsp/xcgalaxy/main/staging/21960093/working/spades_assembly/read_correction/corrected/configs/config.info
  0:00:01.056     4M / 4M    INFO    General                 (main.cpp                  :  78)   Maximum # of threads to use (adjusted due to OMP capabilities): 11
  0:00:01.056     4M / 4M    INFO    General                 (memory_limit.cpp          :  49)   Memory limit set to 250 Gb
  0:00:01.056     4M / 4M    INFO    General                 (main.cpp                  :  86)   Trying to determine PHRED offset
  0:00:01.172     4M / 4M    INFO    General                 (main.cpp                  :  92)   Determined value is 33
  0:00:01.172     4M / 4M    INFO    General                 (hammer_tools.cpp          :  36)   Hamming graph threshold tau=1, k=21, subkmer positions = [ 0 10 ]
  0:00:01.172     4M / 4M    INFO    General                 (main.cpp                  : 113)   Size of aux. kmer data 24 bytes
     === ITERATION 0 begins ===
  0:00:01.225     4M / 4M    INFO   K-mer Index Building     (kmer_index_builder.hpp    : 301)   Building kmer index
  0:00:01.225     4M / 4M    INFO    General                 (kmer_index_builder.hpp    : 117)   Splitting kmer instances into 176 files using 11 threads. This might take a while.
  0:00:01.245     4M / 4M    INFO    General                 (file_limit.hpp            :  32)   Open file limit set to 51200
  0:00:01.245     4M / 4M    INFO    General                 (kmer_splitters.hpp        :  89)   Memory available for splitting buffers: 7.57564 Gb
  0:00:01.245     4M / 4M    INFO    General                 (kmer_splitters.hpp        :  97)   Using cell size of 381300
  0:00:02.292     6G / 6G    INFO   K-mer Splitting          (kmer_data.cpp             :  97)   Processing /pylon5/mc48nsp/xcgalaxy/main/staging/21960093/working/fq1.fastq
libgomp: Thread creation failed: Resource temporarily unavailable
== Error ==  system call for: "['/pylon5/mc48nsp/xcgalaxy/conda/envs/__unicycler@0.4.6/share/spades-3.12.0-1/bin/spades-hammer', '/pylon5/mc48nsp/xcgalaxy/main/staging/21960093/working/spades_assembly/read_correction/corrected/configs/config.info']" finished abnormally, err code: 1
In case you have troubles running SPAdes, you can write to spades.support@cab.spbu.ru
or report an issue on our GitHub repository github.com/ablab/spades
Please provide us with params.txt and spades.log files from the output directory.


Job Information
Remote job server indicated a problem running or monitoring this job.
Job Traceback
None
This is an automated message. Do not reply to this address.

test2 full error report

Job Execution and Failure Information
Command Line
None
stderr
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
tput: No value for $TERM and no -T specified
Error: SPAdes crashed! Please view spades.log for more information.

stdout

Starting Unicycler (2019-01-09 11:24:43)
    Welcome to Unicycler, an assembly pipeline for bacterial genomes. Since you
provided both short and long reads, Unicycler will perform a hybrid assembly.
It will first use SPAdes to make a short-read assembly graph, and then it will
use various methods to scaffold that graph with the long reads.
    For more information, please see https://github.com/rrwick/Unicycler

Command: /pylon5/mc48nsp/xcgalaxy/conda/envs/__unicycler@0.4.6/bin/unicycler -t 11 -o ./ --verbosity 3 --pilon_path /pylon5/mc48nsp/xcgalaxy/conda/envs/__unicycler@0.4.6/share/pilon-1.22-1/pilon-1.22.jar -1 fq1.fastq -2 fq2.fastq -l lr.fastq --mode normal --min_fasta_length 100 --linear_seqs 0 --min_kmer_frac 0.2 --max_kmer_frac 0.95 --kmer_count 10 --depth_filter 0.25 --start_gene_id 90.0 --start_gene_cov 95.0 --min_polish_size 1000 --min_component_size 1000 --min_dead_end_size 1000 --scores 

Unicycler version: v0.4.6
Using 11 threads

The output directory already exists and files may be reused or overwritten:
  /pylon5/mc48nsp/xcgalaxy/main/staging/21960054/working

Bridging mode: normal
  using default normal bridge quality cutoff: 10.00

Dependencies:
  Program         Version             Status     Path                                                        
  spades.py       3.12.0              good       /pylon5/mc48nsp/xcgalaxy/conda/envs/__unicycler@0.4.6/bin/sp
                                                 ades.py                                                     
  racon           1.3.1               good       /pylon5/mc48nsp/xcgalaxy/conda/envs/__unicycler@0.4.6/bin/ra
                                                 con                                                         
  makeblastdb     2.5.0+              good       /pylon5/mc48nsp/xcgalaxy/conda/envs/__unicycler@0.4.6/bin/ma
                                                 keblastdb                                                   
  tblastn         2.5.0+              good       /pylon5/mc48nsp/xcgalaxy/conda/envs/__unicycler@0.4.6/bin/tb
                                                 lastn                                                       
  bowtie2-build   2.3.4.3             good       /pylon5/mc48nsp/xcgalaxy/conda/envs/__unicycler@0.4.6/bin/bo
                                                 wtie2-build                                                 
  bowtie2         2.3.4.3             good       /pylon5/mc48nsp/xcgalaxy/conda/envs/__unicycler@0.4.6/bin/bo
                                                 wtie2                                                       
  samtools        1.9                 good       /pylon5/mc48nsp/xcgalaxy/conda/envs/__unicycler@0.4.6/bin/sa
                                                 mtools                                                      
  java            1.8.0_152-release   good       /pylon5/mc48nsp/xcgalaxy/conda/envs/__unicycler@0.4.6/bin/ja
                                                 va                                                          
  pilon           1.22                good       /pylon5/mc48nsp/xcgalaxy/conda/envs/__unicycler@0.4.6/share/
                                                 pilon-1.22-1/pilon-1.22.jar                                 
  bcftools                            not used                                                               


SPAdes read error correction (2019-01-09 11:24:48)
    Unicycler uses the SPAdes read error correction module to reduce the number
of errors in the short read before SPAdes assembly. This can make the assembly
faster and simplify the assembly graph structure.

Command: /pylon5/mc48nsp/xcgalaxy/conda/envs/__unicycler@0.4.6/bin/spades.py -1 /pylon5/mc48nsp/xcgalaxy/main/staging/21960054/working/fq1.fastq -2 /pylon5/mc48nsp/xcgalaxy/main/staging/21960054/working/fq2.fastq -o /pylon5/mc48nsp/xcgalaxy/main/staging/21960054/working/spades_assembly/read_correction --threads 11 --only-error-correction

System information:
  SPAdes version: 3.12.0
  Python version: 3.6.6
  OS: Linux-3.10.0-693.21.1.el7.x86_64-x86_64-with-centos-7.4.1708-Core
Output dir: /pylon5/mc48nsp/xcgalaxy/main/staging/21960054/working/spades_assembly/read_correction
Mode: ONLY read error correction (without assembling)
Debug mode is turned OFF
Dataset parameters:
  Multi-cell mode (you should set '--sc' flag if input data was obtained with MDA (single-cell) technology or --meta flag if processing metagenomic dataset)
  Reads:
    Library number: 1, library type: paired-end
      orientation: fr
      left reads: ['/pylon5/mc48nsp/xcgalaxy/main/staging/21960054/working/fq1.fastq']
      right reads: ['/pylon5/mc48nsp/xcgalaxy/main/staging/21960054/working/fq2.fastq']
      interlaced reads: not specified
      single reads: not specified
      merged reads: not specified
Read error correction parameters:
  Iterations: 1
  PHRED offset will be auto-detected
  Corrected reads will be compressed
Other parameters:
  Dir for temp files: /pylon5/mc48nsp/xcgalaxy/main/staging/21960054/working/spades_assembly/read_correction/tmp
  Threads: 11
  Memory limit (in Gb): 250
======= SPAdes pipeline started. Log can be found here: /pylon5/mc48nsp/xcgalaxy/main/staging/21960054/working/spades_assembly/read_correction/spades.log
===== Read error correction started.
== Running read error correction tool: /pylon5/mc48nsp/xcgalaxy/conda/envs/__unicycler@0.4.6/share/spades-3.12.0-1/bin/spades-hammer /pylon5/mc48nsp/xcgalaxy/main/staging/21960054/working/spades_assembly/read_correction/corrected/configs/config.info
  0:00:00.000     4M / 4M    INFO    General                 (main.cpp                  :  75)   Starting BayesHammer, built from N/A, git revision N/A
  0:00:00.000     4M / 4M    INFO    General                 (main.cpp                  :  76)   Loading config from /pylon5/mc48nsp/xcgalaxy/main/staging/21960054/working/spades_assembly/read_correction/corrected/configs/config.info
  0:00:00.009     4M / 4M    INFO    General                 (main.cpp                  :  78)   Maximum # of threads to use (adjusted due to OMP capabilities): 11
  0:00:00.009     4M / 4M    INFO    General                 (memory_limit.cpp          :  49)   Memory limit set to 250 Gb
  0:00:00.009     4M / 4M    INFO    General                 (main.cpp                  :  86)   Trying to determine PHRED offset
  0:00:00.015     4M / 4M    INFO    General                 (main.cpp                  :  92)   Determined value is 33
  0:00:00.015     4M / 4M    INFO    General                 (hammer_tools.cpp          :  36)   Hamming graph threshold tau=1, k=21, subkmer positions = [ 0 10 ]
  0:00:00.015     4M / 4M    INFO    General                 (main.cpp                  : 113)   Size of aux. kmer data 24 bytes
     === ITERATION 0 begins ===
  0:00:00.030     4M / 4M    INFO   K-mer Index Building     (kmer_index_builder.hpp    : 301)   Building kmer index
  0:00:00.030     4M / 4M    INFO    General                 (kmer_index_builder.hpp    : 117)   Splitting kmer instances into 176 files using 11 threads. This might take a while.
  0:00:00.036     4M / 4M    INFO    General                 (file_limit.hpp            :  32)   Open file limit set to 51200
  0:00:00.036     4M / 4M    INFO    General                 (kmer_splitters.hpp        :  89)   Memory available for splitting buffers: 7.57564 Gb
  0:00:00.036     4M / 4M    INFO    General                 (kmer_splitters.hpp        :  97)   Using cell size of 381300
  0:00:01.068     6G / 6G    INFO   K-mer Splitting          (kmer_data.cpp             :  97)   Processing /pylon5/mc48nsp/xcgalaxy/main/staging/21960054/working/fq1.fastq
libgomp: Thread creation failed: Resource temporarily unavailable
== Error ==  system call for: "['/pylon5/mc48nsp/xcgalaxy/conda/envs/__unicycler@0.4.6/share/spades-3.12.0-1/bin/spades-hammer', '/pylon5/mc48nsp/xcgalaxy/main/staging/21960054/working/spades_assembly/read_correction/corrected/configs/config.info']" finished abnormally, err code: 1
In case you have troubles running SPAdes, you can write to spades.support@cab.spbu.ru
or report an issue on our GitHub repository github.com/ablab/spades
Please provide us with params.txt and spades.log files from the output directory.

Job Information
Remote job server indicated a problem running or monitoring this job.
Job Traceback
None
This is an automated message. Do not reply to this address.
@jennaj jennaj added bug functionality usegalaxy.org tool/dependency/function fix usegalaxy.org labels Jan 9, 2019
@jennaj
Copy link
Member Author

jennaj commented Jan 14, 2019

ping @natefoo -- any updates about what may be going wrong?

@eringaf
Copy link

eringaf commented Jan 21, 2019

Still getting a server error. Any updates?

@jennaj
Copy link
Member Author

jennaj commented Jan 22, 2019

@eringaf We just added this issue to our next weekly list of priorities. I would expect a solution near-term, possibly even sometime next week. We'll post updates here. Thanks!

@jennaj
Copy link
Member Author

jennaj commented Feb 4, 2019

Status: still failing

@jennaj
Copy link
Member Author

jennaj commented Feb 11, 2019

xref prior issue ticket (PCS Bridges problems impacting both Trinity & Unicycler) to avoid end-user confusion about Unicycler's most current status: #176

@natefoo
Copy link
Member

natefoo commented Feb 11, 2019

EDIT: struck some stuff that's incorrect lest I mislead future readers. See followup comments.

This is an issue with the resources configured for SPAdes by Unicycler, how SPAdes uses those limits, and Slurm on Bridges controlling memory usage with cgroups. Essentially, SPAdes is running out of memory, but the causes are very unintuitive, are specific to cgroup memory limits, and may be difficult to fix without some tool hacking. Because as it turns out, SPAdes is not actually running out of memory, but it's doing things that make the kernel think that it is.

The libgomp: Thread creation failed: Resource temporarily unavailable error occurs when the number of threads allocated for spades-hammer is too high but its memory limit is too low. In the default case, we are running Unicycler on Bridges with 480 GB of memory, which allocates 11 cores1 to the job. However, Unicycler does not have a memory option and does not configure memory for SPAdes, so SPAdes uses its default limit of 250 GB. It then sets this for itself using setrlimit(2):

21059 setrlimit(RLIMIT_AS, {rlim_cur=262144000*1024, rlim_max=RLIM64_INFINITY}) = 0

Shortly afterward, it mmap(2)s 30 GB of anonymous memory and then clone(2)s (forks) for each one of the 11 processes:

21059 mmap(NULL, 32212258816, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7fab837ff000
21059 mprotect(0x7fab837ff000, 4096, PROT_NONE) = 0
21059 clone(child_stack=0x7fb3037fedd0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fb
3037ff9d0, tls=0x7fb3037ff700, child_tidptr=0x7fb3037ff9d0) = 21943

This works up until the kernel returns ENOMEM, and then spades-hammer exits.

21059 <... mmap resumed> )              = -1 ENOMEM (Cannot allocate memory)
21059 write(2, "\nlibgomp: ", 10)       = 10
21059 write(2, "Thread creation failed: Resource"..., 56 <unfinished ...>
21946 futex(0x7fb70403c404, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>

On a typical Linux system where cgroup memory limits are not enabled, the kernel would not care that the process has requested 330 GB of memory, it'd simply say "ok" and then start OOM-killing once the processes actually tried to use more memory than exists in the system. With cgroup memory limits, the enforcement is done up front. The cgroup docs discuss the specific case of anonymous mmaps in section 4.1.

Pulling apart Unicycler, you can run its call to SPAdes by hand and set --memory, e.g.:

/pylon5/mc48nsp/xcgalaxy/conda/envs/__unicycler@0.4.6/bin/spades.py -1 /pylon5/mc48nsp/xcgalaxy/main/staging/21960054/working/fq1.fastq -2 /pylon5/mc48nsp/xcgalaxy/main/staging/21960054/working/fq2.fastq -o /pylon5/mc48nsp/xcgalaxy/main/staging/21960054/working/spades_assembly/read_correction --threads 11 --only-error-correction --memory 400

Which gets it a little further, but it ends in a segfault:

  0:00:17.599     8G / 9G    INFO   K-mer Splitting          (kmer_data.cpp             : 107)   Processed 667035 reads
  0:00:17.603     8G / 9G    INFO   K-mer Splitting          (kmer_data.cpp             :  97)   Processing /pylon5/mc48nsp/xcgalaxy/main/staging/22315907/working/fq2.fastq.gz
  0:00:28.407     8G / 9G    INFO   K-mer Splitting          (kmer_data.cpp             : 107)   Processed 1334070 reads
  0:00:28.408     8G / 9G    INFO   K-mer Splitting          (kmer_data.cpp             : 112)   Total 1334070 reads processed
  0:00:29.427    64M / 9G    INFO    General                 (kmer_index_builder.hpp    : 120)   Starting k-mer counting.
  0:00:31.903    64M / 9G    INFO    General                 (kmer_index_builder.hpp    : 127)   K-mer counting done. There are 99273534 kmers in total. 
  0:00:31.903    64M / 9G    INFO    General                 (kmer_index_builder.hpp    : 133)   Merging temporary buckets.
  0:00:37.347    64M / 9G    INFO   K-mer Index Building     (kmer_index_builder.hpp    : 314)   Building perfect hash indices
The program was terminated by segmentation fault
=== Stack Trace ===
The program was terminated by segmentation fault
=== Stack Trace ===
[0x40a709]
[0x40a876]
[0x5b3220]
[0x5ac033]
[0x449e8a]
[0x44aefb]
[0x59a7ae]
[0x5ab8c4]
[0x60b969]
The program was terminated by segmentation fault
[0x40a709]
[0x40a876]
[0x5b3220]
[0x5ac033]
[0x449e8a]
[0x44aefb]=== Stack Trace ===

[0x59a7ae]
[0x5ab8c4]
[0x60b969]

Digging through the syscalls at this point shows it's roughly the same thing, this time the child processes are mmaping 30 GB until the kernel ENOMEMs, which is not handled, and spades-hammer segfaults and dies.

If you increase the amount of memory given to spades-hammer (which is not possible with Unicycler) and/or reduce the number of parallel SPAdes processes, it'll work, but in both cases, it's going to be incredibly inefficient - you'll have to run with fewer threads than you could be (making it slower) and/or request way more memory on Bridges than you will actually use (wasting SUs).

The real solution is to figure out why SPAdes is doing such large anonymous mmaps, and figuring out if/how to stop it. Perhaps it's already addressed but I don't see anything in the changelog for the latest version. I'll have a deeper look in to SPAdes tomorrow.

1 Not sure why it's 11, as per the Bridges User Guide, the LM partition should be allocating 1 core per 48 GB of memory requested, so I expect it to be 10.

@suzukimicro
Copy link

Thank you for your precious help and support. Really hope this problem could be fixed soon. I have a lot of sequences to be analyzed by Unicycler...

@jennaj
Copy link
Member Author

jennaj commented Feb 12, 2019

thanks @natefoo for looking into this

@suzukimicro Please see the workaround at the very top of this ticket. In short, the tool is functional at Galaxy EU https://usegalaxy.eu based on tests with smaller sample data. Please consider running your jobs there for now.

@natefoo
Copy link
Member

natefoo commented Feb 12, 2019

Update:

It looks like my previous conclusion was partially incorrect: cgroups are unrelated to the problem. With the cgroup memory limits as they are set on Bridges, the OS will happily allocate an amount of virtual memory larger than the limit. The issue is in fact with SPAdes' usage of setrlimit() itself. On Bridges LM nodes, the size of the mmaps are much larger than on other systems, such as a Bridges login node.

lm002 (3TB, 64 cores):

mmap(NULL, 32212258816, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = -1 ENOMEM (Cannot allocate memory)

login005 (128GB, 28 cores):

mmap(NULL, 25169920, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7fd007bfe000

In both cases, RLIMIT_AS is set to the value of SPAdes' --memory option (or 250GB or the total amount of memory on the system, whichever is smaller), but that limit is not reached on login005, which is mmaping far smaller memory segments. In both cases, the size is right around 0.0001x the total system memory.

I thought maybe the issue was with SPAdes' use of jemalloc, since it bases the number of "arenas" on the number of cores (although not the total system memory, as far as I can tell), but after compiling SPAdes without jemalloc (-DSPADES_USE_JEMALLOC:BOOL=OFF), there's no change to the size of the mmaps.

So, I still need to figure out what is responsible for the size of those mmaps, which unfortunately my gdb-foo has failed me at so far. The threads are actually created via the use of BBHash aka BooPHF.h, so this is probably a place to look. Or maybe this is just the default behavior of pthreads?

@natefoo
Copy link
Member

natefoo commented Feb 12, 2019

Or maybe this is just the default behavior of pthreads?

It's RLIMIT_STACK (aka ulimit -s):

login005:

$ ulimit -s
24576
$ strace -e trace=getrlimit,mmap spades-hammer ...
getrlimit(RLIMIT_STACK, {rlim_cur=24576*1024, rlim_max=4194304*1024}) = 0
mmap(NULL, 25169920, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7fd007bfe000

l002:

$ ulimit -s
31457280
$ strace -e trace=getrlimit,mmap spades-hammer ...
getrlimit(RLIMIT_STACK, {rlim_cur=31457280*1024, rlim_max=31457280*1024}) = 0
mmap(NULL, 32212258816, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = -1 ENOMEM (Cannot allocate memory)
$ ulimit -s 24576
$ strace -e trace=getrlimit,mmap spades-hammer ...
getrlimit(RLIMIT_STACK, {rlim_cur=24576*1024, rlim_max=24576*1024}) = 0
mmap(NULL, 25169920, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7fb226783000

We can fix this by setting ulimit -s for the Bridges destination, but oddly enough, the PSC Trinity on Blacklight/Bridges documentation suggests setting ulimit -s unlimited. So we may need to create separate destinations with differing values of ulimit -s for Trinity and Unicycler.

I'll test this out and update once there's a fix in place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug functionality usegalaxy.org tool/dependency/function fix usegalaxy.org test/retest-pass passed retest
Projects
None yet
Development

No branches or pull requests

6 participants