-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incomplete pipeline and different errors when using nanopore reads files with different sizes (900 mb vs 11Gb) #52
Comments
Hi @SergioChile81 , Also, if you have an example of a hummingbird public nanopore data that I can give it a try, please let me know 😄 |
I think the problem relates to the fact the pipeline currently has settings for the long reads raw and long reads corrcted. However, I still did not implement the parameter for high quality long reads. I will do so this week, I will update the tools that have new versions, and add a new parameter in the pipeline for when input long reads are high quality. Then I let you know so we can test this new version before releasing 🙂 |
Hi @SergioChile81 ,
Also, for Barcode6, I saw that Canu complained about read coverage, I believe this is happening due to it applying algorithms for uncorrected reads in your dataset that is 'high-quality long reads' ... For this, I am now updating the pipeline to have a parameter I am currently developing the update with the publicly available High-Quality ONT reads R10.4 from |
Hello @fmalmeida, Thank you for improving the pipeline with the new --hq-longreads option. I wonder if later on you could include a QC of the reads (adapter, quality trimming with porechop, quality filtering with Filling, Remove Lambda with NanoLyse and evaluation with Nanoplot) -- See the beginning of this other great pipeline (https://nf-co.re/mag/2.5.0) I will get the folder contests and upload it here. Cheers, Sergio |
Hi @SergioChile81 , Also, thanks for the enhancement suggestion. At first, it was a design choice to keep it separate. You will see that we have developed a separate pipeline for preprocess and QC: https://github.com/fmalmeida/ngs-preprocess Right now, due limited resources, I would probably not be able to commit in adding it to this pipeline. However, I would invite you to try running this other pipeline https://github.com/fmalmeida/ngs-preprocess and see it if it fits your needs. If so, would also invite you can open an issue in its github to provide feedback and suggestions of enhancements. It is always very much welcomed. About the present issue, I have currently added the parameter and am assembling the drosophila reads with 3 algorithms (uncorrected, corrected and high-quality) so I can compare how the outputs look like depending on it, so I could suggest a new command line for your data 😃 |
Thank you @fmalmeida, I will give the pipelines a try. About the work files of barcode04, I had to rerun the pipeline because I was using the same folder to create the output of barcode04 and 06, and I think the first run results got lost. I am attaching the files for barcode04 in the new run with in this link. I also included the directory that shows up in the first error (in the html pipeline report). https://drive.google.com/drive/folders/1mxBNmZ7g7clanqaeLvQ6Yy9irYK5Dp-t?usp=share_link Thanks, |
Hi @SergioChile81 , I am currently doing the benchmarking/comparison between the assembly algorithms using In the meantime, I would like to ask if you could try running the branch using the The following command line should do the trick for testing: nextflow \
run fmalmeida/mpgap \
-r 52-incomplete-pipeline-and-different-errors-when-using-nanopore-reads-files-with-different-sizes-900-mb-vs-11gb \
-latest \
--input input.yml \
--output mpgap_results \
--tracedir mpgap_results/pipeline_info \
-profile docker \
--max_cpus 10 --max_memory '60.GB' \
--quast_additional_parameters ' --eukaryote --large ' \
--skip_unicycler --skip_canu --skip_shasta --skip_wtdbg2 --skip_raven \
--high_quality_longreads The |
Also, @SergioChile81 , About the error showed here: https://drive.google.com/drive/folders/1mxBNmZ7g7clanqaeLvQ6Yy9irYK5Dp-t?usp=share_link My best guess would be memory, as I saw that you only specified the parameter By default, the pipeline does a first try using small amount of resources, in other to try multiple assmelbies at the same time, if it fails, it then launches a second attempt using the maximum values set by the user. As you can see, this was happening as all your assemblies failed in the first attempt. Much probably due to memory. I would advice to try to setting a new value Finally, if wanting to set the first try bigger, you can also increase the amount of resources the pipeline uses in the 1st attempt. Currently, this is the configuration the pipeline uses for assemblies: process {
// Assemblies will first try to adjust themselves to a parallel execution
// If it is not possible, then it waits to use all the resources allowed
withLabel:process_assembly {
cpus = { if (task.attempt == 1) { check_max( 6 * task.attempt, 'cpus' ) } else { params.max_cpus } }
memory = { if (task.attempt == 1) { check_max( 20.GB * task.attempt, 'memory' ) } else { params.max_memory } }
time = { if (task.attempt == 1) { check_max( 24.h * task.attempt, 'time' ) } else { params.max_time } }
// retry at least once to try it with full resources
errorStrategy = { task.exitStatus in [1,21,143,137,104,134,139,247] ? 'retry' : 'finish' }
maxRetries = 1
maxErrors = '-1'
}
}
// Function to ensure that resource requirements don't go beyond
// a maximum limit
def check_max(obj, type) {
if(type == 'memory'){
try {
if(obj.compareTo(params.max_memory as nextflow.util.MemoryUnit) == 1)
return params.max_memory as nextflow.util.MemoryUnit
else
return obj
} catch (all) {
println " ### ERROR ### Max memory '${params.max_memory}' is not valid! Using default value: $obj"
return obj
}
} else if(type == 'time'){
try {
if(obj.compareTo(params.max_time as nextflow.util.Duration) == 1)
return params.max_time as nextflow.util.Duration
else
return obj
} catch (all) {
println " ### ERROR ### Max time '${params.max_time}' is not valid! Using default value: $obj"
return obj
}
} else if(type == 'cpus'){
try {
return Math.min( obj, params.max_cpus as int )
} catch (all) {
println " ### ERROR ### Max cpus '${params.max_cpus}' is not valid! Using default value: $obj"
return obj
}
}
}
You can possibly change it to process {
// Assemblies will first try to adjust themselves to a parallel execution
// If it is not possible, then it waits to use all the resources allowed
withLabel:process_assembly {
cpus = { if (task.attempt == 1) { check_max( 12 * task.attempt, 'cpus' ) } else { params.max_cpus } }
memory = { if (task.attempt == 1) { check_max( 40.GB * task.attempt, 'memory' ) } else { params.max_memory } }
time = { if (task.attempt == 1) { check_max( 24.h * task.attempt, 'time' ) } else { params.max_time } }
// retry at least once to try it with full resources
errorStrategy = { task.exitStatus in [1,21,143,137,104,134,139,247] ? 'retry' : 'finish' }
maxRetries = 1
maxErrors = '-1'
}
}
// Function to ensure that resource requirements don't go beyond
// a maximum limit
def check_max(obj, type) {
if(type == 'memory'){
try {
if(obj.compareTo(params.max_memory as nextflow.util.MemoryUnit) == 1)
return params.max_memory as nextflow.util.MemoryUnit
else
return obj
} catch (all) {
println " ### ERROR ### Max memory '${params.max_memory}' is not valid! Using default value: $obj"
return obj
}
} else if(type == 'time'){
try {
if(obj.compareTo(params.max_time as nextflow.util.Duration) == 1)
return params.max_time as nextflow.util.Duration
else
return obj
} catch (all) {
println " ### ERROR ### Max time '${params.max_time}' is not valid! Using default value: $obj"
return obj
}
} else if(type == 'cpus'){
try {
return Math.min( obj, params.max_cpus as int )
} catch (all) {
println " ### ERROR ### Max cpus '${params.max_cpus}' is not valid! Using default value: $obj"
return obj
}
}
}
You can save these lines to a file called
|
Hello Felipe, Sorry for the late response. The computer where I do analyses is the same one I do the Nanopore runs and we have been busy. Here is the results without the actual fasta assembly of your previous request for testing. https://drive.google.com/drive/folders/1qdlh7F14_CNg3HdNYeDync0JL40g6pd7?usp=sharing Our server has 64 cpu and 512 RAM. I will adjust the custom.config file and let you know. Thanks! |
Hi @SergioChile81 , Don’t need to run right now. I am must probably finishing the parameter testing this weekend so you can launch a real run to test the enhancement code in the new branch for real. Already using the parameters for high_quality reads as yours. I let you know in the beginning of next week when you can launch the run to use and check the new parameter. |
Hi @SergioChile81 , Now it seems to be properly passing the information for the assemblers. From what I saw, the only assemblers that have special parameters for high quality ONT reads, are Finally, please make sure to use the configuration for increase memory usage. Your command line would look more like this: nextflow \
run fmalmeida/mpgap \
-r 52-incomplete-pipeline-and-different-errors-when-using-nanopore-reads-files-with-different-sizes-900-mb-vs-11gb \
-latest \
--input input.yml \
--output mpgap_results \
--tracedir mpgap_results/pipeline_info \
-profile docker \
--max_cpus 10 --max_memory '60.GB' \
--quast_additional_parameters ' --eukaryote --large ' \
--skip_unicycler --skip_shasta --skip_wtdbg2 --skip_raven \
--high_quality_longreads \
-c custom.config Where Finally, the selection of reads quality can happen globally, by using the parameter samplesheet:
- id: highquality_algorithm
nanopore: in_reads/final_output/nanopore/SRR23215008.filtered.fq.gz
high_quality_longreads: true
genome_size: 180m
medaka_model: r1041_e82_400bps_sup_v4.2.0 Please let me know how it goes, because if things work, and the high quality parameter really propagates the information to the assemblers and all, I will then start working in the documentation, to make sure to add full information about these features in the manual. |
Hi @fmalmeida, Thank you for the new version of the pipeline. I tried to follow your instructions but got this error: N E X T F L O W ~ version 23.10.0 Perhaps I am using a different version of Nextflow. Let me know how to proceed. |
Hi @SergioChile81 ,
I have updated the code here so it has the missing "". The idea, is that we can test both that you can run the pipeline with more memory, and second that the 'high-quality' parameters are being properly passed to the assemblers that have special params for it ( only I am currently working in another ticket to add 😄 |
Thanks again for the fast response. I have updated the files and commands with the instructions you provided. However I have this new error.. Sorry : / (mpgap_nf) ubuntu@AGROSAVIA:~/mpgap$ nextflow run fmalmeida/mpgap -r 52-incomplete-pipeline-and-different-errors-when-using-nanopore-reads-files-with-different-sizes-900-mb-vs-11gb -latest --input MPGAP_samplesheet_barcode04.yml --output mpgap_results_barcode04 --tracedir mpgap_results_barcode04/pipeline_info -profile docker --max_cpus 10 --max_memory '60.GB' --quast_additional_parameters ' --eukaryote --large ' --skip_unicycler --skip_shasta --skip_wtdbg2 --skip_raven --high_quality_longreads -c custom.config
N E X T F L O W ~ version 23.10.0
Pulling fmalmeida/mpgap ...
Already-up-to-date
Launching `https://github.com/fmalmeida/mpgap` [fabulous_boyd] DSL2 - revision: 19b743abbc [52-incomplete-pipeline-and-different-errors-when-using-nanopore-reads-files-with-different-sizes-900-mb-vs-11gb]
WARN: Found unexpected parameters:
* --pilon_polish_rounds: 4
- Ignore this warning: params.schema_ignore_params = "pilon_polish_rounds"
------------------------------------------------------
fmalmeida/mpgap v3.2
------------------------------------------------------
Core Nextflow options
revision : 52-incomplete-pipeline-and-different-errors-when-using-nanopore-reads-files-with-different-sizes-900-mb-vs-11gb
runName : fabulous_boyd
containerEngine : docker
container : fmalmeida/mpgap@sha256:0439466a52a3aef70c3e3b2b8ba5504bf167db2437a7fbb85d40f94c95a67fb9
launchDir : /data/disk1_SSD_8TB/scratch/Sergio_Marchant_sep_2023/Colibries-20Oct-23
workDir : /data/disk1_SSD_8TB/scratch/Sergio_Marchant_sep_2023/Colibries-20Oct-23/work
projectDir : /home/ubuntu/.nextflow/assets/fmalmeida/mpgap
userName : ubuntu
profile : docker
configFiles : /home/ubuntu/.nextflow/assets/fmalmeida/mpgap/nextflow.config, /data/disk1_SSD_8TB/scratch/Sergio_Marchant_sep_2023/Colibries-20Oct-23/custom.config
Input/output options
input : MPGAP_samplesheet_barcode04.yml
output : mpgap_results_barcode04
Computational options
max_cpus : 10
max_memory : 60.GB
Long reads assemblers parameters
high_quality_longreads : true
Turn assemblers and modules on/off
skip_unicycler : true
skip_raven : true
skip_wtdbg2 : true
skip_shasta : true
Software' additional parameters
quast_additional_parameters: --eukaryote --large
Generic options
tracedir : mpgap_results_barcode04/pipeline_info
!! Only displaying parameters that differ from the pipeline defaults !!
------------------------------------------------------
If you use fmalmeida/mpgap for your analysis please cite:
* The pipeline
https://doi.org/10.5281/zenodo.3445485
* The nf-core framework
https://doi.org/10.1038/s41587-020-0439-x
* Software dependencies
https://github.com/fmalmeida/mpgap#citation
------------------------------------------------------
Launching defined workflows!
By default, all workflows will appear in the console "log" message.
However, the processes of each workflow will be launched based on the inputs received.
You can see that processes that were not launched have an empty [- ].
[- ] process > SHORTREADS_ONLY:spades -
[- ] process > SHORTREADS_ONLY:shovill -
[- ] process > SHORTREADS_ONLY:megahit -
[- ] process > LONGREADS_ONLY:canu -
[- ] process > LONGREADS_ONLY:flye -
[- ] process > LONGREADS_ONLY:medaka -
[- ] process > LONGREADS_ONLY:nanopolish -
[- ] process > LONGREADS_ONLY:gcpp -
[- ] process > SHORTREADS_ONLY:spades -
[- ] process > SHORTREADS_ONLY:shovill -
[- ] process > SHORTREADS_ONLY:megahit -
[- ] process > LONGREADS_ONLY:canu -
[- ] process > LONGREADS_ONLY:flye -
[- ] process > LONGREADS_ONLY:medaka -
[- ] process > LONGREADS_ONLY:nanopolish -
[- ] process > LONGREADS_ONLY:gcpp -
[- ] process > HYBRID:strategy_1_spades -
[- ] process > HYBRID:strategy_1_haslr -
[- ] process > HYBRID:strategy_2_canu -
[- ] process > HYBRID:strategy_2_flye -
[- ] process > HYBRID:strategy_2_medaka -
[- ] process > HYBRID:strategy_2_nanopolish -
[- ] process > HYBRID:strategy_2_gcpp -
[- ] process > HYBRID:strategy_2_pilon -
[- ] process > SHORTREADS_ONLY:spades -
[- ] process > SHORTREADS_ONLY:shovill -
[- ] process > SHORTREADS_ONLY:megahit -
[- ] process > LONGREADS_ONLY:canu -
[- ] process > LONGREADS_ONLY:flye -
[- ] process > LONGREADS_ONLY:medaka -
[- ] process > LONGREADS_ONLY:nanopolish -
[- ] process > LONGREADS_ONLY:gcpp -
[- ] process > HYBRID:strategy_1_spades -
[- ] process > HYBRID:strategy_1_haslr -
[- ] process > HYBRID:strategy_2_canu -
[- ] process > HYBRID:strategy_2_flye -
[- ] process > HYBRID:strategy_2_medaka -
[- ] process > HYBRID:strategy_2_nanopolish -
[- ] process > HYBRID:strategy_2_gcpp -
[- ] process > HYBRID:strategy_2_pilon -
[- ] process > HYBRID:strategy_2_polypolish -
[- ] process > ASSEMBLY_QC:quast -
[- ] process > ASSEMBLY_QC:multiqc -
Execution cancelled -- Finishing pending tasks before exit
Pipeline completed at: 2023-11-17T10:22:38.681112228-05:00
Execution status: failed
Execution duration: 2.4s
Do not give up, we can fix it!
ERROR ~ Error executing process > 'LONGREADS_ONLY:flye (highquality_algorithm)'
Caused by:
No signature of method: nextflow.script.ScriptBinding.check_max() is applicable for argument types: () values: [] -- Check script '/home/ubuntu/.nextflow/assets/fmalmeida/mpgap/./workflows/../modules/LongReads/flye.nf' at line: 26
Source block:
lr = (lr_type == 'nanopore') ? '--nano' : '--pacbio'
if (corrected_longreads.toBoolean()) { lrparam = lr + '-corr' }
else if (high_quality_longreads.toBoolean()) {
lrsuffix = (lr_type == 'nanopore') ? '-hq' : '-hifi'
lrparam = lr + lrsuffix
}
else { lrparam = lr + '-raw' }
gsize = (genome_size) ? "--genome-size ${genome_size}" : ""
additional_params = (params.flye_additional_parameters) ? params.flye_additional_parameters : ""
"""
# run flye
flye \\
${lrparam} $lreads \\
${gsize} \\
--out-dir flye \\
$additional_params \\
--threads $task.cpus &> flye.log ;
# rename results
mv flye/assembly.fasta flye/flye_assembly.fasta
"""
Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`
-- Check '.nextflow.log' file for details |
what is the contents of your |
I think there might be a problem with your or show it here so we can check? |
…erent-errors-when-using-nanopore-reads-files-with-different-sizes-900-mb-vs-11gb Relates to issue #52 Adds a new parameter for interpreting high quality long reads.
Closing this ticket due lack of activity. Results from the ticket.A new parameter to handle high quality long reads and activate parameters in the assemblers that have it. Merged in dev branch bu #63 and shall come in next release. Finally, a new issue was created in order to make it easier to modify the amount of memory that the pipeline requests from start, to make it easier to run with datasets of bigger genomes that require more memory, allowing it to run without having to first fail with a starting |
Added some new parameters in the latest release to allow users to quickly modify the amount of memory of starting assembly results. Select different BUSCO dbs. And also, say if long reads are corrected or high quality. https://github.com/fmalmeida/MpGAP/releases/tag/v3.2.0 Hope it helps. If error persists, we can open a new ticket for tackling it. |
Describe the bug
I encountered an issue while running the pipeline with two barcoded genome samples (Barcode04 and Barcode06). These samples produced exceptionally large output files: Barcode04 resulted in an 11GB file, while Barcode06 generated a massive 980GB file. Both runs also exhibited different errors. It's worth noting that the reference genome size for these samples, which are from hummingbirds, is approximately 1.5GB. The sequencing was performed using the Nanopore Promethion 10.4.1 platform, and basecalling was done with the Super Accurate algorithm.
To Reproduce
Steps to reproduce the behavior:
Run the following command line with the files in the respective folders
nextflow run fmalmeida/mpgap --output output_barcode04_20_oct_2023 --max_cpus 64 --input "MPGAP_samplesheet_barcode04.yml" -profile docker
or
nextflow run fmalmeida/mpgap --output output_barcode06_20_oct_2023 --max_cpus 64 --input "MPGAP_samplesheet_barcode06.yml" -profile docker
Expected behavior
Output folders with the results of the pipe
Archive.zip
The text was updated successfully, but these errors were encountered: