-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
frameshift_deletion_checks, simBench & Haplotype callers #132
Comments
Dear Dimitris, Regarding your first problem, frameshift_deletion_checks: We have only extensively tested frameshifting_deletions_checks with SARS-CoV-2 viruses. It is possible that your work on capsicum's virus has unearthed a new unknown corner case. Which version of V-pipe are you running? (what is the output of Could you also please provide the two log files mentionned by snakemake?:
This might help pin-point the origin of the crash. For providing the GFF, the second is the correct syntax ( frameshift_deletion_checks:
genes_gff : gff_dir/[…].gff3 This GFF file is used to determin the position of the genes on the virus' genome, in order to determine the open reading frames from which to scan all codon for premature "stop" introduced by mutations. In case that switching to the latest V-pipe (with the latest smallgenomeutilities) doesn't solve your problem, would you accept sending the .bam, the consensus.bcftools.fasta and the reference .fasta? |
Regarding you second question, simBench: Indeed, the benchmarking infrastructure that was introduced with Vpipe 2.0 and documented in the wiki has been deprecated in the recent reengineering of V-pipe. A new benchmarking infrastructure is currently being developped. the subdirectory resources/auxiliary_workflows/benchmark contain more information. @LaraFuhrmann could probably help you get more informations about this. Keep in mind that it is still considered experimental. |
Regarding your third question, global haplotype: Well, the global haplotype reconstruction is still an ongoing research field and far from solved. The software is still buggy. Regarding HaploClique, from the last time I've experimented with it, on some sample the number of cliques keep increasing in the early iterations and the memory consumption explodes. Regarding PredictHaplo, there are still bugs in the code (in the past most came from corner cases (e.g.: empty list) causing memory corruption and crashes). Debugging this requires skills that are in short supply in our small team. |
Dear @DrYak, thank you very much for your swift and elaborate response. Regarding Q1 For any user that may face the same issue, I will describe in detail the error: Following the error log shows that the .gff file is not given as input for the frameshift_deletions_checks:
My initial config.yaml file was the following:
Specifying the gff directory under the input tab as gff_directory: gffs_rev does not update the gff file to be given as input to the frameshift_deletions_checks but is only used for the visualization module of V-pipe. My second config.yaml file:
Logically, the error log was the same as before, as the typo prevents f_d_c to read the .gff file. It is also worth noting that I did not find in the documentation that a .gff file is a prerequisite for the QA process. Would be nice if you print an informative error message in such cases. Regarding Q2: Thank you very much for the information. I will probably resolve with writing a script for generating the reads myself. It is a great idea to include such benchmarking functionality in your tool! Regarding Q3: Yes, my experience is also frustrating with haplotype callers. Because I am not the original owner of the datasets, I will have to ask for permission in order to share the .bam file. I think that Haploclique severely struggles when the coverage is high. One alternative that worked for me in terms of at least consistently delivering the output is CliqueSNV (The outputted haplotypes though do not always make sense biologically). Again, thank you very much for your time and detailed responses. V-pipe looks very promising! With kind regards, |
Thank you for point out these pain points. Yes, indeed, we should add more sanity checks and refuse to run
We should find a way to make both more straight forward. The validation is still lacking. We do rely on schemas to validate the configuration, but the official documentations of snakemake uses a validator that is tolerant of missing/unknown keyword (it will simply not enforce types and defaults on unknown keywords). (This is something which wasn't a problem with the legacy INI-like format of I think we should consider:
|
Small update: Commit #58a9cccca3bf25021a016031f19f0816db748f44 helps a bit the user experience with multiple GFF files.
|
Dear V-pipe authors,
Thank you very much for your pipeline. I attempt to use V-pipe to assess the intra-host viral diversity in plant samples. However, I encountered the following issues:
The error is not fixed even if I provide the .gff file in the config.yaml with the following two alternative syntaxes:
or
I cannot find the benchmarking functionality in the latest documentation. In the config HTML file, there are no explanations and tags on how to generate reads and overall use the benchmarking modules. Downloading the benchmark branch and adjusting to the deprecated documentation was not successful. Are you planning to update the documentation for the newest version of your master branch, so that the benchmarking capabilities of the pipeline can be utilized? Can you please point me to the documentation of the benchmarking?
Considering the Haplotype calling modules, I consistently get segmentation faults when using PredictHaplo (changing between local to global analysis) and (core dumped) errors when using Haploclique. I was trying these options as I want to perform reference-based reconstruction of haplotypes. Can you provide any ideas on the reasons that such errors pop up?
Haploclique error log:
PredictHaplo error log remains empty, but after completing the local haplotype reconstruction, raises a segmentation fault.
It may be relevant that my datasets show very high coverage (50.000x -350.000x).
Please let me know if you need any more information to answer these issues. Thank you very much for your consideration.
Dimitris Karapliafis
The text was updated successfully, but these errors were encountered: