-
Notifications
You must be signed in to change notification settings - Fork 599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a SNPsplit module #593
Conversation
@FelixKrueger and @vivekbhr: Let me know if there's something you'd like changed in the output. |
Another PR added to the pile - thanks @dpryan79 😅 Trying to work through these today but as you're currently at your keyboard here is some speed feedback from looking at the report output:
Many thanks for working on this though @dpryan79 - it looks great! Phil |
Hi Devon, This looks pretty cool to me already, thanks for your efforts in that direction! Just generally, the allele-tagging and allele-sorting reports should be identical for single-end files which would make the two bar graphs indeed very similar to each other. An exception to this is BS-Seq which also has the useful information of how many C>T SNPs had to be ignored. For paired-end data the two reports may be substantially different because both reads are taken into account for the sorting step (while the tagging is based on individual reads). So one might argue that generally the sorting report is probably the most immediately useful one to plot, however there may be some useful data in the tagging report that may be good to present somehow as well. Maybe this could be accomplished in the form of a I also find a useful metric the number of N-containing reads that were present in the SNP file:
This is a good and quick metric to see if someone simply used the wrong SNP file for the allele sorting.
could also be included for record keeping? Thanks again, Felix |
Quick thoughts in response to @FelixKrueger:
Phil |
Not quite sure if I understand the first comment, pretty much every single module seems to have a these general statistics underlaid with coloured bar graphs, do they not? I agree with the other suggestions though, maybe SNPs covered stat could be a single stacked horizontal bar chart as well (with a shout-out red colour if the majority or SNPs were unaccounted for..) . |
As I understood your comment, I thought you were suggesting taking the categories currently in a bar graph and moving them into a table instead? I'm just pointing out that I often spend PRs asking people to move data out of tables and into plots where possible :) |
Thanks @dpryan79 for this useful PR :) Summarizing above discussion with my suggestions:
|
Hi @vivekbhr A good line to decide whether it is SE or PE could the header of the Allele-Sorting step:
Thanks, Felix |
I think maybe we have a case of too many cooks here..? Are you happy with a conclusion from all of this @dpryan79? Phil |
@ewels Yeah, I just need to find some time to finish working on the module so people can have another look :) |
2488ce1
to
a75c62a
Compare
Hi chaps, Just a little reminder of this ongoing PR. Let me know if I can do anything to help here. Phil |
Thanks for the reminder, hopefully I'll have a chance to look at this again early next week. |
Ah, sad times.. What are your thoughts on merging this as it currently stands @dpryan79 @FelixKrueger? I'm tempted to think that having something in MultiQC for SNPsplit is better than having nothing..? |
You're a better judge than me. I haven't had any chance to work on this in seemingly forever and don't foresee having a chance in the foreseeable future :( I'm happy to leave the branch and fork around if anyone wants to use it as a starting point of course :) |
I'd love to see SNPsplit supported in MultiQC, but having never written any module for it it would probably take me a long while... I'm sure you could add the (mainly design) changes you had in mind in just a few minutes, what yo you think, Phil? |
I'll see what I can do.. |
Seeing that you are currently working on a new Version, would you mind including this as well? Many thanks! |
I have kindly requested that the developer behind SNPsplit has a read of my thoughts about log files at http://tallphil.co.uk/writing-good-log-files/#suggestion-2-use-nice-formats and considers adding support for a machine-readable metrics file, preferably YAML or JSON. 😜 |
I'm happy to write all metrics to a YAML file, but it will have to wait until today or tomorrow. Hope I won't be too late to the party.... |
I seem to have missed the party ever so slightly, but nevertheless there should now be YAML reports for SNPsplit. I have never produced any YAML files before, but I hope they are acceptable. The statistics can be slightly different for different data types, so I am attaching sample reports for
Many thanks, Felix |
Apologies @FelixKrueger - I know you've been asking me for this for ages and I cut you off from the release.. Just needed to get it out! I promise it won't be another year until the next one.. 😉 Thanks for the YAML support, looks great! Couple of minor suggestions (trying to use perl terminology):
|
I have now updated the YAML section to include lots more useful metadata. Please find the sample reports attached. More details here: FelixKrueger/SNPsplit#29. |
Yay! Super nice! Date string is kind of a tricky format (my example above uses the canonical YAML format), but that really doesn't matter.. Thank you for doing this! |
I could quickly re-factor the time format to be in canonical YAMl format if that helps? What does the |
I have now changed it to prodece the format: (I am just appending ---
Meta:
tool: SNPsplit
version: 0.3.4_dev
infile: R1_CAST_EiJ_N-masked_GRCm38_bismark_bt2_pe.bam
date_run: 2019-11-25T17:03:40.1Z
mode: bisulfite
library: paired-end
command: SNPsplit --snp all_SNPs_CAST_EiJ_GRCm38.txt.gz R1_CAST_EiJ_N-masked_GRCm38_bismark_bt2_pe.bam
Tagging:
total_reads: 3754
unaligned: 0
percent_unaligned: 0.00
g1: 612
percent_g1: 16.30
g2: 476
percent_g2: 12.68
unassignable: 2665
percent_unassignable: 70.99
unassigned_but_ct: 379
no_snp: 2
percent_no_snp: 0.05
bizarre: 1
percent_bizarre: 0.03
SNP_annotation: all_SNPs_CAST_EiJ_GRCm38.txt.gz
SNPs_stored: 20668547
N_containing_reads: 1469
non_N_containing_reads: 2284
N_deletion: 1
percent_N_deletion: 0.03
multi_N_deletion: 0
N_was_known_SNP: 2296
percent_N_was_known_SNP: 100.00
CT_positions_skipped: 774
N_not_known: 0
percent_N_not_known: 0.00
Sorting:
tagged_infile: R1_CAST_EiJ_N-masked_GRCm38_bismark_bt2_pe.allele_flagged.bam
PE_total_reads: 1877
PE_total_pairs: 1877
PE_total_singletons: 0
PE_unassignable: 1068
PE_percent_unassignable: 56.90
PE_unassignable_pairs: 1068
PE_unassignable_singletons: 0
PE_genome1: 455
PE_percent_genome1: 24.24
PE_genome1_pairs: 455
PE_genome1_singletons: 0
PE_genome2: 353
PE_percent_genome2: 18.81
PE_genome2_pairs: 353
PE_genome2_singletons: 0
PE_conflicting: 1
PE_percent_conflicting: 0.05
PE_conflicting_pairs: 1
PE_conflicting_singletons: 0
... |
Off the top of my head, I think the |
Yup - see https://stackoverflow.com/a/29951427/713980 for a perl snippet |
Thanks for that. So are you saying that the current format is acceptable? |
Hi all, Merging into a new branch Phil |
Excellent, thanks! |
This adds support for output from SNPsplit. Both regular, bisulfite, and HiC outputs are currently supported. Example output is currently available here.