Add a SNPsplit module #593

dpryan79 · 2017-09-21T12:35:37Z

This adds support for output from SNPsplit. Both regular, bisulfite, and HiC outputs are currently supported. Example output is currently available here.

dpryan79 · 2017-09-21T12:38:10Z

@FelixKrueger and @vivekbhr: Let me know if there's something you'd like changed in the output.

ewels · 2017-09-21T12:46:39Z

Another PR added to the pile - thanks @dpryan79 😅

Trying to work through these today but as you're currently at your keyboard here is some speed feedback from looking at the report output:

Is it worth adding a column or two to the General Statistics table?
- Perhaps % reads aligned to one genome as I guess this is a good single-figure "did it work or not" statistic..?
Bar plot design
- I usually try to put "good" categories on the left, with "failure" categories after that. I also sometimes try to colour code the bad categories to make them more obvious. See the STAR module for an example of this (and colour codes to steal).
Help text
- Love that you wrote this, super helpful 👍 Minor thing - could be nice to use backticks to give markdown code when quoting the category names, to make them stand out more.
I find the two plots kind of confusing as they're so similar
- You've tried to explain this in the help text I see, and I'm not sure that I can think of a better way to represent / explain the differences... hmm. @FelixKrueger?

Many thanks for working on this though @dpryan79 - it looks great!

Phil

FelixKrueger · 2017-09-21T14:02:50Z

Hi Devon,

This looks pretty cool to me already, thanks for your efforts in that direction!

Just generally, the allele-tagging and allele-sorting reports should be identical for single-end files which would make the two bar graphs indeed very similar to each other. An exception to this is BS-Seq which also has the useful information of how many C>T SNPs had to be ignored.

For paired-end data the two reports may be substantially different because both reads are taken into account for the sorting step (while the tagging is based on individual reads).

So one might argue that generally the sorting report is probably the most immediately useful one to plot, however there may be some useful data in the tagging report that may be good to present somehow as well. Maybe this could be accomplished in the form of a General Tagging Statistics table that also uses coloured percentages in its cells (which most modules have at their start)? So maybe the different categories of the tagging plot could go in there, also including the total number of reads, C>T SNPs etc.? (sorry I don't know how much work that would be, but you seem to have all the right data at hand already).

I also find a useful metric the number of N-containing reads that were present in the SNP file:

Of valid N containing reads,
N was present in the list of known SNPs:	25884707 (99.98%)
N was not present in the list of SNPs:		6080 (0.02%)

This is a good and quick metric to see if someone simply used the wrong SNP file for the allele sorting.
Maybe

SNP annotation file:	all_SNPs_CAST_EiJ_GRCm38.txt.gz
SNPs stored in total:	20668547

could also be included for record keeping?

Thanks again, Felix

ewels · 2017-09-21T14:21:46Z

Quick thoughts in response to @FelixKrueger:

Tables are a last resort and plots are always better if possible 😉
- A single summary column or two in the General Stats alongside results from other tools is an exception to this rule
Would it make sense to show samples in one plot or the other, depending on whether they're SE or PE? For example, the bowtie2 module has separate SE and PE plots as the two are incompatible. Or just drop one of the tables?
I prefer to avoid putting text into reports (such as the annotation file used) where possible. But it's certainly easy to parse basically everything and spit this into the multiqc_data files for record keeping.
- If the Ns present in list of SNPs data is interesting, could do this as a plot? This also has the added bonus of representing the total number of SNPs in the report.

Phil

FelixKrueger · 2017-09-21T14:29:27Z

Not quite sure if I understand the first comment, pretty much every single module seems to have a these general statistics underlaid with coloured bar graphs, do they not?

I agree with the other suggestions though, maybe SNPs covered stat could be a single stacked horizontal bar chart as well (with a shout-out red colour if the majority or SNPs were unaccounted for..) .

ewels · 2017-09-21T14:40:07Z

Every single module seems to have a these general statistics underlaid with coloured bar graphs

As I understood your comment, I thought you were suggesting taking the categories currently in a bar graph and moving them into a table instead? I'm just pointing out that I often spend PRs asking people to move data out of tables and into plots where possible :)

vivekbhr · 2017-09-21T15:23:19Z

Thanks @dpryan79 for this useful PR :)

Summarizing above discussion with my suggestions:

The colors in the tagging and sorting bar charts should be matched. I suggest the order (left to right) : genome1, genome2, conflictingSNPs, ambiguous, unassignable, skipped.
Allele-tagging report should always appear as bar charts. Allele-sorting report should appear additionally as bar chart for the samples that are paired-end ( @FelixKrueger which line from the report you suggest to detect paired-end sample?) .
Another stacked bar chart for the SNP coverage report could be added as an enhancement. This could also go into another PR.

FelixKrueger · 2017-09-21T15:30:58Z

Hi @vivekbhr

A good line to decide whether it is SE or PE could the header of the Allele-Sorting step:

$grep Allele-specific *r.txt

BS_SNPsplit_report.PE.txt:Allele-specific paired-end sorting report
BS_SNPsplit_report.SE.txt:Allele-specific single-end sorting report
ChIP.SNPsplit_report.PE.txt:Allele-specific paired-end sorting report
ChIP.SNPsplit_report.SE.txt:Allele-specific single-end sorting report
HiC.SNPsplit_report.txt:Allele-specific paired-end sorting report

Thanks, Felix

ewels · 2017-10-05T14:15:55Z

I think maybe we have a case of too many cooks here..? Are you happy with a conclusion from all of this @dpryan79?

Phil

dpryan79 · 2017-10-05T14:17:56Z

@ewels Yeah, I just need to find some time to finish working on the module so people can have another look :)

ewels · 2018-06-30T07:57:20Z

Hi chaps,

Just a little reminder of this ongoing PR. Let me know if I can do anything to help here.

Phil

dpryan79 · 2018-07-02T16:38:49Z

Thanks for the reminder, hopefully I'll have a chance to look at this again early next week.

ewels · 2019-03-11T10:13:36Z

Ah, sad times.. What are your thoughts on merging this as it currently stands @dpryan79 @FelixKrueger?

I'm tempted to think that having something in MultiQC for SNPsplit is better than having nothing..?

dpryan79 · 2019-03-11T10:15:47Z

You're a better judge than me. I haven't had any chance to work on this in seemingly forever and don't foresee having a chance in the foreseeable future :( I'm happy to leave the branch and fork around if anyone wants to use it as a starting point of course :)

FelixKrueger · 2019-03-11T10:50:29Z

I'd love to see SNPsplit supported in MultiQC, but having never written any module for it it would probably take me a long while... I'm sure you could add the (mainly design) changes you had in mind in just a few minutes, what yo you think, Phil?

ewels · 2019-03-11T16:54:02Z

I'll see what I can do..

FelixKrueger · 2019-10-04T21:22:37Z

I'll see what I can do..

Seeing that you are currently working on a new Version, would you mind including this as well? Many thanks!

ewels · 2019-11-19T23:31:05Z

I have kindly requested that the developer behind SNPsplit has a read of my thoughts about log files at http://tallphil.co.uk/writing-good-log-files/#suggestion-2-use-nice-formats and considers adding support for a machine-readable metrics file, preferably YAML or JSON. 😜

FelixKrueger · 2019-11-20T12:25:14Z

I'm happy to write all metrics to a YAML file, but it will have to wait until today or tomorrow. Hope I won't be too late to the party....

FelixKrueger · 2019-11-22T17:25:51Z

I seem to have missed the party ever so slightly, but nevertheless there should now be YAML reports for SNPsplit. I have never produced any YAML files before, but I hope they are acceptable.

The statistics can be slightly different for different data types, so I am attaching sample reports for

SE alignments [Bowtie2]
PE alignments [Bowtie2]
SE bisulfite alignments [Bismark]
PE bisulfite alignments [Bismark]
PE Hi-C alignments [HiCUP]

Many thanks, Felix
YAML_reports.zip

ewels · 2019-11-23T14:50:57Z

Apologies @FelixKrueger - I know you've been asking me for this for ages and I cut you off from the release.. Just needed to get it out! I promise it won't be another year until the next one.. 😉

Thanks for the YAML support, looks great! Couple of minor suggestions (trying to use perl terminology):

No need to use an array of hashes for the top level organisation. You can just use a hash.

eg. Instead of this:

- Tagging:
    infile: filename.bam
    total_reads: 345678
- Sorting:
    tagged_infile: filename.bam
    HiC_total_pairs: 345678

You could just do this:

Tagging:
    infile: filename.bam
    total_reads: 345678
Sorting:
    tagged_infile: filename.bam
    HiC_total_pairs: 345678

It would be nice to have some metadata about the tool and the run itself. For example, starting the file with the following:

Meta:
    tool: SNPsplit
    version: 1.23
    date_run: 2019-11-23T15:45:43.1Z
    mode: bisulfite
    command: snpsplit -x -i filename.bam -o output.tsv

FelixKrueger · 2019-11-25T15:55:46Z

I have now updated the YAML section to include lots more useful metadata. Please find the sample reports attached. More details here: FelixKrueger/SNPsplit#29.

YAML_reports.zip

ewels · 2019-11-25T16:37:56Z

Yay! Super nice!

Date string is kind of a tricky format (my example above uses the canonical YAML format), but that really doesn't matter.. Thank you for doing this!

FelixKrueger · 2019-11-25T16:49:36Z

I could quickly re-factor the time format to be in canonical YAMl format if that helps? What does the Z at the end signify?

FelixKrueger · 2019-11-25T17:05:52Z

I have now changed it to prodece the format:
2019-11-25T16:59:39.1Z

(I am just appending .1Z at the end, not exactly sure what this means though). Here is an example report:

---
Meta:
  tool: SNPsplit
  version: 0.3.4_dev
  infile: R1_CAST_EiJ_N-masked_GRCm38_bismark_bt2_pe.bam
  date_run: 2019-11-25T17:03:40.1Z
  mode: bisulfite
  library: paired-end
  command: SNPsplit --snp all_SNPs_CAST_EiJ_GRCm38.txt.gz R1_CAST_EiJ_N-masked_GRCm38_bismark_bt2_pe.bam
Tagging:
  total_reads: 3754
  unaligned: 0
  percent_unaligned: 0.00
  g1: 612
  percent_g1: 16.30
  g2: 476
  percent_g2: 12.68
  unassignable: 2665
  percent_unassignable: 70.99
  unassigned_but_ct: 379
  no_snp: 2
  percent_no_snp: 0.05
  bizarre: 1
  percent_bizarre: 0.03
  SNP_annotation: all_SNPs_CAST_EiJ_GRCm38.txt.gz
  SNPs_stored: 20668547
  N_containing_reads: 1469
  non_N_containing_reads: 2284
  N_deletion: 1
  percent_N_deletion: 0.03
  multi_N_deletion: 0
  N_was_known_SNP: 2296
  percent_N_was_known_SNP: 100.00
  CT_positions_skipped: 774
  N_not_known: 0
  percent_N_not_known: 0.00
Sorting:
  tagged_infile: R1_CAST_EiJ_N-masked_GRCm38_bismark_bt2_pe.allele_flagged.bam
  PE_total_reads: 1877
  PE_total_pairs: 1877
  PE_total_singletons: 0
  PE_unassignable: 1068
  PE_percent_unassignable: 56.90
  PE_unassignable_pairs: 1068
  PE_unassignable_singletons: 0
  PE_genome1: 455
  PE_percent_genome1: 24.24
  PE_genome1_pairs: 455
  PE_genome1_singletons: 0
  PE_genome2: 353
  PE_percent_genome2: 18.81
  PE_genome2_pairs: 353
  PE_genome2_singletons: 0
  PE_conflicting: 1
  PE_percent_conflicting: 0.05
  PE_conflicting_pairs: 1
  PE_conflicting_singletons: 0
...

ewels · 2019-11-25T19:07:23Z

Off the top of my head, I think the .1 is a fraction of a second, and the Z is something to do with having a UTC time zone?

ewels · 2019-11-26T09:05:50Z

Yup - see https://stackoverflow.com/a/29951427/713980 for a perl snippet

FelixKrueger · 2019-11-26T10:06:42Z

Thanks for that. So are you saying that the current format is acceptable?

ewels · 2020-04-02T14:55:08Z

Hi all,

Merging into a new branch snpsplit so that I can continue working on this code 👍

Phil

FelixKrueger · 2020-04-02T14:58:21Z

Excellent, thanks!

dpryan79 added 2 commits September 17, 2017 13:51

Initial support for SNPsplit (unfinished)

1e7cd98

Support all known SNPsplit output types

539247f

Backticks for highlighting

cbc5afd

ewels added the waiting: changes Issue / PR is on hold, waiting for requested changes label Oct 5, 2017

ewels force-pushed the master branch 2 times, most recently from 2488ce1 to a75c62a Compare March 15, 2018 14:59

dpryan79 closed this Mar 10, 2019

ewels removed the waiting: changes Issue / PR is on hold, waiting for requested changes label Mar 11, 2019

ewels reopened this Mar 11, 2019

ewels added this to the MultiQC v1.8 milestone Oct 7, 2019

ewels removed this from the MultiQC v1.8 milestone Nov 20, 2019

ewels added this to the MultiQC v1.9 milestone Nov 20, 2019

FelixKrueger mentioned this pull request Nov 22, 2019

Stats in YAML format to enable integration into MultiQC FelixKrueger/SNPsplit#29

Closed

Merge branch 'master' into SNPsplit

7fe30ae

ewels changed the base branch from master to snpsplit April 2, 2020 14:49

ewels merged commit 363aecf into MultiQC:snpsplit Apr 2, 2020

ewels mentioned this pull request Apr 2, 2020

SNPsplit #1150

Merged

ewels added a commit that referenced this pull request Apr 2, 2020

SNPsplit: Finish up PR #593

e46ca70

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a SNPsplit module #593

Add a SNPsplit module #593

dpryan79 commented Sep 21, 2017

dpryan79 commented Sep 21, 2017

ewels commented Sep 21, 2017

FelixKrueger commented Sep 21, 2017

ewels commented Sep 21, 2017

FelixKrueger commented Sep 21, 2017

ewels commented Sep 21, 2017

vivekbhr commented Sep 21, 2017 •

edited

Loading

FelixKrueger commented Sep 21, 2017

ewels commented Oct 5, 2017

dpryan79 commented Oct 5, 2017

ewels commented Jun 30, 2018

dpryan79 commented Jul 2, 2018

ewels commented Mar 11, 2019

dpryan79 commented Mar 11, 2019

FelixKrueger commented Mar 11, 2019

ewels commented Mar 11, 2019

FelixKrueger commented Oct 4, 2019

ewels commented Nov 19, 2019

FelixKrueger commented Nov 20, 2019

FelixKrueger commented Nov 22, 2019

ewels commented Nov 23, 2019

FelixKrueger commented Nov 25, 2019

ewels commented Nov 25, 2019

FelixKrueger commented Nov 25, 2019

FelixKrueger commented Nov 25, 2019 •

edited by ewels

Loading

ewels commented Nov 25, 2019

ewels commented Nov 26, 2019

FelixKrueger commented Nov 26, 2019

ewels commented Apr 2, 2020

FelixKrueger commented Apr 2, 2020

Add a SNPsplit module #593

Add a SNPsplit module #593

Conversation

dpryan79 commented Sep 21, 2017

dpryan79 commented Sep 21, 2017

ewels commented Sep 21, 2017

FelixKrueger commented Sep 21, 2017

ewels commented Sep 21, 2017

FelixKrueger commented Sep 21, 2017

ewels commented Sep 21, 2017

vivekbhr commented Sep 21, 2017 • edited Loading

FelixKrueger commented Sep 21, 2017

ewels commented Oct 5, 2017

dpryan79 commented Oct 5, 2017

ewels commented Jun 30, 2018

dpryan79 commented Jul 2, 2018

ewels commented Mar 11, 2019

dpryan79 commented Mar 11, 2019

FelixKrueger commented Mar 11, 2019

ewels commented Mar 11, 2019

FelixKrueger commented Oct 4, 2019

ewels commented Nov 19, 2019

FelixKrueger commented Nov 20, 2019

FelixKrueger commented Nov 22, 2019

ewels commented Nov 23, 2019

FelixKrueger commented Nov 25, 2019

ewels commented Nov 25, 2019

FelixKrueger commented Nov 25, 2019

FelixKrueger commented Nov 25, 2019 • edited by ewels Loading

ewels commented Nov 25, 2019

ewels commented Nov 26, 2019

FelixKrueger commented Nov 26, 2019

ewels commented Apr 2, 2020

FelixKrueger commented Apr 2, 2020

vivekbhr commented Sep 21, 2017 •

edited

Loading

FelixKrueger commented Nov 25, 2019 •

edited by ewels

Loading