Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a SNPsplit module #593

Merged
merged 4 commits into from
Apr 2, 2020
Merged

Add a SNPsplit module #593

merged 4 commits into from
Apr 2, 2020

Conversation

dpryan79
Copy link
Contributor

This adds support for output from SNPsplit. Both regular, bisulfite, and HiC outputs are currently supported. Example output is currently available here.

@dpryan79
Copy link
Contributor Author

@FelixKrueger and @vivekbhr: Let me know if there's something you'd like changed in the output.

@ewels
Copy link
Member

ewels commented Sep 21, 2017

Another PR added to the pile - thanks @dpryan79 😅

Trying to work through these today but as you're currently at your keyboard here is some speed feedback from looking at the report output:

  • Is it worth adding a column or two to the General Statistics table?
    • Perhaps % reads aligned to one genome as I guess this is a good single-figure "did it work or not" statistic..?
  • Bar plot design
    • I usually try to put "good" categories on the left, with "failure" categories after that. I also sometimes try to colour code the bad categories to make them more obvious. See the STAR module for an example of this (and colour codes to steal).
  • Help text
    • Love that you wrote this, super helpful 👍 Minor thing - could be nice to use backticks to give markdown code when quoting the category names, to make them stand out more.
  • I find the two plots kind of confusing as they're so similar
    • You've tried to explain this in the help text I see, and I'm not sure that I can think of a better way to represent / explain the differences... hmm. @FelixKrueger?

Many thanks for working on this though @dpryan79 - it looks great!

Phil

@FelixKrueger
Copy link

Hi Devon,

This looks pretty cool to me already, thanks for your efforts in that direction!

Just generally, the allele-tagging and allele-sorting reports should be identical for single-end files which would make the two bar graphs indeed very similar to each other. An exception to this is BS-Seq which also has the useful information of how many C>T SNPs had to be ignored.

For paired-end data the two reports may be substantially different because both reads are taken into account for the sorting step (while the tagging is based on individual reads).

So one might argue that generally the sorting report is probably the most immediately useful one to plot, however there may be some useful data in the tagging report that may be good to present somehow as well. Maybe this could be accomplished in the form of a General Tagging Statistics table that also uses coloured percentages in its cells (which most modules have at their start)? So maybe the different categories of the tagging plot could go in there, also including the total number of reads, C>T SNPs etc.? (sorry I don't know how much work that would be, but you seem to have all the right data at hand already).

I also find a useful metric the number of N-containing reads that were present in the SNP file:

Of valid N containing reads,
N was present in the list of known SNPs:	25884707 (99.98%)
N was not present in the list of SNPs:		6080 (0.02%)

This is a good and quick metric to see if someone simply used the wrong SNP file for the allele sorting.
Maybe

SNP annotation file:	all_SNPs_CAST_EiJ_GRCm38.txt.gz
SNPs stored in total:	20668547

could also be included for record keeping?

Thanks again, Felix

@ewels
Copy link
Member

ewels commented Sep 21, 2017

Quick thoughts in response to @FelixKrueger:

  • Tables are a last resort and plots are always better if possible 😉
    • A single summary column or two in the General Stats alongside results from other tools is an exception to this rule
  • Would it make sense to show samples in one plot or the other, depending on whether they're SE or PE? For example, the bowtie2 module has separate SE and PE plots as the two are incompatible. Or just drop one of the tables?
  • I prefer to avoid putting text into reports (such as the annotation file used) where possible. But it's certainly easy to parse basically everything and spit this into the multiqc_data files for record keeping.
    • If the Ns present in list of SNPs data is interesting, could do this as a plot? This also has the added bonus of representing the total number of SNPs in the report.

Phil

@FelixKrueger
Copy link

Not quite sure if I understand the first comment, pretty much every single module seems to have a these general statistics underlaid with coloured bar graphs, do they not?

I agree with the other suggestions though, maybe SNPs covered stat could be a single stacked horizontal bar chart as well (with a shout-out red colour if the majority or SNPs were unaccounted for..) .

@ewels
Copy link
Member

ewels commented Sep 21, 2017

Every single module seems to have a these general statistics underlaid with coloured bar graphs

As I understood your comment, I thought you were suggesting taking the categories currently in a bar graph and moving them into a table instead? I'm just pointing out that I often spend PRs asking people to move data out of tables and into plots where possible :)

@vivekbhr
Copy link

vivekbhr commented Sep 21, 2017

Thanks @dpryan79 for this useful PR :)

Summarizing above discussion with my suggestions:

  1. The colors in the tagging and sorting bar charts should be matched. I suggest the order (left to right) : genome1, genome2, conflictingSNPs, ambiguous, unassignable, skipped.

  2. Allele-tagging report should always appear as bar charts. Allele-sorting report should appear additionally as bar chart for the samples that are paired-end ( @FelixKrueger which line from the report you suggest to detect paired-end sample?) .

  3. Another stacked bar chart for the SNP coverage report could be added as an enhancement. This could also go into another PR.

@FelixKrueger
Copy link

Hi @vivekbhr

A good line to decide whether it is SE or PE could the header of the Allele-Sorting step:

$grep Allele-specific *r.txt

BS_SNPsplit_report.PE.txt:Allele-specific paired-end sorting report
BS_SNPsplit_report.SE.txt:Allele-specific single-end sorting report
ChIP.SNPsplit_report.PE.txt:Allele-specific paired-end sorting report
ChIP.SNPsplit_report.SE.txt:Allele-specific single-end sorting report
HiC.SNPsplit_report.txt:Allele-specific paired-end sorting report

Thanks, Felix

@ewels
Copy link
Member

ewels commented Oct 5, 2017

I think maybe we have a case of too many cooks here..? Are you happy with a conclusion from all of this @dpryan79?

Phil

@dpryan79
Copy link
Contributor Author

dpryan79 commented Oct 5, 2017

@ewels Yeah, I just need to find some time to finish working on the module so people can have another look :)

@ewels ewels added the waiting: changes Issue / PR is on hold, waiting for requested changes label Oct 5, 2017
@ewels ewels force-pushed the master branch 2 times, most recently from 2488ce1 to a75c62a Compare March 15, 2018 14:59
@ewels
Copy link
Member

ewels commented Jun 30, 2018

Hi chaps,

Just a little reminder of this ongoing PR. Let me know if I can do anything to help here.

Phil

@dpryan79
Copy link
Contributor Author

dpryan79 commented Jul 2, 2018

Thanks for the reminder, hopefully I'll have a chance to look at this again early next week.

@dpryan79 dpryan79 closed this Mar 10, 2019
@ewels
Copy link
Member

ewels commented Mar 11, 2019

Ah, sad times.. What are your thoughts on merging this as it currently stands @dpryan79 @FelixKrueger?

I'm tempted to think that having something in MultiQC for SNPsplit is better than having nothing..?

@dpryan79
Copy link
Contributor Author

You're a better judge than me. I haven't had any chance to work on this in seemingly forever and don't foresee having a chance in the foreseeable future :( I'm happy to leave the branch and fork around if anyone wants to use it as a starting point of course :)

@FelixKrueger
Copy link

I'd love to see SNPsplit supported in MultiQC, but having never written any module for it it would probably take me a long while... I'm sure you could add the (mainly design) changes you had in mind in just a few minutes, what yo you think, Phil?

@ewels ewels removed the waiting: changes Issue / PR is on hold, waiting for requested changes label Mar 11, 2019
@ewels
Copy link
Member

ewels commented Mar 11, 2019

I'll see what I can do..

@ewels ewels reopened this Mar 11, 2019
@FelixKrueger
Copy link

I'll see what I can do..

Seeing that you are currently working on a new Version, would you mind including this as well? Many thanks!

@ewels ewels added this to the MultiQC v1.8 milestone Oct 7, 2019
@ewels
Copy link
Member

ewels commented Nov 19, 2019

I have kindly requested that the developer behind SNPsplit has a read of my thoughts about log files at http://tallphil.co.uk/writing-good-log-files/#suggestion-2-use-nice-formats and considers adding support for a machine-readable metrics file, preferably YAML or JSON. 😜

@FelixKrueger
Copy link

I'm happy to write all metrics to a YAML file, but it will have to wait until today or tomorrow. Hope I won't be too late to the party....

@ewels ewels removed this from the MultiQC v1.8 milestone Nov 20, 2019
@FelixKrueger
Copy link

I seem to have missed the party ever so slightly, but nevertheless there should now be YAML reports for SNPsplit. I have never produced any YAML files before, but I hope they are acceptable.

The statistics can be slightly different for different data types, so I am attaching sample reports for

  • SE alignments [Bowtie2]
  • PE alignments [Bowtie2]
  • SE bisulfite alignments [Bismark]
  • PE bisulfite alignments [Bismark]
  • PE Hi-C alignments [HiCUP]

Many thanks, Felix
YAML_reports.zip

@ewels
Copy link
Member

ewels commented Nov 23, 2019

Apologies @FelixKrueger - I know you've been asking me for this for ages and I cut you off from the release.. Just needed to get it out! I promise it won't be another year until the next one.. 😉

Thanks for the YAML support, looks great! Couple of minor suggestions (trying to use perl terminology):

  • No need to use an array of hashes for the top level organisation. You can just use a hash.

    • eg. Instead of this:

      - Tagging:
          infile: filename.bam
          total_reads: 345678
      - Sorting:
          tagged_infile: filename.bam
          HiC_total_pairs: 345678

      You could just do this:

      Tagging:
          infile: filename.bam
          total_reads: 345678
      Sorting:
          tagged_infile: filename.bam
          HiC_total_pairs: 345678
  • It would be nice to have some metadata about the tool and the run itself. For example, starting the file with the following:

    Meta:
        tool: SNPsplit
        version: 1.23
        date_run: 2019-11-23T15:45:43.1Z
        mode: bisulfite
        command: snpsplit -x -i filename.bam -o output.tsv

@FelixKrueger
Copy link

I have now updated the YAML section to include lots more useful metadata. Please find the sample reports attached. More details here: FelixKrueger/SNPsplit#29.

YAML_reports.zip

@ewels
Copy link
Member

ewels commented Nov 25, 2019

Yay! Super nice!

Date string is kind of a tricky format (my example above uses the canonical YAML format), but that really doesn't matter.. Thank you for doing this!

@FelixKrueger
Copy link

I could quickly re-factor the time format to be in canonical YAMl format if that helps? What does the Z at the end signify?

@FelixKrueger
Copy link

FelixKrueger commented Nov 25, 2019

I have now changed it to prodece the format:
2019-11-25T16:59:39.1Z

(I am just appending .1Z at the end, not exactly sure what this means though). Here is an example report:

---
Meta:
  tool: SNPsplit
  version: 0.3.4_dev
  infile: R1_CAST_EiJ_N-masked_GRCm38_bismark_bt2_pe.bam
  date_run: 2019-11-25T17:03:40.1Z
  mode: bisulfite
  library: paired-end
  command: SNPsplit --snp all_SNPs_CAST_EiJ_GRCm38.txt.gz R1_CAST_EiJ_N-masked_GRCm38_bismark_bt2_pe.bam
Tagging:
  total_reads: 3754
  unaligned: 0
  percent_unaligned: 0.00
  g1: 612
  percent_g1: 16.30
  g2: 476
  percent_g2: 12.68
  unassignable: 2665
  percent_unassignable: 70.99
  unassigned_but_ct: 379
  no_snp: 2
  percent_no_snp: 0.05
  bizarre: 1
  percent_bizarre: 0.03
  SNP_annotation: all_SNPs_CAST_EiJ_GRCm38.txt.gz
  SNPs_stored: 20668547
  N_containing_reads: 1469
  non_N_containing_reads: 2284
  N_deletion: 1
  percent_N_deletion: 0.03
  multi_N_deletion: 0
  N_was_known_SNP: 2296
  percent_N_was_known_SNP: 100.00
  CT_positions_skipped: 774
  N_not_known: 0
  percent_N_not_known: 0.00
Sorting:
  tagged_infile: R1_CAST_EiJ_N-masked_GRCm38_bismark_bt2_pe.allele_flagged.bam
  PE_total_reads: 1877
  PE_total_pairs: 1877
  PE_total_singletons: 0
  PE_unassignable: 1068
  PE_percent_unassignable: 56.90
  PE_unassignable_pairs: 1068
  PE_unassignable_singletons: 0
  PE_genome1: 455
  PE_percent_genome1: 24.24
  PE_genome1_pairs: 455
  PE_genome1_singletons: 0
  PE_genome2: 353
  PE_percent_genome2: 18.81
  PE_genome2_pairs: 353
  PE_genome2_singletons: 0
  PE_conflicting: 1
  PE_percent_conflicting: 0.05
  PE_conflicting_pairs: 1
  PE_conflicting_singletons: 0
...

@ewels
Copy link
Member

ewels commented Nov 25, 2019

Off the top of my head, I think the .1 is a fraction of a second, and the Z is something to do with having a UTC time zone?

@ewels
Copy link
Member

ewels commented Nov 26, 2019

Yup - see https://stackoverflow.com/a/29951427/713980 for a perl snippet

@FelixKrueger
Copy link

Thanks for that. So are you saying that the current format is acceptable?

@ewels ewels changed the base branch from master to snpsplit April 2, 2020 14:49
@ewels
Copy link
Member

ewels commented Apr 2, 2020

Hi all,

Merging into a new branch snpsplit so that I can continue working on this code 👍

Phil

@ewels ewels merged commit 363aecf into MultiQC:snpsplit Apr 2, 2020
@ewels ewels mentioned this pull request Apr 2, 2020
@FelixKrueger
Copy link

Excellent, thanks!

ewels added a commit that referenced this pull request Apr 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants