binned coverage plot option in align_and_plot #957

lakras · 2019-06-03T20:35:47Z

added binning for coverage plots with large reference genomes generated by align_and_plot:

off by default, can be turned on with option bin_large_plots = True
plots maximum or minimum read depth in each bin (option binning_summary_statistic = "max" or "min", default is max)
bin size set to one bin per pixel (bin size displayed in y axis label--Plasmodium bin size is 27 kb)
binning results in Plasmodium coverage plots with file size 20 KB, compared to 140 MB without binning (unopenable and indecipherable if it does open)

also:

made sample name field in plot_coverage optional to allow running with new batch tool runner on DNA nexus (gets sample name from reads_unmapped_bam)
fixed indentation, repeated code in plot_coverage

Here's a coverage plot for the first chromosome of P. falciparum without binning (left; 5.2 MB) and with binning (right; 21 KB).

And here's a coverage plot for the whole P. falciparum genome without binning (left; 136.4 MB) and with binning (right; 21 KB).

… bulk (gets sample name from reads_unmapped_bam)

…t code following if statement

This reverts commit a820ab2.

… it" This reverts commit 4317d21.

…ented out code following if statement" This reverts commit a2b590b.

This reverts commit 9f346e1.

This reverts commit 999fad2.

…nning in bulk (gets sample name from reads_unmapped_bam)" This reverts commit 5529f5c.

…bulk (gets sample name from reads_unmapped_bam)

…mmit" This reverts commit 938f6c2.

…into lk-binned-coverage-plots

tomkinsc

This is great! It'll be nice to merge this in so we can make plots for giant genomes without giant file sizes. A few comments below. Do you have an example plot you can add to the PR so we can see how binned plots look (if they're noticeably different)?

reports.py

tomkinsc · 2019-06-03T21:12:17Z

reports.py

+    parser.add_argument(
+        '--binLargePlots',
+        dest="bin_large_plots",
+        action="store_true",


Do you think we should turn on binned plots as the default option for genomes beyond a certain length (the axis xlim?)? It would be nice for users to simply get a plot regardless of genome size, and your y-axis labeling code makes it clear whether a given plot has been binned or not. If we do that we'd probably want the ability to override and turn off auto-binning (or set the bin size manually and auto-scale the width?).

Beyond some point, plots without binning are useless. But there is a middle ground where you could bin or not bin. We could have binning turned on by default with an override option, and bin a bit less aggressively (something like two to 20 bins per pixel, rather than one bin per pixel). If a plot doesn't really need it, then it won't be binned. As things currently are, GB virus C gets 11-bp bins when binning is turned on, but the non-binned plot is quite readable and reasonably sized, which makes me think I was a bit too aggressive.

Setting the bin size manually—I would not make this the default, but I think that would make an excellent option—I can see people wanting to be able to do it. Auto-scaling the width would be a nice way to make sure that all coverage plots produced are actually readable, but I don't think it is absolutely necessary.

reports.py

tomkinsc · 2019-06-03T21:33:53Z

reports.py

+        bin_size = 1 + int(domain_max/preferred_domain)
+        binned_segment_depths = OrderedDict()
+        for segment_num, (segment_name, position_depths) in enumerate(segment_depths.items()):
+            max_depths_in_bins = [max(position_depths[i:i + bin_size]) for i in range(0, len(position_depths), bin_size)]


This may be a question for others who routinely look at these plots, but since these plots are often used to spot regions of low coverage, I wonder if the max depth of positions within a bin is the stat we care about most, or if min may be more appropriate. Or maybe the value could be the number of positions within a bin exceeding some threshold value (the mean coverage within the bin?)? Perhaps it should be user-configurable?

Sure. When I was playing around with different options, I found that the max best represented the actual shape of the plot, which is what I wanted to see. I have also been more interested in the presence rather than absence of coverage. I don't see any harm in giving people options—but I would want max to be one of those options, since it has served me quite well.

pipes/WDL/workflows/tasks/tasks_reports.wdl

…nside plt.style block, rephrased string concatenation

…mmaryStatistic argument (default max); fixed read_length_threshold parser.add_argument indentation to match the others

reports.py

tomkinsc · 2019-06-05T00:54:35Z

reports.py

+                binning_summary_statistic = "min" # for y axis label
+                binning_action = min
+            else:
+                binning_summary_statistic = "max"


Using binning_action.__name__ where you want a string version of the binning function name would obviate the need to store it separately

Removed that whole block of code and replaced with

binning_action = eval(binning_summary_statistic)

since binning_summary_statistic is constrained by the choices now.

Using eval was my first instinct as well, despite the fact that it always feels a little wrong, even with constraints. It may be nice to preserve the ability to do it by name in case we ever add an inline function for another bin stat (threshold, q score filter, etc.)

lakras added 22 commits March 5, 2019 23:58

fixed indentation, repeated code in plot_coverage

079d175

added binning for large plots, on by default

4019bf1

added command line and pipe option for binning (now off by default)

a99ab22

moved binLargePlots argument to parser_plot_coverage_common

2b52b4a

makes sample name field in plot_coverage optional to allow running in…

5529f5c

… bulk (gets sample name from reads_unmapped_bam)

hopefully fixed syntax

999fad2

hopefully fixed syntax

9f346e1

syntax debugging: fixed string equality in if statement, commented ou…

a2b590b

…t code following if statement

moved wdl code out of the command block--hopefully this fixes it

4317d21

moved use of reads_unmapped_bam to AFTER it's initialized, oops

a820ab2

Revert "moved use of reads_unmapped_bam to AFTER it's initialized, oops"

b71f50a

This reverts commit a820ab2.

Revert "moved wdl code out of the command block--hopefully this fixes…

20f6392

… it" This reverts commit 4317d21.

Revert "syntax debugging: fixed string equality in if statement, comm…

6e89e82

…ented out code following if statement" This reverts commit a2b590b.

Revert "hopefully fixed syntax"

59f9eda

This reverts commit 9f346e1.

Revert "hopefully fixed syntax"

c98a449

This reverts commit 999fad2.

Revert "makes sample name field in plot_coverage optional to allow ru…

6c0e9f5

…nning in bulk (gets sample name from reads_unmapped_bam)" This reverts commit 5529f5c.

made sample name field in plot_coverage optional to allow running in …

c071dff

…bulk (gets sample name from reads_unmapped_bam)

fixed silly syntax error oops

f856602

removed question mark on sample_name initialization

a7fc758

moved sample_name initialization to test travis build upon commit

938f6c2

Revert "moved sample_name initialization to test travis build upon co…

6c19658

…mmit" This reverts commit 938f6c2.

Merge branch 'master' of https://github.com/broadinstitute/viral-ngs …

56dcfac

…into lk-binned-coverage-plots

tomkinsc reviewed Jun 3, 2019

View reviewed changes

tomkinsc requested a review from dpark01 June 4, 2019 20:48

lakras added 4 commits June 4, 2019 19:43

fixed inner plot width calculation and variable name, moved binning i…

93e0758

…nside plt.style block, rephrased string concatenation

merging in new changes to master

035eb28

fixed indentation error

8067b4d

added option to plot either max or min in each bin; added --binningSu…

db97775

…mmaryStatistic argument (default max); fixed read_length_threshold parser.add_argument indentation to match the others

tomkinsc reviewed Jun 5, 2019

View reviewed changes

reports.py Show resolved Hide resolved

tomkinsc reviewed Jun 5, 2019

View reviewed changes

reports.py Outdated Show resolved Hide resolved

tomkinsc reviewed Jun 5, 2019

View reviewed changes

lakras and others added 7 commits June 5, 2019 00:06

constrained binningSummaryStatistic choices to min and max

ccc3ba2

replaced evaluation of binning_summary_statistic with eval statement

b46cf00

Merge branch 'master' into lk-binned-coverage-plots

0ad9b6e

Merge branch 'master' into lk-binned-coverage-plots

25ef503

Merge branch 'master' into lk-binned-coverage-plots

489cb00

Merge branch 'master' into lk-binned-coverage-plots

6178118

Merge branch 'master' into lk-binned-coverage-plots

2234c42

tomkinsc approved these changes Jun 29, 2019

View reviewed changes

tomkinsc merged commit 0c51acb into master Jul 1, 2019

tomkinsc deleted the lk-binned-coverage-plots branch July 1, 2019 15:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

binned coverage plot option in align_and_plot #957

binned coverage plot option in align_and_plot #957

lakras commented Jun 3, 2019 •

edited

tomkinsc left a comment

tomkinsc Jun 3, 2019

lakras Jun 4, 2019 •

edited

tomkinsc Jun 3, 2019

lakras Jun 4, 2019

tomkinsc Jun 5, 2019

lakras Jun 5, 2019

tomkinsc Jun 5, 2019

binned coverage plot option in align_and_plot #957

binned coverage plot option in align_and_plot #957

Conversation

lakras commented Jun 3, 2019 • edited

tomkinsc left a comment

Choose a reason for hiding this comment

tomkinsc Jun 3, 2019

Choose a reason for hiding this comment

lakras Jun 4, 2019 • edited

Choose a reason for hiding this comment

tomkinsc Jun 3, 2019

Choose a reason for hiding this comment

lakras Jun 4, 2019

Choose a reason for hiding this comment

tomkinsc Jun 5, 2019

Choose a reason for hiding this comment

lakras Jun 5, 2019

Choose a reason for hiding this comment

tomkinsc Jun 5, 2019

Choose a reason for hiding this comment

lakras commented Jun 3, 2019 •

edited

lakras Jun 4, 2019 •

edited