Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

binned coverage plot option in align_and_plot #957

Merged
merged 33 commits into from Jul 1, 2019

Conversation

lakras
Copy link
Contributor

@lakras lakras commented Jun 3, 2019

added binning for coverage plots with large reference genomes generated by align_and_plot:

  • off by default, can be turned on with option bin_large_plots = True
  • plots maximum or minimum read depth in each bin (option binning_summary_statistic = "max" or "min", default is max)
  • bin size set to one bin per pixel (bin size displayed in y axis label--Plasmodium bin size is 27 kb)
  • binning results in Plasmodium coverage plots with file size 20 KB, compared to 140 MB without binning (unopenable and indecipherable if it does open)

also:

  • made sample name field in plot_coverage optional to allow running with new batch tool runner on DNA nexus (gets sample name from reads_unmapped_bam)
  • fixed indentation, repeated code in plot_coverage

Here's a coverage plot for the first chromosome of P. falciparum without binning (left; 5.2 MB) and with binning (right; 21 KB).

plasmodium_chr1_binning_comparison

And here's a coverage plot for the whole P. falciparum genome without binning (left; 136.4 MB) and with binning (right; 21 KB).

plasmodium_full_genome_binning_comparison

lakras added 22 commits March 5, 2019 23:58
… bulk (gets sample name from reads_unmapped_bam)
…ented out code following if statement"

This reverts commit a2b590b.
…nning in bulk (gets sample name from reads_unmapped_bam)"

This reverts commit 5529f5c.
…bulk (gets sample name from reads_unmapped_bam)
Copy link
Member

@tomkinsc tomkinsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! It'll be nice to merge this in so we can make plots for giant genomes without giant file sizes. A few comments below. Do you have an example plot you can add to the PR so we can see how binned plots look (if they're noticeably different)?

reports.py Outdated Show resolved Hide resolved
reports.py Show resolved Hide resolved
parser.add_argument(
'--binLargePlots',
dest="bin_large_plots",
action="store_true",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should turn on binned plots as the default option for genomes beyond a certain length (the axis xlim?)? It would be nice for users to simply get a plot regardless of genome size, and your y-axis labeling code makes it clear whether a given plot has been binned or not. If we do that we'd probably want the ability to override and turn off auto-binning (or set the bin size manually and auto-scale the width?).

Copy link
Contributor Author

@lakras lakras Jun 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beyond some point, plots without binning are useless. But there is a middle ground where you could bin or not bin. We could have binning turned on by default with an override option, and bin a bit less aggressively (something like two to 20 bins per pixel, rather than one bin per pixel). If a plot doesn't really need it, then it won't be binned. As things currently are, GB virus C gets 11-bp bins when binning is turned on, but the non-binned plot is quite readable and reasonably sized, which makes me think I was a bit too aggressive.

Setting the bin size manually—I would not make this the default, but I think that would make an excellent option—I can see people wanting to be able to do it. Auto-scaling the width would be a nice way to make sure that all coverage plots produced are actually readable, but I don't think it is absolutely necessary.

reports.py Outdated Show resolved Hide resolved
reports.py Outdated
bin_size = 1 + int(domain_max/preferred_domain)
binned_segment_depths = OrderedDict()
for segment_num, (segment_name, position_depths) in enumerate(segment_depths.items()):
max_depths_in_bins = [max(position_depths[i:i + bin_size]) for i in range(0, len(position_depths), bin_size)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a question for others who routinely look at these plots, but since these plots are often used to spot regions of low coverage, I wonder if the max depth of positions within a bin is the stat we care about most, or if min may be more appropriate. Or maybe the value could be the number of positions within a bin exceeding some threshold value (the mean coverage within the bin?)? Perhaps it should be user-configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. When I was playing around with different options, I found that the max best represented the actual shape of the plot, which is what I wanted to see. I have also been more interested in the presence rather than absence of coverage. I don't see any harm in giving people options—but I would want max to be one of those options, since it has served me quite well.

pipes/WDL/workflows/tasks/tasks_reports.wdl Show resolved Hide resolved
@tomkinsc tomkinsc requested a review from dpark01 June 4, 2019 20:48
…nside plt.style block, rephrased string concatenation
…mmaryStatistic argument (default max); fixed read_length_threshold parser.add_argument indentation to match the others
reports.py Show resolved Hide resolved
reports.py Outdated Show resolved Hide resolved
reports.py Outdated
binning_summary_statistic = "min" # for y axis label
binning_action = min
else:
binning_summary_statistic = "max"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using binning_action.__name__ where you want a string version of the binning function name would obviate the need to store it separately

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed that whole block of code and replaced with

binning_action = eval(binning_summary_statistic)

since binning_summary_statistic is constrained by the choices now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using eval was my first instinct as well, despite the fact that it always feels a little wrong, even with constraints. It may be nice to preserve the ability to do it by name in case we ever add an inline function for another bin stat (threshold, q score filter, etc.)

@tomkinsc tomkinsc merged commit 0c51acb into master Jul 1, 2019
@tomkinsc tomkinsc deleted the lk-binned-coverage-plots branch July 1, 2019 15:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants