New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
binned coverage plot option in align_and_plot #957
Conversation
… bulk (gets sample name from reads_unmapped_bam)
…t code following if statement
This reverts commit a820ab2.
… it" This reverts commit 4317d21.
…ented out code following if statement" This reverts commit a2b590b.
This reverts commit 9f346e1.
This reverts commit 999fad2.
…nning in bulk (gets sample name from reads_unmapped_bam)" This reverts commit 5529f5c.
…bulk (gets sample name from reads_unmapped_bam)
…mmit" This reverts commit 938f6c2.
…into lk-binned-coverage-plots
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! It'll be nice to merge this in so we can make plots for giant genomes without giant file sizes. A few comments below. Do you have an example plot you can add to the PR so we can see how binned plots look (if they're noticeably different)?
parser.add_argument( | ||
'--binLargePlots', | ||
dest="bin_large_plots", | ||
action="store_true", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should turn on binned plots as the default option for genomes beyond a certain length (the axis xlim
?)? It would be nice for users to simply get a plot regardless of genome size, and your y-axis labeling code makes it clear whether a given plot has been binned or not. If we do that we'd probably want the ability to override and turn off auto-binning (or set the bin size manually and auto-scale the width?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beyond some point, plots without binning are useless. But there is a middle ground where you could bin or not bin. We could have binning turned on by default with an override option, and bin a bit less aggressively (something like two to 20 bins per pixel, rather than one bin per pixel). If a plot doesn't really need it, then it won't be binned. As things currently are, GB virus C gets 11-bp bins when binning is turned on, but the non-binned plot is quite readable and reasonably sized, which makes me think I was a bit too aggressive.
Setting the bin size manually—I would not make this the default, but I think that would make an excellent option—I can see people wanting to be able to do it. Auto-scaling the width would be a nice way to make sure that all coverage plots produced are actually readable, but I don't think it is absolutely necessary.
reports.py
Outdated
bin_size = 1 + int(domain_max/preferred_domain) | ||
binned_segment_depths = OrderedDict() | ||
for segment_num, (segment_name, position_depths) in enumerate(segment_depths.items()): | ||
max_depths_in_bins = [max(position_depths[i:i + bin_size]) for i in range(0, len(position_depths), bin_size)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be a question for others who routinely look at these plots, but since these plots are often used to spot regions of low coverage, I wonder if the max
depth of positions within a bin is the stat we care about most, or if min
may be more appropriate. Or maybe the value could be the number of positions within a bin exceeding some threshold value (the mean coverage within the bin?)? Perhaps it should be user-configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. When I was playing around with different options, I found that the max best represented the actual shape of the plot, which is what I wanted to see. I have also been more interested in the presence rather than absence of coverage. I don't see any harm in giving people options—but I would want max to be one of those options, since it has served me quite well.
…nside plt.style block, rephrased string concatenation
…mmaryStatistic argument (default max); fixed read_length_threshold parser.add_argument indentation to match the others
reports.py
Outdated
binning_summary_statistic = "min" # for y axis label | ||
binning_action = min | ||
else: | ||
binning_summary_statistic = "max" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using binning_action.__name__
where you want a string version of the binning function name would obviate the need to store it separately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed that whole block of code and replaced with
binning_action = eval(binning_summary_statistic)
since binning_summary_statistic
is constrained by the choices
now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using eval
was my first instinct as well, despite the fact that it always feels a little wrong, even with constraints. It may be nice to preserve the ability to do it by name in case we ever add an inline function for another bin stat (threshold, q score filter, etc.)
added binning for coverage plots with large reference genomes generated by align_and_plot:
also:
Here's a coverage plot for the first chromosome of P. falciparum without binning (left; 5.2 MB) and with binning (right; 21 KB).
And here's a coverage plot for the whole P. falciparum genome without binning (left; 136.4 MB) and with binning (right; 21 KB).