-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How do I define which BAM tags contain the UMI / cell barcode? #23
Comments
Hi @mschilli87 you're correct about replacing CB -> XC. As for replacing "UB", I need to start with an explanation. UB is a cellranger-specific tag, and demuxalot also uses alignment scoring tag from cellranger (to drop poor alignments. Usually mapq is used for this purpose, but some terrible alignments can have maximal mapq). All cellranger-specific logic is encapsulated in a single function here: You can pass an alternative implementation of parse_read to all functions. Or just edit existing |
@arogozhnikov: Thx for your response. I think a simple generic |
@mschilli87 from pysam import AlignedRead
from typing import Optional, Tuple
from demuxalot.utils import hash_string
class CustomReadFilter:
def __init__(self, umi_tag='UB', min_mapq=20):
self.umi_tag = umi_tag
self.min_mapq = min_mapq
def __call__(self, read: AlignedRead) -> Optional[Tuple[float, int]]:
if not read.has_tag(self.umi_tag):
# does not have molecule barcode
return None
if read.mapq < self.min_mapq:
# this one should not be triggered because of NH, but just in case
return None
p_misaligned = 0.01 # default value
ub = hash_string(read.get_tag("UB"))
return p_misaligned, ub
my_parse_read = CustomReadFilter(umi_tag="XM") # pass this as a parse_read argument PS. I'd still recommend checking if there are some available tags that can help to discard non-reliable alignments |
@arogozhnikov: Thx, but shouldn't it be - ub = hash_string(read.get_tag("UB"))
+ ub = hash_string(read.get_tag(self.umi_tag)) ? I'll check the available BAM flags regarding Would you be up for including such a class in a demuxalot release, so other projects could make use of it without the need to (re-)implement it there? |
My files do have the from pysam import AlignedRead
from typing import Optional, Tuple
from demuxalot.utils import hash_string
class CustomReadFilter:
def __init__(self, umi_tag='UB', min_mapq=20):
self.umi_tag = umi_tag
self.min_mapq = min_mapq
def __call__(self, read: AlignedRead) -> Optional[Tuple[float, int]]:
if read.get_tag("AS") <= len(read.seq) - 8:
# more than 2 edits
return None
if read.get_tag("NH") > 1:
# multi-mapped
return None
if not read.has_tag(self.umi_tag):
# does not have molecule barcode
return None
if read.mapq < self.min_mapq:
# this one should not be triggered because of NH, but just in case
return None
p_misaligned = 0.01 # default value
ub = hash_string(read.get_tag(self.umi_tag))
return p_misaligned, ub
my_parse_read = CustomReadFilter(umi_tag="XM") # pass this as a parse_read argument @arogozhnikov: Does this look good to you? If so, is there any reason not to have this be the default implementation rather than hard-coding the UMI tag to |
I guess what I am trying to pitch here is def parse_read(read: AlignedRead, umi_tag="UB") -> Optional[Tuple[float, int]]:
"""
returns None if read should be ignored.
Read still can be ignored if it is not in the barcode list
"""
if read.get_tag("AS") <= len(read.seq) - 8:
# more than 2 edits
return None
if read.get_tag("NH") > 1:
# multi-mapped
return None
if not read.has_tag(umi_tag):
# does not have molecule barcode
return None
if read.mapq < 20:
# this one should not be triggered because of NH, but just in case
return None
p_misaligned = 0.01 # default value
ub = hash_string(read.get_tag(umi_tag))
return p_misaligned, ub Wouldn't this keep working the same as the current implementation with the added benefit of allowing to change the UMI tag without the need to copy/paste all that boilerplate code needed for the custom class? |
sure, you're correct
if you have illumina-style reads, this should work. If you have long reads, number of tolerated errors should be somewhat proportional to length.
I guess for user it would be easier to select across available configs, rather than across flags, e.g. Your suggestion would work for STAR-aligned short reads, which is probably enough. |
Yes, that's exactly my situtation: drop-seq derived short reads sequenced with Illumina and mapped with STAR. I'll fork the repo, make the changes, test them out and open a PR once I confirm it works on my data. I'll leave this issue open until the PR is in. |
@arogozhnikov: While working on (or rather testing, actually) my PR I encountered an issue not with the While As my Python is a bit rusty, I figured I check with you I got this right before trying to hack around more. My best guess would be to add a Any feedback would be appreciated. |
good catch, that's why things need to be tested 😉
yes, that's the best approach |
While cellranger uses the `UB` SAM tag, other scRNA-seq analyses use different SAM tags (*e.g.* `XM`) for the same information. By parameterizing the existing `parse_read` function, this (and other read parsing/filtering options) can be adjusted by the caller without the need for boilerplate code required for a custom callback class. At the same time, by providing the previously hard-coded values as default parameters, compatibility with existing code is kept. All functions using this callback have been extended by optional keyword arguments to allow passing through the newly added parameters. This addresses one out of two issues raised in arogozhnikov#23.
While `BarcodeHandler.__init__` has (optional) parameters to adjust barcode-related options (*e.g.* using non-`CB` SAM tags for the cell barcodes, like `XC`), the static `from_file` function called it without any of those parameters, making it impossible to adjust them when using that function. This is resolved by adding an optional keyword argument dictionary parameters to the static function and passing all argument provided therein, if any, down to the `__init__` call. This addresses one out of two issues raised in arogozhnikov#23.
@arogozhnikov: I have implemented everything and opened three PRs:
I have tested #26
If you were to merge those changes and include them in a release, it would greatly simplify my workflow. Thank you for your guidance so far and you feedback yet to come. |
awesome, thank you! Please see my remark on #24 |
thank you @mschilli87 If you have time to make a separate version of this for custom tags, I think that would help others with non-cellranger tags |
@arogozhnikov: Thank you. I'll try to make some time to add an example as per your suggestion. |
sure, it should be already live on pypi: pip install demuxalot==0.4.1 |
@arogozhnikov: Perfect. Thank you! I opened PR #30 as per your suggestion. |
While cellranger uses the `UB` SAM tag, other scRNA-seq analyses use different SAM tags (*e.g.* `XM`) for the same information. By parameterizing the existing `parse_read` function, this (and other read parsing/filtering options) can be adjusted by the caller without the need for boilerplate code required for a custom callback class. At the same time, by providing the previously hard-coded values as default parameters, compatibility with existing code is kept. All functions using this callback have been extended by optional keyword arguments to allow passing through the newly added parameters. This addresses one out of two issues raised in arogozhnikov/demuxalot#23.
While `BarcodeHandler.__init__` has (optional) parameters to adjust barcode-related options (*e.g.* using non-`CB` SAM tags for the cell barcodes, like `XC`), the static `from_file` function called it without any of those parameters, making it impossible to adjust them when using that function. This is resolved by adding an optional keyword argument dictionary parameters to the static function and passing all argument provided therein, if any, down to the `__init__` call. This addresses one out of two issues raised in arogozhnikov/demuxalot#23.
I understand the by default Demuxalot assumes the UMI to be in the
UB
BAM tag and the cell barcode in theCB
BAM tag as they are in cellranger output.I have a BAM input file using
XM
andXC
instead.I guess that by passing
tag="XC"
when creating theBarcodeHandler
, I can take care of one of those but I have not found any way to specify an alternative to theUB
default.Could anyone please give me a pointer on how to make Demuxalot run on my BAM file?
Thank you in advance for your help.
The text was updated successfully, but these errors were encountered: