How do I define which BAM tags contain the UMI / cell barcode? #23

mschilli87 · 2024-02-13T15:04:54Z

I understand the by default Demuxalot assumes the UMI to be in the UB BAM tag and the cell barcode in the CB BAM tag as they are in cellranger output.
I have a BAM input file using XM and XC instead.

I guess that by passing tag="XC" when creating the BarcodeHandler, I can take care of one of those but I have not found any way to specify an alternative to the UB default.

Could anyone please give me a pointer on how to make Demuxalot run on my BAM file?

Thank you in advance for your help.

The text was updated successfully, but these errors were encountered:

arogozhnikov · 2024-02-14T20:04:52Z

Hi @mschilli87

you're correct about replacing CB -> XC.

As for replacing "UB", I need to start with an explanation. UB is a cellranger-specific tag, and demuxalot also uses alignment scoring tag from cellranger (to drop poor alignments. Usually mapq is used for this purpose, but some terrible alignments can have maximal mapq).

All cellranger-specific logic is encapsulated in a single function here:
https://github.com/herophilus/demuxalot/blob/master/demuxalot/cellranger_specific.py

You can pass an alternative implementation of parse_read to all functions. Or just edit existing parse_read in-place. LMK if you need any guidance on this

mschilli87 · 2024-02-14T23:02:21Z

@arogozhnikov: Thx for your response. I think a simple generic parse_read implementation with a UMI tag parameter (if you want, defaulting to "UB") would be nice to have. Would you be willing to provide one as part of demuxalot? I could compose a PR if that helps but I'd appreciate any form or guidance/feedback.

arogozhnikov · 2024-02-15T00:41:07Z

@mschilli87
This is a good generic starting point:

from pysam import AlignedRead
from typing import Optional, Tuple

from demuxalot.utils import hash_string

class CustomReadFilter:
    def __init__(self, umi_tag='UB', min_mapq=20):
        self.umi_tag = umi_tag
        self.min_mapq = min_mapq

    def __call__(self, read: AlignedRead) -> Optional[Tuple[float, int]]:
        if not read.has_tag(self.umi_tag):
            # does not have molecule barcode
            return None
        if read.mapq < self.min_mapq:
            # this one should not be triggered because of NH, but just in case
            return None

        p_misaligned = 0.01  # default value
        ub = hash_string(read.get_tag("UB"))
        return p_misaligned, ub


my_parse_read = CustomReadFilter(umi_tag="XM") # pass this as a parse_read argument

PS. I'd still recommend checking if there are some available tags that can help to discard non-reliable alignments

mschilli87 · 2024-02-15T08:13:04Z

@arogozhnikov: Thx, but shouldn't it be

-        ub = hash_string(read.get_tag("UB"))
+        ub = hash_string(read.get_tag(self.umi_tag))

?

I'll check the available BAM flags regarding p_misaligned.

Would you be up for including such a class in a demuxalot release, so other projects could make use of it without the need to (re-)implement it there?

mschilli87 · 2024-02-15T09:39:07Z

My files do have the AS and NH tags. So I guess I could go with

from pysam import AlignedRead
from typing import Optional, Tuple

from demuxalot.utils import hash_string

class CustomReadFilter:
    def __init__(self, umi_tag='UB', min_mapq=20):
        self.umi_tag = umi_tag
        self.min_mapq = min_mapq

    def __call__(self, read: AlignedRead) -> Optional[Tuple[float, int]]:
        if read.get_tag("AS") <= len(read.seq) - 8:
            # more than 2 edits
            return None
        if read.get_tag("NH") > 1:
            # multi-mapped
            return None
        if not read.has_tag(self.umi_tag):
            # does not have molecule barcode
            return None
        if read.mapq < self.min_mapq:
            # this one should not be triggered because of NH, but just in case
            return None

        p_misaligned = 0.01  # default value
        ub = hash_string(read.get_tag(self.umi_tag))
        return p_misaligned, ub

my_parse_read = CustomReadFilter(umi_tag="XM") # pass this as a parse_read argument

@arogozhnikov: Does this look good to you? If so, is there any reason not to have this be the default implementation rather than hard-coding the UMI tag to UB?

mschilli87 · 2024-02-15T09:45:11Z

I guess what I am trying to pitch here is

def parse_read(read: AlignedRead, umi_tag="UB") -> Optional[Tuple[float, int]]:
    """
    returns None if read should be ignored.
    Read still can be ignored if it is not in the barcode list
    """
    if read.get_tag("AS") <= len(read.seq) - 8:
        # more than 2 edits
        return None
    if read.get_tag("NH") > 1:
        # multi-mapped
        return None
    if not read.has_tag(umi_tag):
        # does not have molecule barcode
        return None
    if read.mapq < 20:
        # this one should not be triggered because of NH, but just in case
        return None

    p_misaligned = 0.01  # default value
    ub = hash_string(read.get_tag(umi_tag))
    return p_misaligned, ub

Wouldn't this keep working the same as the current implementation with the added benefit of allowing to change the UMI tag without the need to copy/paste all that boilerplate code needed for the custom class?

arogozhnikov · 2024-02-15T22:23:41Z

shouldn't it be

sure, you're correct

My files do have the AS and NH tags ... Does this look good to you? ...

if you have illumina-style reads, this should work.
AS/NH are likely to be produced by STAR aligner internally, and I'd expect their meaning is the same as in cellranger.

If you have long reads, number of tolerated errors should be somewhat proportional to length.

I guess what I am trying to pitch here is ...

I guess for user it would be easier to select across available configs, rather than across flags, e.g. cellranger_short, cellranger_pacbio, pipseeker_short etc - so that parse_read could use right setting and the right tags... That's the perfect case, but this unlikely to happen.

Your suggestion would work for STAR-aligned short reads, which is probably enough.
So back to your question - PR with either setup is welcome, just make sure it actually works on your data.

mschilli87 · 2024-02-16T09:09:26Z

Yes, that's exactly my situtation: drop-seq derived short reads sequenced with Illumina and mapped with STAR. I'll fork the repo, make the changes, test them out and open a PR once I confirm it works on my data. I'll leave this issue open until the PR is in.

mschilli87 · 2024-02-22T16:27:46Z

@arogozhnikov: While working on (or rather testing, actually) my PR I encountered an issue not with the UB tag parameter I am adding, but with the already existing CB one:

While BarcodeHandler.__init__ has the tag parameter I need to adjust to run Demuxalot on my data, BarcodeHandler.from_file does not and returns BarcodeHandler(barcodes) without any chance to modify that flag.

As my Python is a bit rusty, I figured I check with you I got this right before trying to hack around more. My best guess would be to add a kwargs argument to the static function and pass it down to the __init__ function.

Any feedback would be appreciated.

arogozhnikov · 2024-02-22T18:05:53Z

good catch, that's why things need to be tested 😉

My best guess would be to add a kwargs argument to the static function and pass it down to the init function.

yes, that's the best approach

While cellranger uses the `UB` SAM tag, other scRNA-seq analyses use different SAM tags (*e.g.* `XM`) for the same information. By parameterizing the existing `parse_read` function, this (and other read parsing/filtering options) can be adjusted by the caller without the need for boilerplate code required for a custom callback class. At the same time, by providing the previously hard-coded values as default parameters, compatibility with existing code is kept. All functions using this callback have been extended by optional keyword arguments to allow passing through the newly added parameters. This addresses one out of two issues raised in arogozhnikov#23.

While `BarcodeHandler.__init__` has (optional) parameters to adjust barcode-related options (*e.g.* using non-`CB` SAM tags for the cell barcodes, like `XC`), the static `from_file` function called it without any of those parameters, making it impossible to adjust them when using that function. This is resolved by adding an optional keyword argument dictionary parameters to the static function and passing all argument provided therein, if any, down to the `__init__` call. This addresses one out of two issues raised in arogozhnikov#23.

mschilli87 · 2024-02-23T01:41:25Z

@arogozhnikov: I have implemented everything and opened three PRs:

Parameterize parse_read for custom UMI-tags etc. #24, addressing the UB part only,
BarcodeHandler.from_file: Pass down parameters #25, addressing the CB part only, and
Non cellranger tags #26, combining both of the above for your convenience

I have tested #26

with cellranger-style data and a script not passing any of the new arguments and verified that is works exactly as on current master, and
with my non-cellranger-style data and an adjusted script passing the corresponding parameters to BarcodeHandler.from_file and count_snps, respectively and verified that it runs and yield results in reasonable agreement with running vireo on the same dataset.

If you were to merge those changes and include them in a release, it would greatly simplify my workflow.

Thank you for your guidance so far and you feedback yet to come.

arogozhnikov · 2024-02-23T04:30:32Z

awesome, thank you! Please see my remark on #24

arogozhnikov · 2024-02-27T23:10:10Z

thank you @mschilli87
we just sorted out things with repo, and I've merged both your PRs 🎉

If you have time to make a separate version of this
https://github.com/arogozhnikov/demuxalot/blob/master/examples/1-plain_demultiplexing.py

for custom tags, I think that would help others with non-cellranger tags

mschilli87 · 2024-02-28T08:17:11Z

@arogozhnikov: Thank you. I'll try to make some time to add an example as per your suggestion.
Is there any chance once I have done so those changes will make it into a (Pypi) release anytime soon? I have drneavin/Demultiplexing_Doublet_Detecting_Docs#48 open which depends on that new API which would significantly simplify deployment for me.

arogozhnikov · 2024-02-28T08:21:28Z

sure, it should be already live on pypi:

pip install demuxalot==0.4.1

closes arogozhnikov#23

mschilli87 · 2024-02-28T13:37:27Z

@arogozhnikov: Perfect. Thank you! I opened PR #30 as per your suggestion.

closes arogozhnikov#23

While cellranger uses the `UB` SAM tag, other scRNA-seq analyses use different SAM tags (*e.g.* `XM`) for the same information. By parameterizing the existing `parse_read` function, this (and other read parsing/filtering options) can be adjusted by the caller without the need for boilerplate code required for a custom callback class. At the same time, by providing the previously hard-coded values as default parameters, compatibility with existing code is kept. All functions using this callback have been extended by optional keyword arguments to allow passing through the newly added parameters. This addresses one out of two issues raised in arogozhnikov/demuxalot#23.

While `BarcodeHandler.__init__` has (optional) parameters to adjust barcode-related options (*e.g.* using non-`CB` SAM tags for the cell barcodes, like `XC`), the static `from_file` function called it without any of those parameters, making it impossible to adjust them when using that function. This is resolved by adding an optional keyword argument dictionary parameters to the static function and passing all argument provided therein, if any, down to the `__init__` call. This addresses one out of two issues raised in arogozhnikov/demuxalot#23.

mschilli87 mentioned this issue Feb 23, 2024

Parameterize parse_read for custom UMI-tags etc. #24

Merged

This was referenced Feb 23, 2024

BarcodeHandler.from_file: Pass down parameters #25

Merged

Non cellranger tags #26

Closed

arogozhnikov added the question Further information is requested label Feb 27, 2024

mschilli87 added a commit to mschilli87/demuxalot that referenced this issue Feb 28, 2024

Examples: Add simply custom tag examples

9414a79

closes arogozhnikov#23

mschilli87 mentioned this issue Feb 28, 2024

Examples: Add simple custom tag example #30

Merged

mschilli87 added a commit to mschilli87/demuxalot that referenced this issue Feb 28, 2024

Examples: Add simple custom tag examples

f08417d

closes arogozhnikov#23

mschilli87 added a commit to mschilli87/demuxalot that referenced this issue Feb 28, 2024

Examples: Add simple custom tag example

0d44fac

closes arogozhnikov#23

mschilli87 added a commit to mschilli87/demuxalot that referenced this issue Feb 28, 2024

Examples: Add simple custom tag example

dbf2abc

closes arogozhnikov#23

arogozhnikov closed this as completed in #30 Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do I define which BAM tags contain the UMI / cell barcode? #23

How do I define which BAM tags contain the UMI / cell barcode? #23

mschilli87 commented Feb 13, 2024

arogozhnikov commented Feb 14, 2024

mschilli87 commented Feb 14, 2024

arogozhnikov commented Feb 15, 2024

mschilli87 commented Feb 15, 2024 •

edited

Loading

mschilli87 commented Feb 15, 2024

mschilli87 commented Feb 15, 2024

arogozhnikov commented Feb 15, 2024

mschilli87 commented Feb 16, 2024

mschilli87 commented Feb 22, 2024

arogozhnikov commented Feb 22, 2024

mschilli87 commented Feb 23, 2024

arogozhnikov commented Feb 23, 2024

arogozhnikov commented Feb 27, 2024

mschilli87 commented Feb 28, 2024 •

edited

Loading

arogozhnikov commented Feb 28, 2024

mschilli87 commented Feb 28, 2024

How do I define which BAM tags contain the UMI / cell barcode? #23

How do I define which BAM tags contain the UMI / cell barcode? #23

Comments

mschilli87 commented Feb 13, 2024

arogozhnikov commented Feb 14, 2024

mschilli87 commented Feb 14, 2024

arogozhnikov commented Feb 15, 2024

mschilli87 commented Feb 15, 2024 • edited Loading

mschilli87 commented Feb 15, 2024

mschilli87 commented Feb 15, 2024

arogozhnikov commented Feb 15, 2024

mschilli87 commented Feb 16, 2024

mschilli87 commented Feb 22, 2024

arogozhnikov commented Feb 22, 2024

mschilli87 commented Feb 23, 2024

arogozhnikov commented Feb 23, 2024

arogozhnikov commented Feb 27, 2024

mschilli87 commented Feb 28, 2024 • edited Loading

arogozhnikov commented Feb 28, 2024

mschilli87 commented Feb 28, 2024

mschilli87 commented Feb 15, 2024 •

edited

Loading

mschilli87 commented Feb 28, 2024 •

edited

Loading