Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do I define which BAM tags contain the UMI / cell barcode? #23

Closed
mschilli87 opened this issue Feb 13, 2024 · 16 comments · Fixed by #30
Closed

How do I define which BAM tags contain the UMI / cell barcode? #23

mschilli87 opened this issue Feb 13, 2024 · 16 comments · Fixed by #30
Labels
question Further information is requested

Comments

@mschilli87
Copy link
Contributor

I understand the by default Demuxalot assumes the UMI to be in the UB BAM tag and the cell barcode in the CB BAM tag as they are in cellranger output.
I have a BAM input file using XM and XC instead.

I guess that by passing tag="XC" when creating the BarcodeHandler, I can take care of one of those but I have not found any way to specify an alternative to the UB default.

Could anyone please give me a pointer on how to make Demuxalot run on my BAM file?

Thank you in advance for your help.

@arogozhnikov
Copy link
Owner

Hi @mschilli87

you're correct about replacing CB -> XC.

As for replacing "UB", I need to start with an explanation. UB is a cellranger-specific tag, and demuxalot also uses alignment scoring tag from cellranger (to drop poor alignments. Usually mapq is used for this purpose, but some terrible alignments can have maximal mapq).

All cellranger-specific logic is encapsulated in a single function here:
https://github.com/herophilus/demuxalot/blob/master/demuxalot/cellranger_specific.py

You can pass an alternative implementation of parse_read to all functions. Or just edit existing parse_read in-place. LMK if you need any guidance on this

@mschilli87
Copy link
Contributor Author

@arogozhnikov: Thx for your response. I think a simple generic parse_read implementation with a UMI tag parameter (if you want, defaulting to "UB") would be nice to have. Would you be willing to provide one as part of demuxalot? I could compose a PR if that helps but I'd appreciate any form or guidance/feedback.

@arogozhnikov
Copy link
Owner

@mschilli87
This is a good generic starting point:

from pysam import AlignedRead
from typing import Optional, Tuple

from demuxalot.utils import hash_string

class CustomReadFilter:
    def __init__(self, umi_tag='UB', min_mapq=20):
        self.umi_tag = umi_tag
        self.min_mapq = min_mapq

    def __call__(self, read: AlignedRead) -> Optional[Tuple[float, int]]:
        if not read.has_tag(self.umi_tag):
            # does not have molecule barcode
            return None
        if read.mapq < self.min_mapq:
            # this one should not be triggered because of NH, but just in case
            return None

        p_misaligned = 0.01  # default value
        ub = hash_string(read.get_tag("UB"))
        return p_misaligned, ub


my_parse_read = CustomReadFilter(umi_tag="XM") # pass this as a parse_read argument

PS. I'd still recommend checking if there are some available tags that can help to discard non-reliable alignments

@mschilli87
Copy link
Contributor Author

mschilli87 commented Feb 15, 2024

@arogozhnikov: Thx, but shouldn't it be

-        ub = hash_string(read.get_tag("UB"))
+        ub = hash_string(read.get_tag(self.umi_tag))

?

I'll check the available BAM flags regarding p_misaligned.

Would you be up for including such a class in a demuxalot release, so other projects could make use of it without the need to (re-)implement it there?

@mschilli87
Copy link
Contributor Author

My files do have the AS and NH tags. So I guess I could go with

from pysam import AlignedRead
from typing import Optional, Tuple

from demuxalot.utils import hash_string

class CustomReadFilter:
    def __init__(self, umi_tag='UB', min_mapq=20):
        self.umi_tag = umi_tag
        self.min_mapq = min_mapq

    def __call__(self, read: AlignedRead) -> Optional[Tuple[float, int]]:
        if read.get_tag("AS") <= len(read.seq) - 8:
            # more than 2 edits
            return None
        if read.get_tag("NH") > 1:
            # multi-mapped
            return None
        if not read.has_tag(self.umi_tag):
            # does not have molecule barcode
            return None
        if read.mapq < self.min_mapq:
            # this one should not be triggered because of NH, but just in case
            return None

        p_misaligned = 0.01  # default value
        ub = hash_string(read.get_tag(self.umi_tag))
        return p_misaligned, ub

my_parse_read = CustomReadFilter(umi_tag="XM") # pass this as a parse_read argument

@arogozhnikov: Does this look good to you? If so, is there any reason not to have this be the default implementation rather than hard-coding the UMI tag to UB?

@mschilli87
Copy link
Contributor Author

I guess what I am trying to pitch here is

def parse_read(read: AlignedRead, umi_tag="UB") -> Optional[Tuple[float, int]]:
    """
    returns None if read should be ignored.
    Read still can be ignored if it is not in the barcode list
    """
    if read.get_tag("AS") <= len(read.seq) - 8:
        # more than 2 edits
        return None
    if read.get_tag("NH") > 1:
        # multi-mapped
        return None
    if not read.has_tag(umi_tag):
        # does not have molecule barcode
        return None
    if read.mapq < 20:
        # this one should not be triggered because of NH, but just in case
        return None

    p_misaligned = 0.01  # default value
    ub = hash_string(read.get_tag(umi_tag))
    return p_misaligned, ub

Wouldn't this keep working the same as the current implementation with the added benefit of allowing to change the UMI tag without the need to copy/paste all that boilerplate code needed for the custom class?

@arogozhnikov
Copy link
Owner

shouldn't it be

sure, you're correct

My files do have the AS and NH tags ... Does this look good to you? ...

if you have illumina-style reads, this should work.
AS/NH are likely to be produced by STAR aligner internally, and I'd expect their meaning is the same as in cellranger.

If you have long reads, number of tolerated errors should be somewhat proportional to length.

I guess what I am trying to pitch here is ...

I guess for user it would be easier to select across available configs, rather than across flags, e.g. cellranger_short, cellranger_pacbio, pipseeker_short etc - so that parse_read could use right setting and the right tags... That's the perfect case, but this unlikely to happen.

Your suggestion would work for STAR-aligned short reads, which is probably enough.
So back to your question - PR with either setup is welcome, just make sure it actually works on your data.

@mschilli87
Copy link
Contributor Author

Yes, that's exactly my situtation: drop-seq derived short reads sequenced with Illumina and mapped with STAR. I'll fork the repo, make the changes, test them out and open a PR once I confirm it works on my data. I'll leave this issue open until the PR is in.

@mschilli87
Copy link
Contributor Author

@arogozhnikov: While working on (or rather testing, actually) my PR I encountered an issue not with the UB tag parameter I am adding, but with the already existing CB one:

While BarcodeHandler.__init__ has the tag parameter I need to adjust to run Demuxalot on my data, BarcodeHandler.from_file does not and returns BarcodeHandler(barcodes) without any chance to modify that flag.

As my Python is a bit rusty, I figured I check with you I got this right before trying to hack around more. My best guess would be to add a kwargs argument to the static function and pass it down to the __init__ function.

Any feedback would be appreciated.

@arogozhnikov
Copy link
Owner

good catch, that's why things need to be tested 😉

My best guess would be to add a kwargs argument to the static function and pass it down to the init function.

yes, that's the best approach

mschilli87 added a commit to mschilli87/demuxalot that referenced this issue Feb 23, 2024
While cellranger uses the `UB` SAM tag, other scRNA-seq analyses use
different SAM tags (*e.g.* `XM`) for the same information.
By parameterizing the existing `parse_read` function, this (and other
read parsing/filtering options) can be adjusted by the caller without
the need for boilerplate code required for a custom callback class.
At the same time, by providing the previously hard-coded values as
default parameters, compatibility with existing code is kept.
All functions using this callback have been extended by optional keyword
arguments to allow passing through the newly added parameters.

This addresses one out of two issues raised in
arogozhnikov#23.
mschilli87 added a commit to mschilli87/demuxalot that referenced this issue Feb 23, 2024
While `BarcodeHandler.__init__` has (optional) parameters to adjust
barcode-related options (*e.g.* using non-`CB` SAM tags for the cell
barcodes, like `XC`), the static `from_file` function called it without
any of those parameters, making it impossible to adjust them when using
that function.
This is resolved by adding an optional keyword argument dictionary
parameters to the static function and passing all argument provided
therein, if any, down to the `__init__` call.

This addresses one out of two issues raised in
arogozhnikov#23.
@mschilli87
Copy link
Contributor Author

@arogozhnikov: I have implemented everything and opened three PRs:

  1. Parameterize parse_read for custom UMI-tags etc. #24, addressing the UB part only,
  2. BarcodeHandler.from_file: Pass down parameters #25, addressing the CB part only, and
  3. Non cellranger tags #26, combining both of the above for your convenience

I have tested #26

  1. with cellranger-style data and a script not passing any of the new arguments and verified that is works exactly as on current master, and
  2. with my non-cellranger-style data and an adjusted script passing the corresponding parameters to BarcodeHandler.from_file and count_snps, respectively and verified that it runs and yield results in reasonable agreement with running vireo on the same dataset.

If you were to merge those changes and include them in a release, it would greatly simplify my workflow.

Thank you for your guidance so far and you feedback yet to come.

@arogozhnikov
Copy link
Owner

awesome, thank you! Please see my remark on #24

@arogozhnikov
Copy link
Owner

thank you @mschilli87
we just sorted out things with repo, and I've merged both your PRs 🎉

If you have time to make a separate version of this
https://github.com/arogozhnikov/demuxalot/blob/master/examples/1-plain_demultiplexing.py

for custom tags, I think that would help others with non-cellranger tags

@arogozhnikov arogozhnikov added the question Further information is requested label Feb 27, 2024
@mschilli87
Copy link
Contributor Author

mschilli87 commented Feb 28, 2024

@arogozhnikov: Thank you. I'll try to make some time to add an example as per your suggestion.
Is there any chance once I have done so those changes will make it into a (Pypi) release anytime soon? I have drneavin/Demultiplexing_Doublet_Detecting_Docs#48 open which depends on that new API which would significantly simplify deployment for me.

@arogozhnikov
Copy link
Owner

sure, it should be already live on pypi:

pip install demuxalot==0.4.1

mschilli87 added a commit to mschilli87/demuxalot that referenced this issue Feb 28, 2024
@mschilli87
Copy link
Contributor Author

@arogozhnikov: Perfect. Thank you! I opened PR #30 as per your suggestion.

mschilli87 added a commit to mschilli87/demuxalot that referenced this issue Feb 28, 2024
mschilli87 added a commit to mschilli87/demuxalot that referenced this issue Feb 28, 2024
mschilli87 added a commit to mschilli87/demuxalot that referenced this issue Feb 28, 2024
bestsoftwareandcodereviews10 added a commit to bestsoftwareandcodereviews10/demuxalot that referenced this issue Aug 16, 2024
While cellranger uses the `UB` SAM tag, other scRNA-seq analyses use
different SAM tags (*e.g.* `XM`) for the same information.
By parameterizing the existing `parse_read` function, this (and other
read parsing/filtering options) can be adjusted by the caller without
the need for boilerplate code required for a custom callback class.
At the same time, by providing the previously hard-coded values as
default parameters, compatibility with existing code is kept.
All functions using this callback have been extended by optional keyword
arguments to allow passing through the newly added parameters.

This addresses one out of two issues raised in
arogozhnikov/demuxalot#23.
bestsoftwareandcodereviews10 added a commit to bestsoftwareandcodereviews10/demuxalot that referenced this issue Aug 16, 2024
While `BarcodeHandler.__init__` has (optional) parameters to adjust
barcode-related options (*e.g.* using non-`CB` SAM tags for the cell
barcodes, like `XC`), the static `from_file` function called it without
any of those parameters, making it impossible to adjust them when using
that function.
This is resolved by adding an optional keyword argument dictionary
parameters to the static function and passing all argument provided
therein, if any, down to the `__init__` call.

This addresses one out of two issues raised in
arogozhnikov/demuxalot#23.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
2 participants