Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does RSEM need HI:i:<n> tag to correctly process multimappers? #191

Open
apredeus opened this issue Mar 21, 2023 · 2 comments
Open

Does RSEM need HI:i:<n> tag to correctly process multimappers? #191

apredeus opened this issue Mar 21, 2023 · 2 comments

Comments

@apredeus
Copy link

Hi all,

I have a niche project that requires the use of Hisat2. The resulting BAM file has a huge number of multimappers. RSEM runs well on it, but the overall number of counts in the output tables seems to be the same as number of lines in the BAM file - which is like 5 times more than the actual number of reads (because they are multimappers!).

I know this was not the issue with STAR transcriptomic BAMs. I've looked at the tags, and the only difference I see that STAR has and Hisat2 doesn't is the HI:i:<n> tag. Is this required for correct processing by RSEM? If so, do you know of any ready-made solution that would add the tag?

Thank you in advance!

-- Alex

@aryan-9898
Copy link

Yes, RSEM does require the HI:i: tag in order to correctly process multimappers. The HI tag (HI:i:) is a field in SAM/BAM files that is used to indicate the index of a read within the input file. When processing multimapped reads, RSEM uses the HI tag to keep track of which reads are duplicates, so that they are not counted multiple times.

If the BAM files generated by Hisat2 do not contain the HI tag, you can add it using the samtools command-line tool. The samtools addreplacerg command allows you to add or modify the values of any field in the SAM/BAM header, including the HI tag.

Here's an example of how to add the HI tag to a BAM file using samtools:

samtools addreplacerg -r 'ID:1' -r 'SM:sample' -r 'PL:illumina' -r 'PU:unit' -r 'LB:library' -r 'DT:2023-05-03T12:34:56-0700' -R 'HiSeq2500' -i 'HI' -o output.bam input.bam

In this example, the ID, SM, PL, PU, LB, and DT tags are arbitrary and can be changed to match your specific sample information. The -R option specifies the sequencing platform used to generate the data, and the -i option specifies the name of the index tag to be added (in this case, "HI"). The input BAM file is "input.bam", and the output BAM file with the added HI tag is "output.bam".

@apredeus
Copy link
Author

apredeus commented May 9, 2023

Hi Aryan,

Thank you for your answer! I am confused, however. RSEM appears to support hisat2 as a mapping tool - yet the options used by RSEM do not allow for HI tag to be outputted.

The issue I was having was due to sorting of the BAM file; apparently, transcriptomic files are always read name-sorted by default, and both RSEM and Salmon expect this to be true in order to work correctly. Adding HI tag does not seem to influence the result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants