Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Isoquant is missing 1 and 2-exon genes from PacBio RNA-seq data #128

Closed
sarahcalvo opened this issue Dec 5, 2023 · 12 comments
Closed
Labels
weird results Something looks odd in the resulting files

Comments

@sarahcalvo
Copy link

I'm working with Brian Haas to use Isoquant to reconstruct genes in an amoeba species. This species has genes tightly packed together with almost no intergenic space, and genes typically have many introns.

Isoquant is completely missing several obvious 1- and 2- exon genes, when run either with default params or any of the following parameters: --report_novel_unspliced "true" ; --model_construction_strategy "sensitive_pacbio"; --fl_data

I created a tiny BAM file with a single region with 8 genes (~900 reads total), of which Isoquant only calls transcripts for 6 (using any of the above flags).

Here are some example files in directory https://personal.broadinstitute.org/scalvo/for_isoquant_debugging/

  • Isoquant.example_missing_genes.pptx : screen shot showing 2 missing genes
  • T1.r1.mini.bam : bam file with ~900 reads, that should have 8 transcripts
  • out_mini_default : directory with output of isoquant using default parameters (run on this mini.bam file)

Any suggestions?

@andrewprzh
Copy link
Collaborator

Dear @sarahcalvo

Based on my experience, mono-exonic and single-intron alignments can be incorrect significantly more often compared to alignments with 3 or more exons. Thus, IsoQuant performs additional checks for these alignments, for example filters them out based on mapping quality or presence of polyA tail. I presume some of these filters may affect your results.

Thank you for the data, I have some tight schedule at the moment, hope to get my hands on them ASAP.

Best
Andrey

@andrewprzh andrewprzh added the weird results Something looks odd in the resulting files label Dec 8, 2023
@sarahcalvo
Copy link
Author

Dear @andrewprzh, I know this is a super busy time of year and I've been trying various work-arounds. But I just wanted to let you know that when you have a chance to look into this (in the new year) -- I'm super eager to follow-up! Best, and happy holidays -- Sarah

@andrewprzh
Copy link
Collaborator

andrewprzh commented Dec 26, 2023

Dear @sarahcalvo

The answer appeared to be simpler than I thought. IsoQuant is being extra careful about 1 and 2-exon alignments as they can be false positives (typically for ONT data). Thus, IsoQuant requires them to have a polyA tail in order to be used for transcript construction. You reads have no polyA tails and thus 1-2 exonic transcripts are entirely missed.

At least for 2-exon transcripts that can be easily fixed with an option. For monoexonic it might take some time as polyA position is essential for the detection.

Anyway, I'll see what can be done and hopefully will improve that in the next release.

Best
Andrey

@sarahcalvo
Copy link
Author

sarahcalvo commented Dec 26, 2023 via email

@andrewprzh
Copy link
Collaborator

@sarahcalvo did you do some read cleaning/trimming before using IsoQuant?
Because polyA is detected in exactly 0 reads.

Best
Andrey

@sarahcalvo
Copy link
Author

sarahcalvo commented Dec 26, 2023 via email

@sarahcalvo
Copy link
Author

sarahcalvo commented Dec 26, 2023 via email

@andrewprzh
Copy link
Collaborator

@sarahcalvo

I know that IsoSeq pipeline provide CSS headers that contain information about polyA tails detected in reads. I think I might implement their support at some point if that would be useful.

Meanwhile I improved reporting novel mono-intronic transcripts and added new options allowing to tune polyA usage by the user. This will come out in the next release.

Monoexonic transcripts are still in question as polyA positions are essential for clustering reads together.

Best
Andrey

@sarahcalvo
Copy link
Author

sarahcalvo commented Jan 3, 2024 via email

@andrewprzh
Copy link
Collaborator

Dear @sarahcalvo

Would it be possible for you to share just a few line from any CCS hearer files if you happen to have any?

Best
Andrey

@sarahcalvo
Copy link
Author

sarahcalvo commented Jan 8, 2024 via email

@andrewprzh
Copy link
Collaborator

I'll close this issue for now as original problem should be now solved in IsoQuant 3.4

Implementing CCS headers is on the roadmap for the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
weird results Something looks odd in the resulting files
Projects
None yet
Development

No branches or pull requests

2 participants