BUG: Isoquant is missing 1 and 2-exon genes from PacBio RNA-seq data #128

sarahcalvo · 2023-12-05T15:15:32Z

I'm working with Brian Haas to use Isoquant to reconstruct genes in an amoeba species. This species has genes tightly packed together with almost no intergenic space, and genes typically have many introns.

Isoquant is completely missing several obvious 1- and 2- exon genes, when run either with default params or any of the following parameters: --report_novel_unspliced "true" ; --model_construction_strategy "sensitive_pacbio"; --fl_data

I created a tiny BAM file with a single region with 8 genes (~900 reads total), of which Isoquant only calls transcripts for 6 (using any of the above flags).

Here are some example files in directory https://personal.broadinstitute.org/scalvo/for_isoquant_debugging/

Isoquant.example_missing_genes.pptx : screen shot showing 2 missing genes
T1.r1.mini.bam : bam file with ~900 reads, that should have 8 transcripts
out_mini_default : directory with output of isoquant using default parameters (run on this mini.bam file)

Any suggestions?

andrewprzh · 2023-12-08T00:35:06Z

Dear @sarahcalvo

Based on my experience, mono-exonic and single-intron alignments can be incorrect significantly more often compared to alignments with 3 or more exons. Thus, IsoQuant performs additional checks for these alignments, for example filters them out based on mapping quality or presence of polyA tail. I presume some of these filters may affect your results.

Thank you for the data, I have some tight schedule at the moment, hope to get my hands on them ASAP.

Best
Andrey

sarahcalvo · 2023-12-19T15:24:34Z

Dear @andrewprzh, I know this is a super busy time of year and I've been trying various work-arounds. But I just wanted to let you know that when you have a chance to look into this (in the new year) -- I'm super eager to follow-up! Best, and happy holidays -- Sarah

andrewprzh · 2023-12-26T10:09:04Z

Dear @sarahcalvo

The answer appeared to be simpler than I thought. IsoQuant is being extra careful about 1 and 2-exon alignments as they can be false positives (typically for ONT data). Thus, IsoQuant requires them to have a polyA tail in order to be used for transcript construction. You reads have no polyA tails and thus 1-2 exonic transcripts are entirely missed.

At least for 2-exon transcripts that can be easily fixed with an option. For monoexonic it might take some time as polyA position is essential for the detection.

Anyway, I'll see what can be done and hopefully will improve that in the next release.

Best
Andrey

sarahcalvo · 2023-12-26T11:42:37Z

Thanks so much Andrey! This raises some very interesting biological hypotheses too that we will look into— unless it’s an artifact from the MAS-iso-seq processing pipeline. I’ll look into both options and let you know!SarahSent from my iPhoneOn Dec 26, 2023, at 5:09 AM, Andrey Prjibelski ***@***.***> wrote: Dear @sarahcalvo The answer appeared to be simpler than I thought. IsoQuant is being extra careful about 1 and 2-exon alignments as they can be false positives. Thus, IsoQuant requires them to have a polyA tail in order to be used for transcript construction. You reads have no polyA tails and thus 1-2 exonic transcripts are entirely missed. At least for 2-exon transcripts that can be easily fixed with an option. For monoexonic it might take some time as polyA position is essential for the detection. Anyway, I'll see what can be done and hopefully will improve that in the next release. Best Andrey —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

andrewprzh · 2023-12-26T11:56:54Z

@sarahcalvo did you do some read cleaning/trimming before using IsoQuant?
Because polyA is detected in exactly 0 reads.

Best
Andrey

sarahcalvo · 2023-12-26T13:04:08Z

MAS-ISO-seq is a new experimental method where cDNA transcripts are concatenated together then sequenced with PacBio. They have developed a software pipeline that processes the long concatenated consensus reads into the transcript reads. I hadn’t realized none of the reads had polyA but my guess is the mas-ISO-seq pipeline trims the polyA as part of its processing. I’ll ask Brian Haas to confirm.Sarah Sent from my iPhoneOn Dec 26, 2023, at 6:57 AM, Andrey Prjibelski ***@***.***> wrote: @sarahcalvo did you do some read cleaning/trimming before using IsoQuant? Because polyA is detected in exactly 0 reads. Best Andrey —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

sarahcalvo · 2023-12-26T22:04:29Z

Yes just confirmed that PolyA is trimmed from the ends of the reads as part of the standard/official mas- Isoseq processing pipeline, and so shouldn't show up in any of the reads that get aligned to the genome.Sarah Sent from my iPhoneOn Dec 26, 2023, at 8:04 AM, Sarah Calvo ***@***.***> wrote:MAS-ISO-seq is a new experimental method where cDNA transcripts are concatenated together then sequenced with PacBio. They have developed a software pipeline that processes the long concatenated consensus reads into the transcript reads. I hadn’t realized none of the reads had polyA but my guess is the mas-ISO-seq pipeline trims the polyA as part of its processing. I’ll ask Brian Haas to confirm.Sarah Sent from my iPhoneOn Dec 26, 2023, at 6:57 AM, Andrey Prjibelski ***@***.***> wrote: @sarahcalvo did you do some read cleaning/trimming before using IsoQuant? Because polyA is detected in exactly 0 reads. Best Andrey —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

andrewprzh · 2023-12-27T14:18:28Z

@sarahcalvo

I know that IsoSeq pipeline provide CSS headers that contain information about polyA tails detected in reads. I think I might implement their support at some point if that would be useful.

Meanwhile I improved reporting novel mono-intronic transcripts and added new options allowing to tune polyA usage by the user. This will come out in the next release.

Monoexonic transcripts are still in question as polyA positions are essential for clustering reads together.

Best
Andrey

sarahcalvo · 2024-01-03T18:46:28Z

Thanks so much Andrey! I look forward to the next release. Yes, it would definitely be helpful to have a version that supports IsoSeq (with info from polyA tails in the CSS headers). The technology seems to be working great. Sarah

…

On Wed, Dec 27, 2023 at 9:18 AM Andrey Prjibelski ***@***.***> wrote: @sarahcalvo <https://github.com/sarahcalvo> I know that IsoSeq pipeline provide CSS headers that contain information about polyA tails detected in reads. I think I might implement their support at some point if that would be useful. Meanwhile I improved reporting novel mono-intronic transcripts and added new options allowing to tune polyA usage by the user. This will come out in the next release. Monoexonic transcripts are still in question as polyA positions are essential for clustering reads together. Best Andrey — Reply to this email directly, view it on GitHub <#128 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AIZYB7JQK7YIR3OVZMEI4ELYLQU37AVCNFSM6AAAAABAH3N5LCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZQGM2DQMRTHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- ------------------------------------------ Sarah Calvo, Ph.D. Sr. Computational Biologist Broad Institute of MIT/Harvard ***@***.*** 617-714-7687 -------------------------------------------

andrewprzh · 2024-01-05T13:24:23Z

Dear @sarahcalvo

Would it be possible for you to share just a few line from any CCS hearer files if you happen to have any?

Best
Andrey

sarahcalvo · 2024-01-08T17:35:25Z

Hi Andrey, Yes! Sorry for the delay. This file has the info for the region in the mini bam I sent you before: https://personal.broadinstitute.org/scalvo/for_isoquant_debugging/T1.r1.mini.read.refine.report.csv Here are a few lines: id,strand,fivelen,threelen,polyAlen,insertlen,primer m84043_231012_201426_s1/160961840/ccs/2694_3482,+,9,7,33,788,asa04_5p--3p m84043_231012_201426_s1/126490519/ccs/67_836,+,7,9,43,769,asa04_5p--3p m84043_231012_201426_s1/60360594/ccs/4737_5506,+,22,51,48,769,asa04_5p--3p m84043_231012_201426_s1/229184738/ccs/5806_6560,+,5,9,32,754,asa04_5p--3p m84043_231012_201426_s1/187438050/ccs/1980_2399,+,1,9,58,419,asa04_5p--3p m84043_231012_201426_s1/200545278/ccs/7144_7898,+,9,7,54,754,asa04_5p--3p m84043_231012_201426_s1/146020224/ccs/35_789,+,9,7,26,754,asa04_5p--3p m84043_231012_201426_s1/146085784/ccs/3011_3430,+,9,6,59,419,asa04_5p--3p m84043_231012_201426_s1/163125161/ccs/2731_3045,+,7,9,35,314,asa04_5p--3p m84043_231012_201426_s1/153490061/ccs/1383_2137,+,7,9,32,754,asa04_5p--3p Sarah

…

On Fri, Jan 5, 2024 at 8:24 AM Andrey Prjibelski ***@***.***> wrote: Dear @sarahcalvo <https://github.com/sarahcalvo> Would it be possible for you to share just a few line from any CCS hearer files if you happen to have any? Best Andrey — Reply to this email directly, view it on GitHub <#128 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AIZYB7K62EBBIJRIKORRZMDYM75JDAVCNFSM6AAAAABAH3N5LCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZYGY2TIOBUG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- ------------------------------------------ Sarah Calvo, Ph.D. Sr. Computational Biologist Broad Institute of MIT/Harvard ***@***.*** 617-714-7687 -------------------------------------------

andrewprzh · 2024-05-09T09:30:25Z

I'll close this issue for now as original problem should be now solved in IsoQuant 3.4

Implementing CCS headers is on the roadmap for the next release.

andrewprzh added the weird results Something looks odd in the resulting files label Dec 8, 2023

andrewprzh closed this as completed May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Isoquant is missing 1 and 2-exon genes from PacBio RNA-seq data #128

BUG: Isoquant is missing 1 and 2-exon genes from PacBio RNA-seq data #128

sarahcalvo commented Dec 5, 2023

andrewprzh commented Dec 8, 2023

sarahcalvo commented Dec 19, 2023

andrewprzh commented Dec 26, 2023 •

edited

Loading

sarahcalvo commented Dec 26, 2023 via email

andrewprzh commented Dec 26, 2023

sarahcalvo commented Dec 26, 2023 via email

sarahcalvo commented Dec 26, 2023 via email

andrewprzh commented Dec 27, 2023

sarahcalvo commented Jan 3, 2024 via email

andrewprzh commented Jan 5, 2024

sarahcalvo commented Jan 8, 2024 via email •

edited

Loading

andrewprzh commented May 9, 2024

BUG: Isoquant is missing 1 and 2-exon genes from PacBio RNA-seq data #128

BUG: Isoquant is missing 1 and 2-exon genes from PacBio RNA-seq data #128

Comments

sarahcalvo commented Dec 5, 2023

andrewprzh commented Dec 8, 2023

sarahcalvo commented Dec 19, 2023

andrewprzh commented Dec 26, 2023 • edited Loading

sarahcalvo commented Dec 26, 2023 via email

andrewprzh commented Dec 26, 2023

sarahcalvo commented Dec 26, 2023 via email

sarahcalvo commented Dec 26, 2023 via email

andrewprzh commented Dec 27, 2023

sarahcalvo commented Jan 3, 2024 via email

andrewprzh commented Jan 5, 2024

sarahcalvo commented Jan 8, 2024 via email • edited Loading

andrewprzh commented May 9, 2024

andrewprzh commented Dec 26, 2023 •

edited

Loading

sarahcalvo commented Jan 8, 2024 via email •

edited

Loading