-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Isoquant is missing 1 and 2-exon genes from PacBio RNA-seq data #128
Comments
Dear @sarahcalvo Based on my experience, mono-exonic and single-intron alignments can be incorrect significantly more often compared to alignments with 3 or more exons. Thus, IsoQuant performs additional checks for these alignments, for example filters them out based on mapping quality or presence of polyA tail. I presume some of these filters may affect your results. Thank you for the data, I have some tight schedule at the moment, hope to get my hands on them ASAP. Best |
Dear @andrewprzh, I know this is a super busy time of year and I've been trying various work-arounds. But I just wanted to let you know that when you have a chance to look into this (in the new year) -- I'm super eager to follow-up! Best, and happy holidays -- Sarah |
Dear @sarahcalvo The answer appeared to be simpler than I thought. IsoQuant is being extra careful about 1 and 2-exon alignments as they can be false positives (typically for ONT data). Thus, IsoQuant requires them to have a polyA tail in order to be used for transcript construction. You reads have no polyA tails and thus 1-2 exonic transcripts are entirely missed. At least for 2-exon transcripts that can be easily fixed with an option. For monoexonic it might take some time as polyA position is essential for the detection. Anyway, I'll see what can be done and hopefully will improve that in the next release. Best |
Thanks so much Andrey! This raises some very interesting biological hypotheses too that we will look into— unless it’s an artifact from the MAS-iso-seq processing pipeline. I’ll look into both options and let you know!SarahSent from my iPhoneOn Dec 26, 2023, at 5:09 AM, Andrey Prjibelski ***@***.***> wrote:
Dear @sarahcalvo
The answer appeared to be simpler than I thought. IsoQuant is being extra careful about 1 and 2-exon alignments as they can be false positives. Thus, IsoQuant requires them to have a polyA tail in order to be used for transcript construction. You reads have no polyA tails and thus 1-2 exonic transcripts are entirely missed.
At least for 2-exon transcripts that can be easily fixed with an option. For monoexonic it might take some time as polyA position is essential for the detection.
Anyway, I'll see what can be done and hopefully will improve that in the next release.
Best
Andrey
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
@sarahcalvo did you do some read cleaning/trimming before using IsoQuant? Best |
MAS-ISO-seq is a new experimental method where cDNA transcripts are concatenated together then sequenced with PacBio. They have developed a software pipeline that processes the long concatenated consensus reads into the transcript reads. I hadn’t realized none of the reads had polyA but my guess is the mas-ISO-seq pipeline trims the polyA as part of its processing. I’ll ask Brian Haas to confirm.Sarah Sent from my iPhoneOn Dec 26, 2023, at 6:57 AM, Andrey Prjibelski ***@***.***> wrote:
@sarahcalvo did you do some read cleaning/trimming before using IsoQuant?
Because polyA is detected in exactly 0 reads.
Best
Andrey
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Yes just confirmed that PolyA is trimmed from the ends of the reads as part of the standard/official mas- Isoseq processing pipeline, and so shouldn't show up in any of the reads that get aligned to the genome.Sarah Sent from my iPhoneOn Dec 26, 2023, at 8:04 AM, Sarah Calvo ***@***.***> wrote:MAS-ISO-seq is a new experimental method where cDNA transcripts are concatenated together then sequenced with PacBio. They have developed a software pipeline that processes the long concatenated consensus reads into the transcript reads. I hadn’t realized none of the reads had polyA but my guess is the mas-ISO-seq pipeline trims the polyA as part of its processing. I’ll ask Brian Haas to confirm.Sarah Sent from my iPhoneOn Dec 26, 2023, at 6:57 AM, Andrey Prjibelski ***@***.***> wrote:
@sarahcalvo did you do some read cleaning/trimming before using IsoQuant?
Because polyA is detected in exactly 0 reads.
Best
Andrey
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
I know that IsoSeq pipeline provide CSS headers that contain information about polyA tails detected in reads. I think I might implement their support at some point if that would be useful. Meanwhile I improved reporting novel mono-intronic transcripts and added new options allowing to tune polyA usage by the user. This will come out in the next release. Monoexonic transcripts are still in question as polyA positions are essential for clustering reads together. Best |
Thanks so much Andrey! I look forward to the next release.
Yes, it would definitely be helpful to have a version that supports IsoSeq
(with info from polyA tails in the CSS headers). The technology seems to
be working great.
Sarah
…On Wed, Dec 27, 2023 at 9:18 AM Andrey Prjibelski ***@***.***> wrote:
@sarahcalvo <https://github.com/sarahcalvo>
I know that IsoSeq pipeline provide CSS headers that contain information
about polyA tails detected in reads. I think I might implement their
support at some point if that would be useful.
Meanwhile I improved reporting novel mono-intronic transcripts and added
new options allowing to tune polyA usage by the user. This will come out in
the next release.
Monoexonic transcripts are still in question as polyA positions are
essential for clustering reads together.
Best
Andrey
—
Reply to this email directly, view it on GitHub
<#128 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIZYB7JQK7YIR3OVZMEI4ELYLQU37AVCNFSM6AAAAABAH3N5LCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZQGM2DQMRTHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
------------------------------------------
Sarah Calvo, Ph.D.
Sr. Computational Biologist
Broad Institute of MIT/Harvard
***@***.***
617-714-7687
-------------------------------------------
|
Dear @sarahcalvo Would it be possible for you to share just a few line from any CCS hearer files if you happen to have any? Best |
Hi Andrey,
Yes! Sorry for the delay. This file has the info for the region in the
mini bam I sent you before:
https://personal.broadinstitute.org/scalvo/for_isoquant_debugging/T1.r1.mini.read.refine.report.csv
Here are a few lines:
id,strand,fivelen,threelen,polyAlen,insertlen,primer
m84043_231012_201426_s1/160961840/ccs/2694_3482,+,9,7,33,788,asa04_5p--3p
m84043_231012_201426_s1/126490519/ccs/67_836,+,7,9,43,769,asa04_5p--3p
m84043_231012_201426_s1/60360594/ccs/4737_5506,+,22,51,48,769,asa04_5p--3p
m84043_231012_201426_s1/229184738/ccs/5806_6560,+,5,9,32,754,asa04_5p--3p
m84043_231012_201426_s1/187438050/ccs/1980_2399,+,1,9,58,419,asa04_5p--3p
m84043_231012_201426_s1/200545278/ccs/7144_7898,+,9,7,54,754,asa04_5p--3p
m84043_231012_201426_s1/146020224/ccs/35_789,+,9,7,26,754,asa04_5p--3p
m84043_231012_201426_s1/146085784/ccs/3011_3430,+,9,6,59,419,asa04_5p--3p
m84043_231012_201426_s1/163125161/ccs/2731_3045,+,7,9,35,314,asa04_5p--3p
m84043_231012_201426_s1/153490061/ccs/1383_2137,+,7,9,32,754,asa04_5p--3p
Sarah
…On Fri, Jan 5, 2024 at 8:24 AM Andrey Prjibelski ***@***.***> wrote:
Dear @sarahcalvo <https://github.com/sarahcalvo>
Would it be possible for you to share just a few line from any CCS hearer
files if you happen to have any?
Best
Andrey
—
Reply to this email directly, view it on GitHub
<#128 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AIZYB7K62EBBIJRIKORRZMDYM75JDAVCNFSM6AAAAABAH3N5LCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZYGY2TIOBUG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
------------------------------------------
Sarah Calvo, Ph.D.
Sr. Computational Biologist
Broad Institute of MIT/Harvard
***@***.***
617-714-7687
-------------------------------------------
|
I'll close this issue for now as original problem should be now solved in IsoQuant 3.4 Implementing CCS headers is on the roadmap for the next release. |
I'm working with Brian Haas to use Isoquant to reconstruct genes in an amoeba species. This species has genes tightly packed together with almost no intergenic space, and genes typically have many introns.
Isoquant is completely missing several obvious 1- and 2- exon genes, when run either with default params or any of the following parameters: --report_novel_unspliced "true" ; --model_construction_strategy "sensitive_pacbio"; --fl_data
I created a tiny BAM file with a single region with 8 genes (~900 reads total), of which Isoquant only calls transcripts for 6 (using any of the above flags).
Here are some example files in directory https://personal.broadinstitute.org/scalvo/for_isoquant_debugging/
Any suggestions?
The text was updated successfully, but these errors were encountered: