-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Understanding origin and significance of gene, transcript, and isoform MSTRG results #50
Comments
Hi, thanks for raising this - could you give an example of the inconsistency? There is an issue we are aware of and looking in to fixing which is where a high number of gene_ids are MSTRG's because stringtie assigns gene_ids as a unique ID for its internal processes (As mentioned in a few git issues like this one gpertea/stringtie#179 (comment)) which we then propogate throughout the workflow, but we can hopefully fix by using ref_gene_id attribute from the stringtie output but are you seeing a different inconsistency? Also could you give an example of your full cmd? |
Attached are DE analysis results and my full cmd. You can see that reads align very well but most genes and transcripts are MSTRG. In some files, the MSTRG are associated with annotated features, but in others they're left on their own. The trivial issue is propagating this association. The issue I'm more interested in is what gives rise to these MSTRG tags, and whether there is any way to recover the genes to which these reads are most likely associated and what aspects of that correspondence/lack thereof gives rise to the MSTRG. full cmd: 15 ./nextflow -log /work/caitken/cea_nextflow.log run epi2me-labs/wf-transcriptomes -with-trace output: |
There are a few reasons the MSTRG prefix is used by stringtie. The MSTRG's from StringTie mostly originate from then there is no genomic overlap with any feature in the reference annotation gtf/gff file you are using, or a novel transcript in a known gene or a novel transcript in a cluster of genes. I will try to add some thorough explanation to the read me/documentation as well as reducing the number present as mentioned above. |
That is consistent with how I understood their origin. But I still have two questions:
1. Based on the workflow output, it looks like the vast majority of my reads align to the reference genome, so I’m surprised that I’m ending up with so many MSTRG results. Is there any way to parse which of these come from (1) reads with no overlap (2) novel transcripts in a known gene or (3) novel transcripts in a cluster of genes ?
2. For those two latter cases, is it possible to preserve information about which known genes/gene clusters these MSTRG results belong to?
I’m currently trying to compare these results to previous experiments using short-read (illumina) sequencing approaches … it would be really helpful to have a way to group all the novel transcripts by the genes from which they originate, both to enable more direct comparison and also to take advantage of the ability to identify these by nanopore sequencing while also having access to previous data to strengthen the conclusions one can make.
Colin Echeverría Aitken
Assistant Professor
Biology Department
Biochemistry Program
Vassar College
***@***.*** ***@***.***>
845.437.7430
… On Jan 2, 2024, at 7:31 AM, Sarah Griffiths ***@***.***> wrote:
There are a few reasons the MSTRG prefix is used by stringtie. The MSTRG's from StringTie mostly originate from then there is no genomic overlap with any feature in the reference annotation gtf/gff file you are using, or a novel transcript in a known gene or a novel transcript in a cluster of genes. I will try to add some thorough explanation to the read me/documentation as well as reducing the number present as mentioned above.
—
Reply to this email directly, view it on GitHub <#50 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BD2SWU4HCZMF65IOAQ4JKWDYMP42RAVCNFSM6AAAAABAU7XPEGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZTHE3DSOBYGU>.
You are receiving this because you authored the thread.
|
Are you seeing MSTRG more than expected for gene ID's or transcript IDs? I think I need to update the workflow so it outputs the final string tie gtf file that is used downstream in the meantime you could find it in the work folder of the merge_transcriptomes process if you are able to access that. That will be useful in distinguishing where the MSTRG's are originating from. For reads with no overlap with ref genome, you would expect MSTRG and no gene name in the final gtf. If novel transcript in a known gene you should see the gene name as the ID (but this is the part that needs some work based on what i mentioned above) and to identify novel transcripts in a cluster of genes you would see MSTRG as a gene id with more than one gene name and you will notice the locus's will overlap. Which reference_genome and reference_annotation files are you using? Is it from a publicly available source - perhaps I could try it out with my data set? |
Closing through lack of response and new issue continuing discussion. |
Apologies for lack of response … trying to get grant renewal submitted with current analysis. Can you point me to the current version of this discussion? Have intended updates you mentioned in previous response been implemented or are they still to-do’s? |
Hi, sorry for late response. They are still in to-do's but hopefully will be implemented soon. |
Ask away!
Hello,
I've run wf-transcriptomes on paired wild-type and mutant samples from budding yeast (cerevisae). There is plenty of transcriptional heterogeneity in yeast but available reference transcriptomes span only coding sequences. I have assembled de novo transcriptomes using the workflow and then performed DE analysis. In my output, I see many "MSTRG" results (from StringTie, I believe) for genes, transcripts, and transcript isfoforms. My questions are as follows:
There does not seem to be a consistent association between gene IDs, transcript IDs, and transcript isoform (I think these are feature IDs) in the output .tsv files). Which is to say that many of these outputs will list one of these values (for example transcript IDs) without associating them with their parent gene IDs, and so on. It's relatively straightforward to associate these myself, but would it be possible to specify a uniform output style that would preserve these associations in all files?
I would love to understand better how these "MSTRG" results originate, as I see many, many more than I would expect, particularly for a well annotated genome such as cerevisiae. I am using the NCBI/GenBank reference genome.
The text was updated successfully, but these errors were encountered: