Understanding origin and significance of gene, transcript, and isoform MSTRG results #50

cea295933 · 2023-12-14T15:02:37Z

Ask away!

Hello,

I've run wf-transcriptomes on paired wild-type and mutant samples from budding yeast (cerevisae). There is plenty of transcriptional heterogeneity in yeast but available reference transcriptomes span only coding sequences. I have assembled de novo transcriptomes using the workflow and then performed DE analysis. In my output, I see many "MSTRG" results (from StringTie, I believe) for genes, transcripts, and transcript isfoforms. My questions are as follows:

There does not seem to be a consistent association between gene IDs, transcript IDs, and transcript isoform (I think these are feature IDs) in the output .tsv files). Which is to say that many of these outputs will list one of these values (for example transcript IDs) without associating them with their parent gene IDs, and so on. It's relatively straightforward to associate these myself, but would it be possible to specify a uniform output style that would preserve these associations in all files?
I would love to understand better how these "MSTRG" results originate, as I see many, many more than I would expect, particularly for a well annotated genome such as cerevisiae. I am using the NCBI/GenBank reference genome.

sarahjeeeze · 2023-12-15T16:16:00Z

Hi, thanks for raising this - could you give an example of the inconsistency? There is an issue we are aware of and looking in to fixing which is where a high number of gene_ids are MSTRG's because stringtie assigns gene_ids as a unique ID for its internal processes (As mentioned in a few git issues like this one gpertea/stringtie#179 (comment)) which we then propogate throughout the workflow, but we can hopefully fix by using ref_gene_id attribute from the stringtie output but are you seeing a different inconsistency? Also could you give an example of your full cmd?

cea295933 · 2023-12-20T15:13:28Z

Attached are DE analysis results and my full cmd. You can see that reads align very well but most genes and transcripts are MSTRG. In some files, the MSTRG are associated with annotated features, but in others they're left on their own. The trivial issue is propagating this association. The issue I'm more interested in is what gives rise to these MSTRG tags, and whether there is any way to recover the genes to which these reads are most likely associated and what aspects of that correspondence/lack thereof gives rise to the MSTRG.

full cmd:

15 ./nextflow -log /work/caitken/cea_nextflow.log run epi2me-labs/wf-transcriptomes -with-trace
16 --fastq /work/caitken/data/DegronNanoporeSequencing/Total/
17 --ref_genome /work/caitken/data/DegronNanoporeSequencing/NCBI_S288C_R64/GenBank/GCA_000146045. 2_R64_genomic.fna
18 --ref_annotation /work/caitken/data/DegronNanoporeSequencing/NCBI_S288C_R64/GenBank/genomic.gff
19 --pychopper_opts '-k PCS111'
20 --de_analysis
21 --sample_sheet /work/caitken/BarcodesTotal.csv
22 --out_dir /work/caitken/data/DegronNanoporeSequencing/outputTotal_112223full
23 -c /work/caitken/data/DegronNanoporeSequencing/my_config.cfg

output:

DEresults.zip

sarahjeeeze · 2024-01-02T12:31:25Z

There are a few reasons the MSTRG prefix is used by stringtie. The MSTRG's from StringTie mostly originate from then there is no genomic overlap with any feature in the reference annotation gtf/gff file you are using, or a novel transcript in a known gene or a novel transcript in a cluster of genes. I will try to add some thorough explanation to the read me/documentation as well as reducing the number present as mentioned above.

cea295933 · 2024-01-02T13:50:04Z

That is consistent with how I understood their origin. But I still have two questions: 1. Based on the workflow output, it looks like the vast majority of my reads align to the reference genome, so I’m surprised that I’m ending up with so many MSTRG results. Is there any way to parse which of these come from (1) reads with no overlap (2) novel transcripts in a known gene or (3) novel transcripts in a cluster of genes ? 2. For those two latter cases, is it possible to preserve information about which known genes/gene clusters these MSTRG results belong to? I’m currently trying to compare these results to previous experiments using short-read (illumina) sequencing approaches … it would be really helpful to have a way to group all the novel transcripts by the genes from which they originate, both to enable more direct comparison and also to take advantage of the ability to identify these by nanopore sequencing while also having access to previous data to strengthen the conclusions one can make. Colin Echeverría Aitken Assistant Professor Biology Department Biochemistry Program Vassar College ***@***.*** ***@***.***> 845.437.7430

…

On Jan 2, 2024, at 7:31 AM, Sarah Griffiths ***@***.***> wrote: There are a few reasons the MSTRG prefix is used by stringtie. The MSTRG's from StringTie mostly originate from then there is no genomic overlap with any feature in the reference annotation gtf/gff file you are using, or a novel transcript in a known gene or a novel transcript in a cluster of genes. I will try to add some thorough explanation to the read me/documentation as well as reducing the number present as mentioned above. — Reply to this email directly, view it on GitHub <#50 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/BD2SWU4HCZMF65IOAQ4JKWDYMP42RAVCNFSM6AAAAABAU7XPEGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZTHE3DSOBYGU>. You are receiving this because you authored the thread.

sarahjeeeze · 2024-01-05T16:47:26Z

Are you seeing MSTRG more than expected for gene ID's or transcript IDs? I think I need to update the workflow so it outputs the final string tie gtf file that is used downstream in the meantime you could find it in the work folder of the merge_transcriptomes process if you are able to access that. That will be useful in distinguishing where the MSTRG's are originating from.

For reads with no overlap with ref genome, you would expect MSTRG and no gene name in the final gtf. If novel transcript in a known gene you should see the gene name as the ID (but this is the part that needs some work based on what i mentioned above) and to identify novel transcripts in a cluster of genes you would see MSTRG as a gene id with more than one gene name and you will notice the locus's will overlap.

Which reference_genome and reference_annotation files are you using? Is it from a publicly available source - perhaps I could try it out with my data set?

sarahjeeeze · 2024-02-16T09:18:53Z

Closing through lack of response and new issue continuing discussion.

cea295933 · 2024-02-16T12:22:02Z

Apologies for lack of response … trying to get grant renewal submitted with current analysis. Can you point me to the current version of this discussion? Have intended updates you mentioned in previous response been implemented or are they still to-do’s?

sarahjeeeze · 2024-05-08T10:00:48Z

Hi, sorry for late response. They are still in to-do's but hopefully will be implemented soon.

cea295933 added the question Further information is requested label Dec 14, 2023

pbuendia mentioned this issue Jan 30, 2024

Why did you remove the --transcriptome_source denovo option ? #63

Closed

sarahjeeeze closed this as completed Feb 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding origin and significance of gene, transcript, and isoform MSTRG results #50

Understanding origin and significance of gene, transcript, and isoform MSTRG results #50

cea295933 commented Dec 14, 2023

sarahjeeeze commented Dec 15, 2023

cea295933 commented Dec 20, 2023

sarahjeeeze commented Jan 2, 2024

cea295933 commented Jan 2, 2024 via email

sarahjeeeze commented Jan 5, 2024

sarahjeeeze commented Feb 16, 2024

cea295933 commented Feb 16, 2024

sarahjeeeze commented May 8, 2024

Understanding origin and significance of gene, transcript, and isoform MSTRG results #50

Understanding origin and significance of gene, transcript, and isoform MSTRG results #50

Comments

cea295933 commented Dec 14, 2023

Ask away!

sarahjeeeze commented Dec 15, 2023

cea295933 commented Dec 20, 2023

sarahjeeeze commented Jan 2, 2024

cea295933 commented Jan 2, 2024 via email

sarahjeeeze commented Jan 5, 2024

sarahjeeeze commented Feb 16, 2024

cea295933 commented Feb 16, 2024

sarahjeeeze commented May 8, 2024