Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding origin and significance of gene, transcript, and isoform MSTRG results #50

Closed
cea295933 opened this issue Dec 14, 2023 · 8 comments
Labels
question Further information is requested

Comments

@cea295933
Copy link

Ask away!

Hello,

I've run wf-transcriptomes on paired wild-type and mutant samples from budding yeast (cerevisae). There is plenty of transcriptional heterogeneity in yeast but available reference transcriptomes span only coding sequences. I have assembled de novo transcriptomes using the workflow and then performed DE analysis. In my output, I see many "MSTRG" results (from StringTie, I believe) for genes, transcripts, and transcript isfoforms. My questions are as follows:

  1. There does not seem to be a consistent association between gene IDs, transcript IDs, and transcript isoform (I think these are feature IDs) in the output .tsv files). Which is to say that many of these outputs will list one of these values (for example transcript IDs) without associating them with their parent gene IDs, and so on. It's relatively straightforward to associate these myself, but would it be possible to specify a uniform output style that would preserve these associations in all files?

  2. I would love to understand better how these "MSTRG" results originate, as I see many, many more than I would expect, particularly for a well annotated genome such as cerevisiae. I am using the NCBI/GenBank reference genome.

@cea295933 cea295933 added the question Further information is requested label Dec 14, 2023
@sarahjeeeze
Copy link
Contributor

Hi, thanks for raising this - could you give an example of the inconsistency? There is an issue we are aware of and looking in to fixing which is where a high number of gene_ids are MSTRG's because stringtie assigns gene_ids as a unique ID for its internal processes (As mentioned in a few git issues like this one gpertea/stringtie#179 (comment)) which we then propogate throughout the workflow, but we can hopefully fix by using ref_gene_id attribute from the stringtie output but are you seeing a different inconsistency? Also could you give an example of your full cmd?

@cea295933
Copy link
Author

Attached are DE analysis results and my full cmd. You can see that reads align very well but most genes and transcripts are MSTRG. In some files, the MSTRG are associated with annotated features, but in others they're left on their own. The trivial issue is propagating this association. The issue I'm more interested in is what gives rise to these MSTRG tags, and whether there is any way to recover the genes to which these reads are most likely associated and what aspects of that correspondence/lack thereof gives rise to the MSTRG.

full cmd:

15 ./nextflow -log /work/caitken/cea_nextflow.log run epi2me-labs/wf-transcriptomes -with-trace
16 --fastq /work/caitken/data/DegronNanoporeSequencing/Total/
17 --ref_genome /work/caitken/data/DegronNanoporeSequencing/NCBI_S288C_R64/GenBank/GCA_000146045. 2_R64_genomic.fna
18 --ref_annotation /work/caitken/data/DegronNanoporeSequencing/NCBI_S288C_R64/GenBank/genomic.gff
19 --pychopper_opts '-k PCS111'
20 --de_analysis
21 --sample_sheet /work/caitken/BarcodesTotal.csv
22 --out_dir /work/caitken/data/DegronNanoporeSequencing/outputTotal_112223full
23 -c /work/caitken/data/DegronNanoporeSequencing/my_config.cfg

output:

DEresults.zip

@sarahjeeeze
Copy link
Contributor

There are a few reasons the MSTRG prefix is used by stringtie. The MSTRG's from StringTie mostly originate from then there is no genomic overlap with any feature in the reference annotation gtf/gff file you are using, or a novel transcript in a known gene or a novel transcript in a cluster of genes. I will try to add some thorough explanation to the read me/documentation as well as reducing the number present as mentioned above.

@cea295933
Copy link
Author

cea295933 commented Jan 2, 2024 via email

@sarahjeeeze
Copy link
Contributor

Are you seeing MSTRG more than expected for gene ID's or transcript IDs? I think I need to update the workflow so it outputs the final string tie gtf file that is used downstream in the meantime you could find it in the work folder of the merge_transcriptomes process if you are able to access that. That will be useful in distinguishing where the MSTRG's are originating from.

For reads with no overlap with ref genome, you would expect MSTRG and no gene name in the final gtf. If novel transcript in a known gene you should see the gene name as the ID (but this is the part that needs some work based on what i mentioned above) and to identify novel transcripts in a cluster of genes you would see MSTRG as a gene id with more than one gene name and you will notice the locus's will overlap.

Which reference_genome and reference_annotation files are you using? Is it from a publicly available source - perhaps I could try it out with my data set?

@sarahjeeeze
Copy link
Contributor

Closing through lack of response and new issue continuing discussion.

@cea295933
Copy link
Author

Apologies for lack of response … trying to get grant renewal submitted with current analysis. Can you point me to the current version of this discussion? Have intended updates you mentioned in previous response been implemented or are they still to-do’s?

@sarahjeeeze
Copy link
Contributor

Hi, sorry for late response. They are still in to-do's but hopefully will be implemented soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Development

No branches or pull requests

2 participants