-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
generateEvents option returning empty .ioe files #178
Comments
Hi Arnav,
could you please send me the GTF here: ***@***.*** ?
In principle the format looks fine. But it was generated from a GFF, which
has a single ID in column 9.
So I am wondering whether your GTF contains multiple transcript IDs
associated to the same Gene ID.
Without that, SUPPA cannot calculate the AS events.
Thanks
Eduardo
…On Wed, 3 Jan 2024 at 14:26, Arnav Bharti ***@***.***> wrote:
For input:
1. I have got a GFF file from PlasmoDB. "PlasmoDB-66_PvivaxP01.gff"
2. I converted it to GTF file using agat agat_convert_sp_gff2gtf.pl
--gff ../PlasmoDB-66_PvivaxP01.gff -o PlasmoDB-66_PvivaxP01.gtf
3. suppa.py generateEvents -i ../PlasmoDB-66_PvivaxP01.gtf -o
AlternativeSplicing/localAS/local -f ioe -e {SE,SS,MX,RI,FL}
Output:
empty .ioe files
eg.
seqname gene_id event_id alternative_transcripts total_transcripts
GTF file:
PvP01_API_v2 VEuPathDB ncRNA_gene 9 63 . + . gene_id "PVP01_API00100"; ID "PVP01_API00100"; description "tRNA Threonine"; ebi_biotype "tRNA";
PvP01_API_v2 VEuPathDB tRNA 9 63 . + . gene_id "PVP01_API00100"; transcript_id "PVP01_API00100.1"; ID "PVP01_API00100.1"; Parent "PVP01_API00100"; description "tRNA Threonine"; gene_ebi_biotype "tRNA";
PvP01_API_v2 VEuPathDB exon 9 63 . + . gene_id "PVP01_API00100"; transcript_id "PVP01_API00100.1"; ID "exon_PVP01_API00100.1-E1"; Parent "PVP01_API00100.1";
PvP01_API_v2 VEuPathDB protein_coding_gene 90 704 . + . gene_id "PVP01_API00200"; ID "PVP01_API00200"; Name "RPS4"; description "apicoplast ribosomal protein S4, putative"; ebi_biotype "protein_coding";
PvP01_API_v2 VEuPathDB mRNA 90 704 . + . gene_id "PVP01_API00200"; transcript_id "PVP01_API00200.1"; ID "PVP01_API00200.1"; Parent "PVP01_API00200"; description "apicoplast ribosomal protein S4, putative"; gene_ebi_biotype "protein_coding";
PvP01_API_v2 VEuPathDB exon 90 704 . + . gene_id "PVP01_API00200"; transcript_id "PVP01_API00200.1"; ID "exon_PVP01_API00200.1-E1"; Parent "PVP01_API00200.1";
PvP01_API_v2 VEuPathDB CDS 90 704 . + 0 gene_id "PVP01_API00200"; transcript_id "PVP01_API00200.1"; ID "PVP01_API00200.1-p1-CDS1"; Parent "PVP01_API00200.1"; protein_source_id "PVP01_API00200.1-p1";
PvP01_API_v2 VEuPathDB ncRNA_gene 716 787 . + . gene_id "PVP01_API00300"; ID "PVP01_API00300"; description "tRNA Histidine"; ebi_biotype "tRNA";
—
Reply to this email directly, view it on GitHub
<#178>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADCZKB4MYI243INRJNOFAHLYMTFU7AVCNFSM6AAAAABBK3FNF6VHI2DSMVQWIX3LMV43ASLTON2WKOZSGA3DGMJZGI2TMOA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
I managed to get the GTF file (instead of converting from GFF). A few lines were like this: BUT the .ioe file generated is still empty.
|
Thanks for sending the GTF file.
The issue is what I suspected. The GTF does not contain
multiple transcripts per gene.
Using only the "exon" lines (used by SUPPA), I counted how many genes
"gene_id" contain more than 1 "transcript_id", but there are none:
awk '$3=="exon"' genomic_data.gtf | cut -f9 | cut -f1-4 -d" " | sort -u |
cut -f1,2 -d" " | sort | uniq -c | awk '$1>1' | wc -l
0
In these cases, you can use the option *-p* | *--pool-genes*
when running generateEvents
This option will first cluster the transcripts into genes according to exon
overlaps in the same strand.
It uses a simple definition of gene to be transcripts with exons that
overlap on the same strand and share at least one splice-site.
This will create new gene IDs based on the gene_id IDs that are overlapping
in genomic loci.
Please try this to see whether there are overlapping transcripts in the
annotated genomic loci.
I hope this helps
Best
Eduardo
…On Fri, 5 Jan 2024 at 03:04, Arnav Bharti ***@***.***> wrote:
I managed to get the GTF file (instead of converting from GFF).
A few lines were like this: NC_009911.1 RefSeq exon 441005 441169 . - .
gene_id "PVX_110843"; transcript_id "XR_003001228.1"; db_xref
"GeneID:5471288"; locus_tag "PVX_110843"; note "LSU 5.8S rRNA; O-type";
orig_transcript_id "gnl|WGS:AAKM|mrna.PVX_110843-RA"; product "5.8S
ribosomal RNA"; transcript_biotype "rRNA"; exon_number "1"; where the ; in
LSU 5.8S rRNA; O-type was split into two causing an IndexError while
parsing. I was wondering if replacing ; with ; (removing space) where it
occurs inside quotation marks would affect the output. 'Cause by doing this
IndexError goes away.
BUT the .ioe file generated is empty.
genomic_data.gtf.zip
<https://github.com/comprna/SUPPA/files/13832434/genomic_data.gtf.zip>
att_dict = dict(map(lambda x: (x[0], x[1].strip('"')), attributes)) <-
this line is causing the index error
—
Reply to this email directly, view it on GitHub
<#178 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADCZKB6VT4PGWOCB4PDTY43YM3HIZAVCNFSM6AAAAABBK3FNF6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZXGM2TCOBZHA>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Even with pool genes flag the output is still empty.
|
Hi Arnav,
this means that the GTF annotation that you're using does not contain any
alternative transcripts in the annotated genes, or that the variation of
those transcripts cannot be expressed in terms of splicing events.
I checked, and there are no alternative transcripts; that is, the
annotation is 1-gene:1-transcript. SUPPA needs an annotation with
overlapping transcripts to calculate the events.
You could either enhance the annotation with novel transcripts using
RNA-seq reads with a tool like StringTie (or StringTie2 or FLAIR if you
have long reads), or directly use RNA-seq reads with e.g. rMATs or MAJIQ to
directly calculate alternative splicing events.
Another option is if Ensembl has other annotations that may be more
transcript rich (https://protists.ensembl.org/info/data/ftp/index.html)
I hope this helps
Eduardo
…On Fri, 5 Jan 2024 at 15:55, Arnav Bharti ***@***.***> wrote:
Even with pool genes flag the output is still empty.
$ suppa.py generateEvents -i ~/Downloads/genomic.gtf -o local -f ioe -e
{SE,SS,MX,RI,FL} --pool-genes
INFO:eventGenerator:Reading input data.
INFO:eventGenerator:Pooling genes
INFO:eventGenerator:Calculating events
INFO:eventGenerator:Done
$ cat local*.ioe > biglocal.ioe
$ cat biglocal.ioe
seqname gene_id event_id alternative_transcripts total_transcripts
seqname gene_id event_id alternative_transcripts total_transcripts
seqname gene_id event_id alternative_transcripts total_transcripts
seqname gene_id event_id alternative_transcripts total_transcripts
seqname gene_id event_id alternative_transcripts total_transcripts
seqname gene_id event_id alternative_transcripts total_transcripts
seqname gene_id event_id alternative_transcripts total_transcripts
—
Reply to this email directly, view it on GitHub
<#178 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADCZKB6BPP6N7LEW5CVLSKTYM6BULAVCNFSM6AAAAABBK3FNF6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZYGEZDGOBVG4>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
For input:
agat_convert_sp_gff2gtf.pl --gff ../PlasmoDB-66_PvivaxP01.gff -o PlasmoDB-66_PvivaxP01.gtf
suppa.py generateEvents -i ../PlasmoDB-66_PvivaxP01.gtf -o AlternativeSplicing/localAS/local -f ioe -e {SE,SS,MX,RI,FL}
Output:
empty .ioe files
eg.
GTF file:
The text was updated successfully, but these errors were encountered: