Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import of introns fails #181

Closed
Yu-jinKim opened this issue Mar 11, 2022 · 2 comments
Closed

Import of introns fails #181

Yu-jinKim opened this issue Mar 11, 2022 · 2 comments
Labels
v0.11 fixed in v0.11

Comments

@Yu-jinKim
Copy link

Hi,

I have been trying to pull out exons of a GFF but i'm getting an error that i don't understand.

Here is a head -14 of that GFF which is enough to get the error:

#OriSeqID=Chr1	Accession=GWHACDB00000001.1
GWHACDB00000001.1	.	gene	9738	14906	1	+	.	ID=Hc.01G000010;Accession=GWHGACDB000001.1;Augustus_transcriptSupport_percentage=100;Augustus_intronSupport=4/4;Source=augustus000002;Integrity=complete;;transl_table=1
GWHACDB00000001.1	.	mRNA	9738	14906	1	+	.	ID=Hc.01G000010.t1;Accession=GWHTACDB000001.1;Parent=Hc.01G000010;Parent_Accession=GWHGACDB000001.1;IntronSupport=4/4;Integrity=complete;;transl_table=1
GWHACDB00000001.1	.	exon	9738	11686	.	+	.	ID=Hc.01G000010.t1.exon1;Parent=Hc.01G000010.t1;Parent_Accession=GWHTACDB000001.1;;transl_table=1
GWHACDB00000001.1	.	CDS	9738	11686	1	+	0	ID=Hc.01G000010.t1.CDS1;Parent=Hc.01G000010.t1;Parent_Accession=GWHTACDB000001.1;Protein_Accession=GWHPACDB000001.1;;transl_table=1
GWHACDB00000001.1	.	exon	11897	12671	.	+	.	ID=Hc.01G000010.t1.exon2;Parent=Hc.01G000010.t1;Parent_Accession=GWHTACDB000001.1;;transl_table=1
GWHACDB00000001.1	.	CDS	11897	12671	1	+	1	ID=Hc.01G000010.t1.CDS2;Parent=Hc.01G000010.t1;Parent_Accession=GWHTACDB000001.1;Protein_Accession=GWHPACDB000001.1;;transl_table=1
GWHACDB00000001.1	.	exon	13090	13368	.	+	.	ID=Hc.01G000010.t1.exon3;Parent=Hc.01G000010.t1;Parent_Accession=GWHTACDB000001.1;;transl_table=1
GWHACDB00000001.1	.	CDS	13090	13368	1	+	0	ID=Hc.01G000010.t1.CDS3;Parent=Hc.01G000010.t1;Parent_Accession=GWHTACDB000001.1;Protein_Accession=GWHPACDB000001.1;;transl_table=1
GWHACDB00000001.1	.	exon	13650	14396	.	+	.	ID=Hc.01G000010.t1.exon4;Parent=Hc.01G000010.t1;Parent_Accession=GWHTACDB000001.1;;transl_table=1
GWHACDB00000001.1	.	CDS	13650	14396	1	+	0	ID=Hc.01G000010.t1.CDS4;Parent=Hc.01G000010.t1;Parent_Accession=GWHTACDB000001.1;Protein_Accession=GWHPACDB000001.1;;transl_table=1
GWHACDB00000001.1	.	exon	14601	14906	.	+	.	ID=Hc.01G000010.t1.exon5;Parent=Hc.01G000010.t1;Parent_Accession=GWHTACDB000001.1;;transl_table=1
GWHACDB00000001.1	.	CDS	14601	14708	1	+	0	ID=Hc.01G000010.t1.CDS5;Parent=Hc.01G000010.t1;Parent_Accession=GWHTACDB000001.1;Protein_Accession=GWHPACDB000001.1;;transl_table=1
GWHACDB00000001.1	.	three_prime_UTR	14709	14906	.	+	.	ID=Hc.01G000010.t1.3UTR1;Parent=Hc.01G000010.t1;Parent_Accession=GWHTACDB000001.1;;transl_table=1

Using this:

import gffutils

try:
    db = gffutils.create_db("test.gff", "gff.sqlite", verbose=True)
except Exception as e:
    db = gffutils.interface.FeatureDB("gff.sqlite")

db.update(db.create_introns(), verbose=True)

I get this

2022-03-11 14:24:50,576 - INFO - Populating features
Traceback (most recent call last):t-order relations: 0 features
  File "/home/kimy/NHS/envs/daz/lib/python3.9/site-packages/gffutils/create.py", line 589, in _populate_from_lines
    self._insert(f, c)
  File "/home/kimy/NHS/envs/daz/lib/python3.9/site-packages/gffutils/create.py", line 530, in _insert
    cursor.execute(constants._INSERT, feature.astuple())
sqlite3.IntegrityError: UNIQUE constraint failed: features.id

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/kimy/daz/parse_gff.py", line 11, in <module>
    db.update(db.create_introns(), verbose=True)
  File "/home/kimy/NHS/envs/daz/lib/python3.9/site-packages/gffutils/interface.py", line 938, in update
    db._populate_from_lines(data)
  File "/home/kimy/NHS/envs/daz/lib/python3.9/site-packages/gffutils/create.py", line 591, in _populate_from_lines
    fixed, final_strategy = self._do_merge(f, self.merge_strategy)
  File "/home/kimy/NHS/envs/daz/lib/python3.9/site-packages/gffutils/create.py", line 226, in _do_merge
    raise ValueError("Duplicate ID {0.id}".format(f))
ValueError: Duplicate ID Hc.01G000010.t1.exon1

Looking at the introns, they look fine:

import gffutils

try:
    db = gffutils.create_db("test.gff", "gff.sqlite", verbose=True)
except Exception as e:
    db = gffutils.interface.FeatureDB("gff.sqlite")


for intron in db.create_introns():
    print(intron)
GWHACDB00000001.1       gffutils_derived        intron  11687   11896   .       +       .       ID=Hc.01G000010.t1.exon1,Hc.01G000010.t1.exon2;Parent=Hc.01G000010.t1;Parent_Accession=GWHTACDB000001.1;;transl_table=1
GWHACDB00000001.1       gffutils_derived        intron  12672   13089   .       +       .       ID=Hc.01G000010.t1.exon2,Hc.01G000010.t1.exon3;Parent=Hc.01G000010.t1;Parent_Accession=GWHTACDB000001.1;;transl_table=1
GWHACDB00000001.1       gffutils_derived        intron  13369   13649   .       +       .       ID=Hc.01G000010.t1.exon3,Hc.01G000010.t1.exon4;Parent=Hc.01G000010.t1;Parent_Accession=GWHTACDB000001.1;;transl_table=1
GWHACDB00000001.1       gffutils_derived        intron  14397   14600   .       +       .       ID=Hc.01G000010.t1.exon4,Hc.01G000010.t1.exon5;Parent=Hc.01G000010.t1;Parent_Accession=GWHTACDB000001.1;;transl_table=1

Doesn't the import support {exon1_id},{exon2_id} IDs? I have looked at the source code but i'm not sure why it fails.

daler added a commit that referenced this issue Jun 5, 2022
@daler daler mentioned this issue Jun 5, 2022
@daler
Copy link
Owner

daler commented Jun 5, 2022

Hmm. For all other attributes, a comma indicates multiple values for that attribute. I don't think it makes sense to have multiple values for the primary key, so this should raise an exception with a clearer description of the problem.

#193 does this.

In your case you can make it work by providing an id spec that will convert your list of IDs to a string, like this:

def intron_id(f):
    return ','.join(f['ID'])

db.update(introns, id_spec='intron': [intron_id]})

This test verifies it works with the GFF you provided.

@daler daler added the v0.11 fixed in v0.11 label Jun 5, 2022
daler added a commit that referenced this issue Jun 5, 2022
@daler
Copy link
Owner

daler commented Jun 5, 2022

Now addressed in v0.11.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v0.11 fixed in v0.11
Projects
None yet
Development

No branches or pull requests

2 participants