Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genbank Parsing Problem? #52

Closed
mercutio22 opened this issue Mar 12, 2012 · 7 comments
Closed

Genbank Parsing Problem? #52

mercutio22 opened this issue Mar 12, 2012 · 7 comments

Comments

@mercutio22
Copy link

Hi Brad, AnnotationSketch is complaining about the parsed file again:

GenomeTools error: CDS feature on line 27 in file "../../mirna-django/src/scripts/tp53.gff3" has the wrong phase 0 (should be 1)

I don't know if the problem is with their GFF3 parser though. Can you tell me what you think?

http://paste.debian.net/159462/

@chapmanb
Copy link
Owner

Hugo;
I think that the phase is correct but happy to adjust if the GenomeTools folks think otherwise. The GFF spec specifies the phase as 0,1 or 2:

http://www.sequenceontology.org/gff3.shtml

while codon_start from the GenBank file is 1, 2 or 3:

http://www.ddbj.nig.ac.jp/FT/full_index.html#7.2

so I've made the adjustment from 1 to 0 in the GFF output when converting. Let me know if your interaction with the GenomeTools developers indicate I've missed something in the conversion.

@mercutio22
Copy link
Author

Thanks Brad. I will contact them and will let you know asap.

 .''.      Hugo A. M. Torres : :' : . '   “Talk is cheap,  -    show me the code. ”  -- L. Torvalds.

On Mon, Mar 12, 2012 at 3:04 PM, Brad Chapman
reply@reply.github.com
wrote:

Hugo;
I think that the phase is correct but happy to adjust if the GenomeTools folks think otherwise. The GFF spec specifies the phase as 0,1 or 2:

http://www.sequenceontology.org/gff3.shtml

while codon_start from the GenBank file is 1, 2 or 3:

http://www.ddbj.nig.ac.jp/FT/full_index.html#7.2

so I've made the adjustment from 1 to 0 in the GFF output when converting. Let me know if your interaction with the GenomeTools developers indicate I've missed something in the conversion.


Reply to this email directly or view it on GitHub:
#52 (comment)

@mercutio22
Copy link
Author

HI Brad, perhaps this might be useful for testing your program:
http://modencode.oicr.on.ca/cgi-bin/validate_gff3_online

I tried and the tool pointed for instance is that the produced gff3
file file has a "source" field. IIRC Peter Cock in one his blog posts
says genbank has those but GFF3 does not.

Here, I paste you a sample report:

GFF3 File Validation Report

ontology_file(s):

http://song.cvs.sourceforge.net/*checkout*/song/ontology/so.obo

generated: 12-Mar-12 15:27:10

###############################################################################

THIS FILE HAS NOT BEEN VALIDATED, IT CONTAINS ERRORS, PLEASE REVIEW REPORT!

(NO WARNINGS HAVE BEEN ISSUED FOR THIS FILE)

###############################################################################

###############################################################################

THIS FILE HAS BEEN PROCESSED ENTIRELY AND ALL ERRORS/WARNINGS ARE REPORTED!

###############################################################################

First 10 lines of the analyzed GFF3 file follows:

[line 1]> ##gff-version 3
[line 2]> ##sequence-region NG_017013.1 1 26144
[line 3]> NG_017013.1 annotation remark 1 26144 .
[line 3]> . . comment=REVIEWED%20REFSEQ%3A%20This%20record%20has%20been%20curated%20by%20NCBI%20staff%20in%0Acollaboration%20with%20Graham%20Taylor.%20The%20reference%20sequence%20was%0Aderived%20from%20AC087388.9%20and%20AC007421.13.%0AThis%20sequence%20is%20a%20reference%20standard%20in%20the%20RefSeqGene%20project.%0APublication%20Note%3A%20%20This%20RefSeq%20record%20includes%20a%20subset%20of%20the%0Apublications%20that%20are%20available%20for%20this%20gene.%20Please%20see%20the%20Gene%0Arecord%20to%20access%20additional%20publications.%0ASummary%3A%20This%20gene%20encodes%20tumor%20protein%20p53%2C%20which%20responds%20to%0Adiverse%20cellular%20stresses%20to%20regulate%20target%20genes%20that%20induce%20cell%0Acycle%20arrest%2C%20apoptosis%2C%20senescence%2C%20DNA%20repair%2C%20or%20changes%20in%0Ametabolism.%20p53%20protein%20is%20expressed%20at%20low%20level%20in%20normal%20cells%0Aand%20at%20a%20high%20level%20in%20a%20variety%20of%20transformed%20cell%20lines%2C%20where%0Ait%27s%20believed%20to%20contribute%20to%20transformation%20and%20malignancy.%20p53%0Ais%20a%20DNA-binding%20protein%20containing%20transcription%20activation%2C%0ADNA-binding%2C%20and%20oligomerization%20domains.%20It%20is%20postulated%20to%20bind%0Ato%20a%20p53-binding%20site%20and%20activate%20expression%20of%20downstream%20genes%0Athat%20inhibit%20growth%20and/or%20invasion%2C%20and%20thus%20function%20as%20a%20tumor%0Asuppressor.%20Mutants%20of%20p53%20that%20frequently%20occur%20in%20a%20number%20of%0Adifferent%20human%20cancers%20fail%20to%20bind%20the%20consensus%20DNA%20binding%0Asite%2C%20and%20hence%20cause%20the%20loss%20of%20tumor%20suppressor%20activity.%0AAlterations%20of%20this%20gene%20occur%20not%20only%20as%20somatic%20mutations%20in%0Ahuman%20malignancies%2C%20but%20also%20as%20germline%20mutations%20in%20some%0Acancer-prone%20families%20with%20Li-Fraumeni%20syndrome.%20Multiple%20p53%0Avariants%20due%20to%20alternative%20promoters%20and%20multiple%20alternative%0Asplicing%20have%20been%20found.%20These%20variants%20encode%20distinct%20isoforms%2C%0Awhich%20can%20regulate%20p53%20transcriptional%20activity.%20%5Bprovided%20by%0ARefSeq%2C%20Jul%202008%5D.;
[line 3]> sequence_version=1;source=Homo%20sapiens%20%28human%29;
[line 3]> taxonomy=Eukaryota,Metazoa,Chordata,
[line 3]> Craniata,Vertebrata,Euteleostomi,
[line 3]> Mammalia,Eutheria,Euarchontoglires,
[line 3]> Primates,Haplorrhini,Catarrhini,
[line 3]> Hominidae,Homo;keywords=RefSeqGene;
[line 3]> references=location%3A%20%5B0%3A26144%5D%0Aauthors%3A%20Marcel%2CV.%2C%20Tran%2CP.L.%2C%20Sagne%2CC.%2C%20Martel-Planche%2CG.%2C%20Vaslin%2CL.%2C%20Teulade-Fichou%2CM.P.%2C%20Hall%2CJ.%2C%20Mergny%2CJ.L.%2C%20Hainaut%2CP.%20and%20Van%20Dyck%2CE.%0Atitle%3A%20G-quadruplex%20structures%20in%20TP53%20intron%203%3A%20role%20in%20alternative%20splicing%20and%20in%20production%20of%20p53%20mRNA%20isoforms%0Ajournal%3A%20Carcinogenesis%2032%20%283%29%2C%20271-278%20%282011%29%0Amedline%20id%3A%20%0Apubmed%20id%3A%2021112961%0Acomment%3A,
[line 3]> location%3A%20%5B0%3A26144%5D%0Aauthors%3A%20Naidu%2CS.R.%2C%20Love%2CI.M.%2C%20Imbalzano%2CA.N.%2C%20Grossman%2CS.R.%20and%20Androphy%2CE.J.%0Atitle%3A%20The%20SWI/SNF%20chromatin%20remodeling%20subunit%20BRG1%20is%20a%20critical%20regulator%20of%20p53%20necessary%20for%20proliferation%20of%20malignant%20cells%0Ajournal%3A%20Oncogene%2028%20%2827%29%2C%202492-2501%20%282009%29%0Amedline%20id%3A%20%0Apubmed%20id%3A%2019448667%0Acomment%3A,
[line 3]> location%3A%20%5B0%3A26144%5D%0Aauthors%3A%20Anczukow%2CO.%2C%20Ware%2CM.D.%2C%20Buisson%2CM.%2C%20Zetoune%2CA.B.%2C%20Stoppa-Lyonnet%2CD.%2C%20Sinilnikova%2CO.M.%20and%20Mazoyer%2CS.%0Atitle%3A%20Does%20the%20nonsense-mediated%20mRNA%20decay%20mechanism%20prevent%20the%20synthesis%20of%20truncated%20BRCA1%2C%20CHK2%2C%20and%20p53%20proteins%3F%0Ajournal%3A%20Hum.%20Mutat.%2029%20%281%29%2C%2065-73%20%282008%29%0Amedline%20id%3A%20%0Apubmed%20id%3A%2017694537%0Acomment%3A,
[line 3]> location%3A%20%5B0%3A26144%5D%0Aauthors%3A%20Bourdon%2CJ.C.%0Atitle%3A%20p53%20Family%20isoforms%0Ajournal%3A%20Curr%20Pharm%20Biotechnol%208%20%286%29%2C%20332-336%20%282007%29%0Amedline%20id%3A%20%0Apubmed%20id%3A%2018289041%0Acomment%3A%20Review%20article,
[line 3]> location%3A%20%5B0%3A26144%5D%0Aauthors%3A%20Murray-Zmijewski%2CF.%2C%20Lane%2CD.P.%20and%20Bourdon%2CJ.C.%0Atitle%3A%20p53/p63/p73%20isoforms%3A%20an%20orchestra%20of%20isoforms%20to%20harmonise%20cell%20differentiation%20and%20response%20to%20stress%0Ajournal%3A%20Cell%20Death%20Differ.%2013%20%286%29%2C%20962-972%20%282006%29%0Amedline%20id%3A%20%0Apubmed%20id%3A%2016601753%0Acomment%3A%20Review%20article,
[line 3]> location%3A%20%5B0%3A26144%5D%0Aauthors%3A%20Flaman%2CJ.M.%2C%20Waridel%2CF.%2C%20Estreicher%2CA.%2C%20Vannier%2CA.%2C%20Limacher%2CJ.M.%2C%20Gilbert%2CD.%2C%20Iggo%2CR.%20and%20Frebourg%2CT.%0Atitle%3A%20The%20human%20tumour%20suppressor%20gene%20p53%20is%20alternatively%20spliced%20in%20normal%20cells%0Ajournal%3A%20Oncogene%2012%20%284%29%2C%20813-818%20%281996%29%0Amedline%20id%3A%20%0Apubmed%20id%3A%208632903%0Acomment%3A,
[line 3]> location%3A%20%5B0%3A26144%5D%0Aauthors%3A%20Lamb%2CP.%20and%20Crawford%2CL.%0Atitle%3A%20Characterization%20of%20the%20human%20p53%20gene%0Ajournal%3A%20Mol.%20Cell.%20Biol.%206%20%285%29%2C%201379-1385%20%281986%29%0Amedline%20id%3A%20%0Apubmed%20id%3A%202946935%0Acomment%3A,
[line 3]> location%3A%20%5B0%3A26144%5D%0Aauthors%3A%20Harlow%2CE.%2C%20Williamson%2CN.M.%2C%20Ralston%2CR.%2C%20Helfman%2CD.M.%20and%20Adams%2CT.E.%0Atitle%3A%20Molecular%20cloning%20and%20in%20vitro%20expression%20of%20a%20cDNA%20clone%20for%20human%20cellular%20tumor%20antigen%20p53%0Ajournal%3A%20Mol.%20Cell.%20Biol.%205%20%287%29%2C%201601-1610%20%281985%29%0Amedline%20id%3A%20%0Apubmed%20id%3A%203894933%0Acomment%3A,
[line 3]> location%3A%20%5B0%3A26144%5D%0Aauthors%3A%20Zakut-Houri%2CR.%2C%20Bienz-Tadmor%2CB.%2C%20Givol%2CD.%20and%20Oren%2CM.%0Atitle%3A%20Human%20p53%20cellular%20tumor%20antigen%3A%20cDNA%20sequence%20and%20expression%20in%20COS%20cells%0Ajournal%3A%20EMBO%20J.%204%20%285%29%2C%201251-1255%20%281985%29%0Amedline%20id%3A%20%0Apubmed%20id%3A%204006916%0Acomment%3A,
[line 3]> location%3A%20%5B0%3A26144%5D%0Aauthors%3A%20Matlashewski%2CG.%2C%20Lamb%2CP.%2C%20Pim%2CD.%2C%20Peacock%2CJ.%2C%20Crawford%2CL.%20and%20Benchimol%2CS.%0Atitle%3A%20Isolation%20and%20characterization%20of%20a%20human%20p53%20cDNA%20clone%3A%20expression%20of%20the%20human%20p53%20gene%0Ajournal%3A%20EMBO%20J.%203%20%2813%29%2C%203257-3262%20%281984%29%0Amedline%20id%3A%20%0Apubmed%20id%3A%206396087%0Acomment%3A;
[line 3]> accessions=NG_017013;data_file_division=PRI;
[line 3]> date=19-FEB-2012;organism=Homo%20sapiens;
[line 3]> gi=293651587
[line 4]> NG_017013.1 feature source 1 26144 . + .
[line 4]> db_xref=taxon%3A9606;mol_type=genomic%20DNA;
[line 4]> organism=Homo%20sapiens;chromosome=17;
[line 4]> map=17p13.1
[line 5]> NG_017013.1 feature gene 1 6475 . - .
[line 5]> note=WD%20repeat%20containing%2C%20antisense%20to%20TP53;
[line 5]> db_xref=GeneID%3A55135,HGNC%3A25522,
[line 5]> MIM%3A612661;gene=WRAP53;gene_synonym=DKCB3%3B%20TCAB1%3B%20WDR79
[line 6]> NG_017013.1 feature mRNA 2845 6475 . - .
[line 6]> db_xref=GI%3A221136857,GeneID%3A55135,
[line 6]> HGNC%3A25522,MIM%3A612661;product=WD%20repeat%20containing%2C%20antisense%20to%20TP53%2C%20transcript%20variant%202;
[line 6]> transcript_id=NM_001143990.1;inference=similar%20to%20RNA%20sequence%2C%20mRNA%20%28same%20species%29%3ARefSeq%3ANM_001143990.1;
[line 6]> exception=annotated%20by%20transcript%20or%20proteomic%20data;
[line 6]> gene=WRAP53;gene_synonym=DKCB3%3B%20TCAB1%3B%20WDR79;
[line 6]> ID=NM_001143990.1
[line 7]> NG_017013.1 feature mRNA 2845 2956 . - .
[line 7]> Parent=NM_001143990.1
[line 8]> NG_017013.1 feature mRNA 3224 3322 . - .
[line 8]> Parent=NM_001143990.1
[line 9]> NG_017013.1 feature mRNA 3467 3898 . - .
[line 9]> Parent=NM_001143990.1
[line 10]> NG_017013.1 feature mRNA 6322 6475 . - .
[line 10]> Parent=NM_001143990.1

...

Line Number Error/Warning


4 [ERROR] invalid type (type: source)
7 [ERROR] invalid type pair - check all parents (at line
6; mRNA to mRNA)
12 [ERROR] invalid type pair - check all parents (at line
11; mRNA to mRNA)
17 [ERROR] invalid type pair - check all parents (at line
16; mRNA to mRNA)
22 [ERROR] invalid type pair - check all parents (at line
21; mRNA to mRNA)
26 [ERROR] invalid type pair - check all parents (at line
25; CDS to CDS)
30 [ERROR] invalid type pair - check all parents (at line
29; CDS to CDS)
34 [ERROR] invalid type pair - check all parents (at line
33; CDS to CDS)
38 [ERROR] invalid type pair - check all parents (at line
37; CDS to CDS)
44 [ERROR] invalid type pair - check all parents (at line
43; mRNA to mRNA)
56 [ERROR] invalid type pair - check all parents (at line
55; mRNA to mRNA)
69 [ERROR] invalid type pair - check all parents (at line
68; mRNA to mRNA)
82 [ERROR] invalid type pair - check all parents (at line
81; mRNA to mRNA)
94 [ERROR] invalid type pair - check all parents (at line
93; mRNA to mRNA)
113 [ERROR] invalid type pair - check all parents (at line
112; CDS to CDS)
124 [ERROR] invalid type pair - check all parents (at line
123; CDS to CDS)
135 [ERROR] invalid type pair - check all parents (at line
134; CDS to CDS)
145 [ERROR] invalid type pair - check all parents (at line
144; CDS to CDS)
162 [ERROR] invalid type pair - check all parents (at line
161; CDS to CDS)
171 [ERROR] invalid type pair - check all parents (at line
170; mRNA to mRNA)
180 [ERROR] invalid type pair - check all parents (at line
179; mRNA to mRNA)
189 [ERROR] invalid type pair - check all parents (at line
188; mRNA to mRNA)
206 [ERROR] invalid type pair - check all parents (at line
205; CDS to CDS)
214 [ERROR] invalid type pair - check all parents (at line
213; CDS to CDS)
221 [ERROR] invalid type pair - check all parents (at line
220; CDS to CDS)

 .''.      Hugo A. M. Torres : :' : . '   “Talk is cheap,  -    show me the code. ”  -- L. Torvalds.

On Mon, Mar 12, 2012 at 3:50 PM, A M Torres, Hugo
mnemonico@posthocergopropterhoc.net wrote:

Thanks Brad. I will contact them and will let you know asap.

 .''.      Hugo A. M. Torres : :' : . '   “Talk is cheap,  -    show me the code. ”  -- L. Torvalds.

On Mon, Mar 12, 2012 at 3:04 PM, Brad Chapman
reply@reply.github.com
wrote:

Hugo;
I think that the phase is correct but happy to adjust if the GenomeTools folks think otherwise. The GFF spec specifies the phase as 0,1 or 2:

http://www.sequenceontology.org/gff3.shtml

while codon_start from the GenBank file is 1, 2 or 3:

http://www.ddbj.nig.ac.jp/FT/full_index.html#7.2

so I've made the adjustment from 1 to 0 in the GFF output when converting. Let me know if your interaction with the GenomeTools developers indicate I've missed something in the conversion.


Reply to this email directly or view it on GitHub:
#52 (comment)

@chapmanb
Copy link
Owner

Hugo;
Thanks for this. The validator is complaining about 'source' not being present in the Sequence Ontology. Mapping GenBank to SO is a fairly large problem. I tried to tackle this a few years back but it ended up being too much work. Here's the progress I made:

http://bcbio.wordpress.com/2008/12/14/standard-ontologies-in-biosql/

Practically, most tools will not enforce this requirement, so being unable to map the entire thing I took the approach of keeping the output GFF similar to the input GenBank. If you wanted to take on a mapping of GenBank to Sequence Ontology I'd be happy to incorporate in.

Is GenomeTools requiring the ontology matches, or just that online validator?

@mercutio22
Copy link
Author

Hi Brad,

Is GenomeTools requiring the ontology matches, or just that online validator?


Hmm, It seems only the validator. GenomeTools seems only to be
complaining about that "phase" field.

I have already posted your considerations on their issue tracker. I
will let you know what they say when I get a reply. In any case,
thanks for taking the time you spent on looking at my problem.

@chapmanb
Copy link
Owner

Thanks Hugo -- let me know if there ends up being anything I can change on my end to improve the phase information. Hopefully that'll do it and get things working smoothly with GenomeTools. Thanks for your patience with this.

@chapmanb
Copy link
Owner

Hugo;
I'm going to close this to clean up the issues. Hopefully everything was solved on the GenomeTools side. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants