Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

F-P GAF has redundant annotations #576

Closed
ValWood opened this issue Mar 22, 2018 · 13 comments
Closed

F-P GAF has redundant annotations #576

ValWood opened this issue Mar 22, 2018 · 13 comments

Comments

@ValWood
Copy link
Contributor

ValWood commented Mar 22, 2018

There is a lot of redundancy with existing annotation in the inferred GAF

Here are some examples:

thi1. 2 redundant annotations:

ii) redundant and identical (we made this annotation from this paper)
GO:0045944 | positive regulation of transcription by RNA polymerase II | IMP | Tang CS et al. (1994)

ii) redundant and less specific

GO:0006366 | transcription by RNA polymerase II | IMP | Tang CS et al. (1994)

this1 2 repeats

vht1, lots of redundant less specific

vht1 3 redundant parents

GO:0042886 | amide transport | IMP | Stolz J (2003)
GO:1905039 | carboxylic acid transmembrane transport | IMP | Stolz J (2003) |
GO:0015718 | monocarboxylic acid transport | IMP | Stolz J (2003)

  1. crm1 exact duplicate

GO:0006611 | protein export from nucleus IMP | Takeda K et al. (2010)

crm1 exact dupe

so can we filter
i) exact duplicates
ii) annotations less specific than the existing annotations

For the non experimental annotations they are filtered by our pipeline, but we have a rule not to filter any experimental annotation.

Having duplicated annotation isn't a show stopper, but it isn't useful and it looks sloppy/confusing to users

@pgaudet
Copy link
Contributor

pgaudet commented Mar 22, 2018

Are they duplicate because they were assigned by different sources ?

@ValWood
Copy link
Contributor Author

ValWood commented Mar 22, 2018

Well yes. PomBase made them from the paper, and the inference pipeline assigned the duplicates from our annotation !

@rachhuntley made the same point about a slightly different type of redundancy from the inference pipeline here:

geneontology/go-annotation#1487
3.
An inferred annotation is created even when there is a pre-existing manual annotation from the same paper with better evidence (See UniProtKB:P46153 GO:0010666 PMID:25950484). With other automated pipelines, duplicated annotations are suppressed.

Although I do not understand this comment about the source of the annotation:

The annotation that is created in this instance is odd; it states that P46153 is inferred to be involved in
"positive regulation of cardiac muscle cell apoptotic process" with evidence from the annotation to "negative regulation of pri-miRNA transcription from RNA polymerase II promoter”, there is no indication that the evidence is actually coming from the extension statement associated with the annotation that the regulation of transcription is positively regulating apoptosis.

@ValWood
Copy link
Contributor Author

ValWood commented Mar 22, 2018

If I understood correctly,
this ticket is blocked by
geneontology/go-annotation#1427

@ValWood
Copy link
Contributor Author

ValWood commented Mar 22, 2018

Another example with the redundancy

Lots of annotations to "cell" when we have more specific annotations (over occurs_at links)
(Maybe we should block the term "cell" for direct annotation anyway? @pgaudet)

@ValWood
Copy link
Contributor Author

ValWood commented Mar 22, 2018

Also the term GO:0005622 intracellular, should we block this one too? I don't think it makes useful inferences anyway.....

@cmungall
Copy link
Member

We just need a single example of where this is not behaving as expectedly. Here just a gene ID and an inferred term ID.

@yy20716 will investigate this

Let me try an restate

The pombase source has a direct annotations of ste11 to GO:0045944

$curl -L http://geneontology.org/gene-associations/submission/gene_association.pombase.gz | gzip -dc | grep GO:0045944 | grep SPBC32C12.02
PomBase SPBC32C12.02    ste11           GO:0045944      PMID:8196631    IMP             P       transcription factor Ste11      stex|aff1       protein taxon:4896      20120208        PomBase  has_regulation_target(PomBase:SPAPB8E5.05)      
PomBase SPBC32C12.02    ste11           GO:0045944      PMID:8196631    IMP             P       transcription factor Ste11      stex|aff1       protein taxon:4896      20120208        PomBase  has_regulation_target(PomBase:SPAC513.03)       
...

The prediction file includes an inference of the same gene to this term:

 curl -L http://snapshot.geneontology.org/products/annotations/pombase-prediction.gaf | grep GO:0045944 | grep SPBC32C12.02
PomBase SPBC32C12.02    ste11           GO:0045944      PMID:10908327   IDA             P       transcription factor Ste11      stex|aff1       protein taxon:4896      20150519        GOC      

in this case it does not matter where the inference comes from. If we are intending to filter out redundant annotations, this is a bug. If we are leaving it to the consumer to filter these out this needs documented.

See my comments here on where this should be documented: #2226

@ValWood
Copy link
Contributor Author

ValWood commented Mar 22, 2018

Here is a single example: (number 2 above)

vht1
SPAC1B3.16c

We made the following process annotations from this paper: PMID:12557275

vht1 | GO:1905135 | biotin import across plasma membrane | IMP |
vht1 | GO:1905136 | dethiobiotin import across plasma membrane | IMP |

the we get the following less specific inferences from the same source: PMID:12557275

  • | GO:0042886 | amide transport | IMP | Stolz J (2003)
  • | GO:1905039 | carboxylic acid transmembrane transport | IMP | Stolz J (2003)
  • | GO:0015718 | monocarboxylic acid transport | IMP | Stolz J (2003)

@ValWood
Copy link
Contributor Author

ValWood commented Mar 22, 2018

They are quite annoying when we make great efforts to present non-redundant annotation ;)

We filter redundant non experimental from any source, but these show up because they are experimental. I guess we could change so that we filter duplicates from GOC assigner, but I think it would be better generally for everyone not to generate more duplicate annotations, so I haven't actioned this yet.

@ValWood
Copy link
Contributor Author

ValWood commented Mar 22, 2018

Your example for ste11 is slightly different than the examples I am using. It is from 2 independent sources, and it's a different evidence code.
I think there is a strong case to filter these too, but I am more concerned about the ones which come from an identical source.

@rachhuntley
Copy link
Contributor

Responding to Val's comment:
"Although I do not understand this comment about the source of the annotation:

The annotation that is created in this instance is odd; it states that P46153 is inferred to be involved in
"positive regulation of cardiac muscle cell apoptotic process" with evidence from the annotation to "negative regulation of pri-miRNA transcription from RNA polymerase II promoter”, there is no indication that the evidence is actually coming from the extension statement associated with the annotation that the regulation of transcription is positively regulating apoptosis.""

It looks like this inference is no longer made.

@ValWood
Copy link
Contributor Author

ValWood commented May 11, 2018

closed duplicate
geneontology/go-annotation#1682

has more examples...

@ValWood
Copy link
Contributor Author

ValWood commented Nov 8, 2018

@cmungall you can probably close this one?

@pgaudet
Copy link
Contributor

pgaudet commented May 24, 2019

We will develop guidelines: geneontology/go-annotation#2060

@pgaudet pgaudet closed this as completed May 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants