Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge terms transcription factor activity, RNA polymerase II distal enhancer (and proximal) sequence-specific binding and children #16152

Closed
6 tasks done
pgaudet opened this issue Jul 27, 2018 · 37 comments

Comments

@pgaudet
Copy link
Contributor

pgaudet commented Jul 27, 2018

The transcription working group has agreed that these terms are too specific under the transcription factor branch, and we propose to merge as indicated below:

  • GO:0003705 transcription factor activity, RNA polymerase II distal enhancer sequence-specific binding | 161 manual annotations
  • GO:0000982 transcription factor activity, RNA polymerase II proximal promoter sequence-specific DNA binding | 63 manual annotations
    -> merge into GO:0000981 DNA-binding transcription factor activity, RNA polymerase II-specific

  • GO:0001205 transcriptional activator activity, RNA polymerase II distal enhancer sequence-specific DNA binding | 44 manual annotations
  • GO:0001077 transcriptional activator activity, RNA polymerase II proximal promoter sequence-specific DNA binding | 651 manual annotations
    -> merge into GO:0001228 DNA-binding transcription activator activity, RNA polymerase II-specific

  • GO:0001206 transcriptional repressor activity, RNA polymerase II distal enhancer sequence-specific binding | 12 manual annotations
  • GO:0001078 transcriptional repressor activity, RNA polymerase II proximal promoter sequence-specific DNA binding | 275 manual annotations
    -> merge into GO:0001227 DNA-binding transcription repressor activity, RNA polymerase II-specific

@pgaudet pgaudet self-assigned this Jul 27, 2018
@pgaudet pgaudet changed the title Merge terms transcription factor activity, RNA polymerase II distal enhancer sequence-specific binding and siblings Merge terms transcription factor activity, RNA polymerase II distal enhancer (and proximal) sequence-specific binding and children Jul 27, 2018
@pgaudet
Copy link
Contributor Author

pgaudet commented Jul 27, 2018

Obviously this is a big change (impacts a lot of annotations, there are over 1000 manual EXP to these 6 terms - see table below). In the new version of the ontology these are covered by the terms they would be merged into; if a binding annotation is appropriate it should already be there, according to the previous guidelines.

@krchristie @ValWood @RLovering @thomaspd
Is this OK ?

Group Number EXP
AgBase 1
ARUK-UCL 3
BHF-UCL 109
CACAO 1
CAFA 4
CGD 14
ComplexPortal 8
FlyBase 187
HGNC 3
MGI 185
NTNU_SB 565
ParkinsonsUK-UCL 1
PomBase 192
RGD 25
SGD 147
UniProt 102
WB 10
ZFIN 9

@krchristie
Copy link
Contributor

I am curious as to why the transcription working group thought these were too specific when it is clear that there are a lot of experimental annotations which indicates that there are a significant number of papers where it is possible to distinguish distal from proximal regulatory sequences, though I know there are times when it is not clear which one is being used. It seems that if there are so many direct experimental annotations just since when these terms got created, that it seems there is utility for them. The last I had read, there are some very interesting processes in development that differ whether the transcriptional regulation occurs at a more global level (i.e a distal enhancer) versus the more individual levels of the proximal promoters.

Please provide a more detailed explanation for removing this level of specifity.

Thanks,

-Karen

@ValWood
Copy link
Contributor

ValWood commented Jul 29, 2018

GO:0000982 transcription factor activity, RNA polymerase II proximal promoter sequence-specific DNA binding

should be captured by

GO:0000978 | RNA polymerase II proximal promoter sequence-specific DNA binding
in the DNA binding branch

so currently it is effectively captured twice...

@RLovering
Copy link

Hi Pascale
are you going to create a table for these annotations?

Also I am afraid not all of our annotations have the DNA binding annotations. I annotated PMID:15147242 in 2011, I am not sure that the guidelines were in place then, but in any case I had not made the separate DNA binding annotations (7 annotations required revisions).

The annotations associated with PMID:21632880 included the separate DNA binding annotation, but I changed GO:0003705 to GO:0000981 (only 1 annotation revised). I realise that this edit isn't required, but as I was checking these it seemed sensible.

However, what is a concern is something that Karen has raised. These dbTFs are binding different regulatory regions and may have a different function according to where they have bound. It looks like some of my annotations now are redundant, but I have left these along the recommended lines, in case there is a change of policy. How can we capture that if the TF binds the proximal promoter it has a repressor activity, it might have an activator activity when bound to a distal enhancer. I can't see how we can get this information linked, it maybe possible in Noctua but it also would ideally be captured some how with the AE field.

Suggestions?

Ruth

@pgaudet
Copy link
Contributor Author

pgaudet commented Aug 2, 2018

@krchristie @RLovering -
As @ValWood points out, it seems that the 'DNA binding' branch covers the location of the binding:

  • 'RNA polymerase II regulatory region sequence-specific DNA binding'
    • 'RNA polymerase II distal enhancer sequence-specific DNA binding'
    • 'RNA polymerase II proximal promoter sequence-specific DNA binding'

Moreover, as far as I know, most if not all transcription factors can bind both proximal and distal regions. There are no 'proximal-specific transcription factor' that I know of, so that's not really a function per se.

I'll check with Colin, Marcio and Astrid.
Thanks,

@RLovering
Copy link

RLovering commented Aug 6, 2018

Hi Pascale

Following our discussion I think we worked out that AE users could capture this using:
'RNA polymerase II proximal promoter sequence-specific DNA binding' AE part_of GO:0000981 DNA-binding transcription factor activity, RNA polymerase II-specific

and that Noctua users would probably capture with

GO:0000981 DNA-binding transcription factor activity, RNA polymerase II-specific has_part 'RNA polymerase II proximal promoter sequence-specific DNA binding'

Also can use occurs_at SO ID to specify region bound, ideally would be good to have genomic location information but this would require updating with each new build.

Hopefully I remembered this right and people are happy with this idea. I think this would work. But we have a lot of edits to do and these will take time to complete

Ruth

@pgaudet
Copy link
Contributor Author

pgaudet commented Aug 7, 2018

Hi @krchristie @krchristie
Thanks for the feedback. I have made a document showing the impact on the annotations of this propose merge:
https://drive.google.com/file/d/1fn-QMXM3FsoIvn5b7R_KUlL_RK94Y7qk/view?usp=sharing

According to my query, 126 proteins annotated to transcription factor activity are missing a DNA binding annotation (see details in the annotation ticket (so the merge doesn't make them 'less missing'):
geneontology/go-annotation#2046)

(Note that I haven't checked if the winning terms are missing annotations in the DNA binding branch; I could do that too but again, this proposal would not affect these missing annotations)

Does that seem reasonable to you ? I'd like to bring that up at the next annotation call @vanaukenk

Thanks, Pascale

@ValWood
Copy link
Contributor

ValWood commented Aug 8, 2018

I have a suggestion.

I see what Ruth is trying to do with
'RNA polymerase II proximal promoter sequence-specific DNA binding' AE part_of GO:0000981 DNA-binding transcription factor activity, RNA polymerase II-specific

But it seems strange to use "part_of" to connect 2 molecular functions (when you are really trying to describe a single activity, and implies that the term should be instantiated).

A "DNA binding transcription factor" binds the DNA (MF) and connect to the transcription machinery to "regulate transcrpition" (BP). The elemental activitiy of a DNA binding transcription factor is as an adaptor between the bound DNA region and the polymerase.

This is what we have right now:

transcription_factor

(ignore the process and component parts of this graph, I x'd these out because I needed to lop off the the root nodes to get the resolution)

We are removing the specific children of the DNA binding transcription factor, because it is a duplication. However, I think simplify curation here by implementing the following:

  1. Because
    GO:0003700
    DNA-binding transcription factor activity
    is defined
    A protein or a member of a complex that interacts selectively and non-covalently with a specific DNA sequence (sometimes referred to as a motif) within the regulatory region of a gene to modulate transcription. Regulatory regions include promoters (proximal and distal) and enhancers. Genes are transcriptional units, and include bacterial operons.

I've never been able to see any reason why see any reason why GO:0003700 cannot have the parent:
GO:0000976 transcription regulatory region sequence-specific DNA binding
Interacting selectively and non-covalently with a specific sequence of DNA that is part of a regulatory region that controls transcription of that section of the DNA. The transcribed region might be described as a gene, cistron, or operon.

In addition to the current parent, because the definiton is describing a DNA binding activity.

GO:0000976 should NOT be used if you have no evidence for DNA binding (include author intent here & sequence similarity here). In fact, I would argue that "DNA binding" is absolutely what the term represents, because if you don't know this, then you can only annotate to the process "regulation of transcription". It should be made clear that if you use this term the evidence you use refers to the DNA binding activity (but combinatorial evidence might be helpful here....I usually default to EXP in these cases which use 2 lines of evidence).

Whatever, DNA binding IS ALWAYS TRUE.
Quoting @cmungall "The ontology is for generalizations that always hold, annotations for everything else"

Why put the burden on the curator to remember they need to annotate in two branches to capture this? (especially as we don't have other documented situations where this is necessary)

If you are making an annotation to "GO:0003700
DNA-binding transcription factor activity" DNA binding is implicit, whether you have evidence for DNA binding or not.

You can then capture (if you have evidence), the precise type of element, or even the specific
motif your transcription factor binds to with an extension.
DNA binding regions are not really the province of the GO and should be captured using SO (in extensions).

So, for example if your DNA-binding transcription factor binds to a
GO:0000987 proximal promoter sequence-specific DNA binding
you would annotate

GO:0003700 DNA-binding transcription factor activity (or child) occurs_at SO:0001668
proximal_promoter_element

if GO:0001158 enhancer sequence-specific DNA binding
the extension would be:
SO:0000165 enhancer
A cis-acting sequence that increases the utilization of (some) eukaryotic promoters, and can function in either orientation and in any location (upstream or downstream) relative to the promoter.

You can also request specific proximal promoter of enhancer motifs from SO:
A small set of extant example include:

sterol_regulatory_element (SO:0001861)
NDM2_motif (SO:0001167)
regulatory_promoter_element (SO:0001678)
DRE (SO:0001845)
STREP_motif (SO:0001859)
forkhead_motif (SO:0001847)
DMv4_motif (SO:0001157)
HSE (SO:0001850)
pheromone_response_element (SO:0002045)
AP_1_binding_site (SO:0001842)
homol_E_box (SO:0001849)
zinc_repressed_element (SO:0002006)
CArG_box (SO:0002156)
CDRE_motif (SO:0001865)

So you can be absolutely precise about the absolute motif bound.

This would have a lot of advantages :

  • Easier for curators (especially new ones), who do not realize that you need to use 2 terms in different branches to describe a DNA binding transcription factor.
  • Annotation would become much more consistent (A single term to describe the activity of a DNA binding TF).
  • Annotation would also become more descriptive if the precise binding motif was captured. Much better for general GO users, but also for people trying to construct Txn regulatory networks etc.
  • Better definition between SO and GO.

@ValWood
Copy link
Contributor

ValWood commented Aug 8, 2018

eg

so this annotation would simplify to:
DNA binding transcription factor activity, RNA polymerase II specific
at Ace2_UAS regulates x,y,z

@hattrill
Copy link

hattrill commented Aug 8, 2018

Big YES vote from FlyBase - honestly, trying to explain to newbies that they have to add two terms for one MF, especially when that MF is looks like a composite term, is tiresome. Having to remind them every few months is tiresome too!

@ValWood
Copy link
Contributor

ValWood commented Aug 8, 2018

I'm tired of trying to remember myself. I run a query every couple of months to make everything equivalent in both branches!

@hattrill
Copy link

hattrill commented Aug 8, 2018

Imagine the FTE saved!

@bmeldal
Copy link

bmeldal commented Aug 8, 2018

Nice one, Val!
And here we are at CP - just winging it as everything is more complex for complexes! One term for Tx activities would be great!

And the new guide for DNA-bdg TF activity vs Tx coregulator activity vs general Tx initiation factor activity is working a treat for me!

@pgaudet
Copy link
Contributor Author

pgaudet commented Aug 8, 2018

I create a new ticket to discuss the parents of 'GO:0003700 DNA-binding transcription factor activity' #16214

Here I just want to discuss the merge proposed above.

Thanks, Pascale

@ValWood
Copy link
Contributor

ValWood commented Aug 8, 2018

Here I just want to discuss the merge proposed above.

The reason I mention it here because fully implementing this would change how the annotations in the merge branch are treated. At present they would only be reannotated/transferred to the term in the DNA binding branch. In the proposed scenario these terms (region specific DNA binding terms) would also go away and we would use SO extensions for this instead ( which is what we should be doing already IMHO).

So here, imagine that Ruth moves all her promoter type terms under GO:0003700 to the DNA binding branch. Then afterwards we trim the DNA binding terms which mention specific binding regions to use SO extensions...Ruth would need to reannotate them all again to add the correct SO extension.

One solution would be for Ruth (or anyone) to add the 'occurs_at' SO:xxxx during this migration.
Then if all the future changes are dealt with by merges (which should be possible), the annotation will be preserved, and will be already present.

I just want to prevent doing anything twice.

@ValWood
Copy link
Contributor

ValWood commented Aug 8, 2018

So instead of doing only

GO:0001078 proximal promoter DNA-binding transcription repressor activity, RNA polymerase II-specific
into
GO:0000987 proximal promoter sequence-specific DNA binding

It would be into
GO:0000987 proximal promoter sequence-specific DNA binding occurs_at SO:0001668
proximal_promoter_element (or a child)
(i.e adding the SO extension).

The extension would be preserved in any future merge into the GO:0000981 branch

Actually, if the proposal were implemented, no reannotation would be necessary except for the addition of extensions. Everything would be dealt with by the merges of the specific region binding terms and additional DNA binding parent for
GO:0000981 DNA-binding transcription factor activity, RNA polymerase II-specific

@srengel
Copy link

srengel commented Aug 8, 2018

I support Val’s proposal.

@RLovering
Copy link

If you add 'DNA binding' parent to 'GO:0003700 DNA-binding transcription factor activity' then if you say you a protein regulates DNA-binding transcription factor activity then you are saying that the protein regulates DNA binding. Which is not always true.

This is why we should NOT add 'DNA binding' parent to 'GO:0003700 DNA-binding transcription factor activity'

@ValWood
Copy link
Contributor

ValWood commented Aug 8, 2018

comment moved to
#16214
I would say if you annotated a protein to
regulates DNA-binding transcription factor activity"
then you ARE saying that protein regulates DNA binding.
there is no MF "DNA-binding transcription factor activity regulator"
but if there was this is what it would mean.
The process terms
GO:0051090 regulation of DNA-binding transcription factor activity
would imply that binding is regulated in some way, directly or indirectly.
but maybe we shouldn't be annotating these as a process anyways?
Most of these appear to be describing the TF at the end of a signalling pathway?
" x signalling" has_regulation target(specific transcription factor)
would seem to be the correct way to do this?
a lot of the annotation here don't make sense in terms of "DNA binding transcription factors".
For example:
KDM5A | Lysine-specific demethylase 5A |   | regulation of DNA-binding transcription factor activity | has_direct_input UniProtKB:P25440 | CAFA
but the input UniProtKB:P25440 is not a DNA binding TF.
Many others have "general" terms but do not specify which TF is regulated so are not really useful.
A lot are co-repressors etc
Lots of inconsistency here too.....

@ValWood
Copy link
Contributor

ValWood commented Aug 8, 2018

comment moved to
#16214
For the record
GO:0051090 regulation of DNA-binding transcription factor activity
is defined
Any process that modulates the frequency, rate or extent of the activity of a transcription factor, any factor involved in the initiation or regulation of transcription. Source: GOC:ai
so this isn't even specific for "DNA -binding trancription factors" despite the term name because it defines a "transcription factor" as anything involved in transcription regulation....

@srengel
Copy link

srengel commented Aug 8, 2018

bleck..that GO:0051090 is a mess and needs attention!

@ValWood
Copy link
Contributor

ValWood commented Aug 8, 2018

comment moved to
#16214
we had 7 annotations to this term and descendants. They are easy to rehouse
pombase/curation#2137
"regulation of DNA binding transcription factor activity" is really an example of a function term in the process ontology.
All of ours could be described as" function" has substrate "involved in process"
Some will be more complicated as they will be "has_regulation_target" of a signalling pathway
(i.e not direct substrate)

@RLovering
Copy link

regulation of DNA binding TFs does not always lead to regulation of DNA binding. I am not sure where you are going with this, but years ago (like 8 years ago) I requested that DNA binding should be removed as a parent of DNA binding TF activity because I was unable to annotate a protein to regulation of DNA binding TF activity and it took at least a year (or so it seemed) for this to be done. There are other places in the ontology where a binding parent has been removed for the same reason. Please think very carefully before reinstating this parent.

@ValWood
Copy link
Contributor

ValWood commented Aug 8, 2018

bleck..that GO:0051090 is a mess and needs attention!

yup, I probably shouldn't have looked under that stone today....
but this is a good example of exactly why we should get rid of these types of term.

@bmeldal
Copy link

bmeldal commented Aug 8, 2018

Opened another can of worms, have you Val ;-)
At first glance, we have 14 complexes annotated to GO:0051090 or children.

Btw, this latest discussion should be in #16214 (Proposal to add 'DNA binding' parent to 'GO:0003700 DNA-binding transcription factor activity') as we are talking about the pros and cons of adding this relationship...

@ValWood
Copy link
Contributor

ValWood commented Aug 8, 2018

move to
#16214

Ruth, I don't know what else you mean by
"regulation of DNA binding TF activity"
Do you have a specific example of where you are regulating the "trancription factor activty" but you are not regulating the binding?
DNA-binding transcription factor activity (GO:0003700)
A protein or a member of a complex that interacts selectively and non-covalently with a specific DNA sequence (sometimes referred to as a motif) within the regulatory region of a gene to modulate transcription.
The first part of this def is the DNA binding activity. The second part is the process that is regulated by the binding.
So what can you be regulating apart from the DNA binding here?

@ValWood
Copy link
Contributor

ValWood commented Aug 8, 2018

Btw, this latest discussion should be in #16214 (Proposal to add 'DNA binding' parent to 'GO:0003700 DNA-binding transcription factor activity') as we are talking about the pros and cons of adding this relationship...

Yes I'm in the wrong ticket. Will migrate some comments to #162ro

@RLovering
Copy link

I would like to propose that before the merge Tony/Alex add to the AE field the specific DNA binding statements so that if someone wants to look for papers that provide specific location of the TF binding sites then these annotations can be used to initially triage for these papers.

So the suggestion I am making is:

Add the AE field occurs_at SO_0000165 (enhancer) to these terms

GO:0003705 transcription factor activity, RNA polymerase II distal enhancer sequence-specific binding
GO:0001205 transcriptional activator activity, RNA polymerase II distal enhancer sequence-specific DNA binding
GO:0001206 transcriptional repressor activity, RNA polymerase II distal enhancer sequence-specific binding

Add the AE field occurs_at SO_0001952 (promoter_flanking_region)

GO:0000982 transcription factor activity, RNA polymerase II proximal promoter sequence-specific DNA binding
GO:0001077 transcriptional activator activity, RNA polymerase II proximal promoter sequence-specific DNA binding
GO:0001078 transcriptional repressor activity, RNA polymerase II proximal promoter sequence-specific DNA binding

This will only be possible for the annotations submitted in Protein2GO, other groups can consider whether this is something they are interested in

Ruth

@tonysawfordebi

@tonysawfordebi
Copy link
Contributor

Tagging @alexsign so he's aware of this.

@RLovering
Copy link

Hi Tony and Alex

Pascale is now looking to make the merge described above. Before this merge takes place we need annotation extension informatio to be added to each of these annotations. Please could you contact all relevant Protein2GO users to get confirmation or not for you to add this information to the AE field.

Herein I give you permission to do this for all UCL annotations.
Kimberly and Marcio/Astrid have verbally agreed to this, so hopefully these groups will confirm quickly.

Best

Ruth

@hattrill
Copy link

hattrill commented Oct 9, 2018

I (where "I" may stand for FlyBase or the person known as H. Attrill), do not require the AE field populating. I hereby give permission for Tony or Alex (or other persons nominated by GOA) to merge the terms as specified by P. Gaudet in this ticket.

@pgaudet
Copy link
Contributor Author

pgaudet commented Oct 10, 2018

For information: Annotations are here:

NOTE THAT THESE WERE DOWNLOADED A COUPLE OF WEEKS AGO

1. Annotations that already have extensions:
https://docs.google.com/spreadsheets/d/12_ywu5lnlUoiq6bEl9NAdj_UsH5sh_nst4C5gDLIG4Q/edit#gid=0

Contributor (Assigned by) COUNT
BHF-UCL 15
MGI 61
NTNU_SB 39
PomBase 190
SGD 10
UniProt 13

2. Annotations that DO NOT have extensions:
https://docs.google.com/spreadsheets/d/1dW1EP1FfykiexseJngzCihmwTx6KwYPA5dkgq2B6HWc/edit#gid=0

Contributor (Assigned by) COUNT
AgBase 1
BHF-UCL 72
CACAO 1
CAFA 2
CGD 12
ComplexPortal 1
FlyBase 146
MGI 77
NTNU_SB 372
PINC 8
PomBase 51
RGD 24
SGD 140
UniProt 71
UniProtKB 8
WB 4
ZFIN 3

Thanks, Pascale

@mlacencio
Copy link

Hi @pgaudet and @RLovering !

The NTNU team will issue a final decision about the retention of regulatory information at the AE field on October 16th. We are ok with the merge, but unfortunately we have not yet reached an agreement on the retention of regulatory information.

Sorry for issuing a decision only next week, but the problem is that this current week is the autumn break week here in Norway.

Best regards!

@ValWood
Copy link
Contributor

ValWood commented Oct 10, 2018

Pombase don't need any AE field populating. We have used the AE filed specifically for the individual TF binding motif when known, like so:
https://www.pombase.org/browse-curation/dna-binding-sites

All of our promoters are proximal not distal, so there will be no loss of information for us from the merge. We don't need to state "proximal" explicity.

@tonysawfordebi
Copy link
Contributor

Just to be clear, we (GOA) would only be populating extensions in annotations that are managed by Protein2GO; groups that do not use P2G as their annotation tool will be left to their own devices...

@bmeldal
Copy link

bmeldal commented Oct 10, 2018

As far as I can see, I have made the changes already as my 2 complex terms are now annotated to GO:0001228 DNA-binding transcription activator activity, RNA polymerase II-specific.

@suzialeksander
Copy link
Contributor

@tonysawfordebi @alexsign SGD is fine with the merge and the addition of the SO field. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

10 participants