-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Availability of Sequence Ontology Terms in Noctua (go-lego) #20419
Comments
Can priority of this ticket be increased ? If we have the resources, of course. |
Hi @jesualdotomasfernandezbreis were you there for Karen's presentation on MSO at the GREEKC meeting. We're currently discussing SO vs MSO (but don't want to block any progress or work you need to do). Do you and other GREEKC participants have any opinions or requirements for this decision? |
In GREEKC we are currently using SO but we will use MSO. We will migrate once GO integrates MSO. |
See also: geneontology/noctua#561 |
@jesualdotomasfernandezbreis Are the terms needed already in MSO ? Perhaps we should directly load MSO ? Pascale |
We are currently using the SO terms, but they will have to be replaced by the corresponding MSO terms. For such a migration we agreed to wait until the SO terms are replaced by the MSO ones in the GO. But if the MSO terms become available in Noctua we could start using them. Best, |
From discussion at 2020-11-17 GO-CAM Jamboree. The GREEKC group would like more SO terms available to them in Noctua. Is the MSO/SO distinction still important for GO-CAMs? If not, can we add all of SO to neo? If yes, does anyone know what the current status of the MSO vs SO work is? |
|
I'm going to move this issue to the ontology repo, since that's where this change would be made (in production of "go-lego"). |
Thank you @balhoff ! |
Hello all, SO should be sufficient for the purposes of GO and will contain the most recent updates. The member that created MSO has been in a new position for a couple of years and MSO has not really been maintained by the SO brach. All updates to SO have only been added to SO on the SO GitHub Page and not to MSO. Best, Dave Sant |
Hi, I am working with Colin to add new SO terms as logical definitions for GO terms. Once this is done we'll re-evaluate the need to include SO in Noctua. Thanks, Pascale |
Those are the terms of interest from SO: first I'll add them to the imports SO:0000727 CRM (cis-regulatory module) (is a) SO:0000235 TF binding site (has part) binds GO:0003700 and GO:0000981, who bind GO:0003712 Pull request for this list: #20576 |
Note that PomBase use a lot of SO IDs in extensions for RNA polymerase II cis-regulatory region sequence-specific DNA binding we have used all of these Since this is the only way we have to connect a transcription factor to a binding site on the transcription factor gene pages, and it is useful information for our users, we will still do this in PomBase and filter the extensions for submission to GO if it is disallowed. But for the list above many seem likely to be redundant with the GO terms. |
Hi @ValWood @davidwsant: From the PomBase SO term constellation use it looks like SO could also host the currently known specificities of the human dbTF monomeric or homomultimeric DNA binding sequence motifs. There are less than one thousand of these mapping to more than one thousand human dbTFs, but they may be relevant to tens of thousands of dbTFs across the phylogenic species trees. GREEKC has contacts with the researchers that could feed such an annotation into SO, namely the authors of the Catalogue manuscript https://www.biorxiv.org/content/10.1101/2020.10.28.359232v2. Hence proteins and their annotation in GO terms would include SO:term entries that can be linked to DNA position weight matrices. One other GREEKC authority to consult if SO wishes to do that is Philipp Bucher, perhaps? In the nitty gritty spirit, a mapping of the existing SO:motif and SO:element entries to the available human motifs could be performed? Forkhead to name one example. But that is not strictly necessary if all the new entries are in first instance specified as human? @davidwsant: How does SO envisage species? Just to be clear: the annotation exercise for GO-Noctua models would concern experimentally / biochemically determined chromosomal binding sites linking to one or multiple genes. For that, a placeholder for the DNA material entity is needed in the form genomebuild:chromosome:start-end. This is wholly independent of the DNA sequence specificity specification notion that the concepts ‘motif’ and ‘element’ encompass. Nevertheless, some researchers have elucidated both the genomic position for a transcription regulator and the local chromosomal DNA sequence corresponding to a motif instance and capturing this is an ambition for many GREEKC use cases. What strikes me is that PomBase is using SO terms to provide a non-amino acid encyclopedic definition of the genomic binding sites involved. Simply linking gene entries to granular SO terms. Most of the human dbTFs could have something like this too, as the ‘motif’ / ‘element’ information is often known for the simplest biochemical interaction: pure DNA and 1 pure recombinant dbTF protein. Alongside this, other motif types, derived from ChIP-seq experiments are available too for many human dbTFs (heteromeric complexes) and databases for these exist too. Epigenetically, accessibility, DNA base methylation status of the sequences and nucleosome modification are also of importance biologically and also this has been studied and documented and could enter Noctua-based high-throughput annotation, which is why the above small selection of SO terms was requested. Ultimately, SO:term instances of genomebuild:chromosome:start-end will enable inputting and reasoning computationally across GO’s universe of annotations when transcription regulator activities are considered. If additionally, the SO: description includes DNA motif / element information like it currently does for Pombe, that would be a nice windfall for GO, isn’t it? |
and it's 1000 terms for human... |
Hi guys, As TF binding sites are DNA motifs, I think this makes sense to add the annotations to SO. I think we would not separate terms by species, or at least I have not done this in the past. For example, bacterial rRNA terms and eukaryotic rRNA terms both fall under the same parent. I think this would hold true for the transcription factors as well. Currently I can find one yeast TF, pheromone_response_element, which is_a TF_binding_site. A sister term of this is retinoic_acid_responsive_element which is present in humans. I looked at the link from the Catalogue manuscript. It looks like the link has the names of all of the transcription factors, but I do not see the consensus sequences. It has 1,429 TFs listed, and it appears as though they are all human. I will need to get the consensus sequences for these to include in the definitions. I also have a question about the names for the terms. How about something that is listed as ESR1? Would ESR1_binding_motif be a child of TF_binding_site, or would we include it in estergen_response_element? I would like some input from others on that. Colin, what is your take? Thanks, |
Hi Dave I think that Arttu Jolma (https://www.sciencedirect.com/science/article/pii/S0092867412014961) would be a good person to discuss this with. Best Ruth |
That this is no small enterprise. Perhaps a couple of months to get
everyone to agree (and to disagree in part). However, if there is SO
commitment we can explore with the GREEKC experts.
The particular data set I had in mind is the set of specificities for human
dbTF exposed individually to DNA, and is not as such a response element
like for example the estrogen response element. It may be better described
as the 'ESR1 motif', and is a biochemically defined object, namely 'the DNA
sequences this individual dbTF binds well'. In humans, many dbTF form
heterodimers, however, and the natural chromosomal response elements
therefore tend to include two binding sites, one for each subunit of the
dimeric protein complex.
…On Wed, Dec 16, 2020 at 2:08 AM David Sant ***@***.***> wrote:
Hi guys,
As TF binding sites are DNA motifs, I think this makes sense to add the
annotations to SO.
I think we would not separate terms by species, or at least I have not
done this in the past. For example, bacterial rRNA terms and eukaryotic
rRNA terms both fall under the same parent. I think this would hold true
for the transcription factors as well. Currently I can find one yeast TF,
pheromone_response_element
<http://sequenceontology.org/browser/current_svn/term/SO:0002045>, which
is_a TF_binding_site
<http://sequenceontology.org/browser/current_svn/term/SO:0000235>. A
sister term of this is retinoic_acid_responsive_element
<http://sequenceontology.org/browser/current_svn/term/SO:0001653> which
is present in humans.
I looked at the link from the Catalogue manuscript
<https://www.ebi.ac.uk/QuickGO/targetset/dbTF>. It looks like the link
has the names of all of the transcription factors, but I do not see the
consensus sequences. It has 1,429 TFs listed, and it appears as though they
are all human. I will need to get the consensus sequences for these to
include in the definitions.
I also have a question about the names for the terms. How about something
that is listed as ESR1? Would ESR1_binding_motif be a child of
TF_binding_site, or would we include it in estergen_response_element? I
would like some input from others on that.
Colin, what is your take?
Thanks,
Dave
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#20419 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ALZVLKD2X6PLLGLGO6HBGVDSVACABANCNFSM4T2EXJEQ>
.
|
I agree with what Ruth said about it being a very flat ontology under TF_binding_site. I would prefer that they not all be listed under a single term. I agree, this sounds like a very large undertaking. I do think getting some help from some experts would be a good idea. Colin, you mentioned getting help from the GREEKC experts. Do you think they would be willing to help even though GREEKC is ending? |
How many binding sites are currently known? I'm presuming that this info is only available currently for a subset of transcription factors? ( so even though a large undertaking, it will be sporadic once the known sites are included). |
I believe the latest version of the ENCODE project includes information from ChIP-seq experiments with hundreds of different transcription factors across several cell types. I have been looking through the data, and it looks like K562 cells have 628 different transcription factor experiments, but some of them are replicates (like POLR2A, MYC and JUN). Here is the link where you can see the different experiments studying DNA-binding proteins through ChIP-seq. I am not particularly familiar with the data here or how to access it, but I believe they have a way to download the locations of called peaks for each transcription factor for each cell type studied. I know the pipeline I have previously used for analyzing ChIP-seq was the pipeline they developed in this consortium (called the irreproducible discovery rate, IDR). While this is definitely still a subset, if it has even 500 transcription factors that would be a great deal. |
I believe one thing should be done at a time. Providing the dbTF intrinsic DNA binding motif is feasible. But, the very many ChIP-seq datasets are a whole different matter altogether because the chromosomal binding sites for dbTF vary in their occupancy within one cell type as a function of environmental/cell culture conditions and between cell lineages. Those are very much a matter of study. Can SO host the coordinates for the 1500 human dbTF genomic binding sites in all the different human cell types? Is such a thing desirable when ENCODE already has all this information available? I think not. What SO can provide is a controlled vocabulary in the form of terms that make precise operations possible. However, the motifs that are bound by dbTFs are protein-specific and they can be stored as position weight matrices that are equivalent to a consensus DNA binding site. While for ChIP motifs there is still much dissent/discussion as to what exactly and how exactly these should be rendered, for the in vitro (pure DNA + pure individual dbTF protein) this is not contentious/disputed and is 'absolute' and is available for more than 1000 dbTFs (I counted 1007 dbTF from the current human dbTF Catalogue with such an associated motif). The group of GREEKC experts that can help are Philipp Bucher and the authors of the dbTF Catalogue paper. As for their concrete contribution, I would let them (Ivan, Oriol, Arttu + Phillip and other GREEKC experts they see fit to include) discuss whether they want to do this (in January 2021?) and how to go about it. What GO and SO can do at this point in time is to tell these experts that SO is committed to storing the resulting product and that therefore their efforts will therefore not result in an ephemeral product. @ValWood @davidwsant @RLovering The instances of dbTF DNA binding to the genome can ultimately be captured by GO-CAM-type annotations, which is why GREEKC needs SO terms, be they adopted as GO terms or as SO stand-alone terms. One major conceptual hurdle is that by itself, the ChIP-seq experiment does not provide proof that there is causality for gene regulation. Hence, not every ChIP-seq site can be labeled as a 'response element' while they can all be labelled as genomic binding sites. In my mind, the annotation of genomic dbTF binding sites and response elements on the genome is therefore complementary but orthogonal to the creation of generic SO motifs for each human dbTF in SO. It is the latter that can be done in the short term and the product is independent of heteromeric dbTF-dbTF and dbTF-cofactor interactions at the genomic binding sites. Can everybody see the distinction? |
Hi Colin, I agree, trying to add the individual locations of binding motifs would not be consistent with SO. That is actually not what I had in mind. Adding the motifs would be a possibility, however. I don't know if we can use the position weight matrices because the definitions in SO can't hold multiple dimensions. For RARE, for example, we just put the consensus sequence as PuGGTCA. Do you think this is fine, or do you think it would be better to do something like this: A [ 12 0 1 0 0 22 4] C [ 0 0 0 0 18 0 7 ] G [ 11 23 13 1 3 1 5] T [0 0 9 22 2 0 7]? I agree that binding does not necessarily mean that it is regulating any gene. I think that labeling them as binding sites is probably a good call. Dave |
Hi.
Would it be possible to have the Sequence Ontology terms in NOCTUA, so we could use them in the models?
Thanks,
Jesualdo
The text was updated successfully, but these errors were encountered: