new to PheKnowLator #98

felipemello1 · 2021-05-10T18:01:10Z

felipemello1
May 10, 2021

Hello, my name is Felipe. Before I ask my questions, I wanted to say that I am very impressed by this repository. Awesome work!

A colleague and I are developing a Graph Neural Network to predict GENE-DISEASE-DRUG edges. We are currently working with OpenBioLink dataset. I read PheKnowLator docs, but couldn't "absorb" everything yet. I have three main questions:

In the dataset I am currently working with, many nodes are scattered, like schizophrenia that has 20 subtypes. This not only scatters the data, but also leaks data to the validation set. Does PheKnowLator do some sort of aggregation on its own?
Usually edge associations have some sort of quality score. For DISEASE-GENE associations, we were using DisGeNet. It turns out that the scores didnt correlate with what we expected. For example, the gene CGRP has approved drugs for Migraine and over 200 citations, but the associations score is only 0.4. Situations like that are very frequent. Does PheKnowLator provides the scores, and does PheKnowLator uses these scores in some way? I check that your version 2.0 has 12,735 dis-gene triples. However, if I sample all associations from DisGeNet and consider positive ones all with >15 citations, I get 27,000 triples. It would be nice to learn more about your criteria.
Apparently the dataset doesnt contain edge subtypes. For example, the DRKG and OpenBioLink has drug-gene edge types as drug-activates-gene, drug-inihibits-gene, drug-binds gene. DisGeNet source has edge subtypes such as gene-biomarker-disease, gene-drugtarget-disease. Is this type of information saved in the graph creation process?

Thanks in advance! And congrats again for building something so well organized.

callahantiff · 2021-05-11T03:45:43Z

callahantiff
May 11, 2021
Maintainer

@fmellomascarenhas - Thank you so much for your interest in PheKnowLator! I really appreciate your feedback and initial reactions to the repo. I'd also love to learn more about your use case, it sounds really interesting! 🙂

These are really great questions. I'm going to organize some information this evening to help me answer your questions and be in touch tomorrow morning.

1 reply

felipemello1 May 11, 2021
Author

Awesome! Thanks for replying. Please take your time to answer. About our use case, we work for a pharmaceutical company. The use case is mostly to find/rank/confirm unknown gene-disease associations, possibly also do drug repurposing by predicting drug-gene edgea and finally use the graph to add explainability/ insights to the predictions, for example pathway contribution to the results. But it's still early days.

felipemello1 · 2021-05-11T16:56:40Z

felipemello1
May 11, 2021
Author

Hi, I was able to find the description of the data in the documentation (again, its amazing to see how organized it is).

Could you please explain the DisGeNet criteria when you have some time?
Usage: Utilized to create gene-disease, and gene-phenotype edges. The original data is filtered such that only records meeting the following criteria were included: EI >= "1.0" (90th percentile). . I am not sure what EI >= "1.0" means. Below is an example of what their data look like in their website

I also was analyzing the data and I found ~39k unique gene IDs, while the human genome (as far as I understand), has around 20k. I don't have a biology back ground, so if you could give me some intuition on this number...

Thanks!

1 reply

callahantiff May 11, 2021
Maintainer

Hello @fmellomascarenhas!

I'm going to add responses to each of your questions as separate comments, ensuring that I don't miss anything. Let me know if you have any follow-up questions. Thanks again for your questions, I'm thrilled you are interested in PheKnowLator!

callahantiff · 2021-05-11T19:06:10Z

callahantiff
May 11, 2021
Maintainer

Question 1:

In the dataset I am currently working with, many nodes are scattered, like schizophrenia that has 20 subtypes. This not only scatters the data, but also leaks data to the validation set. Does PheKnowLator do some sort of aggregation on its own?

Training leakage is really challenging and I suspect, depending on the types of models you are interested in training and your training protocol (i.e. open versus closed world assumption, whether or not you will include inverse relations, sampling strategy, and loss function), which all have been shown in recent ablation studies (Ali et al., 2020; Ruffinelli et al., 2020) it can get really precarious! PheKnowLator as a knowledge graph construction ecosystem does not initially apply any control/solution for this. There are a few reasons for this, the biggest reason being that we are currently in the process of evaluating the different types of knowledge graphs that PheKnowLator can build, so we don't yet have empirical evidence to demonstrate which builds, if used for tasks like link prediction, may have leakage. This is something I am very interested in understanding further though. There are definitely ways that this can be handled, the easiest being to generate several training/testing tests that control for leakage and then report the average performance. There are likely more elegant ways that this can be done by leverage the implicit structure of the ontologies we have enhanced. For example, rolling disease identifiers up to their parent concepts, something that could be easily done with any of the PheKnowLator builds. This idea has not yet been formally tested though. Does that help?

2 replies

felipemello1 May 13, 2021
Author

For example, rolling disease identifiers up to their parent concepts, something that could be easily done with any of the PheKnowLator builds.

That sounds link a very good solution! In practice, I am not sure how that would work. I can see, for example, 20 types of alzheimer being squeezed into one. But I can't see the same happening with all autoimmune diseases being merged into one. I am not sure if there is a clear rule of thumb to know when to merge or not. Do you have any ideas?

callahantiff May 13, 2021
Maintainer

Happy to help! In PheKnowLator, we use MONDO as the primary ontology for diseases. One of the best parts about using an ontology is that it makes it easier to do the type of roll-up I mentioned and assuming your diseases of interest are represented, it should pretty much always work and it should usually still be clinically meaningful. For example, see the screenshot below. If your disorder was relapsing-remitting multiple sclerosis, you could roll-up to the immediate parent chronic aggressive multiple sclerosis or if you had this term and chronic aggressive multiple sclerosis and you needed to roll-up both concepts you could choose their common ancestor Multiple Sclerosis.

Is your concern related to how to choose the best ancestor given some specific set of concepts? If so, there are lots of great options and metrics like semantic similarity and Jaccard Index. Happy to talk through this further. Especially if you have a specific disease or set of diseases you are concerned about. I always find that easier to think through (and more fun!).

🤔 Some Other Thoughts 🤔:

If not rolling up to a parent term is to roll up to a given node's neighborhood and either choose the most central node or find the closest common ancestor of all node's n hops away from the node of interest.
There is a utility in the library that will return all ancestors of a given node (and allows you to specify the relation to use -- relevant for your search strategy), you could do this for all of the children disease concepts, and identify the roll-up concept by: (1) concatenating the ancestor paths and choosing the most frequently occurring node or (2) take the intersection of their ancestor paths and choose from the remaining nodes.

callahantiff · 2021-05-11T19:07:23Z

callahantiff
May 11, 2021
Maintainer

Question 2:

Usually edge associations have some sort of quality score. For DISEASE-GENE associations, we were using DisGeNet. It turns out that the scores didnt correlate with what we expected. For example, the gene CGRP has approved drugs for Migraine and over 200 citations, but the associations score is only 0.4. Situations like that are very frequent. Does PheKnowLator provides the scores, and does PheKnowLator uses these scores in some way? I check that your version 2.0 has 12,735 dis-gene triples. However, if I sample all associations from DisGeNet and consider positive ones all with >15 citations, I get 27,000 triples. It would be nice to learn more about your criteria.

Great question. PheKnowLator definitely takes advantage of sources of evidence (those that may be useful for generating edge weights) when provided. Currently, these types of evidence are used when generating an edge set. At this time, we are not also adding the evidence value explicitly to the knowledge graph, but we absolutely plan to do that in future builds (see issue#99 for one way we may make this type of information available). All is not lost! In the meantime, you can add this information posthoc, using the edge_source_metadata.txt document provided for each build (and described below). This document lists the data sources (including a URL where it can be downloaded from) and columns within each data source containing the evidence, for each edge type. Happy to provide additional details if this is of interest to you.

Information on when and how filtering/evidence criteria is used is described in two places:

A general description for each major release that is applied for all builds associated with that release (here). I should probably add some additional text for each source here to better justify the decisions that are made, but currently, it provides a complete list of all filtering criteria for each source used.
Within each build directory in our public Google Cloud Storage (GCS) bucket there is a file called edge_source_metadata.txt (file linked here and example shown below are for the May 2021 Subclass - Relations Only - OWL build), this file contains all information on how all non-ontology edge sources were created, example below:

===================================
#Tue May 04 00:54:36 UTC 2021 
===================================
...

EDGE: gene-disease
DATA PROCESSING INFO
  - IDENTIFIER MAPPING = disease (./resources/processed_data/DISEASE_MONDO_MAP.txt)
  - FILTERING CRITERIA = None
  - EVIDENCE CRITERIA = data[10]>=1.0
DATA INFO
  - DOWNLOAD_URL = https://storage.googleapis.com/pheknowlator/archived_builds/release_v2.1.0/build_01MAY2021/data/original_data/curated_gene_disease_associations.tsv
  - DOWNLOAD_DATE = 05/04/2021
  - FILE_SIZE_IN_BYTES = 11542996
  - DOWNLOADED_FILE_LOCATION = resources/edge_data/gene-disease_curated_gene_disease_associations.tsv

From the snippet above, you get information for the gene-disease edge type. The file paths refers to the docker container (all docker containers are available here and all data used can be found for each build in our GCS bucket -- current build output/data). The most relevant information to your initial question is probably:

FILTERING CRITERIA: if any filtering was applied to the original data source. This is used for things like removing non-human organisms or only keep rows of a certain identifier type.
EVIDENCE CRITERIA: if any filtering was applied to the original data source. This is used for filtering based on specific kinds (e.g. presence or absence of PubMed ids, types of computational evidence codes) or values of evidence (e.g. quantitative scores)

For both types, information will be presented in the following pattern: data, column, criteria, where data is referring to the data pointed to by DOWNLOAD_URL, column is the zero-indexed column in that data source, and criteria is the information that was used to filter the data.

Part Two of your question sent this morning

Could you please explain the DisGeNet criteria when you have some time?
Absolutely. To filter this source we remove all rows where the EI >= "1.0" (90th percentile). Information from the source on this metric can be found here. EI stands for Evidence Index and measures the extent to which a gene-disease association is free from contradicting evidence in the literature. From the DisGeNET documentation linked above: "EI = 1 indicates that all the publications support the GDA or the VDA, while EI < 1 indicates that there are publications that assert that there is no association between the gene/variants and the disease. If the gene/variant has no EI value, it indicates that the index has not been computed for this association." We chose to only keep the most confident associations, those where there was no conflicting evidence. Note that for our initial builds, we purposefully chose to keep the most confident edges. Not only does this give us a better chance of not adding false positives, but it also then provides opportunities to use the filtered rows in other downstream prediction tasks. Does that make sense?

I also was analyzing the data and I found ~39k unique gene IDs, while the human genome (as far as I understand), has around 20k. I don't have a biology back ground, so if you could give me some intuition on this number...

Great question. The number of human genes (depending on the source you look at, they all differ slightly on the grand total) is estimated to be 80,000-100,000. This number is further broken down by gene type. The number you reference sounds closer to the number of protein-coding genes. Isn't biology fun?! 😄 The primary source we use for gene identifiers is the National Center for Biotechnology Information -Gene. As of today, there are 62,247 human genes (here). This count includes a wide variety of gene types like protein-coding, non-coding, pseudogene, etc. The reason that you only see 39K is a result of our data preparation pipeline (see Data_Preparation.ipynb) -- note that there are specific sections for aligning gene identifiers (HGNC, Ensembl, etc) and for using the aligned gene identifiers to create different types of files that assist when creating the different edge types.

Note. The current counts of each edge type can be found within each build's directory on GCS in the log document titled pkt_builder_phases12_log.log. You can see the log for the May 2021 Subclass - Relations Only - OWL build here. The counts on the Wiki reflect specific builds, but the logs within each build are the most accurate.

1 reply

felipemello1 May 13, 2021
Author

perfectly explained. Thanks for taking your time to do it in such great detail.

callahantiff · 2021-05-11T19:09:08Z

callahantiff
May 11, 2021
Maintainer

Question 3:

Apparently the dataset doesnt contain edge subtypes. For example, the DRKG and OpenBioLink has drug-gene edge types as drug-activates-gene, drug-inihibits-gene, drug-binds gene. DisGeNet source has edge subtypes such as gene-biomarker-disease, gene-drugtarget-disease. Is this type of information saved in the graph creation process?

Great Question. The edge type information currently not formally added to the knowledge graph builds, but it is accessible via a hash map that is output for each build, which is saved in a file called Master_Edge_List_Dict.json (the log for the May 2021 Subclass - Relations Only - OWL build is available here). This dictionary contains instructions that the algorithm uses to build each edge type. It's keyed by the edge types that are listed here. For each edge type, the information you will need is in the sub-key edge_list. The edge_list is a list of lists, where each inner list contains a subject and object node referenced by a CURIE. To align the output files (depending on which you use), you will need to remove the first part of the identifier URI to make it a CURIE that can be aligned with this dictionary. The piece of each identifier to remove is also stored in the dictionary, under the sub-key uri. The value returned from this sub-key is a tuple, where the first item is for the subject node and the second for the object node. Sometimes the subject and the object use the same URI, but it will still be shown as tuple for ease in processing. Examples of the dictionary and how to convert an identifier to a CURIE so you can search each edge in the dictionary are shown below.

An example of the dictionary content for the chemical-disease edge type:

master_edges = {'chemical-disease'  :
                {'source_labels'    : ';MESH_;',
                 'data_type'        : 'class-class',
                 'edge_relation'    : 'RO_0002606',
                 'uri'              : ('http://purl.obolibrary.org/obo/',
                                       'http://purl.obolibrary.org/obo/'),
                 'delimiter'        : '#',
                 'column_idx'       : '1;4',
                 'identifier_maps'  : '0:./MESH_CHEBI_MAP.txt;1:disease-dbxref-map',
                 'evidence_criteria': "5;!=;' ",
                 'filter_criteria'  : 'None',
                 'edge_list'        : [['CHEBI_1234`, `MONDO_1234`], ['CHEBI_5678`, `MONDO_1234`] ...]}

Example for using this:

# edge data from output file
edge_data = ['http://purl.obolibrary.org/obo/CHEBI_1234', 'http://purl.obolibrary.org/obo/MONDO_1234']

# piece to remove to obtain CURIEs
chemical_curie = edge_data[0].strip(master_edges['chemical-disease']['uri'][0])
disease_curie = edge_data[1].strip(master_edges['chemical-disease']['uri'][1])
curie_edge = [chemical_curie, disease_curie]

# CURIEs
print(chemical_curie)
>>> 'CHEBI_1234'

print(disease_curie)
>>> 'MONDO_1234'

print(curie_edge)
>>> ['CHEBI_1234', 'MONDO_1234']

I recognize that this information is not as easy to access as it could be and plan on creating another file for each build that provides edge-type-specific information. I apologize that I don't yet have that created. Hopefully, the dictionary will help you in the meantime. I have made a new issue (issue#99) to document this.

6 replies

callahantiff May 13, 2021
Maintainer

I see, sorry I missed your initial question, let me try again (and let me know if I am still missing it 😄).

While we don't have super-specific sub-edge types within a given edge type (like the gene types you mention above), I do think we have some of what you are looking for, specifically with respect to the hierarchical concept classification (i.e., IS-A, PART-OF, etc). The core of our graph is built on 11 ontologies (everything in a yellow box in the image below). The beauty of this approach is that for the domain that they represent (e.g., disease, phenotype, anatomical entities), they have a very rich representation of the nodes within it. I added a snippet of this to your comment about handling leakage by rolling up concepts, which will give you a sense of the richness and their hierarchical nature. The 33 edge types we have listed on the Wiki page (shown as the solid lines with green-box labeled relations between all of the yellow ontology boxes) serve to branch, extend, and/or enrich these ontologies. So, we definitely have the same type of IS-A relations (as well as others) that you mention for all ontologies we include.

We definitely plan on extending the expressivity and specificity of the existing edge types to both add evidence and other details. Any suggestions you have (in addition to the great ones you have already made) will definitely are welcome and will be seriously considered. This work is slated for this summer, once we get the initial paper out (and I finish my thesis 😄 ).

How did I do, closer to answering your question or am I still missing it (apologies if so and hope you will keep trying)? 😄 😜 🤔

felipemello1 May 13, 2021
Author

How did I do, closer to answering your question or am I still missing it (apologies if so and hope you will keep trying)? smile stuck_out_tongue_winking_eye thinking

Sounds great! thank you! :)

When I check the Master edge list dict, those are the edge types I find:

import json

with open('pheknowlator/archived_builds_release_v2.0.0_build_10MAY2020_knowledge_graphs_PheKnowLator_Master_Edge_List_Dict.json') as f:
    data = json.load(f)
    
print(data.keys())

dict_keys(['chemical-disease', 'chemical-gene', 'chemical-gobp', 'chemical-gocc', 'chemical-gomf', 'chemical-pathway', 'chemical-phenotype', 'chemical-protein', 'chemical-rna', 'disease-phenotype', 'gene-disease', 'gene-gene', 'gene-pathway', 'gene-phenotype', 'gene-protein', 'gene-rna', 'gobp-pathway', 'pathway-gocc', 'pathway-gomf', 'protein-anatomy', 'protein-catalyst', 'protein-cell', 'protein-cofactor', 'protein-gobp', 'protein-gocc', 'protein-gomf', 'protein-pathway', 'protein-protein', 'rna-anatomy', 'rna-cell', 'rna-protein', 'variant-disease', 'variant-gene', 'variant-phenotype'])

I don't see edge_types between same node types (disease-disease, anatomy-anatomy, pathway-pathway), so it doesn't seem like the IS_A / PART_OF edge types are being stored or are easily accessible through this document. Apparently what I am looking for is in PheKnowLator_v2.0.0_full_subclass_inverseRelations_noOWL_Triples_Identifiers.txt, is that right?

callahantiff May 13, 2021
Maintainer

Yep, that document will work great for those two relation types! The Master Edge list dictionary just contains the new edges we add to the ontologies. I should have mentioned that in my last post (sorry! 😞 😓 ).

Until we create the edge-list metadata document we discussed above (issue#99), the *_identifiers.txt edge lists for each build will be the best bet for getting that kind of information. Note that for our OWL-based representations, you will need to use the following relations for IS-A and PART-OF:

IS-A → http://www.w3.org/2000/01/rdf-schema#SubClassOf
PART-OF → http://purl.obolibrary.org/obo/BFO_0000050

The PheKnowLator_v2.0.0_full_subclass_inverseRelations_OWL_NodeLabels.txt file may also be helpful for seeing the different types of node relation metadata we have. In case you wanted human-readable labels to search with 😄 . Note that starting with release_v2.1.0, this file also includes some relation-level information.

felipemello1 May 13, 2021
Author

Perfect. Thank you so much!

callahantiff May 13, 2021
Maintainer

Thank you! Your questions have been great at helping me see places where I can improve/extend our documentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

new to PheKnowLator #98

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

new to PheKnowLator #98

felipemello1 May 10, 2021

Replies: 5 comments · 11 replies

callahantiff May 11, 2021 Maintainer

felipemello1 May 11, 2021 Author

felipemello1 May 11, 2021 Author

callahantiff May 11, 2021 Maintainer

callahantiff May 11, 2021 Maintainer

felipemello1 May 13, 2021 Author

callahantiff May 13, 2021 Maintainer

callahantiff May 11, 2021 Maintainer

felipemello1 May 13, 2021 Author

callahantiff May 11, 2021 Maintainer

callahantiff May 13, 2021 Maintainer

felipemello1 May 13, 2021 Author

callahantiff May 13, 2021 Maintainer

felipemello1 May 13, 2021 Author

callahantiff May 13, 2021 Maintainer

felipemello1
May 10, 2021

Replies: 5 comments 11 replies

callahantiff
May 11, 2021
Maintainer

felipemello1 May 11, 2021
Author

felipemello1
May 11, 2021
Author

callahantiff May 11, 2021
Maintainer

callahantiff
May 11, 2021
Maintainer

felipemello1 May 13, 2021
Author

callahantiff May 13, 2021
Maintainer

callahantiff
May 11, 2021
Maintainer

felipemello1 May 13, 2021
Author

callahantiff
May 11, 2021
Maintainer

callahantiff May 13, 2021
Maintainer

felipemello1 May 13, 2021
Author

callahantiff May 13, 2021
Maintainer

felipemello1 May 13, 2021
Author

callahantiff May 13, 2021
Maintainer