new to PheKnowLator #98
Replies: 5 comments 11 replies
-
@fmellomascarenhas - Thank you so much for your interest in PheKnowLator! I really appreciate your feedback and initial reactions to the repo. I'd also love to learn more about your use case, it sounds really interesting! 🙂 These are really great questions. I'm going to organize some information this evening to help me answer your questions and be in touch tomorrow morning. |
Beta Was this translation helpful? Give feedback.
-
Hi, I was able to find the description of the data in the documentation (again, its amazing to see how organized it is). Could you please explain the DisGeNet criteria when you have some time? I also was analyzing the data and I found ~39k unique gene IDs, while the human genome (as far as I understand), has around 20k. I don't have a biology back ground, so if you could give me some intuition on this number... Thanks! |
Beta Was this translation helpful? Give feedback.
-
Question 1:
Training leakage is really challenging and I suspect, depending on the types of models you are interested in training and your training protocol (i.e. open versus closed world assumption, whether or not you will include inverse relations, sampling strategy, and loss function), which all have been shown in recent ablation studies (Ali et al., 2020; Ruffinelli et al., 2020) it can get really precarious! PheKnowLator as a knowledge graph construction ecosystem does not initially apply any control/solution for this. There are a few reasons for this, the biggest reason being that we are currently in the process of evaluating the different types of knowledge graphs that PheKnowLator can build, so we don't yet have empirical evidence to demonstrate which builds, if used for tasks like link prediction, may have leakage. This is something I am very interested in understanding further though. There are definitely ways that this can be handled, the easiest being to generate several training/testing tests that control for leakage and then report the average performance. There are likely more elegant ways that this can be done by leverage the implicit structure of the ontologies we have enhanced. For example, rolling disease identifiers up to their parent concepts, something that could be easily done with any of the PheKnowLator builds. This idea has not yet been formally tested though. Does that help? |
Beta Was this translation helpful? Give feedback.
-
Question 2:
Great question. PheKnowLator definitely takes advantage of sources of evidence (those that may be useful for generating edge weights) when provided. Currently, these types of evidence are used when generating an edge set. At this time, we are not also adding the evidence value explicitly to the knowledge graph, but we absolutely plan to do that in future builds (see issue#99 for one way we may make this type of information available). All is not lost! In the meantime, you can add this information posthoc, using the Information on when and how filtering/evidence criteria is used is described in two places:
From the snippet above, you get information for the
For both types, information will be presented in the following pattern: Part Two of your question sent this morning
Great question. The number of human genes (depending on the source you look at, they all differ slightly on the grand total) is estimated to be 80,000-100,000. This number is further broken down by gene type. The number you reference sounds closer to the number of protein-coding genes. Isn't biology fun?! 😄 The primary source we use for gene identifiers is the National Center for Biotechnology Information -Gene. As of today, there are Note. The current counts of each edge type can be found within each build's directory on GCS in the log document titled |
Beta Was this translation helpful? Give feedback.
-
Question 3:
Great Question. The edge type information currently not formally added to the knowledge graph builds, but it is accessible via a hash map that is output for each build, which is saved in a file called An example of the dictionary content for the master_edges = {'chemical-disease' :
{'source_labels' : ';MESH_;',
'data_type' : 'class-class',
'edge_relation' : 'RO_0002606',
'uri' : ('http://purl.obolibrary.org/obo/',
'http://purl.obolibrary.org/obo/'),
'delimiter' : '#',
'column_idx' : '1;4',
'identifier_maps' : '0:./MESH_CHEBI_MAP.txt;1:disease-dbxref-map',
'evidence_criteria': "5;!=;' ",
'filter_criteria' : 'None',
'edge_list' : [['CHEBI_1234`, `MONDO_1234`], ['CHEBI_5678`, `MONDO_1234`] ...]} Example for using this:
I recognize that this information is not as easy to access as it could be and plan on creating another file for each build that provides edge-type-specific information. I apologize that I don't yet have that created. Hopefully, the dictionary will help you in the meantime. I have made a new issue (issue#99) to document this. |
Beta Was this translation helpful? Give feedback.
-
Hello, my name is Felipe. Before I ask my questions, I wanted to say that I am very impressed by this repository. Awesome work!
A colleague and I are developing a Graph Neural Network to predict GENE-DISEASE-DRUG edges. We are currently working with OpenBioLink dataset. I read PheKnowLator docs, but couldn't "absorb" everything yet. I have three main questions:
In the dataset I am currently working with, many nodes are scattered, like schizophrenia that has 20 subtypes. This not only scatters the data, but also leaks data to the validation set. Does PheKnowLator do some sort of aggregation on its own?
Usually edge associations have some sort of quality score. For DISEASE-GENE associations, we were using DisGeNet. It turns out that the scores didnt correlate with what we expected. For example, the gene CGRP has approved drugs for Migraine and over 200 citations, but the associations score is only 0.4. Situations like that are very frequent. Does PheKnowLator provides the scores, and does PheKnowLator uses these scores in some way? I check that your version 2.0 has 12,735 dis-gene triples. However, if I sample all associations from DisGeNet and consider positive ones all with >15 citations, I get 27,000 triples. It would be nice to learn more about your criteria.
Apparently the dataset doesnt contain edge subtypes. For example, the DRKG and OpenBioLink has drug-gene edge types as drug-activates-gene, drug-inihibits-gene, drug-binds gene. DisGeNet source has edge subtypes such as gene-biomarker-disease, gene-drugtarget-disease. Is this type of information saved in the graph creation process?
Thanks in advance! And congrats again for building something so well organized.
Beta Was this translation helpful? Give feedback.
All reactions