Skip to content

Dependencies

Tiffany J. Callahan edited this page Nov 2, 2023 · 75 revisions

Project Dependencies

To successfully run the code included in this repository requires the preparation of the following items:

  1. Resource Information
  2. Ontology Data
  3. Edge Data
  4. Construction Approach
  5. Mapping and Filtering Data
  6. Relations Data
  7. Metadata



Programmatic Assistance
Users who would like assistance with assembling the required input documents should run the generates_dependency_documents.py script from the command line:

python3 generates_dependency_documents.py




Resource Information


The figure below provides an overview of how the resources/resource_info.txt, resources/ontology_source_list.txt, and resources/edge_source_list.txt data sources are connected as well as how they work together.


GitHub Repository Location: resources/resource_info.txt

Purpose: This file is used as the master organizer for all project resources.

File Format: The program expects the information stored as a "|" delimited file:

  • edge_type: A string label for an edge (node1-node2; ex: 'gene-disease'). The label matches what is used in the edge_source_list.txt and ontology_source_list.txt files.
  • prefixes: A ";"-separated string where the first item is the final prefix for the subject node and the second is the final prefix for the object node. All prefixes should be the preferred prefix from the BioRegistry.
  • relation: An OBO Foundry ontology CURIE (e.g., RO_0000056).
  • delimiter: A character used to split input text rows into columns (e.g., t for tab-delimited data or , for comma-delimited data).
  • column_indexes: Two-column indices separated by ; (e.g., 0;4 for the first and third columns in the input data source).
  • identifier_maps: A string of mapping information for each node in an edge. For example, the string "2:mapping_file_1.txt;4:mapping_file_2.txt" means that the first node requires data contained in the 2nd column of the mapping_file_1.txt and the second node requires data from the 4th column in the mapping_file_2.txt file.
  • evidence_criteria: Evidence criteria that can be used to filter an input data source (e.g., scores above a certain cut-off). An evidence set is composed of 3 pieces of ";"-separated information. Multiple evidence sets can be passed, where each set is separated by ::. Consider the following example: "4;!=;IEA::8;<;0.0001":
    1. The index of the column to apply the evidence criteria to (e.g., "4" and "8" in the example above).
    2. The operator (i.e., ==, !=, <, >, <=, >=, in, .startswith(), .endswith()) to use when filtering (e.g., != and < in the example above)
    3. The value (i.e., int, float, str, list) to filter on (e.g., "IEA" and "0.0001" in the example above)
  • filter_criteria: Filtering criteria that can be used to filter an input data source (e.g., human proteins). An evidence set is composed of 3 pieces of ";"-separated information. ​Multiple filtering sets can be passed, where each set is separated by ::. Consider the following example: "5;==;P::7;==;9606":
    1. The index of the column to apply the evidence criteria to (e.g., "5" and "7" in the example above)
    2. The operator (i.e., ==, !=, <, >, <=, >=, in, .startswith(), .endswith()) to use when filtering (e.g., == and == in the example above)
    3. The value (i.e., int, float, str, list) to filter on (e.g., "P" and "9606" in the example above)

NOTE. You can also pass dedup as a Filtering Criteria (e.g. 2-0;dedup;desc):

  • The column index should be col1-col2:
    • col1 is the column you want to filter on
    • col2 is the primary identifier to deduplicate
  • The value should be asc or desc to indicate the direction to sort the pandas.DataFrame prior to deduplicating

TABLE: An example resource_info.txt file is provided in the table below.

edge_type prefixes relation delimiter column_indexes identifier_maps evidence_criteria filter_criteria
chemical-gene CHEBI;NCBIGene RO_0002434 t 1;4 0:./resources/data_maps/MESH_CHEBI_MAP.txt None 7;==;9606
gene-gene NCBIGene;NCBIGene RO_0002434 t 0;1 0:./resources/data_maps/STRING_ENTREZ_MAP.txt;1:./resources/data_maps/STRING_ENTREZ_MAP.txt 2;>=;700 None
gene-gobp NCBIGene;GO BFO_0000056 t 1;4 0:./resources/edge_data/gene-go_goa_class_data.txt 8;==;P 12;==;taxon:9606
pathway-disease reactome;MONDO RO_0003302 t 1;0 1:disease-dbxref-map None 1;.startswith('R-HSA-');




Ontology Data


GitHub Repository Location: resources/ontology_source_list.txt

Purpose: This script is used to identify and download specific ontologies.

File Format: The program expects this information to be stored as a "," delimited file.


TABLE: An example ontology_source_list.txt file is provided in the table below.

Ontology URL
disease http://purl.obolibrary.org/obo/doid.owl
go http://purl.obolibrary.org/obo/go.owl
chemical ftp://ftp.ebi.ac.uk/pub/databases/chebi/ontology/chebi_lite.owl




Edge Data


GitHub Repository Location: resources/edge_source_list.txt

Purpose: This script is used to identify and download specific publicly available data sources that will be used to derive edges between ontology classes and instances of ontology classes.

File Format: The program expects this information to be stored as a "," delimited file.


TABLE: An example edge_source_list.txt file is provided in the table below.

EdgeType URL
chemical-gene http://ctdbase.org/reports/CTD_chem_gene_ixns.tsv.gz
gene-gobp http://geneontology.org/gene-associations/goa_human.gaf.gz
gene-disease https://www.disgenet.org/static/disgenet_ap1/files/downloads/curated_gene_disease_associations.tsv.gz
gene-gene https://stringdb-static.org/download/protein.links.v11.0/9606.protein.links.v11.0.txt.gz




Construction Approach


Wiki: KG-Construction
GitHub Repository Location: resources/construction_approach

Purpose: New data can be added to the knowledge graph using 2 different construction approaches: (1) instance-based or (2) subclass-based. Each of these approaches is described further below. For more details, please see the resources/construction_approach/README.md Jupyter Notebook for additional information.


🛑 CONSTRAINTS 🛑
The algorithm makes the following assumptions:

  • Make sure that you have created the non-ontology node data to ontology class mapping dictionary and added it to the ./resources/construction_approach/ directory.

Construction Approach: Instance-Based


In this approach, each new edge is added as an instance of an existing class (via rdf:Type) in the knowledge graph.

EXAMPLE: Adding the edge: Morphine ➞ isSubstanceThatTreats ➞ Migraine

Would require adding:

  • isSubstanceThatTreats(Morphine, x1)
  • Type(x1, Migraine)

In this example, Morphine is a non-ontology data node and Migraine is an HPO ontology term.

Outputs: As mentioned above, a universally unique identifier (UUID) is created for each anonymous node representing an instance of a class. In order to fully utilize the knowledge graph, a .json file containing the mapping from each UUID instance to it's ontology class is output to the ./resources/construction_approach/instance directory. For example,

{
"http://purl.obolibrary.org/obo/CHEBI_24505": "https://github.com/callahantiff/PheKnowLator/obo/ext/c2591241-8952-44ea-a313-e4b3c5fb6d35",
"http://purl.obolibrary.org/obo/PR_000013648": "https://github.com/callahantiff/PheKnowLator/obo/ext/0ea74deb-0002-4f48-b7e4-81a8fd947312",
"http://purl.obolibrary.org/obo/GO_0050031": "https://github.com/callahantiff/PheKnowLator/obo/ext/8f5c81d4-92dd-426e-a2d9-2be87edb1520",
}

Construction Approach: Subclass-Based


In this approach, each new edge is added as a subclass of an existing ontology class (via rdfs:subClassOf) in the knowledge graph.

EXAMPLE: Adding the edge: TGFB1 ➞ participatesIn ➞ Influenza Virus Induced Apoptosis

Would require adding:

  • participatesIn(TGFB1, Influenza Virus Induced Apoptosis)
  • subClassOf(Influenza Virus Induced Apoptosis, Influenza A pathway)
  • Type(Influenza Virus Induced Apoptosis, owl:Class)

Where TGFB1 is an PR ontology term and Influenza Virus Induced Apoptosis is a non-ontology data node. In this example, Influenza A pathway is an existing ontology class.

Outputs: There are no approach-specific output files generated.


Input Requirements for both Approaches: A pickled dictionary where the keys are node identifiers (non-ontology node data) and the values are lists of ontology class identifiers to subclass has been added to the ./resources/construction_approach/ directory. An example of this dictionary is shown below:

{
  'R-HSA-168277'  : ['http://purl.obolibrary.org/obo/PW_0001054',
                     'http://purl.obolibrary.org/obo/GO_0046730'],
  'R-HSA-9026286' : ['http://purl.obolibrary.org/obo/PW_000000001',
                     'http://purl.obolibrary.org/obo/GO_0019372'],
  '100129357'     : ['SO_0000043'],
  '100129358'     : ['SO_0000336'],
}                  

Please see the Reactome Pathways - Pathway Ontology and Genomic Identifiers - Sequence Ontology sections of the Data_Preparation.ipynb Jupyter Notebook for examples of how to construct this document.




Mapping and Filtering Data


Wiki: v2-Data-Sources

Purpose: Several other files are needed to create data used for the filtering and mapping during the creation of knowledge graph edges. For more details on what these data sources are and how they are created, please see the Data_Preparation.ipynb Jupyter Notebook.



Relations Data


GitHub Repository Location: resources/relations_data

Purpose: PheKnowLator can be built using a single set of provided relations (i.e. the owl:ObjectProperty or edge which is used to connect the nodes in the graph) with or without the inclusion of each relation's inverse.



🛑 CONSTRAINTS 🛑
If you would like the knowledge graph to include relations and their inverse relations, you must add the following to the ./resources/relations_data repository (an example of what should be included in each of these is included below):

  • A .txt file of all relations and their labels
  • A .txt file of the relations and their inverse relations

Filename: INVERSE_RELATIONS.txt

The owl:inverseOf property is used to identify each relation's inverse. To make it easier to look up the inverse relations when building the knowledge graph, each relation/inverse relation pair is listed twice, for example:

The data in this file should look like:

  RO_0003000  RO_0003001
  RO_0003001  RO_0003000
  RO_0002233  RO_0002352
  RO_0002352  RO_0002233

Filename: RELATIONS_LABELS.txt

Not all relations have an inverse (e.g. interactions). Even though an inverse relation might not exist, we still want to ensure that all interaction relations are symmetrically represented in the graph. To aid in this process, we need to be able to quickly look-up an edge and determine if it is an interaction. To help make this process more efficient, the algorithm expects a list of all relations and their labels in a .txt file.

The data in this file should look like:

  RO_0002285  developmentally replaces
  RO_0002287  part of developmental precursor of
  RO_0002490  existence overlaps
  RO_0002214  has prototype

Please see the Data_Preparation.ipynb Jupyter Notebook for code on how to create these files.




Metadata


GitHub Repository Location: resources/metadata

A variety of metadata are pulled from the data sources that are used to support external edges added to enhance the core set of ontologies. For the monthly PheKnowLator builds, please see pheknowlator_source_metadata. xlsx spreadsheet. This spreadsheet has two tabs, one for nodes and one for edges. Each entity (i.e., node or relation) there are several columns, including descriptions of the metadata, the variable type, and even examples of values for each type of metadata.

Example Metadata Dictionary Output. The code snippet below is meant to provide a snapshot of how data are organized in the metadata dictionary. As demonstrated by this example, there are three high-level keys:

  • nodes: Nodes are keyed by CURIE. Every node has a Label, Description, Synonym, and Dbxref (whenever possible). Metadata that are obtained from specific sources that are not ontologies are added as a nested dictionary keyed by the filename.
  • edges: Edges are keyed by a label that represents the edge type (the same label that is used in resource_info.txt and edge_source_list.txt files). Metadata that are obtained from specific sources that are not ontologies are added as a nested dictionary keyed by the filename.
  • relations: Relations or owl:ObjectProperty objects are keyed by CURIE. Similar to nodes, every relation has a Label, Description, and Synonym (whenever possible). Metadata that are obtained from specific sources that are not ontologies are added as a nested dictionary keyed by the filename.
{
    'nodes': {
        'NCBIGene_2052': {
            'Label': 'EPHX1',
            'Description': "EPHX1 has locus group 'protein-coding' and is located on chromosome 1 (1q42.12).",
            'Synonym': 'epoxide hydrolase 1, microsomal (xenobiotic)|epoxide hydratase|EPHX|HYL1|MEHepoxide hydrolase 1|epoxide hydrolase 1 microsomal|EPOX',
            'Dbxref': 'MIM:132810|HGNC:HGNC:3401|Ensembl:ENSG00000143819', ... },
        'CHEBI_4592': {
            'Label': 'Dihydroxycarbazepine',
            'Description': "None",
            'Synonym': '10,11-Dihydro-10,11-dihydroxy-5H-dibenzazepine-5-carboxamide|10,11-Dihydroxycarbamazepine',
            'Dbxref': 'CAS:35079-97-1|KEGG:C07495',
            'CTD_chem_gene_ixns.tsv.gz': {  
                'CTD_ChemicalID': {'MESH:C004822'},
                'CTD_CasRN': {'35079-97-1'},
                'CTD_ChemicalName': {'10,11-dihydro-10,11-dihydroxy-5H-dibenzazepine-5-carboxamide'}}, ... }, ... },
    'edges': {
        'chemical-gene': {
            'CHEBI_4592-NCBIGene_2052': {
                {'CTD_chem_gene_ixns.tsv': {
                    'CTD_Evidence': [{'CTD_Interaction': '[EPHX1 gene SNP affects the metabolism of carbamazepine epoxide] which affects the chemical synthesis of 10,11-dihydro-10,11-dihydroxy-5H-dibenzazepine-5-carboxamide',
                     'CTD_InteractionActions': 'affects^chemical synthesis|affects^metabolic processing',
                     'CTD_PubMedIDs': '15692831'}]}}, ... }, ... }, ... }, 
    'relations': {
        'RO_0002434': {
            'Label': 'interacts with',
            'Description': 'A relationship that holds between two entities in which the processes executed by the two entities are causally connected.',
            'Synonym': 'in pairwise interaction with'}, ... }
}

Purpose: The knowledge graph can be built with or without the inclusion of node and relation metadata (i.e. labels, descriptions or definitions, and synonyms). If you'd like to create and use node metadata, please run the Data_Preparation.ipynb Jupyter Notebook and run the code chunks listed under the NODE AND RELATION METADATA section. These code chunks should be run before the knowledge graph is constructed. For more details on what these data sources are and how they are created, please see the metatadata README.md.


🛑 CONSTRAINTS 🛑
The algorithm makes the following assumptions:

  • If metadata is provided, only those edges with nodes that have metadata will be created; valid edges without metadata will be discarded.
  • Metadata will be divided into nodes, relations, and edges. For nodes and relations, entities will be keyed by CURIE. For edges, entities will be keyed by their edge type (i.e., the same label that is used in resource_info.txt and edge_source_list.txt files).
  • For each node and node entity identifier we try to obtain at least the following metadata: Label, Description, and Synonym.
  • Metadata that are obtained from specific sources that are not ontologies will be added as a nested dictionary that is keyed by the filename.



Clone this wiki locally