# Metadata associated with Automat resources

In [1]:
## CX: allows multiple lines of code to print from one code block
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import copy  ## for deepcopy

## MetaKG edge level

### Automat CORD19 Scigraph

API Endpoint (https://automat.renci.org/): /cord19-scigraph/{source_node_type}/{target_node_type}/{curie}

Notes
* biolink_predicate can be found in API Endpoint /cord19-scigraph/predicates or /cord19-scigraph/graph/summary
* traced_provenance / origin: find from /cord19-scigraph/about endpoint, but it's returning a blank template right now. **Want versioning from Ranking Agent.** 
* **numeric measures: want help from Ranking Agent.** These are found from response, but not sure about range, direction and what they are...
* **context/relevance: want help from Ranking Agent.** looks like human gene/disease entities so human taxon? Not sure if it should have coronavirus infection context since some of the relationships don't seem to be related to that?

In [2]:
metakg_cord19scigraph = {      
    "biolink_predicate":"related_to",  ## from TRAPI predicate/edge_label field
    "ingested_predicate_label":"SEMMEDDB:ASSOCIATED_WITH",  ## seen in TRAPI "relation" field

    ## provenance
    "translator_group":["Ranking_Agent"],  ## assign this
    ## it's hard to tell what nodes may be conflated. Maybe gene/geneproduct?
    
    "traced_provenance":  
        [{"name":"Automat CORD19 Scigraph API",
         "type":"text_mined_database",  ## assign this
         "version":"v5"}],  ## made up, from previous /about/ endpoint, WE DON'T HAVE THIS
    "origin":  
        {"name":"CORD19",
         "type":"publications",    ## assign this
         "version":"2020-04-05",     ## made-up, WE DON'T HAVE THIS
         "method":"NLP_Scigraph"},        ## assign this

    ## measures                  
    "numeric_measures_present":True,      ## assign this  
    "numeric_measures":      ## assign this
        [{"name":"enrichment_p",    ## not sure what this is, or if range, direction are right...
          "standard_label":"enrichment_p",  ## being redundant to set this
          "range":"(0-1]", 
          "direction":{"more_specific":"smaller"}}, 
         {"name":"num_publications",  ## not sure what this is, or if range, direction are right... 
          "standard_label":"num_publications",  ## being redundant to set this
          "range":"[0-1]", 
          "direction":{"more_confident":"larger"}}
        ],
    "categorical_measures_present":False
}

In [3]:
metakg_cord19scigraph.keys()
metakg_cord19scigraph['origin']

dict_keys(['biolink_predicate', 'ingested_predicate_label', 'translator_group', 'traced_provenance', 'origin', 'numeric_measures_present', 'numeric_measures', 'categorical_measures_present'])

{'name': 'CORD19',
 'type': 'publications',
 'version': '2020-04-05',
 'method': 'NLP_Scigraph'}

The following are what I think the prefixes are for the main IDs for the nodes inside Automat CORD19 Scigraph API. 

In [None]:
scigraph_prefixes = {
    "ChemicalSubstance":"CHEBI",
    "Disease":"MONDO",
    "Gene":"NCBIGene",
    "PhenotypicFeature":"HP",
    "Cell":"CL",
    "CellularComponent":"GO",
    "BiologicalProcess":"GO",
    "MolecularActivity":"GO",
    "Food":"FOODON",
    "EnvironmentalFeature":"ENVO",
    "PopulationOfIndividualOrganisms":"HANCESTRO",
    "OrganismTaxon":"NCBITaxon",
    "OntologyClass":"NCBITaxon",
    "ExposureEvent":"ECTO",
    "LifeStage":"HsapDv"    
}

### Automat CORD19 Scibite

API Endpoint (https://automat.renci.org/): /cord19-scibite/{source_node_type}/{target_node_type}/{curie}

**I don't think the node count or biolink model hierarchy is correct...** (see graph/summary endpoint)

Notes: I basically made it the same as Automat CORD19 Scigraph since almost identical setup

* biolink_predicate can be found in API Endpoint /cord19-scibite/predicates or /cord19-scibite/graph/summary
* traced_provenance / origin: find from /cord19-scibite/about endpoint, but it's returning a blank template right now. **Want versioning from Ranking Agent.** 
* **numeric measures: want help from Ranking Agent.** These are found from response, but not sure about range, direction and what they are...
* **context/relevance: want help from Ranking Agent.** looks like human gene/disease entities so human taxon? Not sure if it should have coronavirus infection context since some of the relationships don't seem to be related to that?

In [4]:
metakg_cord19scibite = {      
    "biolink_predicate":"related_to",  ## from TRAPI predicate/edge_label field
    "ingested_predicate_label":"SEMMEDDB:ASSOCIATED_WITH",  ## seen in TRAPI "relation" field

    ## provenance
    "translator_group":["Ranking_Agent"],  ## assign this
    ## it's hard to tell what nodes may be conflated. 
    
    "traced_provenance":  
        [{"name":"Automat CORD19 Scibite API",
         "type":"text_mined_database",  ## assign this
         "version":"v5"}],  ## made up, from previous /about/ endpoint, WE DON'T HAVE THIS
    "origin":  
        {"name":"CORD19",
         "type":"publications",    ## assign this
         "version":"2020-04-05",     ## made-up, WE DON'T HAVE THIS
         "method":"NLP_Scibite"},        ## assign this

    ## measures                  
    "numeric_measures_present":True,      ## assign this  
    "numeric_measures":      ## assign this
        [{"name":"enrichment_p",    ## not sure what this is, or if range, direction are right...
          "standard_label":"enrichment_p",  ## being redundant to set this
          "range":"(0-1]", 
          "direction":{"more_specific":"smaller"}}, 
         {"name":"num_publications",  ## not sure what this is, or if range, direction are right... 
          "standard_label":"num_publications",  ## being redundant to set this
          "range":"[0-1]", 
          "direction":{"more_confident":"larger"}}
        ],
    "categorical_measures_present":False
}

In [5]:
metakg_cord19scibite['traced_provenance']

[{'name': 'Automat CORD19 Scibite API',
  'type': 'text_mined_database',
  'version': 'v5'}]

The following are what I think the prefixes are for the main IDs for the nodes inside Automat CORD19 Scibite API. 

In [None]:
scigraph_prefixes = {
    "ChemicalSubstance":"CHEBI",
    "Disease":"MONDO",
    "Gene":"NCBIGene",
    "PhenotypicFeature":("HP", "UMLS"),
    "CellularComponent":"GO",
    "BiologicalProcess":"GO",
    "MolecularActivity":"GO",    
    "OrganismTaxon":"NCBITaxon",  ## I don't think the node count 
    "OntologyClass":("NCBITaxon", "MESH"),
    "LifeStage":"IDO",    
    "GeneProduct":"UniProtKB"
}

### Automat HMDB: 

API Endpoint (https://automat.renci.org/): /hmdb/{source_node_type}/{target_node_type}/{curie}


Notes: 
* **edges have a publications field but I think it's always an empty list. Edges also appear to have no measure related data. Ask?**
* **this API does not flip the predicate when the subject/object flip. The structure of relationships in this API is that chemical_substance -> other entities. so right now metaKG level needs to flip the predicate**
    * ChemicalSubstance -> Gene
    * ChemicalSubstance -> Disease
    * ChemicalSubstance -> PhenotypicFeature
    * ChemicalSubstance -> Pathway
* there is some info in /hmdb/about endpoint, but it's not helpful for versioning. **Want versioning from Ranking Agent.** 
* biolink_predicate / relationships currently in the graph found in hmdb/graph/summary. But it was messy since the way of tracing ancestor biolink model entity classes was...hard to read. 

The following are what I think the prefixes are for the main IDs for the nodes inside the HMDB API. 
* sometimes, response holds the name in the synonyms slot (not the name slot...which has the CHEMBL.COMPOUND ID)
* I noticed here https://automat.renci.org/hmdb/disease/chemical_substance/MONDO:0011422 that there is a node with a Pubchem ID (no CHEBI ID) and an empty string for name. The correct name is Aldosteronum https://pubchem.ncbi.nlm.nih.gov/compound/24758425.

In [None]:
hmdb_prefixes = {
    "ChemicalSubstance":("CHEBI", "PUBCHEM"), 
    "Disease":("MONDO", "UMLS"),
    "Gene":"NCBIGene",
    "PhenotypicFeature":("HP", "UMLS"),
    "Pathway":"SMPDB"
}

#### ChemicalSubstance -> Gene / Gene -> ChemicalSubstance

Symmetrical predicate: "interacts_with"     
node conflation: Automat conflates gene and gene product. 

Example: 
* Angiotensin I (CHEBI:2718) to Gene: only one answer REN, NCBIGene:5972 https://automat.renci.org/hmdb/chemical_substance/gene/CHEBI:2718
    * inverse: REN to Chemical_substance: https://automat.renci.org/hmdb/gene/chemical_substance/NCBIGene:5972


Notes:
* Need a new predicate? HMDB states that the small molecules have enzymes annotated to them where the enzyme is "any protein which catalyzes chemical reactions involving the small molecule". https://hmdb.ca/sources . So it may be better to say something about reactions? But I guess interacts_with seems to match the data. 

Issue with method field:
* the response edge says "hmdb.metabolite_to_enzyme" for edge_source of these edges. However...
    * cystine (CHEBI:35491) to Gene: gives SLC16A10 as one of multiple answers https://automat.renci.org/hmdb/chemical_substance/gene/CHEBI:35491 with the edge_source: "hmdb.metabolite_to_enzyme"
    * On the HMDB website for cystine, SLC16A10 is actually listed as a transporter, not an enzyme...https://hmdb.ca/metabolites/HMDB0000192#transporters 
    * **So Automat likely pulled all proteins (shown on the website together here https://hmdb.ca/metabolites/HMDB0000192/metabolite_protein_links) and then assigned the edge_source as enzyme even though not all of the original annots were enzyme...**
    * therefore, I'm choosing to set the method as "extract_metabolite_to_protein_annot"

In [13]:
metakg_hmdb_cggc = {      
    "biolink_predicate":"interacts_with",  ## from TRAPI predicate/edge_label field
    "ingested_ontology_predicate":"RO:0002434",  ## from TRAPI relation field

    ## provenance
    "translator_group":["Ranking_Agent"],  ## assign this
    "nodes_conflated": 
        {'Gene':
            {'conflated_type':'GeneProduct', 'where':'Automat HMDB API'}},
    ## HMDB actually links chemical substances -> enzymes/uniprot ID
    ## so it must be automat that is converting enzyme/protein -> gene. 
    "traced_provenance":  
        [{"name":"Automat HMDB API",
         "type":"knowledgebase",  ## assign this
         "version":"v5"}],  ## made up, WE DON'T HAVE THIS
    "origin":  
        {"name":"HMDB",    ## took from TRAPI source_database field
         "type":"knowledgebase",    ## assign this
         "version":"2020-03-03",     ## made-up, WE DON'T HAVE THIS
         ## TRAPI edge_source field says "hmdb.metabolite_to_enzyme", see issues with that in markdown chunk above
         "method":"extract_metabolite_to_protein_annot"},   

    ## measures                  
    "numeric_measures_present":False,      ## assign this  
    "categorical_measures_present":False
}

In [14]:
metakg_hmdb_cggc

{'biolink_predicate': 'interacts_with',
 'ingested_ontology_predicate': 'RO:0002434',
 'translator_group': ['Ranking_Agent'],
 'nodes_conflated': {'Gene': {'conflated_type': 'GeneProduct',
   'where': 'Automat HMDB API'}},
 'traced_provenance': [{'name': 'Automat HMDB API',
   'type': 'knowledgebase',
   'version': 'v5'}],
 'origin': {'name': 'HMDB',
  'type': 'knowledgebase',
  'version': '2020-03-03',
  'method': 'extract_metabolite_to_protein_annot'},
 'numeric_measures_present': False,
 'categorical_measures_present': False}

#### ChemicalSubstance -> Disease / Disease -> ChemicalSubstance

Symmetrical predicate: "related_to"    

Example: 
* Angiotensin I (CHEBI:2718) to Disease: only one answer autosomal recessive proximal renal tubular acidosis, MONDO:0011422   https://automat.renci.org/hmdb/chemical_substance/disease/CHEBI:2718
    * inverse: autosomal recessive proximal renal tubular acidosis, MONDO:0011422 to chemical substance (multiple results) https://automat.renci.org/hmdb/disease/chemical_substance/MONDO:0011422 

* cystine (CHEBI:35491) to Disease https://automat.renci.org/hmdb/chemical_substance/disease/CHEBI%3A35491
    * notice the results for this same chemical to phenotypic feature in the next section of this notebook, and the [likely source in HMDB](https://hmdb.ca/metabolites/HMDB0000192#biological_properties). It looks like Automat took the disease references here and tried to resolve the terms to either a MONDO term or HP term (no overlap between them)...
    * notice that one result is a disease with no MONDO id, so the main ID is the UMLS (UMLS:C1827878) and there is an empty string for the name... Using [this](https://www.ebi.ac.uk/spot/oxo/index), the term resolves to "Refractory localization-related epilepsy"

Issue with method field:
* the response edge says "hmdb.disease_to_metabolite" for edge_source of these edges. 
* However, the HMDB website's disease lookup for [proximal renal tubular acidosis](https://hmdb.ca/unearth/q?utf8=%E2%9C%93&query=+proximal+renal+tubular+acidosis&searcher=diseases&button=) does not give the metabolites described in the answer. So Automat likely did not pull from the disease entries to get metabolites. 
* Instead, the HMDB website's metabolite lookup for [cystine](https://hmdb.ca/metabolites/HMDB0000192#biological_properties) gives diseases that are in the ChemicalSubstance -> Disease edge. 
* **ChemicalSubstance -> PhenotypicFeature response edges say "hmdb.metabolite_to_disease". In other words, Automat took the metabolite entries and pulled the diseases from there. This looks more accurate to the situation, so I'm going to make this assumption** 

In [11]:
metakg_hmdb_cddc = {      
    "biolink_predicate":"related_to",  ## from TRAPI predicate/edge_label field
    "ingested_predicate_label":"SEMMEDDB:ASSOCIATED_WITH",  ## from TRAPI relation field

    ## provenance
    "translator_group":["Ranking_Agent"],  ## assign this
    "traced_provenance":  
        [{"name":"Automat HMDB API",
         "type":"knowledgebase",  ## assign this
         "version":"v5"}],  ## made up, WE DON'T HAVE THIS
    "origin":  
        {"name":"HMDB",    ## took from TRAPI source_database field
         "type":"knowledgebase",    ## assign this
         "version":"2020-03-03",     ## made-up, WE DON'T HAVE THIS
         ## TRAPI edge_source field says "hmdb.disease_to_metabolite", see issues with that in markdown chunk above
         "method":"extract_metabolite_to_disease_annot"},   

    ## measures                  
    "numeric_measures_present":False,      ## assign this  
    "categorical_measures_present":False
}

In [12]:
metakg_hmdb_cddc['origin']

{'name': 'HMDB',
 'type': 'knowledgebase',
 'version': '2020-03-03',
 'method': 'extract_metabolite_to_disease_annot'}

#### ChemicalSubstance -> PhenotypicFeature / PhenotypicFeature ->ChemicalSubstance

Symmetrical predicate: "related_to"    
node conflation: Automat conflates phenotypicFeature and disease.

Example: 
* Cystine (CHEBI:35491) to PhenotypicFeature: https://automat.renci.org/hmdb/chemical_substance/phenotypic_feature/CHEBI:35491
    * notice the results for this same chemical to disease in the section above this one,, and the [likely source in HMDB](https://hmdb.ca/metabolites/HMDB0000192#biological_properties). It looks like Automat took the disease references here and tried to resolve the terms to either a MONDO term or HP term (no overlap between them)...
    * notice that one result is a disease with no HP id, so the main ID is the UMLS (UMLS:C0393710) and there is an empty string for the name...  Using [this](https://www.ebi.ac.uk/spot/oxo/index), the term resolves to "Seizures in response to acute event"
* an example of inverse: Molybdenum cofactor deficiency, HP:0003570 to chemical substance (multiple results) https://automat.renci.org/hmdb/phenotypic_feature/chemical_substance/HP:0003570

Correct method field:
* The HMDB website's metabolite lookup for [cystine](https://hmdb.ca/metabolites/HMDB0000192#biological_properties) gives diseases that are in the ChemicalSubstance -> PhenotypicFeature edge. 
    * the HMDB website's disease lookup for [molybdenum cofactor](https://hmdb.ca/unearth/q?utf8=%E2%9C%93&query=Molybdenum+cofactor&searcher=diseases&button=) does not give the metabolites described in the answer. So Automat likely did not pull from the disease entries to get metabolites. 
* **ChemicalSubstance -> PhenotypicFeature response edges say "hmdb.metabolite_to_disease". In other words, Automat took the metabolite entries and pulled the diseases from there. This looks more accurate to the situation, so I'm going to make this assumption** 

In [15]:
metakg_hmdb_cppc = {      
    "biolink_predicate":"related_to",  ## from TRAPI predicate/edge_label field
    "ingested_predicate_label":"SEMMEDDB:ASSOCIATED_WITH",  ## from TRAPI relation field

    ## provenance
    "translator_group":["Ranking_Agent"],  ## assign this
    "nodes_conflated": 
        {'PhenotypicFeature':
            {'conflated_type':'Disease', 'where':'Automat HMDB API'}},
    ## HMDB actually links chemical substances -> diseases
    ## so Automat is the one resolving disease names to PhenotypicFeatures/HP terms 
    "traced_provenance":  
        [{"name":"Automat HMDB API",
         "type":"knowledgebase",  ## assign this
         "version":"v5"}],  ## made up, WE DON'T HAVE THIS
    "origin":  
        {"name":"HMDB",    ## took from TRAPI source_database field
         "type":"knowledgebase",    ## assign this
         "version":"2020-03-03",     ## made-up, WE DON'T HAVE THIS
         ## TRAPI edge_source field says "hmdb.metabolite_to_disease", seems good
         "method":"extract_metabolite_to_disease_annot"},   

    ## measures                  
    "numeric_measures_present":False,      ## assign this  
    "categorical_measures_present":False
}

In [17]:
metakg_hmdb_cppc['nodes_conflated']

{'PhenotypicFeature': {'conflated_type': 'Disease',
  'where': 'Automat HMDB API'}}

#### WIP: ChemicalSubstance -> Pathway

Not-symmetrical predicate: "participates_in"    

Example: 
* 3-Dechloroethylifosfamide (CHEBI:80558) to Pathway: https://automat.renci.org/hmdb/chemical_substance/pathway/CHEBI:80558
    * notice that this is associated with 4 pathways. However, I checked and the two Ifosfamide pathways (metabolism and action) look identical. Similarly, the two Cyclophosphamide pathways (metabolism and action) look identical. Click on the SMPDB/Pathwhiz diagrams [here](https://hmdb.ca/metabolites/HMDB0013858#biological_properties) to see it. That's an original DB issue, not something to do much about.  

Correct method field:
* **ChemicalSubstance -> Pathway response edges say "hmdb.metabolite_to_pathway". In other words, Automat took the metabolite entries and pulled the pathways from there. This looks more accurate to the situation, so I'm going to make this assumption** 

In [18]:
metakg_hmdb_cpath = {      
    "biolink_predicate":"participates_in",  ## from TRAPI predicate/edge_label field
    "ingested_ontology_predicate":"RO:0000056",  ## from TRAPI relation field

    ## provenance
    "translator_group":["Ranking_Agent"],  ## assign this
    "traced_provenance":  
        [{"name":"Automat HMDB API",
         "type":"knowledgebase",  ## assign this
         "version":"v5"}],  ## made up, WE DON'T HAVE THIS
    "origin":  
        {"name":"HMDB",    ## took from TRAPI source_database field
         "type":"knowledgebase",    ## assign this
         "version":"2020-03-03",     ## made-up, WE DON'T HAVE THIS
         ## TRAPI edge_source field says "hmdb.metabolite_to_pathway", seems good
         "method":"extract_metabolite_to_pathway_annot"},   

    ## measures                  
    "numeric_measures_present":False,      ## assign this  
    "categorical_measures_present":False
}

In [17]:
metakg_hmdb_cppc['nodes_conflated']

{'PhenotypicFeature': {'conflated_type': 'Disease',
  'where': 'Automat HMDB API'}}

## Response edge level

### Automat CORD19 Scigraph

Breast cancer (disease, MONDO:0016419) -> Gene: https://automat.renci.org/cord19-scigraph/disease/gene/MONDO:0016419
has one result (EGF, NCBIGene:1950)

Edge-specific parts: 
- put the specific measure values (new key:value into dictionaries)

In [None]:
results_cord19scigraph = copy.deepcopy(metakg_cord19scigraph)

results_cord19scigraph['numeric_measures'][0]['value'] = 8.931854891316576e-9   ## enrichment_p
results_cord19scigraph['numeric_measures'][1]['value'] = 0.03796023182396457   ## num_publications

In [None]:
results_cord19scigraph['numeric_measures']

### Automat CORD19 Scibite

ANGPT2 (Gene, NCBIGene:285) -> "chloride ion binding" (MolecularActivity, GO:0031404): https://automat.renci.org/cord19-scibite/gene/molecular_activity/NCBIGene:285

Edge-specific parts: 
- put the specific measure values (new key:value into dictionaries)

In [None]:
results_cord19scibite = copy.deepcopy(metakg_cord19scibite)

results_cord19scibite['numeric_measures'][0]['value'] = 2.0377154960709163e-8   ## enrichment_p
results_cord19scibite['numeric_measures'][1]['value'] = 0.854319545770893   ## num_publications

In [None]:
results_cord19scibite['origin']