# Metadata associated with MyDisease.info, DisGeNET resource

In [1]:
## CX: allows multiple lines of code to print from one code block
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import copy  ## for deepcopy

So a metaKG edge has:
* one subject and one object (one way, both are BioThings types/Biolink Model entities, they can be the same type)
* one predicate
* one source

## MetaKG edge level

### top-level (exist on all edges)

There are some slots that have the same value, no matter what edge they're on. These are described below.

In [2]:
metakg_top_level = {
    ## PROVENANCE
    "translator_group":["Service_Provider"],    
    "nodes_conflated": 
        {'Disease':     ## all edges have Disease as input or output
            {'conflated_type':'PhenotypicFeature', 'where':'DisGeNET'}
        },   
    
    ## based on input-type/source/output-type, the next member of traced_provenance differs
    ## the METHOD slot/METHOD REF for each dictionary differs between sources
    "traced_provenance":  ## can get most from endpoint http://mydisease.info/v1/metadata 
        [{"name":"MyDisease.info API",
         "type":"service",  ## assign this
         "version":"2020-10-26"}],  ## current as-of 2020-10-27 
    
    
    ## MEASURES
    ## all data has this, but just different measures depending on input-type/output-type
    "numeric_measures_present":True,
    
    ## currently, the data downloads from DisGeNET don't include categorical measures. 
    ## However, the DisGeNET browser does show categorical measures of the association (EL aka evidence level)
    "categorical_measures_present":False
}

### Disease -> Disease

* Based on: https://www.disgenet.org/static/disgenet_ap1/files/downloads/disease_to_disease_CURATED.tsv.gz
    * notice that this appears to use the "curated" sources only (uniprot, ctd, orphanet, clingen, genomics england, cgi, psygenet for genes; then uniprot, clinvar, gwas catalog, gwasdb for variants). 
    * however, "source" doesn't explain what sources are used for what associations. There also isn't information on this file in the download README. 
    * see [website](https://www.disgenet.org/dbinfo#section13) for explanation of the jaccard measures
* Uses UMLS IDs (DisGeNET calls it the CUI)

<br>

predicate logic: 
* "is comparable to" (SIO:000736) has the root, "is related to"
* description sounds like what we have here: comparing shared genes/variants
    * description: "is a relation between two entities that share at least one feature whose value can be compared."
    * see [ontobee](http://www.ontobee.org/ontology/SIO?iri=http://semanticscience.org/resource/SIO_000736)
* other options?
    * RO:HOM0000000 "in similarity relationship with". Doesn't have a root. See [ontobee](http://www.ontobee.org/ontology/RO?iri=http://purl.obolibrary.org/obo/RO_HOM0000000)
  
<br>
  
for current "method" string descriptions, go to the metadata google doc. 
* ingest_consolidate = ingest (simple ETL)
* association_from_shared_annot = annotated to the same stuff (diseases annotated to same genes/variants)

In [3]:
metakg_dd = copy.deepcopy(metakg_top_level)

## PREDICATES
## assign this, no predicate in datafile
metakg_dd["biolink_predicate"] = "similar_to"    
metakg_dd["ingested_ontology_predicate"] = "SIO:000736"    
metakg_dd["ingested_ontology_label"] = "SIO:is_comparable_to"    

## PROVENANCE
## the MyDisease.info dict
metakg_dd['traced_provenance'][0].update({"method":"ingest"})
## can get most from endpoint http://mydisease.info/v1/metadata
metakg_dd["origin"] =    \
    {"name":"DisGeNET",
     "type":"knowledgebase",    ## assign this
     "version":"2020-05-07",     ## current as-of 2020-10-27
     "method":"association_from_shared_annot"}        ## assign this

## MEASURES          
metakg_dd["numeric_measures"] =      \
        [{"name":"Ngenes",
          "standard_label":"num_shared_genes",  ## name Translator may want to use
          "range":"[0-inf)", 
          "direction":{"more_confident":"larger"}}, 
         {"name":"jaccard_genes",   
          "standard_label":"jaccard_shared_genes",  ## name Translator may want to use
          "range":"[0-1]", 
          "direction":{"more_confident":"larger"}, 
          "reference":{"url":"https://www.disgenet.org/dbinfo#section13"}}, 
         {"name":"Nvariants",     
          "standard_label":"num_shared_variants",  ## name Translator may want to use
          "range":"[0-inf)", 
          "direction":{"more_confident":"larger"}}, 
         {"name":"jaccard_variant",   ## rename from jaccard_variant
          "standard_label":"jaccard_shared_variants",  ## name Translator may want to use
          "range":"[0-1]", 
          "direction":{"more_confident":"larger"}, 
          "reference":{"url":"https://www.disgenet.org/dbinfo#section13"}} 
        ]

In [4]:
metakg_dd.keys()
metakg_dd['origin']

[i['standard_label'] for i in metakg_dd['numeric_measures']]

dict_keys(['translator_group', 'nodes_conflated', 'traced_provenance', 'numeric_measures_present', 'categorical_measures_present', 'biolink_predicate', 'ingested_ontology_predicate', 'ingested_ontology_label', 'origin', 'numeric_measures'])

{'name': 'DisGeNET',
 'type': 'knowledgebase',
 'version': '2020-05-07',
 'method': 'association_from_shared_annot'}

['num_shared_genes',
 'jaccard_shared_genes',
 'num_shared_variants',
 'jaccard_shared_variants']

### Relationships with Genes and Diseases

- Based on: https://www.disgenet.org/static/disgenet_ap1/files/downloads/all_gene_disease_pmid_associations.tsv.gz 
- then Service Provider merged rows (all columns the same except for pmid), so pmid column value is a list
- Uses UMLS IDs (DisGeNET calls it the CUI) for diseases
- Uses NCBIGene IDs
- there are 17 underlying sources. sources can have different predicates and methods. 
    * there isn't information given to trace further into the source (like CTD's provenance for the assertion)

Below, there's the information shared between all edges with this combo of input/output-types:
* ingested predicate choice: ultimately, DisGeNET doesn't include "association type" or a predicate in their data dump. The [homepage](https://www.disgenet.org/home/) describes their resource as "collections of genes and variants associated to human diseases". **This is why an "associated with" relation is used, rather than the more specific has_biomarker/biomarker_of relationships. 
    * the has_biomarker/biomarker_of tend to give a *causal* flavor and kind of map to "causes or contributes to condition" (that's an RO term). I'm not fully comfortable with making that assertion for all DisGeNET data. 
* the biolink and ingested predicate are symmetrical so it works with both Disease -> Gene and Gene -> Disease. 

In [5]:
basic_dggd_edge = copy.deepcopy(metakg_top_level)

## PREDICATE
## actually too narrow, but there isn't the SIO term
basic_dggd_edge["biolink_predicate"] = "related_to"  
basic_dggd_edge["ingested_ontology_predicate"] = "SIO:001403"
basic_dggd_edge["ingested_ontology_label"] = "SIO:is_associated_with"

## PROVENANCE
## add the gene/gene-product conflation
basic_dggd_edge['nodes_conflated'].update({'Gene':{'conflated_type':'GeneProduct', 'where':'DisGeNET'}})

## all disease-gene combo edges have this second entry in the trace
## the method differs
basic_dggd_edge['traced_provenance'].append(
    {"name":"DisGeNET",
     "type":"knowledgebase",    ## assign this
     "version":"2020-05-07",     ## current as of 2020-10-27
     "method_ref":{"url":"https://www.disgenet.org/dbinfo#section11"}    ## assign this
    }  
)

## MEASURES: 4 
basic_dggd_edge['numeric_measures'] = \
[
    {"name":"GDAscore",
     "standard_label":"association_score",  ## name Translator may want to use
     "range":"(0-1]", 
     "direction":{"more_confident":"larger"},
     "reference":{"url":"https://www.disgenet.org/dbinfo#section31"}}, 
    {"name":"EI",
     "standard_label":"evidence_index",  ## name Translator may want to use
     "range":"(0-1]", 
     "direction":{"more_confident":"larger"},
     "reference":{"url":"https://www.disgenet.org/dbinfo#section36"}}, 
    {"name":"DSI",
     "standard_label":"gene_specific_to_disease",  ## name Translator may want to use
     "range":"(0-1]",  ## ref claims min=0.25, but it should fluctuate based on db data
     "direction":{"more_specific":"larger"},
     "reference":{"url":"https://www.disgenet.org/dbinfo#section33"}}, 
    {"name":"DPI",
     "standard_label":"gene_specific_to_disease_class",  ## name Translator may want to use
     "range":"(0-1]",       ## ref claims min=1/29, but it would fluctuate if disease class changed
     "direction":{"more_specific":"smaller"},
     "reference":{"url":"https://www.disgenet.org/dbinfo#section34"}} 
]

The provenance then differs by the value in the source column of the original datatype. This provenance is the same between Disease -> Gene and Gene -> Disease. 

#### By source: LHGDN

In [6]:
metakg_dggd_LHGDN = copy.deepcopy(basic_dggd_edge)

## PROVENANCE
## the MyDisease.info dict
metakg_dggd_LHGDN['traced_provenance'][0].update({"method":"ingest_consolidate"})
## the DisGeNET dict
metakg_dggd_LHGDN['traced_provenance'][1].update({"method":"NLP_LHGDN"})

## assign the origin
metakg_dggd_LHGDN['origin'] = \
    {"name":"GeneRIF",  
        "type":"text",
        "version":"2009-03-31",
        "taxon_subset":{'NCBITaxon:9606':'Homo_sapiens'}}  ## used human GeneRIFs 

In [7]:
metakg_dggd_LHGDN

{'translator_group': ['Service_Provider'],
 'nodes_conflated': {'Disease': {'conflated_type': 'PhenotypicFeature',
   'where': 'DisGeNET'},
  'Gene': {'conflated_type': 'GeneProduct', 'where': 'DisGeNET'}},
 'traced_provenance': [{'name': 'MyDisease.info API',
   'type': 'service',
   'version': '2020-10-26',
   'method': 'ingest_consolidate'},
  {'name': 'DisGeNET',
   'type': 'knowledgebase',
   'version': '2020-05-07',
   'method_ref': {'url': 'https://www.disgenet.org/dbinfo#section11'},
   'method': 'NLP_LHGDN'}],
 'numeric_measures_present': True,
 'categorical_measures_present': False,
 'biolink_predicate': 'related_to',
 'ingested_ontology_predicate': 'SIO:001403',
 'ingested_ontology_label': 'SIO:is_associated_with',
 'numeric_measures': [{'name': 'GDAscore',
   'standard_label': 'association_score',
   'range': '(0-1]',
   'direction': {'more_confident': 'larger'},
   'reference': {'url': 'https://www.disgenet.org/dbinfo#section31'}},
  {'name': 'EI',
   'standard_label': 'ev

#### By source: BEFREE

In [8]:
metakg_dggd_BEFREE = copy.deepcopy(basic_dggd_edge)

## PROVENANCE
## the MyDisease.info dict
metakg_dggd_BEFREE['traced_provenance'][0].update({"method":"ingest_consolidate"})
## the DisGeNET dict
metakg_dggd_BEFREE['traced_provenance'][1].update({"method":"NLP_BEFREE"})

## assign the origin
metakg_dggd_BEFREE['origin'] = \
    {"name":"MEDLINE_abstracts",  
        "type":"publications",
        "version":"1970-01_to_2019-12"}  

In [9]:
metakg_dggd_BEFREE

{'translator_group': ['Service_Provider'],
 'nodes_conflated': {'Disease': {'conflated_type': 'PhenotypicFeature',
   'where': 'DisGeNET'},
  'Gene': {'conflated_type': 'GeneProduct', 'where': 'DisGeNET'}},
 'traced_provenance': [{'name': 'MyDisease.info API',
   'type': 'service',
   'version': '2020-10-26',
   'method': 'ingest_consolidate'},
  {'name': 'DisGeNET',
   'type': 'knowledgebase',
   'version': '2020-05-07',
   'method_ref': {'url': 'https://www.disgenet.org/dbinfo#section11'},
   'method': 'NLP_BEFREE'}],
 'numeric_measures_present': True,
 'categorical_measures_present': False,
 'biolink_predicate': 'related_to',
 'ingested_ontology_predicate': 'SIO:001403',
 'ingested_ontology_label': 'SIO:is_associated_with',
 'numeric_measures': [{'name': 'GDAscore',
   'standard_label': 'association_score',
   'range': '(0-1]',
   'direction': {'more_confident': 'larger'},
   'reference': {'url': 'https://www.disgenet.org/dbinfo#section31'}},
  {'name': 'EI',
   'standard_label': 'e

#### By source: HPO

In [10]:
metakg_dggd_HPO = copy.deepcopy(basic_dggd_edge)

## PROVENANCE
## the MyDisease.info dict
metakg_dggd_HPO['traced_provenance'][0].update({"method":"ingest"})
## the DisGeNET dict
metakg_dggd_HPO['traced_provenance'][1].update({"method":"propagate_from_phenotype"})

## assign the origin
metakg_dggd_HPO['origin'] = \
    {"name":"HPO_annotations",  
     "type":"knowledgebase"}  

In [11]:
metakg_dggd_HPO

{'translator_group': ['Service_Provider'],
 'nodes_conflated': {'Disease': {'conflated_type': 'PhenotypicFeature',
   'where': 'DisGeNET'},
  'Gene': {'conflated_type': 'GeneProduct', 'where': 'DisGeNET'}},
 'traced_provenance': [{'name': 'MyDisease.info API',
   'type': 'service',
   'version': '2020-10-26',
   'method': 'ingest'},
  {'name': 'DisGeNET',
   'type': 'knowledgebase',
   'version': '2020-05-07',
   'method_ref': {'url': 'https://www.disgenet.org/dbinfo#section11'},
   'method': 'propagate_from_phenotype'}],
 'numeric_measures_present': True,
 'categorical_measures_present': False,
 'biolink_predicate': 'related_to',
 'ingested_ontology_predicate': 'SIO:001403',
 'ingested_ontology_label': 'SIO:is_associated_with',
 'numeric_measures': [{'name': 'GDAscore',
   'standard_label': 'association_score',
   'range': '(0-1]',
   'direction': {'more_confident': 'larger'},
   'reference': {'url': 'https://www.disgenet.org/dbinfo#section31'}},
  {'name': 'EI',
   'standard_label': 

#### By source: UNIPROT

In [12]:
metakg_dggd_UNIPROT = copy.deepcopy(basic_dggd_edge)

## PROVENANCE
## the MyDisease.info dict
metakg_dggd_UNIPROT['traced_provenance'][0].update({"method":"ingest_consolidate"})
## the DisGeNET dict
metakg_dggd_UNIPROT['traced_provenance'][1].update({"method":"propagate_from_protein_variant"})

## assign the origin
metakg_dggd_UNIPROT['origin'] = \
    {"name":"UniProt",  
     "type":"knowledgebase",
     "taxon_subset":{'NCBITaxon:9606':'Homo_sapiens'}}  ## used human file 

In [13]:
metakg_dggd_UNIPROT

{'translator_group': ['Service_Provider'],
 'nodes_conflated': {'Disease': {'conflated_type': 'PhenotypicFeature',
   'where': 'DisGeNET'},
  'Gene': {'conflated_type': 'GeneProduct', 'where': 'DisGeNET'}},
 'traced_provenance': [{'name': 'MyDisease.info API',
   'type': 'service',
   'version': '2020-10-26',
   'method': 'ingest_consolidate'},
  {'name': 'DisGeNET',
   'type': 'knowledgebase',
   'version': '2020-05-07',
   'method_ref': {'url': 'https://www.disgenet.org/dbinfo#section11'},
   'method': 'propagate_from_protein_variant'}],
 'numeric_measures_present': True,
 'categorical_measures_present': False,
 'biolink_predicate': 'related_to',
 'ingested_ontology_predicate': 'SIO:001403',
 'ingested_ontology_label': 'SIO:is_associated_with',
 'numeric_measures': [{'name': 'GDAscore',
   'standard_label': 'association_score',
   'range': '(0-1]',
   'direction': {'more_confident': 'larger'},
   'reference': {'url': 'https://www.disgenet.org/dbinfo#section31'}},
  {'name': 'EI',
   

#### By source: CGI

In [14]:
metakg_dggd_CGI = copy.deepcopy(basic_dggd_edge)

## PROVENANCE
## the MyDisease.info dict
metakg_dggd_CGI['traced_provenance'][0].update({"method":"ingest"})
## the DisGeNET dict
metakg_dggd_CGI['traced_provenance'][1].update({"method":"ingest"})

## assign the origin
metakg_dggd_CGI['origin'] = \
    {"name":"Cancer_Genome_Interpreter",  
     "type":"knowledgebase"}  

In [15]:
metakg_dggd_CGI

{'translator_group': ['Service_Provider'],
 'nodes_conflated': {'Disease': {'conflated_type': 'PhenotypicFeature',
   'where': 'DisGeNET'},
  'Gene': {'conflated_type': 'GeneProduct', 'where': 'DisGeNET'}},
 'traced_provenance': [{'name': 'MyDisease.info API',
   'type': 'service',
   'version': '2020-10-26',
   'method': 'ingest'},
  {'name': 'DisGeNET',
   'type': 'knowledgebase',
   'version': '2020-05-07',
   'method_ref': {'url': 'https://www.disgenet.org/dbinfo#section11'},
   'method': 'ingest'}],
 'numeric_measures_present': True,
 'categorical_measures_present': False,
 'biolink_predicate': 'related_to',
 'ingested_ontology_predicate': 'SIO:001403',
 'ingested_ontology_label': 'SIO:is_associated_with',
 'numeric_measures': [{'name': 'GDAscore',
   'standard_label': 'association_score',
   'range': '(0-1]',
   'direction': {'more_confident': 'larger'},
   'reference': {'url': 'https://www.disgenet.org/dbinfo#section31'}},
  {'name': 'EI',
   'standard_label': 'evidence_index',


#### By source: CLINVAR

In [71]:
metakg_dggd_CLINVAR = copy.deepcopy(basic_dggd_edge)

## PROVENANCE
## the MyDisease.info dict
metakg_dggd_CLINVAR['traced_provenance'][0].update({"method":"ingest_consolidate"})
## the DisGeNET dict
metakg_dggd_CLINVAR['traced_provenance'][1].update({"method":"propagate_from_sequence_variant"})

## assign the origin
metakg_dggd_CLINVAR['origin'] = \
    {"name":"ClinVar",  
     "type":"knowledgebase"}  

In [17]:
metakg_dggd_CLINVAR

{'translator_group': ['Service_Provider'],
 'nodes_conflated': {'Disease': {'conflated_type': 'PhenotypicFeature',
   'where': 'DisGeNET'},
  'Gene': {'conflated_type': 'GeneProduct', 'where': 'DisGeNET'}},
 'traced_provenance': [{'name': 'MyDisease.info API',
   'type': 'service',
   'version': '2020-10-26',
   'method': 'ingest_consolidate'},
  {'name': 'DisGeNET',
   'type': 'knowledgebase',
   'version': '2020-05-07',
   'method_ref': {'url': 'https://www.disgenet.org/dbinfo#section11'},
   'method': 'propagate_from_gene_variant'}],
 'numeric_measures_present': True,
 'categorical_measures_present': False,
 'biolink_predicate': 'related_to',
 'ingested_ontology_predicate': 'SIO:001403',
 'ingested_ontology_label': 'SIO:is_associated_with',
 'numeric_measures': [{'name': 'GDAscore',
   'standard_label': 'association_score',
   'range': '(0-1]',
   'direction': {'more_confident': 'larger'},
   'reference': {'url': 'https://www.disgenet.org/dbinfo#section31'}},
  {'name': 'EI',
   'st

#### By source: GWASCAT

In [72]:
metakg_dggd_GWASCAT = copy.deepcopy(basic_dggd_edge)

## PROVENANCE
## the MyDisease.info dict
metakg_dggd_GWASCAT['traced_provenance'][0].update({"method":"ingest_consolidate"})
## the DisGeNET dict
metakg_dggd_GWASCAT['traced_provenance'][1].update({"method":"propagate_from_sequence_variant"})

## assign the origin
metakg_dggd_GWASCAT['origin'] = \
    {"name":"NHGRI_EBI_GWAS_CATALOG",  
     "type":"knowledgebase"}  

In [19]:
metakg_dggd_GWASCAT

{'translator_group': ['Service_Provider'],
 'nodes_conflated': {'Disease': {'conflated_type': 'PhenotypicFeature',
   'where': 'DisGeNET'},
  'Gene': {'conflated_type': 'GeneProduct', 'where': 'DisGeNET'}},
 'traced_provenance': [{'name': 'MyDisease.info API',
   'type': 'service',
   'version': '2020-10-26',
   'method': 'ingest_consolidate'},
  {'name': 'DisGeNET',
   'type': 'knowledgebase',
   'version': '2020-05-07',
   'method_ref': {'url': 'https://www.disgenet.org/dbinfo#section11'},
   'method': 'propagate_from_gene_variant'}],
 'numeric_measures_present': True,
 'categorical_measures_present': False,
 'biolink_predicate': 'related_to',
 'ingested_ontology_predicate': 'SIO:001403',
 'ingested_ontology_label': 'SIO:is_associated_with',
 'numeric_measures': [{'name': 'GDAscore',
   'standard_label': 'association_score',
   'range': '(0-1]',
   'direction': {'more_confident': 'larger'},
   'reference': {'url': 'https://www.disgenet.org/dbinfo#section31'}},
  {'name': 'EI',
   'st

#### By source: GWASDB

In [73]:
metakg_dggd_GWASDB = copy.deepcopy(basic_dggd_edge)

## PROVENANCE
## the MyDisease.info dict
metakg_dggd_GWASDB['traced_provenance'][0].update({"method":"ingest_consolidate"})
## the DisGeNET dict
metakg_dggd_GWASDB['traced_provenance'][1].update({"method":"propagate_from_sequence_variant"})

## assign the origin
metakg_dggd_GWASDB['origin'] = \
    {"name":"GWASdb",  
     "type":"knowledgebase"}  

In [21]:
metakg_dggd_GWASDB

{'translator_group': ['Service_Provider'],
 'nodes_conflated': {'Disease': {'conflated_type': 'PhenotypicFeature',
   'where': 'DisGeNET'},
  'Gene': {'conflated_type': 'GeneProduct', 'where': 'DisGeNET'}},
 'traced_provenance': [{'name': 'MyDisease.info API',
   'type': 'service',
   'version': '2020-10-26',
   'method': 'ingest_consolidate'},
  {'name': 'DisGeNET',
   'type': 'knowledgebase',
   'version': '2020-05-07',
   'method_ref': {'url': 'https://www.disgenet.org/dbinfo#section11'},
   'method': 'propagate_from_gene_variant'}],
 'numeric_measures_present': True,
 'categorical_measures_present': False,
 'biolink_predicate': 'related_to',
 'ingested_ontology_predicate': 'SIO:001403',
 'ingested_ontology_label': 'SIO:is_associated_with',
 'numeric_measures': [{'name': 'GDAscore',
   'standard_label': 'association_score',
   'range': '(0-1]',
   'direction': {'more_confident': 'larger'},
   'reference': {'url': 'https://www.disgenet.org/dbinfo#section31'}},
  {'name': 'EI',
   'st

#### By source: MGD

In [22]:
metakg_dggd_MGD = copy.deepcopy(basic_dggd_edge)

## PROVENANCE
## the MyDisease.info dict
metakg_dggd_MGD['traced_provenance'][0].update({"method":"ingest_consolidate"})
## the DisGeNET dict
metakg_dggd_MGD['traced_provenance'][1].update({"method":"propagate_from_orthology"})

## assign the origin
metakg_dggd_MGD['origin'] = \
    {"name":"MGD",  
     "type":"knowledgebase",
     "taxon_subset":{"NCBITaxon:10090":"Mus_musculus"}}  ## mouse resource

In [23]:
metakg_dggd_MGD

{'translator_group': ['Service_Provider'],
 'nodes_conflated': {'Disease': {'conflated_type': 'PhenotypicFeature',
   'where': 'DisGeNET'},
  'Gene': {'conflated_type': 'GeneProduct', 'where': 'DisGeNET'}},
 'traced_provenance': [{'name': 'MyDisease.info API',
   'type': 'service',
   'version': '2020-10-26',
   'method': 'ingest_consolidate'},
  {'name': 'DisGeNET',
   'type': 'knowledgebase',
   'version': '2020-05-07',
   'method_ref': {'url': 'https://www.disgenet.org/dbinfo#section11'},
   'method': 'propagate_from_orthology'}],
 'numeric_measures_present': True,
 'categorical_measures_present': False,
 'biolink_predicate': 'related_to',
 'ingested_ontology_predicate': 'SIO:001403',
 'ingested_ontology_label': 'SIO:is_associated_with',
 'numeric_measures': [{'name': 'GDAscore',
   'standard_label': 'association_score',
   'range': '(0-1]',
   'direction': {'more_confident': 'larger'},
   'reference': {'url': 'https://www.disgenet.org/dbinfo#section31'}},
  {'name': 'EI',
   'stand

#### By source: RGD

In [24]:
metakg_dggd_RGD = copy.deepcopy(basic_dggd_edge)

## PROVENANCE
## the MyDisease.info dict
metakg_dggd_RGD['traced_provenance'][0].update({"method":"ingest_consolidate"})
## the DisGeNET dict
metakg_dggd_RGD['traced_provenance'][1].update({"method":"propagate_from_orthology"})

## assign the origin
metakg_dggd_RGD['origin'] = \
    {"name":"RGD",  
     "type":"knowledgebase"}  
## I don't put a taxon_subset since RGD actually has 8 species, including rat and human, in it: 
##  https://rgd.mcw.edu/wg/about-us/

In [25]:
metakg_dggd_RGD

{'translator_group': ['Service_Provider'],
 'nodes_conflated': {'Disease': {'conflated_type': 'PhenotypicFeature',
   'where': 'DisGeNET'},
  'Gene': {'conflated_type': 'GeneProduct', 'where': 'DisGeNET'}},
 'traced_provenance': [{'name': 'MyDisease.info API',
   'type': 'service',
   'version': '2020-10-26',
   'method': 'ingest_consolidate'},
  {'name': 'DisGeNET',
   'type': 'knowledgebase',
   'version': '2020-05-07',
   'method_ref': {'url': 'https://www.disgenet.org/dbinfo#section11'},
   'method': 'propagate_from_orthology'}],
 'numeric_measures_present': True,
 'categorical_measures_present': False,
 'biolink_predicate': 'related_to',
 'ingested_ontology_predicate': 'SIO:001403',
 'ingested_ontology_label': 'SIO:is_associated_with',
 'numeric_measures': [{'name': 'GDAscore',
   'standard_label': 'association_score',
   'range': '(0-1]',
   'direction': {'more_confident': 'larger'},
   'reference': {'url': 'https://www.disgenet.org/dbinfo#section31'}},
  {'name': 'EI',
   'stand

#### By source: CTD_mouse

In [26]:
metakg_dggd_CTDmouse = copy.deepcopy(basic_dggd_edge)

## PROVENANCE
## the MyDisease.info dict
metakg_dggd_CTDmouse['traced_provenance'][0].update({"method":"ingest_consolidate"})
## the DisGeNET dict
metakg_dggd_CTDmouse['traced_provenance'][1].update({"method":"propagate_from_orthology"})

## assign the origin
metakg_dggd_CTDmouse['origin'] = \
    {"name":"CTD",  
     "type":"knowledgebase",
     "taxon_subset":{"NCBITaxon:10090":"Mus_musculus"}}   ## mouse file

In [27]:
metakg_dggd_CTDmouse

{'translator_group': ['Service_Provider'],
 'nodes_conflated': {'Disease': {'conflated_type': 'PhenotypicFeature',
   'where': 'DisGeNET'},
  'Gene': {'conflated_type': 'GeneProduct', 'where': 'DisGeNET'}},
 'traced_provenance': [{'name': 'MyDisease.info API',
   'type': 'service',
   'version': '2020-10-26',
   'method': 'ingest_consolidate'},
  {'name': 'DisGeNET',
   'type': 'knowledgebase',
   'version': '2020-05-07',
   'method_ref': {'url': 'https://www.disgenet.org/dbinfo#section11'},
   'method': 'propagate_from_orthology'}],
 'numeric_measures_present': True,
 'categorical_measures_present': False,
 'biolink_predicate': 'related_to',
 'ingested_ontology_predicate': 'SIO:001403',
 'ingested_ontology_label': 'SIO:is_associated_with',
 'numeric_measures': [{'name': 'GDAscore',
   'standard_label': 'association_score',
   'range': '(0-1]',
   'direction': {'more_confident': 'larger'},
   'reference': {'url': 'https://www.disgenet.org/dbinfo#section31'}},
  {'name': 'EI',
   'stand

#### By source: CTD_rat

In [28]:
metakg_dggd_CTDrat = copy.deepcopy(basic_dggd_edge)

## PROVENANCE
## the MyDisease.info dict
metakg_dggd_CTDrat['traced_provenance'][0].update({"method":"ingest_consolidate"})
## the DisGeNET dict
metakg_dggd_CTDrat['traced_provenance'][1].update({"method":"propagate_from_orthology"})

## assign the origin
metakg_dggd_CTDrat['origin'] = \
    {"name":"CTD",  
     "type":"knowledgebase",
     "taxon_subset":{"NCBITaxon:10116":"Rattus_norvegicus"}}   ## rat file

In [29]:
metakg_dggd_CTDrat

{'translator_group': ['Service_Provider'],
 'nodes_conflated': {'Disease': {'conflated_type': 'PhenotypicFeature',
   'where': 'DisGeNET'},
  'Gene': {'conflated_type': 'GeneProduct', 'where': 'DisGeNET'}},
 'traced_provenance': [{'name': 'MyDisease.info API',
   'type': 'service',
   'version': '2020-10-26',
   'method': 'ingest_consolidate'},
  {'name': 'DisGeNET',
   'type': 'knowledgebase',
   'version': '2020-05-07',
   'method_ref': {'url': 'https://www.disgenet.org/dbinfo#section11'},
   'method': 'propagate_from_orthology'}],
 'numeric_measures_present': True,
 'categorical_measures_present': False,
 'biolink_predicate': 'related_to',
 'ingested_ontology_predicate': 'SIO:001403',
 'ingested_ontology_label': 'SIO:is_associated_with',
 'numeric_measures': [{'name': 'GDAscore',
   'standard_label': 'association_score',
   'range': '(0-1]',
   'direction': {'more_confident': 'larger'},
   'reference': {'url': 'https://www.disgenet.org/dbinfo#section31'}},
  {'name': 'EI',
   'stand

#### By source: CLINGEN

In [30]:
metakg_dggd_CLINGEN = copy.deepcopy(basic_dggd_edge)

## PROVENANCE
## the MyDisease.info dict
metakg_dggd_CLINGEN['traced_provenance'][0].update({"method":"ingest_consolidate"})
## the DisGeNET dict
metakg_dggd_CLINGEN['traced_provenance'][1].update({"method":"ingest"})

## assign the origin
metakg_dggd_CLINGEN['origin'] = \
    {"name":"ClinGen",  
     "type":"knowledgebase"}  

In [31]:
metakg_dggd_CLINGEN

{'translator_group': ['Service_Provider'],
 'nodes_conflated': {'Disease': {'conflated_type': 'PhenotypicFeature',
   'where': 'DisGeNET'},
  'Gene': {'conflated_type': 'GeneProduct', 'where': 'DisGeNET'}},
 'traced_provenance': [{'name': 'MyDisease.info API',
   'type': 'service',
   'version': '2020-10-26',
   'method': 'ingest_consolidate'},
  {'name': 'DisGeNET',
   'type': 'knowledgebase',
   'version': '2020-05-07',
   'method_ref': {'url': 'https://www.disgenet.org/dbinfo#section11'},
   'method': 'ingest'}],
 'numeric_measures_present': True,
 'categorical_measures_present': False,
 'biolink_predicate': 'related_to',
 'ingested_ontology_predicate': 'SIO:001403',
 'ingested_ontology_label': 'SIO:is_associated_with',
 'numeric_measures': [{'name': 'GDAscore',
   'standard_label': 'association_score',
   'range': '(0-1]',
   'direction': {'more_confident': 'larger'},
   'reference': {'url': 'https://www.disgenet.org/dbinfo#section31'}},
  {'name': 'EI',
   'standard_label': 'evide

#### By source: GENOMICS_ENGLAND (GE)

In [32]:
metakg_dggd_GE = copy.deepcopy(basic_dggd_edge)

## PROVENANCE
## the MyDisease.info dict
metakg_dggd_GE['traced_provenance'][0].update({"method":"ingest_consolidate"})
## the DisGeNET dict
metakg_dggd_GE['traced_provenance'][1].update({"method":"ingest"})

## assign the origin
metakg_dggd_GE['origin'] = \
    {"name":"Genomics_England_PanelApp",  
     "type":"knowledgebase"}  

In [33]:
metakg_dggd_GE

{'translator_group': ['Service_Provider'],
 'nodes_conflated': {'Disease': {'conflated_type': 'PhenotypicFeature',
   'where': 'DisGeNET'},
  'Gene': {'conflated_type': 'GeneProduct', 'where': 'DisGeNET'}},
 'traced_provenance': [{'name': 'MyDisease.info API',
   'type': 'service',
   'version': '2020-10-26',
   'method': 'ingest_consolidate'},
  {'name': 'DisGeNET',
   'type': 'knowledgebase',
   'version': '2020-05-07',
   'method_ref': {'url': 'https://www.disgenet.org/dbinfo#section11'},
   'method': 'ingest'}],
 'numeric_measures_present': True,
 'categorical_measures_present': False,
 'biolink_predicate': 'related_to',
 'ingested_ontology_predicate': 'SIO:001403',
 'ingested_ontology_label': 'SIO:is_associated_with',
 'numeric_measures': [{'name': 'GDAscore',
   'standard_label': 'association_score',
   'range': '(0-1]',
   'direction': {'more_confident': 'larger'},
   'reference': {'url': 'https://www.disgenet.org/dbinfo#section31'}},
  {'name': 'EI',
   'standard_label': 'evide

#### By source: CTD_human

In [34]:
metakg_dggd_CTDhuman = copy.deepcopy(basic_dggd_edge)

## PROVENANCE
## the MyDisease.info dict
metakg_dggd_CTDhuman['traced_provenance'][0].update({"method":"ingest_consolidate"})
## the DisGeNET dict
metakg_dggd_CTDhuman['traced_provenance'][1].update({"method":"ingest"})

## assign the origin
metakg_dggd_CTDhuman['origin'] = \
    {"name":"CTD",  
     "type":"knowledgebase",
     "taxon_subset":{'NCBITaxon:9606':'Homo_sapiens'}}  ## human file

In [35]:
metakg_dggd_CTDhuman

{'translator_group': ['Service_Provider'],
 'nodes_conflated': {'Disease': {'conflated_type': 'PhenotypicFeature',
   'where': 'DisGeNET'},
  'Gene': {'conflated_type': 'GeneProduct', 'where': 'DisGeNET'}},
 'traced_provenance': [{'name': 'MyDisease.info API',
   'type': 'service',
   'version': '2020-10-26',
   'method': 'ingest_consolidate'},
  {'name': 'DisGeNET',
   'type': 'knowledgebase',
   'version': '2020-05-07',
   'method_ref': {'url': 'https://www.disgenet.org/dbinfo#section11'},
   'method': 'ingest'}],
 'numeric_measures_present': True,
 'categorical_measures_present': False,
 'biolink_predicate': 'related_to',
 'ingested_ontology_predicate': 'SIO:001403',
 'ingested_ontology_label': 'SIO:is_associated_with',
 'numeric_measures': [{'name': 'GDAscore',
   'standard_label': 'association_score',
   'range': '(0-1]',
   'direction': {'more_confident': 'larger'},
   'reference': {'url': 'https://www.disgenet.org/dbinfo#section31'}},
  {'name': 'EI',
   'standard_label': 'evide

#### By source: PSYGENET

In [36]:
metakg_dggd_PSYGENET = copy.deepcopy(basic_dggd_edge)

## PROVENANCE
## the MyDisease.info dict
metakg_dggd_PSYGENET['traced_provenance'][0].update({"method":"ingest_consolidate"})
## the DisGeNET dict
metakg_dggd_PSYGENET['traced_provenance'][1].update({"method":"ingest"})

## assign the origin
metakg_dggd_PSYGENET['origin'] = \
    {"name":"PsyGeNET",  
     "type":"knowledgebase"}  

In [37]:
metakg_dggd_PSYGENET

{'translator_group': ['Service_Provider'],
 'nodes_conflated': {'Disease': {'conflated_type': 'PhenotypicFeature',
   'where': 'DisGeNET'},
  'Gene': {'conflated_type': 'GeneProduct', 'where': 'DisGeNET'}},
 'traced_provenance': [{'name': 'MyDisease.info API',
   'type': 'service',
   'version': '2020-10-26',
   'method': 'ingest_consolidate'},
  {'name': 'DisGeNET',
   'type': 'knowledgebase',
   'version': '2020-05-07',
   'method_ref': {'url': 'https://www.disgenet.org/dbinfo#section11'},
   'method': 'ingest'}],
 'numeric_measures_present': True,
 'categorical_measures_present': False,
 'biolink_predicate': 'related_to',
 'ingested_ontology_predicate': 'SIO:001403',
 'ingested_ontology_label': 'SIO:is_associated_with',
 'numeric_measures': [{'name': 'GDAscore',
   'standard_label': 'association_score',
   'range': '(0-1]',
   'direction': {'more_confident': 'larger'},
   'reference': {'url': 'https://www.disgenet.org/dbinfo#section31'}},
  {'name': 'EI',
   'standard_label': 'evide

#### By source: ORPHANET

In [38]:
metakg_dggd_ORPHANET = copy.deepcopy(basic_dggd_edge)

## PROVENANCE
## the MyDisease.info dict
metakg_dggd_ORPHANET['traced_provenance'][0].update({"method":"ingest_consolidate"})
## the DisGeNET dict
metakg_dggd_ORPHANET['traced_provenance'][1].update({"method":"ingest"})

## assign the origin
metakg_dggd_ORPHANET['origin'] = \
    {"name":"Orphanet",  
     "type":"knowledgebase"}  

In [39]:
metakg_dggd_ORPHANET

{'translator_group': ['Service_Provider'],
 'nodes_conflated': {'Disease': {'conflated_type': 'PhenotypicFeature',
   'where': 'DisGeNET'},
  'Gene': {'conflated_type': 'GeneProduct', 'where': 'DisGeNET'}},
 'traced_provenance': [{'name': 'MyDisease.info API',
   'type': 'service',
   'version': '2020-10-26',
   'method': 'ingest_consolidate'},
  {'name': 'DisGeNET',
   'type': 'knowledgebase',
   'version': '2020-05-07',
   'method_ref': {'url': 'https://www.disgenet.org/dbinfo#section11'},
   'method': 'ingest'}],
 'numeric_measures_present': True,
 'categorical_measures_present': False,
 'biolink_predicate': 'related_to',
 'ingested_ontology_predicate': 'SIO:001403',
 'ingested_ontology_label': 'SIO:is_associated_with',
 'numeric_measures': [{'name': 'GDAscore',
   'standard_label': 'association_score',
   'range': '(0-1]',
   'direction': {'more_confident': 'larger'},
   'reference': {'url': 'https://www.disgenet.org/dbinfo#section31'}},
  {'name': 'EI',
   'standard_label': 'evide

#### Finding edges with common attributes

Sets with same traced provenance (methods):
* CLINGEN, Genomics england (GE), CTDhuman, PSYGENET, ORPHANET have "ingest_consolidate" on the MyDisease.info level and "ingest" on the DisGeNET level. 
* MGD, RGD, CTDmouse, CTDrat have "ingest_consolidate" on the MyDisease.info level and "propagate_from_orthology" on the DisGeNET level. 
* CLINVAR, GWASCAT, GWASDB have "ingest_consolidate" on the MyDisease.info level and "propagate_from_gene_variant" on the DisGeNET level. 

Edges with unique traced_provenance (methods): LHGDN, BEFREE, HPO, UNIPROT, CGI

In [40]:
set_of_dggd_edges = [metakg_dggd_LHGDN, metakg_dggd_BEFREE, metakg_dggd_HPO, metakg_dggd_UNIPROT, 
                     metakg_dggd_CGI, metakg_dggd_CLINVAR, metakg_dggd_GWASCAT, metakg_dggd_GWASDB, 
                     metakg_dggd_MGD, metakg_dggd_RGD, metakg_dggd_CTDmouse, metakg_dggd_CTDrat, 
                     metakg_dggd_CLINGEN, metakg_dggd_GE, metakg_dggd_CTDhuman, metakg_dggd_PSYGENET, 
                     metakg_dggd_ORPHANET]

In [41]:
[(i['origin']['name'], i['origin'].get('taxon_subset')) \
 for i in set_of_dggd_edges \
 if (i['traced_provenance'][0]['method'] == "ingest_consolidate") &
    (i['traced_provenance'][1]['method'] == "propagate_from_gene_variant")]

[('ClinVar', None), ('NHGRI_EBI_GWAS_CATALOG', None), ('GWASdb', None)]

### Relationships with Variants and Diseases

- Based on: https://www.disgenet.org/static/disgenet_ap1/files/downloads/all_variant_disease_pmid_associations.tsv.gz 
- Uses UMLS IDs (DisGeNET calls it the CUI) for diseases
- Uses dbSNP / RS numbers (I think it only uses this but not sure)
- there are 5 underlying sources. sources can have different methods. 
    * there isn't information given to trace further into the source (like Uniprot's provenance for the assertion)

Below, there's the information shared between all edges with this combo of input/output-types:
* ingested predicate choice: ultimately, DisGeNET doesn't include "association type" or a predicate in their data dump. The [homepage](https://www.disgenet.org/home/) describes their resource as "collections of genes and variants associated to human diseases". **This is why an "associated with" relation is used, rather than the more specific has_biomarker/biomarker_of relationships. 
    * the has_biomarker/biomarker_of tend to give a *causal* flavor and kind of map to "causes or contributes to condition" (that's an RO term). I'm not fully comfortable with making that assertion for all DisGeNET data. 
* the biolink and ingested predicate are symmetrical so it works with both Disease -> Variant and Variant -> Disease. 

In [42]:
## stuff that is the same in all records 
basic_dvvd_edge = copy.deepcopy(metakg_top_level)

## PREDICATE
## actually too narrow, but there isn't the SIO term
basic_dvvd_edge["biolink_predicate"] = "related_to"  
basic_dvvd_edge["ingested_ontology_predicate"] = "SIO:001403"
basic_dvvd_edge["ingested_ontology_label"] = "SIO:is_associated_with"

## PROVENANCE
## all disease-variant combo edges have this second entry in the trace
## the method for the second entry differs
basic_dvvd_edge['traced_provenance'][0].update({"method":"ingest_consolidate"})
basic_dvvd_edge['traced_provenance'].append(
    {"name":"DisGeNET",
     "type":"knowledgebase",    ## assign this
     "version":"2020-05-07",     ## current as of 2020-10-27
     "method_ref":{"url":"https://www.disgenet.org/dbinfo#section12"}    ## assign this
    }  
)

## MEASURES: 4 
basic_dvvd_edge['numeric_measures'] = \
[
    {"name":"VDAscore",
     "standard_label":"association_score",  ## name Translator may want to use
     "range":"(0-1]", 
     "direction":{"more_confident":"larger"},
     "reference":{"url":"https://www.disgenet.org/dbinfo#section32"}}, 
    {"name":"EI",
     "standard_label":"evidence_index",  ## name Translator may want to use
     "range":"(0-1]", 
     "direction":{"more_confident":"larger"},
     "reference":{"url":"https://www.disgenet.org/dbinfo#section36"}}, 
    {"name":"DSI",
     "standard_label":"gene_specific_to_disease",  ## name Translator may want to use
     "range":"(0-1]",  ## ref claims min=0.25, but it should fluctuate based on db data
     "direction":{"more_specific":"larger"},
     "reference":{"url":"https://www.disgenet.org/dbinfo#section33"}}, 
    {"name":"DPI",
     "standard_label":"gene_specific_to_disease_class",  ## name Translator may want to use
     "range":"(0-1]",       ## ref claims min=1/29, but it would fluctuate if disease class changed
     "direction":{"more_specific":"smaller"},
     "reference":{"url":"https://www.disgenet.org/dbinfo#section34"}} 
]

The provenance then differs by the value in the source column of the original datatype. This provenance is actually the same for Disease -> Variant and Variant -> Disease. 

#### By source:BEFREE

In [43]:
metakg_dvvd_BEFREE = copy.deepcopy(basic_dvvd_edge)

## PROVENANCE
## the DisGeNET dict
metakg_dvvd_BEFREE['traced_provenance'][1].update({"method":"NLP_BEFREE"})

## assign the origin
metakg_dvvd_BEFREE['origin'] = \
    {"name":"MEDLINE_abstracts",  
        "type":"publications",
        "version":"1970-01_to_2019-12"} 

In [44]:
metakg_dvvd_BEFREE

{'translator_group': ['Service_Provider'],
 'nodes_conflated': {'Disease': {'conflated_type': 'PhenotypicFeature',
   'where': 'DisGeNET'}},
 'traced_provenance': [{'name': 'MyDisease.info API',
   'type': 'service',
   'version': '2020-10-26',
   'method': 'ingest_consolidate'},
  {'name': 'DisGeNET',
   'type': 'knowledgebase',
   'version': '2020-05-07',
   'method_ref': {'url': 'https://www.disgenet.org/dbinfo#section12'},
   'method': 'NLP_BEFREE'}],
 'numeric_measures_present': True,
 'categorical_measures_present': False,
 'biolink_predicate': 'related_to',
 'ingested_ontology_predicate': 'SIO:001403',
 'ingested_ontology_label': 'SIO:is_associated_with',
 'numeric_measures': [{'name': 'VDAscore',
   'standard_label': 'association_score',
   'range': '(0-1]',
   'direction': {'more_confident': 'larger'},
   'reference': {'url': 'https://www.disgenet.org/dbinfo#section32'}},
  {'name': 'EI',
   'standard_label': 'evidence_index',
   'range': '(0-1]',
   'direction': {'more_confid

#### By source:CLINVAR

In [45]:
metakg_dvvd_CLINVAR = copy.deepcopy(basic_dvvd_edge)

## PROVENANCE
## the MyDisease.info dict
metakg_dvvd_CLINVAR['traced_provenance'][0].update({"method":"ingest_consolidate"})
## the DisGeNET dict
metakg_dvvd_CLINVAR['traced_provenance'][1].update({"method":"ingest"})

## assign the origin
metakg_dvvd_CLINVAR['origin'] = \
    {"name":"ClinVar",  
        "type":"knowledgebase"} 

In [46]:
metakg_dvvd_CLINVAR

{'translator_group': ['Service_Provider'],
 'nodes_conflated': {'Disease': {'conflated_type': 'PhenotypicFeature',
   'where': 'DisGeNET'}},
 'traced_provenance': [{'name': 'MyDisease.info API',
   'type': 'service',
   'version': '2020-10-26',
   'method': 'ingest_consolidate'},
  {'name': 'DisGeNET',
   'type': 'knowledgebase',
   'version': '2020-05-07',
   'method_ref': {'url': 'https://www.disgenet.org/dbinfo#section12'},
   'method': 'ingest'}],
 'numeric_measures_present': True,
 'categorical_measures_present': False,
 'biolink_predicate': 'related_to',
 'ingested_ontology_predicate': 'SIO:001403',
 'ingested_ontology_label': 'SIO:is_associated_with',
 'numeric_measures': [{'name': 'VDAscore',
   'standard_label': 'association_score',
   'range': '(0-1]',
   'direction': {'more_confident': 'larger'},
   'reference': {'url': 'https://www.disgenet.org/dbinfo#section32'}},
  {'name': 'EI',
   'standard_label': 'evidence_index',
   'range': '(0-1]',
   'direction': {'more_confident'

#### By source:GWASCAT

In [47]:
metakg_dvvd_GWASCAT = copy.deepcopy(basic_dvvd_edge)

## PROVENANCE
## the DisGeNET dict
metakg_dvvd_GWASCAT['traced_provenance'][1].update({"method":"ingest"})

## assign the origin
metakg_dvvd_GWASCAT['origin'] = \
    {"name":"NHGRI_EBI_GWAS_CATALOG",  
        "type":"knowledgebase"} 

In [48]:
metakg_dvvd_GWASCAT

{'translator_group': ['Service_Provider'],
 'nodes_conflated': {'Disease': {'conflated_type': 'PhenotypicFeature',
   'where': 'DisGeNET'}},
 'traced_provenance': [{'name': 'MyDisease.info API',
   'type': 'service',
   'version': '2020-10-26',
   'method': 'ingest_consolidate'},
  {'name': 'DisGeNET',
   'type': 'knowledgebase',
   'version': '2020-05-07',
   'method_ref': {'url': 'https://www.disgenet.org/dbinfo#section12'},
   'method': 'ingest'}],
 'numeric_measures_present': True,
 'categorical_measures_present': False,
 'biolink_predicate': 'related_to',
 'ingested_ontology_predicate': 'SIO:001403',
 'ingested_ontology_label': 'SIO:is_associated_with',
 'numeric_measures': [{'name': 'VDAscore',
   'standard_label': 'association_score',
   'range': '(0-1]',
   'direction': {'more_confident': 'larger'},
   'reference': {'url': 'https://www.disgenet.org/dbinfo#section32'}},
  {'name': 'EI',
   'standard_label': 'evidence_index',
   'range': '(0-1]',
   'direction': {'more_confident'

#### By source:GWASDB

In [49]:
metakg_dvvd_GWASDB = copy.deepcopy(basic_dvvd_edge)

## PROVENANCE
## the DisGeNET dict
metakg_dvvd_GWASDB['traced_provenance'][1].update({"method":"ingest"})

## assign the origin
metakg_dvvd_GWASDB['origin'] = \
    {"name":"GWASdb",  
        "type":"knowledgebase"} 

In [50]:
metakg_dvvd_GWASDB

{'translator_group': ['Service_Provider'],
 'nodes_conflated': {'Disease': {'conflated_type': 'PhenotypicFeature',
   'where': 'DisGeNET'}},
 'traced_provenance': [{'name': 'MyDisease.info API',
   'type': 'service',
   'version': '2020-10-26',
   'method': 'ingest_consolidate'},
  {'name': 'DisGeNET',
   'type': 'knowledgebase',
   'version': '2020-05-07',
   'method_ref': {'url': 'https://www.disgenet.org/dbinfo#section12'},
   'method': 'ingest'}],
 'numeric_measures_present': True,
 'categorical_measures_present': False,
 'biolink_predicate': 'related_to',
 'ingested_ontology_predicate': 'SIO:001403',
 'ingested_ontology_label': 'SIO:is_associated_with',
 'numeric_measures': [{'name': 'VDAscore',
   'standard_label': 'association_score',
   'range': '(0-1]',
   'direction': {'more_confident': 'larger'},
   'reference': {'url': 'https://www.disgenet.org/dbinfo#section32'}},
  {'name': 'EI',
   'standard_label': 'evidence_index',
   'range': '(0-1]',
   'direction': {'more_confident'

#### By source:UNIPROT

In [51]:
metakg_dvvd_UNIPROT = copy.deepcopy(basic_dvvd_edge)

## PROVENANCE
## the DisGeNET dict
metakg_dvvd_UNIPROT['traced_provenance'][1].update({"method":"ingest"})

## assign the origin
metakg_dvvd_UNIPROT['origin'] = \
    {"name":"UniProt",  
     "type":"knowledgebase",
     "taxon_subset":{'NCBITaxon:9606':'Homo_sapiens'}}  ## used human file 

In [52]:
metakg_dvvd_UNIPROT

{'translator_group': ['Service_Provider'],
 'nodes_conflated': {'Disease': {'conflated_type': 'PhenotypicFeature',
   'where': 'DisGeNET'}},
 'traced_provenance': [{'name': 'MyDisease.info API',
   'type': 'service',
   'version': '2020-10-26',
   'method': 'ingest_consolidate'},
  {'name': 'DisGeNET',
   'type': 'knowledgebase',
   'version': '2020-05-07',
   'method_ref': {'url': 'https://www.disgenet.org/dbinfo#section12'},
   'method': 'ingest'}],
 'numeric_measures_present': True,
 'categorical_measures_present': False,
 'biolink_predicate': 'related_to',
 'ingested_ontology_predicate': 'SIO:001403',
 'ingested_ontology_label': 'SIO:is_associated_with',
 'numeric_measures': [{'name': 'VDAscore',
   'standard_label': 'association_score',
   'range': '(0-1]',
   'direction': {'more_confident': 'larger'},
   'reference': {'url': 'https://www.disgenet.org/dbinfo#section32'}},
  {'name': 'EI',
   'standard_label': 'evidence_index',
   'range': '(0-1]',
   'direction': {'more_confident'

#### Finding edges with common attributes

Sets with same traced provenance (methods):
* CLINVAR, GWASCAT, GWASDB, UNIPROT have "ingest_consolidate" on the MyDisease.info level and "ingest" on the DisGeNET level. 

In [53]:
set_of_dvvd_edges = [metakg_dvvd_BEFREE, metakg_dvvd_CLINVAR, 
                     metakg_dvvd_GWASCAT, metakg_dvvd_GWASDB, 
                     metakg_dvvd_UNIPROT]

In [54]:
[(i['origin']['name'], i['origin'].get('taxon_subset')) \
 for i in set_of_dvvd_edges \
 if (i['traced_provenance'][0]['method'] == "ingest_consolidate") &
    (i['traced_provenance'][1]['method'] == "ingest")]

[('ClinVar', None),
 ('NHGRI_EBI_GWAS_CATALOG', None),
 ('GWASdb', None),
 ('UniProt', {'NCBITaxon:9606': 'Homo_sapiens'})]

## response edge examples and website url construction

### Note: the multiple-edge issue

If the hint module maps multiple DisGeNET IDs (UMLS diseases, RS numbers for variants) to one node, multiple edges will be returned from one source and the information from the edges won't be the same (ex: publications list, EI, website). 
* **EI is the problem since it wouldn't be accurate to just average them. See how it's calculated: https://www.disgenet.org/dbinfo#section36**
* the publication lists can be made into sets and merged
* the website can be a list for the multiple object IDs 

<br>

Ex: see https://www.disgenet.org/browser/0/1/0/C0346153::C1861906/geneid__11200-source__BEFREE/_b./

it shows the two IDs for breast cancer linked to the same gene (CHEK2) with the same source. However, the EI and publications list differ.   

- Breast cancer (MONDO:0016419) hint object mapped to two IDs C0346153 (familial breast cancer) and C1861906 (familial male breast cancer)
- the query for Disease -> Gene in BTE looks like this: API 6.1: https://mydisease.info/v1/query?fields=disgenet.genes_related_to_disease.gene_id (POST -d q=C0346153,C1861906&scopes=mondo.xrefs.umls, disgenet.xrefs.umls)

### Website url construction

- I can put multiple IDs into the subject field (this url has Alzheimer's Disease and Achondroplasia): https://www.disgenet.org/browser/0/1/2/C0002395::C0001080/
- **I can only put one ID into the object field** (this url is Alzheimer's Disease -> Diabetes Mellitus)
https://www.disgenet.org/browser/0/1/2/C0002395/diseaseid__C0011849-source__ALL/_b./ 

BUT, the BTE Hint module sometimes maps to multiple UMLS IDs for a disease:
- Breast cancer (MONDO:0016419) hint object mapped to two IDs: Malignant neoplasm of breast (C0006142) and familial breast cancer (C0346153)
- the query for Disease -> Gene in BTE looks like this: API 2.1: https://mydisease.info/v1/query?fields=disgenet.genes_related_to_disease.gene_id (POST -d q=C0006142,C0346153&scopes=mondo.xrefs.umls, disgenet.xrefs.umls)

so what if both the subject and object have multiple UMLS IDs in their Hint objects?

Kevin and I thought about a list of websites, one for each object UMLS ID. 

In [55]:
## website url construction: no node would actually have the two subject IDs, this is just an example
subject_ids = ['C0684249',  ## lung carcinoma
               'C0002395']  ## alzheimer's disease
object_ids = ['C0346153',  ## familial breast cancer 
              'C1861906']  ## familial male breast cancer

## list comprehension + string formatting + list -> string
website_urls = [
    "https://www.disgenet.org/browser/0/1/2/{0}/diseaseid__{1}-source__ALL/_b./".format(\
        "::".join(subject_ids), obj) \
    for obj in object_ids
]
website_urls

['https://www.disgenet.org/browser/0/1/2/C0684249::C0002395/diseaseid__C0346153-source__ALL/_b./',
 'https://www.disgenet.org/browser/0/1/2/C0684249::C0002395/diseaseid__C1861906-source__ALL/_b./']

### Disease -> Disease

Alzheimer's Disease -> Diabetes Mellitus: https://www.disgenet.org/browser/0/1/2/C0002395/diseaseid__C0011849-source__ALL/_b./ 

Edge-specific parts: 
* put specific website for provenance (put query disease into it)
* put the specific measure values (new key:value into dictionaries)

In [56]:
## first constructing the url
dd_subject_ids = ['C0002395']  ## alzheimer's disease
dd_object_ids = ['C0011849']  ## diabetes

## constructing website field: list comprehension + string formatting + list -> string
dd_website_urls = [
    "https://www.disgenet.org/browser/0/1/2/{0}/diseaseid__{1}-source__ALL/_b./".format(\
        "::".join(dd_subject_ids), obj) \
    for obj in dd_object_ids
]
dd_website_urls

['https://www.disgenet.org/browser/0/1/2/C0002395/diseaseid__C0011849-source__ALL/_b./']

In [57]:
## the entry
result_dd = copy.deepcopy(metakg_dd)

result_dd['website'] = dd_website_urls
result_dd['numeric_measures'][0]['value'] = 1418  ## this is Ngenes / num_shared_genes
result_dd['numeric_measures'][1]['value'] = 0.30  ## this is jaccard_genes / jaccard_shared_genes
result_dd['numeric_measures'][2]['value'] = 61  ## this is Nvariants / num_shared_variants
result_dd['numeric_measures'][3]['value'] = 0.023  ## this is jaccard_variant / jaccard_shared_variants

In [58]:
result_dd

{'translator_group': ['Service_Provider'],
 'nodes_conflated': {'Disease': {'conflated_type': 'PhenotypicFeature',
   'where': 'DisGeNET'}},
 'traced_provenance': [{'name': 'MyDisease.info API',
   'type': 'service',
   'version': '2020-10-26',
   'method': 'ingest'}],
 'numeric_measures_present': True,
 'categorical_measures_present': False,
 'biolink_predicate': 'similar_to',
 'ingested_ontology_predicate': 'SIO:000736',
 'ingested_ontology_label': 'SIO:is_comparable_to',
 'origin': {'name': 'DisGeNET',
  'type': 'knowledgebase',
  'version': '2020-05-07',
  'method': 'association_from_shared_annot'},
 'numeric_measures': [{'name': 'Ngenes',
   'standard_label': 'num_shared_genes',
   'range': '[0-inf)',
   'direction': {'more_confident': 'larger'},
   'value': 1418},
  {'name': 'jaccard_genes',
   'standard_label': 'jaccard_shared_genes',
   'range': '[0-1]',
   'direction': {'more_confident': 'larger'},
   'reference': {'url': 'https://www.disgenet.org/dbinfo#section13'},
   'value

### Disease -> Gene

Breast cancer (MONDO:0016419, C0346153 (familial breast cancer) and C1861906 (familial male breast cancer))     
-> PALB2 (gene)    
edge from CLINGEN: https://www.disgenet.org/browser/0/1/1/C0346153::C1861906/geneid__79728-source__CLINGEN/_b./

(for some measures, have to go to other tab in browser:https://www.disgenet.org/browser/0/1/0/C0346153::C1861906/geneid__79728-source__CLINGEN/_b./)

For the website url...
* I'm not really worried about one gene having multiple NCBIGene IDs...so I'm leaving it as a 1 or more disease IDs -> 1 gene ID     
* **to show all sources (multiple edge's provenance), put ALL into the source field of the url**. Ex: https://www.disgenet.org/browser/0/1/1/C0346153::C1861906/geneid__79728-source__ALL/_b./  

Edge-specific parts: 
- put specific publications for provenance
- put specific website for provenance (put query disease into it)
- put the specific measure values (new key:value into dictionaries)

In [59]:
## website url construction: this could be a real edge
dg_subject_ids = ['C0346153',  ## familial breast cancer 
              'C1861906']  ## familial male breast cancer
dg_object_ids = ['79728']  ## gene PALB2, can get from hint object's 'NCBIGene' value
source = 'CLINGEN'   ## this is the parameter value DisGeNET uses for query/ provenance 

dg_website_urls = [
    "https://www.disgenet.org/browser/0/1/1/{0}/geneid__{1}-source__{2}/_b./".format(\
        "::".join(dg_subject_ids), obj, source) \
    for obj in dg_object_ids
]
dg_website_urls

['https://www.disgenet.org/browser/0/1/1/C0346153::C1861906/geneid__79728-source__CLINGEN/_b./']

In [60]:
## the entry
result_dg = copy.deepcopy(metakg_dggd_CLINGEN)
result_dg['website'] = dg_website_urls
result_dg['publications'] = {'pmid':['28319063', '25959805', '25225577',
                                     '23657012', '24136930', '21285249',
                                     '19383810', '17287723', '16793542']}

result_dg['numeric_measures'][0]['value'] = 0.7  ## this is GDAscore / association_score
result_dg['numeric_measures'][1]['value'] = 0.944  ## this is EI / evidence_index
result_dg['numeric_measures'][2]['value'] = 0.485  ## this is DSI / gene_specific_to_disease
result_dg['numeric_measures'][3]['value'] = 0.769  ## this is DPI / gene_specific_to_disease_class

In [61]:
result_dg

{'translator_group': ['Service_Provider'],
 'nodes_conflated': {'Disease': {'conflated_type': 'PhenotypicFeature',
   'where': 'DisGeNET'},
  'Gene': {'conflated_type': 'GeneProduct', 'where': 'DisGeNET'}},
 'traced_provenance': [{'name': 'MyDisease.info API',
   'type': 'service',
   'version': '2020-10-26',
   'method': 'ingest_consolidate'},
  {'name': 'DisGeNET',
   'type': 'knowledgebase',
   'version': '2020-05-07',
   'method_ref': {'url': 'https://www.disgenet.org/dbinfo#section11'},
   'method': 'ingest'}],
 'numeric_measures_present': True,
 'categorical_measures_present': False,
 'biolink_predicate': 'related_to',
 'ingested_ontology_predicate': 'SIO:001403',
 'ingested_ontology_label': 'SIO:is_associated_with',
 'numeric_measures': [{'name': 'GDAscore',
   'standard_label': 'association_score',
   'range': '(0-1]',
   'direction': {'more_confident': 'larger'},
   'reference': {'url': 'https://www.disgenet.org/dbinfo#section31'},
   'value': 0.7},
  {'name': 'EI',
   'standa

### Gene -> Disease

APP (gene) -> Alzheimer's Disease (C0002395)     
edge from RGD: https://www.disgenet.org/browser/1/1/1/351/diseaseid__C0002395-source__RGD/_b./

(for some measures, have to click on magnifying glass next to gene name)

For the website url...
* **to show all sources (multiple edge's provenance), put ALL into the source field of the url**. Ex: https://www.disgenet.org/browser/0/1/1/C0346153::C1861906/geneid__79728-source__ALL/_b./  

Edge-specific parts: 
- put specific publications for provenance
- put specific website for provenance (put query disease into it)
- put the specific measure values (new key:value into dictionaries)

In [62]:
## website url construction: this could be a real edge
gd_subject_ids = ['351']  ## gene APP, can get from hint object's 'NCBIGene' value
gd_object_ids = ['C0002395']  ## Alzheimer's Disease
source = 'RGD'   ## this is the parameter value DisGeNET uses for query/ provenance 

gd_website_urls = [
    "https://www.disgenet.org/browser/1/1/1/{0}/diseaseid__{1}-source__{2}/_b./".format(\
        "::".join(gd_subject_ids), obj, source) \
    for obj in gd_object_ids
]
gd_website_urls

['https://www.disgenet.org/browser/1/1/1/351/diseaseid__C0002395-source__RGD/_b./']

In [63]:
## the entry
result_gd = copy.deepcopy(metakg_dggd_RGD)
result_gd['website'] = gd_website_urls
result_gd['publications'] = {'pmid':['30066400', '29174383', '29568075']}

result_gd['numeric_measures'][0]['value'] = 0.9  ## this is GDAscore / association_score
result_gd['numeric_measures'][1]['value'] = 0.981  ## this is EI / evidence_index
result_gd['numeric_measures'][2]['value'] = 0.422  ## this is DSI / gene_specific_to_disease
result_gd['numeric_measures'][3]['value'] = 0.846  ## this is DPI / gene_specific_to_disease_class

In [64]:
result_gd['numeric_measures']

[{'name': 'GDAscore',
  'standard_label': 'association_score',
  'range': '(0-1]',
  'direction': {'more_confident': 'larger'},
  'reference': {'url': 'https://www.disgenet.org/dbinfo#section31'},
  'value': 0.9},
 {'name': 'EI',
  'standard_label': 'evidence_index',
  'range': '(0-1]',
  'direction': {'more_confident': 'larger'},
  'reference': {'url': 'https://www.disgenet.org/dbinfo#section36'},
  'value': 0.981},
 {'name': 'DSI',
  'standard_label': 'gene_specific_to_disease',
  'range': '(0-1]',
  'direction': {'more_specific': 'larger'},
  'reference': {'url': 'https://www.disgenet.org/dbinfo#section33'},
  'value': 0.422},
 {'name': 'DPI',
  'standard_label': 'gene_specific_to_disease_class',
  'range': '(0-1]',
  'direction': {'more_specific': 'smaller'},
  'reference': {'url': 'https://www.disgenet.org/dbinfo#section34'},
  'value': 0.846}]

### Disease -> Variant 

Alzheimer's Disease -> rs75932628 from GWASCAT: https://www.disgenet.org/browser/0/1/5/C0002395/snpid__rs75932628-source__GWASCAT/_b./ 

(for some measures, have to go to other tab in browser:https://www.disgenet.org/browser/0/1/4/C0002395/snpid__rs75932628-source__GWASCAT/_b./)


Edge-specific parts: 
- put specific publications info
- put specific website for provenance (put query disease into it)
- put the specific measure values (new key:value into dictionaries)

In [65]:
## website url construction: this could be a real edge
dv_subject_ids = ['C0002395']  ## Alzheimer's Disease
dv_object_ids = ['rs75932628']  ## variant, can get from hint object's 'DBSNP' value
source = 'GWASCAT'   ## this is the parameter value DisGeNET uses for query/ provenance 

dv_website_urls = [
    "https://www.disgenet.org/browser/0/1/5/{0}/snpid__{1}-source__{2}/_b./".format(\
        "::".join(dv_subject_ids), obj, source) \
    for obj in dv_object_ids
]
dv_website_urls

['https://www.disgenet.org/browser/0/1/5/C0002395/snpid__rs75932628-source__GWASCAT/_b./']

In [66]:
## the entry
result_dv = copy.deepcopy(metakg_dvvd_GWASCAT)
result_dv['website'] = dv_website_urls
result_dv['publications'] = {'pmid':['23150908']}


result_dv['numeric_measures'][0]['value'] = 0.9  ## this is VDAscore / association_score
result_dv['numeric_measures'][1]['value'] = 0.923  ## this is EI / evidence_index
result_dv['numeric_measures'][2]['value'] = 0.662  ## this is DSI / variant_specific_to_disease
result_dv['numeric_measures'][3]['value'] = 0.480  ## this is DPI / variant_specific_to_disease_class

In [67]:
result_dv['numeric_measures']

[{'name': 'VDAscore',
  'standard_label': 'association_score',
  'range': '(0-1]',
  'direction': {'more_confident': 'larger'},
  'reference': {'url': 'https://www.disgenet.org/dbinfo#section32'},
  'value': 0.9},
 {'name': 'EI',
  'standard_label': 'evidence_index',
  'range': '(0-1]',
  'direction': {'more_confident': 'larger'},
  'reference': {'url': 'https://www.disgenet.org/dbinfo#section36'},
  'value': 0.923},
 {'name': 'DSI',
  'standard_label': 'gene_specific_to_disease',
  'range': '(0-1]',
  'direction': {'more_specific': 'larger'},
  'reference': {'url': 'https://www.disgenet.org/dbinfo#section33'},
  'value': 0.662},
 {'name': 'DPI',
  'standard_label': 'gene_specific_to_disease_class',
  'range': '(0-1]',
  'direction': {'more_specific': 'smaller'},
  'reference': {'url': 'https://www.disgenet.org/dbinfo#section34'},
  'value': 0.48}]

### Variant -> Disease

rs75932628 -> Alzheimer's Disease from GWASDB: https://www.disgenet.org/browser/2/1/1/rs75932628/0/25/diseaseid__C0002395-source__GWASDB/_b./

(for some measures, have to go to other tab in browser:https://www.disgenet.org/browser/2/1/0/rs75932628/diseaseid__C0002395-source__GWASDB/_b./)


Edge-specific parts: 
- put specific publications info
- put specific website for provenance (put query disease into it)
- put the specific measure values (new key:value into dictionaries)

In [68]:
## website url construction: this could be a real edge
vd_subject_ids = ['rs75932628']  ## variant, can get from hint object's 'DBSNP' value
vd_object_ids = ['C0002395']  ## Alzheimer's Disease
source = 'GWASDB'   ## this is the parameter value DisGeNET uses for query/ provenance 

vd_website_urls = [
    "https://www.disgenet.org/browser/2/1/1/{0}/snpid__{1}-source__{2}/_b./".format(\
        "::".join(vd_subject_ids), obj, source) \
    for obj in vd_object_ids
]
vd_website_urls

['https://www.disgenet.org/browser/2/1/1/rs75932628/snpid__C0002395-source__GWASDB/_b./']

In [69]:
## the entry
result_vd = copy.deepcopy(metakg_dvvd_GWASDB)


result_vd['website'] = vd_website_urls
result_vd['publications'] = {'pmid':['23150908']}


result_vd['numeric_measures'][0]['value'] = 0.9  ## this is VDAscore / association_score
result_vd['numeric_measures'][1]['value'] = 0.923  ## this is EI / evidence_index
result_vd['numeric_measures'][2]['value'] = 0.662  ## this is DSI / variant_specific_to_disease
result_vd['numeric_measures'][3]['value'] = 0.480  ## this is DPI / variant_specific_to_disease_class

In [70]:
result_vd['origin']

{'name': 'GWASdb', 'type': 'knowledgebase'}