# Metadata Presentation

The goal: present the work done so far to...
* explore the current metadata situation for ARA -> KP association retrieval 
* develop a structure with augmented metadata to support "reasoning" and user needs:
    * whether two associations stem from the same "source": involves tracing the information beyond what APIs something is from
    * whether numeric/categorical measures of the association are present and info on them (range, direction, reference)
    * websites/publications to view the association and more info
* write examples of metaKG edges and response information augmented with the example above, and code showing how the structure can be used   

<br>

This notebook uses the MyDisease.info DisGeNET Disease -> Gene associations as the main example. 

In [8]:
## dependencies

## CX: allows multiple lines of code to print from one code block
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import copy  ## for deepcopy

from biothings_explorer.smartapi_kg import MetaKG

## Current metadata situation

First, we'll look at the current metadata available to a person using BTE, the SmartAPI Registry/metaKG, and the Translator API responses. 

Below, there's the MyDisease.info DisGeNET Disease -> Gene associations in the metaKG edge accessed via Python. Note that the entry currently isn't fully updated to match Kevin's 10/26 update of the underlying data/API. 

I notice:
* It has great info on how to query the API (request setup, support batch)
* **But there's information missing that could be helpful**:
    * is this predicate a biolink predicate, or something from an ontology? (PREDICATE-RELATED)
    * what is mydisease.info API and what did it do with the knowledge? (PROVENANCE-RELATED)
    * what is disgenet and what did it do with the knowledge? did it get its knowledge from somewhere else? (PROVENANCE-RELATED)

In [9]:
kg = MetaKG()
kg.constructMetaKG(source="remote")
current_metaKG = kg.filter({"api_name": "mydisease.info API",
           'source': 'disgenet',
           "input_type": "Disease", 
           "output_type": "Gene"})[0]
current_metaKG

{'inputSeparator': ',',
 'inputs': [{'id': 'UMLS', 'semantic': 'Disease'}],
 'outputs': [{'id': 'NCBIGene', 'semantic': 'Gene'}],
 'parameters': {'fields': 'disgenet.genes_related_to_disease.gene_id'},
 'predicate': 'related_to',
 'requestBody': {'body': {'q': '{inputs[0]}',
   'scopes': 'mondo.xrefs.umls, disgenet.xrefs.umls'},
  'header': 'application/x-www-form-urlencoded'},
 'response_mapping': {'related_to': {'NCBIGene': 'disgenet.genes_related_to_disease.gene_id'}},
 'source': 'disgenet',
 'supportBatch': True,
 'query_operation': {'server': 'http://mydisease.info/v1',
  'params': {'fields': 'disgenet.genes_related_to_disease.gene_id'},
  'request_body': {'body': {'q': '{inputs[0]}',
    'scopes': 'mondo.xrefs.umls, disgenet.xrefs.umls'},
   'header': 'application/x-www-form-urlencoded'},
  'path': '/query',
  'path_params': [],
  'method': 'post',
  'tags': ['disease', 'annotation', 'query', 'translator', 'biothings'],
  'supportBatch': True,
  'inputSeparator': ','},
 'associat

So...what about the information that is returned in a response? See http://mydisease.info/v1/disease/MONDO:0016419?fields=disgenet and the copy of the first genes_related_to_disease entry in the JSON/dictionary below. 

Note: the response is parsed automatically with the code in BTE (x-bte extensions?)     
MONDO:0016419 is "hereditary breast carcinoma"

In [10]:
## formatted to be easier to read
current_response_ex = \
{
    "DPI": 0.846,
    "DSI": 0.536,
    "EI": 0.917,
    "YearFinal": 2019,
    "YearInitial": 1998,
    "gene_id": 9,
    "gene_name": "NAT1",
    "pubmed": [9610785,10090301,10698485,12835615,12860276,
               14517345,15084249,15090724,15226672,16049806,
               17010218,17973251,18288399,22333393,24467436,
               25528056,27648926,28359264,29315819,29339455,
               29901116,29964355,29969986,31358821
              ],
    "score": 0.1,
    "source": "BEFREE"
}

So...Observation 1, just looking at the metaKG, one didn't know that all this info was available. Maybe some of this info could be available on the metaKG level so users could know this beforehand. 

Observation 2: I still have a lot of questions to be able to make use of this info...
* What is DPI, DSI, EI, score? And I have no idea what they measure, what their range is, how they were calculated, what's a "good" number, whether I can treat them as probabilities in some meta-measure, etc. [MEASURES]
* What is this source "BEFREE"? Is this a part of DisGeNET? And how can I tell as an ARA if another API response result is actually using the same underlying resource, when it may use a different field / spelling / name? [PROVENANCE-RELATED]
    * Spoiler: Turns out that BEFREE is a shorthand "source" tag for a large corpus of associations made by the team developing DisGeNET using the NLP method BeFree on MEDLINE abstract data. So the "real" underlying source is MEDLINE abstracts...

And there are further questions floating around from work done with other Translator teams and discussions within Translator:
* what if an association is specific to a particular context / is only relevant in a specific context? [CONTEXT/RELEVANCE]
    * An example is associations found in particular cell-lines (Multiomics KP), organisms, cohorts (multiple Provider teams)
    * support is needed to describe them, at least on the level needed for querying / reasoning.  
* what if there are categorical measures of the association (ex: "strong", "likely")? This was brought up in context of Text Mining Provider. [MEASURES]
* what if an ARA method requires numeric values on the edges (and what if they need specific measures like a probability)? [MEASURES]
* how was an association actually made? do we even know this? how do we describe it? [PROVENANCE-RELATED]
* are we conflating node types when we make this association? which ones? on what level (which resource)? [PROVENANCE-RELATED]
* can I visit a website to see the association / resource? It would really help the user "double-check" and learn more about the result, helping with confidence in the tool. [PROVENANCE, USER NEED]
* what if the publication IDs aren't PMIDs? [PROVENANCE]
* what Translator team contributed to what API? [PROVENANCE, INTERNAL USE]

## Potential solution setup

So above, a bunch of missing information was described and questions were posed. How could we address these?

Well, we already have the SmartAPI Registry, metaKG, x-BTE extensions / BTE tooling to query and parse...so maybe we could just augment the info that's there. 
* put more fields/slots in the metaKG edge JSON
* in the code to parse the response JSON, add the slots from the metaKG and perhaps more. 
    * Response-specific slots could include the publications and website information - to take the user right to the supporting information for a specific association. 
    * Website info could come from parsing the specific association, then building the specific website url. 

<br>

We could go with the following guidelines/considerations:
* one metaKG edge per unique input-type/output-type/path-of-provenance/biolink-predicate combo. 
    * one biolink-predicate per edge means simpler mapping to provenance/outside-predicates. 
    * Describing one path of provenance can be done in a relatively flat way. 
    * VS describing branched-chains of provenance for one edge sounds like building a graph on top of an edge...and that doesn't sound like the easiest to query / reason on. 
* as simple/flat as possible, while providing enough information 
    * easy to query / parse
    * describe provenance/measures on a surface/simple level - just enough to tell whether it is the same or different from stuff on another edge
    * doesn't describe all fields of the response JSON, just the ones that are likely to be important for ARAs querying or reasoning 
    * follow the provenance back as far as one reasonably can **computationally**. Don't go too far beyond what the API actually provides
* flexible enough to handle various kinds of knowledge sources and missing info
* keep in mind the costs of updating:
    * we don't want something so brittle/intricate where a lot needs to be updated manually with every little change...
    * if manual, ideally make it a "once-and-done" where it only changes if something major changes in data-modeling/structure/methods of underlying resource
    * ideally some of this can be done computationally (ex: setting up an endpoint that provides all the info for augmentation). But we don't control every API or underlying resource...

## New work: documentation

The Setup tab here: https://docs.google.com/spreadsheets/d/1Xnib2FOSSUIK6e22xO6MrGTopVIGJ8kiCCENIVZBIgo/edit?usp=sharing(anyone with link can comment, need to be added to edit)
* name, Python object type, whether it exists on metaKG / response level, description, examples

<br>

Metadata Notes: https://docs.google.com/document/d/1c_0o5YwGSaNaWFjcIelz_PPALsbECSLy3JFnlKoQUcg/edit?usp=sharing (anyone with link can comment, need to be added to edit)
* more detailed description
* notes on formatting (ex: names) and values put in certain fields so far (where we may want to later restrict the allowed values to a set)

## New work: examples

### metaKG edge 

Let's return to the MyDisease.info DisGeNET Disease -> Gene associations.   

I noticed that 
1. DisGeNET had **17** unique source values for this type of association and **4** unique measures of the association. 
2. Kevin had to update DisGeNET within MyDisease.info to expose the source info and provide separate associations for reach source. 
3. The source and measures values alone was not enough to describe the provenance and other information. So I read up on this resource to fill the fields described in the documentation above. 

So...let's pick one path-of-provenance (since there could be one metaKG edge per unique input-type/output-type/path-of-provenance combo): when the DisGeNET data field named source="BEFREE". This is the same source as the example of the current response metadata above. 

First, there is metadata information that is the same, regardless of the source because it is common to all edges coming from DisGeNET:

In [11]:
## stuff that is the same in all records 
basic_dggd_edge = {
    ## provenance-related  
    "translator_group":["Service_Provider"],  ## assign this
    "nodes_conflated": 
        {'Disease':
            {'conflated_type':'PhenotypicFeature', 'where':'DisGeNET'},
        'Gene':
            {'conflated_type':'GeneProduct', 'where':'DisGeNET'}
        },    
    ## measure-related                  
    "numeric_measures_present":True,      ## assign this  
    "numeric_measures":      ## there are 4
        [{"name":"GDAscore",
          "standard_label":"association_score",  ## name Translator may want to use
          "range":"(0-1]", 
          "direction":{"more_confident":"larger"},
          "reference":{"url":"https://www.disgenet.org/dbinfo#section31"}}, 
        {"name":"EI",
          "standard_label":"evidence_index",  ## name Translator may want to use
          "range":"(0-1]", 
          "direction":{"more_confident":"larger"},
          "reference":{"url":"https://www.disgenet.org/dbinfo#section36"}}, 
        {"name":"DSI",
          "standard_label":"gene_specific_to_disease",  ## name Translator may want to use
          "range":"(0-1]",  ## ref claims min=0.25, but it should fluctuate based on db data
          "direction":{"more_specific":"larger"},
          "reference":{"url":"https://www.disgenet.org/dbinfo#section33"}}, 
        {"name":"DPI",
          "standard_label":"gene_specific_to_disease_class",  ## name Translator may want to use
          "range":"(0-1]",       ## ref claims min=1/29, but it would fluctuate if disease class changed
          "direction":{"more_specific":"smaller"},
          "reference":{"url":"https://www.disgenet.org/dbinfo#section34"}} 
        ],
    "categorical_measures_present":False         
}

Then there is provenance and predicate info specific to the BEFREE edges. 

After reviewing some examples of the DisGeNET data, I decided that the edges labeled BEFREE fit the more general "related_to" biolink predicate. This predicate is symmetrical, so it can be used for both the Disease -> Gene and Gene -> Disease metaKG/response edges. 

In [12]:
metakg_dggd_BEFREE = copy.deepcopy(basic_dggd_edge)

## predicate-related
metakg_dggd_BEFREE['biolink_predicate'] = "related_to"  ## assign this, no predicate in datafile

## provenance-related
## can get most from endpoint http://mydisease.info/v1/metadata 
metakg_dggd_BEFREE['traced_provenance'] =  \
    [{"name":"MyDisease.info API",
        "type":"service",  ## assign this
        "version":"2020-11-01",    ## made-up, to reflect change
        "method":"ingest_consolidate"},  ## assign this
     {"name":"DisGeNET",
        "type":"knowledgebase",    ## assign this
        "version":"v7",     ## made-up, to reflect change
        "method":"NLP_BEFREE",
        "method_ref":{"url":"https://www.disgenet.org/dbinfo#section11"}}   ## assign this
    ]  

metakg_dggd_BEFREE['origin'] = \
    {"name":"MEDLINE_abstracts",  
        "type":"publications",
        "version":"1970-01_to_2019-12"}  

So the augmented data on the metaKG edge for MyDisease.info DisGeNET Disease -> Gene associations, DisGeNET source field = "BEFREE" (one path of provenance) is:    

In [13]:
metakg_dggd_BEFREE

{'translator_group': ['Service_Provider'],
 'nodes_conflated': {'Disease': {'conflated_type': 'PhenotypicFeature',
   'where': 'DisGeNET'},
  'Gene': {'conflated_type': 'GeneProduct', 'where': 'DisGeNET'}},
 'numeric_measures_present': True,
 'numeric_measures': [{'name': 'GDAscore',
   'standard_label': 'association_score',
   'range': '(0-1]',
   'direction': {'more_confident': 'larger'},
   'reference': {'url': 'https://www.disgenet.org/dbinfo#section31'}},
  {'name': 'EI',
   'standard_label': 'evidence_index',
   'range': '(0-1]',
   'direction': {'more_confident': 'larger'},
   'reference': {'url': 'https://www.disgenet.org/dbinfo#section36'}},
  {'name': 'DSI',
   'standard_label': 'gene_specific_to_disease',
   'range': '(0-1]',
   'direction': {'more_specific': 'larger'},
   'reference': {'url': 'https://www.disgenet.org/dbinfo#section33'}},
  {'name': 'DPI',
   'standard_label': 'gene_specific_to_disease_class',
   'range': '(0-1]',
   'direction': {'more_specific': 'smaller'

#### metaKG edge queries

With this record, an ARA could now query the metaKG to see if this edge has the info it wants for reasoning. A user could also find some basic provenance and measure information. 

In [20]:
## get the numeric measure info related to confidence in the association, if it exists
if metakg_dggd_BEFREE['numeric_measures_present']:
    confident = [i \
                 for i in metakg_dggd_BEFREE['numeric_measures'] \
                 if 'confident' in "".join(i['direction'].keys())]
confident

[{'name': 'GDAscore',
  'standard_label': 'association_score',
  'range': '(0-1]',
  'direction': {'more_confident': 'larger'},
  'reference': {'url': 'https://www.disgenet.org/dbinfo#section31'}},
 {'name': 'EI',
  'standard_label': 'evidence_index',
  'range': '(0-1]',
  'direction': {'more_confident': 'larger'},
  'reference': {'url': 'https://www.disgenet.org/dbinfo#section36'}}]

In [28]:
## did this info come from text/publications? and which ones?
## ideally, text/publication would only be in the origin slot

if metakg_dggd_BEFREE['origin']['type'] in ['text', 'publications']:
    print("{0}, {1}".format(metakg_dggd_BEFREE['origin']['name'], metakg_dggd_BEFREE['origin'].get('version')))

MEDLINE_abstracts, 1970-01_to_2019-12


In [30]:
## get the methods (and method info if available) used to create the association for each resource
[(i['name'], i['method'], i.get('method_ref')) for i in metakg_dggd_BEFREE['traced_provenance']]

[('MyDisease.info API', 'ingest_consolidate', None),
 ('DisGeNET',
  'NLP_BEFREE',
  {'url': 'https://www.disgenet.org/dbinfo#section11'})]

### specific response edge

So what about a specific edge from a query's response? Above, we saw an example of the following edge:

"hereditary breast carcinoma" (MONDO:0016419) -> NAT1 (NCBIGene:9)

As a reminder, this was the raw API response

In [31]:
current_response_ex

{'DPI': 0.846,
 'DSI': 0.536,
 'EI': 0.917,
 'YearFinal': 2019,
 'YearInitial': 1998,
 'gene_id': 9,
 'gene_name': 'NAT1',
 'pubmed': [9610785,
  10090301,
  10698485,
  12835615,
  12860276,
  14517345,
  15084249,
  15090724,
  15226672,
  16049806,
  17010218,
  17973251,
  18288399,
  22333393,
  24467436,
  25528056,
  27648926,
  28359264,
  29315819,
  29339455,
  29901116,
  29964355,
  29969986,
  31358821],
 'score': 0.1,
 'source': 'BEFREE'}

We can map the API response to the metaKG fields in the following way:
* the numeric measures (DPI, DSI, EI, score) are already annotated in the augmented metaKG edge info. Add a "value" field to put in the specific values for this specific edge
* the source is already annotated in more detail in the augmented metaKG edge info. Replace 'source' with the provenance-related fields from the metaKG. 
* the pubmed field can be mapped to a response-specific edge field -> publications. This is a dict of dict that makes clear that these are PMIDs. 
* DisGeNET has a browser interface, so we can link a website for the users. We need to parse the subject/object of the specific response that we're modeling to do this. 
    * the query for Disease -> Gene in BTE looks like this: API 2.1: https://mydisease.info/v1/query?fields=disgenet.genes_related_to_disease.gene_id (POST -d q=C0006142,C0346153&scopes=mondo.xrefs.umls, disgenet.xrefs.umls)

The final response edge would look like this:

In [33]:
## build the website url. I presume that BTE has resolved the IDs of the subject/object 
## website url construction: this could be a real edge

## BTE maps "hereditary breast carcinoma" (MONDO:0016419) to the following UMLS IDs (used by DisGeNET)
dg_subject_ids = ['C0346153',  ## familial breast cancer 
              'C0006142']  ## Malignant neoplasm of breast
dg_object_ids = ['9']  ## gene NAT1, can get from hint object's 'NCBIGene' value
source = 'BEFREE'   ## this is the parameter value DisGeNET uses for query/ provenance 

dg_website_urls = [
    "https://www.disgenet.org/browser/0/1/1/{0}/geneid__{1}-source__{2}/_b./".format(\
        "::".join(dg_subject_ids), obj, source) \
    for obj in dg_object_ids
]
dg_website_urls

['https://www.disgenet.org/browser/0/1/1/C0346153::C0006142/geneid__9-source__BEFREE/_b./']

In [36]:
## the entry
result_dg = copy.deepcopy(metakg_dggd_BEFREE)
result_dg['website'] = dg_website_urls
result_dg['publications'] = {'pmid':current_response_ex['pubmed']}

result_dg['numeric_measures'][0]['value'] = current_response_ex['score']  ## this is GDAscore / association_score
result_dg['numeric_measures'][1]['value'] = current_response_ex['EI']  ## this is EI / evidence_index
result_dg['numeric_measures'][2]['value'] = current_response_ex['DSI']  ## this is DSI / gene_specific_to_disease
result_dg['numeric_measures'][3]['value'] = current_response_ex['DPI']  ## this is DPI / gene_specific_to_disease_class

## no change to these parts of the response
result_dg['YearInitial'] = current_response_ex['YearInitial']
result_dg['YearFinal'] = current_response_ex['YearFinal']

In [37]:
result_dg

{'translator_group': ['Service_Provider'],
 'nodes_conflated': {'Disease': {'conflated_type': 'PhenotypicFeature',
   'where': 'DisGeNET'},
  'Gene': {'conflated_type': 'GeneProduct', 'where': 'DisGeNET'}},
 'numeric_measures_present': True,
 'numeric_measures': [{'name': 'GDAscore',
   'standard_label': 'association_score',
   'range': '(0-1]',
   'direction': {'more_confident': 'larger'},
   'reference': {'url': 'https://www.disgenet.org/dbinfo#section31'},
   'value': 0.1},
  {'name': 'EI',
   'standard_label': 'evidence_index',
   'range': '(0-1]',
   'direction': {'more_confident': 'larger'},
   'reference': {'url': 'https://www.disgenet.org/dbinfo#section36'},
   'value': 0.917},
  {'name': 'DSI',
   'standard_label': 'gene_specific_to_disease',
   'range': '(0-1]',
   'direction': {'more_specific': 'larger'},
   'reference': {'url': 'https://www.disgenet.org/dbinfo#section33'},
   'value': 0.536},
  {'name': 'DPI',
   'standard_label': 'gene_specific_to_disease_class',
   'range