# BRAINWORKS - Generate Graph Data
[Mohammad M. Ghassemi](https://ghassemi.xyz), DATA Scholar, 2021

<hr>

## 0. Install Dependencies:
To begin, please import the following external and internal python libraries

In [45]:
import re 
import pandas as pd
import os
import sys
from pprint import pprint

currentdir = os.getcwd()
parentdir  = os.path.dirname(currentdir)
sys.path.insert(0, parentdir)

from utils.database.database import database           
db = database()   

from utils.documentCollector.pubmed import pubmed
pm = pubmed()

from configuration.config import config


<br>

## 1. Extract Triples From Text
To Begin,let's pull one abstract from the database

In [8]:
data = db.query(f"""SELECT content, pmid, pub_date
                    FROM   documents
                    WHERE  content_type  = 'abstract'
                    LIMIT 1
                """)
pprint(data)

[{'content': 'Many diabetic individuals use prescription and non-prescription '
             'opioids and opiates. We aimed to investigate the joint effect of '
             'diabetes and opiate use on all-cause and cause-specific '
             'mortality. Golestan Cohort study is a prospective '
             'population-based study in Iran. A total of 50\xa0045 people-aged '
             '40-75, 28\xa0811 women, 8487 opiate users, 3548 diabetic '
             'patients-were followed during a median of 11.1\u2009years, with '
             'over 99% success follow-up. Hazard ratio and 95% confidence '
             'intervals (HRs, 95% CIs), and preventable death attributable to '
             'each risk factor, were calculated. After 533\xa0309 '
             'person-years, 7060 deaths occurred: 4178 (10.8%) of non-diabetic '
             'non-opiate users, 757 (25.3%) diabetic non-users, 1906 (24.0%) '
             'non-diabetic opiate users and 219 (39.8%) diabetic opiate users. '
  

<br><br>
Now let's pass this abstract to the information extraction pipelines. The results will be stored in the `triples` and `concepts` tables of the database. Note that this might take some time to run locally. We **strongly** encourage you to explore the `cluster/` for practical use of the information extraction pipelines on a large volume of data.

In [9]:
pm.extractInformation(paper_list     = data,   # A list of dicts containing the text, id, and date
                      db_insert      = True,   # Indicates if the results should be inserted into the database
                      batch_size     = 1000,   # How much of the data we want to process at any one time.
                      filter_results = True    # When True, filters the triples to return only the minimum spanning set.
                      )

<br> 
We can query the database to collect the extracted triples, as well as their topics; let's extract one triple from the previous paper (along with the topics) to illustrate.

In [25]:
triples = db.query("""WITH data AS(
                                  SELECT *
                                  FROM  triples
                                  WHERE pmid = 32810213
                                  LIMIT 10 
                                  )
                      
                      , results AS( 
                                    SELECT  CONCAT(CONCAT('["',GROUP_CONCAT(DISTINCT cs.concept_name ORDER BY cs.concept_name ASC SEPARATOR '","' )),'"]') as subject_umls_topics,
                                            CONCAT(CONCAT('["',GROUP_CONCAT(DISTINCT co.concept_name ORDER BY co.concept_name ASC SEPARATOR '","' )),'"]') as object_umls_topics,    
                                            lower(d.subject)  as subject,
                                            lower(d.object)   as object,
                                            lower(d.relation) as relation,
                                            d.pmid            as pmid,
                                            d.pub_date        as pub_date,
                                            YEAR(d.pub_date)  as pub_year
                                    FROM data d
                                    JOIN concepts cs ON cs.triple_hash = d.subject_hash AND cs.concept_type = 'subject' AND  cs.concept_name NOT IN('Result','Cohort Studies','Combined','Mental Association','Conclusion','Consecutive','Author','findings aspects','evaluation','evidence','Publications','Lacking','Observational Study','Scientific Study','Potential','research','Country','Clinical Research','Patients','Cohort','week','Persons','Increase','inpatient','child','adult') AND cs.concept_name IS NOT NULL
                                    JOIN concepts co ON co.triple_hash = d.object_hash AND co.concept_type  = 'object' AND  co.concept_name NOT IN('Result','Cohort Studies','Combined','Mental Association','Conclusion','Consecutive','Author','findings aspects','evaluation','evidence','Publications','Lacking','Observational Study','Scientific Study','Potential','research','Country','Clinical Research','Patients','Cohort','week','Persons','Increase','inpatient','child','adult')  AND co.concept_name IS NOT NULL
                                    group by subject, object
                                    ) 
                      
                      SELECT * FROM results""")
pprint(triples[4])

{'object': 'health of diabetic patients',
 'object_umls_topics': '["diabetic","Health"]',
 'pmid': 32810213,
 'pub_date': datetime.date(2021, 3, 3),
 'pub_year': 2021,
 'relation': 'is detrimental to',
 'subject': 'opiates',
 'subject_umls_topics': '["Opiate Alkaloids"]'}


<br>

## 2. Generate Graph Data

Below we provide a function that takes a set of parameters and generates a graph object that can be passed to the graph API at `graph.scigami.org`.

In [41]:
def getGraph(params):
    from datetime import date
    import hashlib
    def str2hex(input):
        hash_object = hashlib.sha256(input.encode('utf-8'))
        hex_dig = hash_object.hexdigest()
        return '#' + hex_dig[0:6]

    query = f"""WITH data AS( SELECT *
                      FROM {params['table']}
            """
    query +="""       WHERE """
    for concept in params['concepts']:
        query += f""" (subject     {" LIKE '%" if concept['type'] == 'LIKE' else " = '"}{concept['term']}{"%' " if concept['type'] == 'LIKE' else "' "}
                       OR object   {" LIKE '%" if concept['type'] == 'LIKE' else " = '"}{concept['term']}{"%' " if concept['type'] == 'LIKE' else "' "}
                      ) OR"""
    query = query[:-2]

    for exclude in params['exclude_concepts']:
        query += f""" AND subject not like '%{exclude}%' 
                      AND object  not like '%{exclude}%' """
    
    if params['include_relations'] != []:
        query += f""" AND relation IN ({'"' + '","'.join(params['include_relations']) + '"'}) """
    
    if params['exclude_relations'] != []:
        query += f""" AND relation NOT IN ({'"' + '","'.join(params['exclude_relations']) + '"'}) """
    
    if params['limit'] is not None:
        query += f""" LIMIT {params['limit']} """
    query += """)
             """

    query += """, results AS(
    SELECT
           CONCAT(CONCAT('["',GROUP_CONCAT(DISTINCT cs.concept_name ORDER BY cs.concept_name ASC SEPARATOR '","' )),'"]') as subject_umls_topics,
           CONCAT(CONCAT('["',GROUP_CONCAT(DISTINCT co.concept_name ORDER BY co.concept_name ASC SEPARATOR '","' )),'"]') as object_umls_topics,  
           lower(d.subject) as subject,
           lower(d.object) as object,
           lower(d.relation) as relation,
           d.pmid as pmid,
           d.pub_date as pub_date,
           YEAR(d.pub_date) as pub_year
      FROM data d
      JOIN concepts cs ON cs.triple_hash = d.subject_hash AND cs.concept_type = 'subject' AND  cs.concept_name NOT IN('Result','Cohort Studies','Combined','Mental Association','Conclusion','Consecutive','Author','findings aspects','evaluation','evidence','Publications','Lacking','Observational Study','Scientific Study','Potential','research','Country','Clinical Research','Patients','Cohort','week','Persons','Increase','inpatient','child','adult') AND cs.concept_name IS NOT NULL
      JOIN concepts co ON co.triple_hash = d.object_hash AND co.concept_type  = 'object'  AND  co.concept_name NOT IN('Result','Cohort Studies','Combined','Mental Association','Conclusion','Consecutive','Author','findings aspects','evaluation','evidence','Publications','Lacking','Observational Study','Scientific Study','Potential','research','Country','Clinical Research','Patients','Cohort','week','Persons','Increase','inpatient','child','adult')  AND co.concept_name IS NOT NULL
      GROUP BY subject, object
    ) 
    """

    query += f"""SELECT * FROM results """ # where information_id IN (SELECT information_id FROM select_set_{i})"""
    all_triples = db.query(query)
    
    # Pre-processing data
    df  = pd.DataFrame(all_triples)
    df1 = df.drop(columns=["object","object_umls_topics"])  
    df2 = df.drop(columns=["subject","subject_umls_topics"]).rename(columns={"object":"subject","object_umls_topics":"subject_umls_topics"})
    df3 = pd.concat([df1,df2])
    df3.reset_index(inplace=True)


    # getting earliest data appearance of subject and converting to dict
    min_df   = df3.groupby("subject")["pub_date"].min().to_frame().reset_index()
    min_dict = dict(zip(min_df["subject"], min_df["pub_date"]))


    #This gets the names of every node
    raw_nodes = dict(Counter([triple['subject'] for triple in all_triples]  + [triple['object'] for triple in all_triples]))
    nodes, edges = [], []
    for node, cnt in raw_nodes.items():

        ind    = df3.subject.isin([node])
        ii     = df3[ind].index.values[0]
        topics = df3.iloc[ii]['subject_umls_topics']

        nodes.append({ 'key'    : node, 'attributes':{
                                        'label' : node,
                                        'x'     : 100*random(),
                                        'y'     : 100*random(),
                                        'size'  : cnt*10,
                                        'color' : '#008cc2',
                                        'data':{'creation':min_dict[node].year + min_dict[node].month/12,
                                                'topics': topics}
                                        }
                      })


    for i,triple in enumerate(all_triples):
        todays_date = date.today()
        opacity = { todays_date.year - 15 :'11',todays_date.year - 14 :'11',todays_date.year - 13 :'11',todays_date.year - 12 :'11',todays_date.year - 11 :'11',todays_date.year - 10 :'22',
                    todays_date.year - 9  :'33',todays_date.year - 8 :'44',todays_date.year - 7 :'55',todays_date.year - 6  :'66',todays_date.year - 5 :'77',todays_date.year - 4 :'88',
                    todays_date.year - 3  :'99',todays_date.year - 2 :'AA',todays_date.year - 1 : 'BB',todays_date.year : 'CC'}

        edges.append({ 'key'     : str(i),
                       'source' : triple['subject'],
                       'target' : triple['object'],
                       'attributes' : { 'label'  : triple['relation'],
                                        'type'   : 'arrow',
                                        'size'   : 3,
                                        'color'  : '#041E42' + opacity[triple['pub_date'].year],
                                        'label'  : triple['relation'],
                                        'data'   :{'time':triple['pub_date'].year + triple['pub_date'].month/12,
                                                   'pmid':triple['pmid']}
                                      }
                     })

    return {'nodes':nodes, 'edges':edges}

<br>
We may call this function, passing in a parameter set, and recieve a formatted graph data object for the API

In [42]:
from collections import Counter
from random      import random
from pprint      import pprint


#----------------------------------------------
# Get the graph data object from the database.
#----------------------------------------------
graph = getGraph({'table'             : 'triples',                           # The name of the table wihere the triples data is stored. 
                  'concepts'          : [{'term':'covid','type':'LIKE'}],    # The term we want to search for, for instance, `covid`
                  'limit'             : 5,                                   # The number of triples we want to return, e.g. `5`
                  'include_relations' : [],                                  # Any edges we are interested in, e.g. ['cause','caused','associated'],
                  'exclude_relations' : [],                                  # Any edges we want to exclude, e.g. ['is','will be'],                         
                  'exclude_concepts'  : []                                   # Any nodes we want to exclude, e.g. ['patient','patients','participants','participant','men','women']
                })


#-----------------------------------------------
# Adding configuration information to the graph
#-----------------------------------------------
json_data = {"graph":{}, "config":{}}
json_data["graph"]  = graph
json_data["config"] = {"maps" : [{"dimension": "cluster", 
                                 },
                                 {"dimension": "node_slider", 
                                  "data"     : "creation",
                                  "args"     : "node slider"
                                 },
                                 {"dimension": "node_size", 
                                  "data"     : "degree",
                                  "args"     : {"min":10, "max":40}
                                 },
                                 {"dimension": "edge_slider", 
                                  "data"     : "time",
                                  "args"     : "edge slider"
                                 }],
                       "settings":{}
                       } 
pprint(json_data)

{'config': {'maps': [{'dimension': 'cluster'},
                     {'args': 'node slider',
                      'data': 'creation',
                      'dimension': 'node_slider'},
                     {'args': {'max': 40, 'min': 10},
                      'data': 'degree',
                      'dimension': 'node_size'},
                     {'args': 'edge slider',
                      'data': 'time',
                      'dimension': 'edge_slider'}],
            'settings': {}},
 'graph': {'edges': [{'attributes': {'color': '#041E42CC',
                                     'data': {'pmid': 32597466,
                                              'time': 2021.3333333333333},
                                     'label': 'treatment with',
                                     'size': 3,
                                     'type': 'arrow'},
                      'key': '0',
                      'source': 'covid 19',
                      'target': 'baricitinib'},
                 

<br> We can now call the API to obtain the graph

In [61]:
import requests
url = requests.get("http://graph.scigami.org:5000/create_graph", json=json_data).content.decode()
url

'http://52.73.26.147:5000/graph/27afe0d6136a4083a072a0113e1eb1cf'

<br>

## Appendix

Navigating Ontologies using NLM APIs. 

In [51]:
import requests
import json
from pprint import pprint

# Obtain a service ticket
r   = requests.post('https://utslogin.nlm.nih.gov/cas/v1/api-key', data={'apikey' : config['UMLS']['APIKey']})
tgt = 'TGT-' + r.text.split('TGT-')[1].split('-cas')[0] + '-cas'
r   = requests.post(f"""https://utslogin.nlm.nih.gov/cas/v1/tickets/{tgt}""", data={'service' : 'http://umlsks.nlm.nih.gov'})
service_ticket = r.text

# Let's search for the children of Neuronal Plasticity: https://meshb-prev.nlm.nih.gov/record/ui?ui=D009473
concept_id = 'D009473'
source     = 'MSH' 
base_url   = 'https://uts-ws.nlm.nih.gov/rest'
extention  = f"""/content/current/source/{source}/{concept_id}/children"""
search     = f"""{base_url}{extention}?ticket={service_ticket}"""
r          = json.loads(requests.get(search).text)
pprint(r)

{'pageCount': 1,
 'pageNumber': 1,
 'pageSize': 25,
 'result': [{'ancestors': 'https://uts-ws.nlm.nih.gov/rest/content/2021AB/source/MSH/D017774/ancestors',
             'atomCount': 6,
             'atoms': 'https://uts-ws.nlm.nih.gov/rest/content/2021AB/source/MSH/D017774/atoms',
             'attributes': 'https://uts-ws.nlm.nih.gov/rest/content/2021AB/source/MSH/D017774/attributes',
             'cVMemberCount': 0,
             'children': 'NONE',
             'classType': 'SourceAtomCluster',
             'concepts': 'https://uts-ws.nlm.nih.gov/rest/search/2021AB?string=D017774&sabs=MSH&searchType=exact&inputType=sourceUi',
             'contentViewMemberships': [],
             'defaultPreferredAtom': 'https://uts-ws.nlm.nih.gov/rest/content/2021AB/source/MSH/D017774/atoms/preferred',
             'definitions': 'NONE',
             'descendants': 'NONE',
             'name': 'Long-Term Potentiation',
             'obsolete': False,
             'parents': 'https://uts-ws.nlm.nih

<br>
Alternative ways to navigate the ontonologies

In [53]:
import requests
import json
from pprint import pprint
#from utils.generalPurpose                import generalPurpose as gp



def flatten(my_dict, last_keys='',key_list=[], value_list=[]):    
    if isinstance(my_dict, dict):
        for key, value in my_dict.items():
            this_key = last_keys + '.' + key
            if isinstance(value, dict):
                flatten(my_dict[key],this_key,key_list,value_list)
            elif isinstance(value,list):
                flatten(my_dict[key],this_key,key_list,value_list)
            elif value == None:
                key_list.append(this_key[1:])
                value_list.append('None')
            else:
                key_list.append(this_key[1:])
                value_list.append(value)
    
    if isinstance(my_dict, list):
        for i in range(len(my_dict)):
            this_key = last_keys + '_' + str(i) + '_'
            if isinstance(my_dict[i], dict):
                flatten(my_dict[i],this_key,key_list,value_list)
            elif isinstance(my_dict[i],list):
                flatten(my_dict[i],this_key,key_list,value_list)
            elif my_dict[i] == None:
                key_list.append(this_key[1:])
                value_list.append('None')
            else:
                key_list.append(this_key[1:])
                value_list.append(my_dict[i])
    
    return dict(zip(key_list, value_list))


def extractFromFlatJson(flat_data, key_has = [], value_has = [], fetch_part = None ):
#label = extractFromPubmedData(flat_x, key_has    = ['label','@language'], 
#                                      value_has  = ['en'], 
#                                      fetch_part = '@value')

    data_elements = flat_data.keys()
    # See if this key matches the criteria 

    results = []
    valid_keys = {}
    for element in data_elements: 
        # Key Critera
        valid_keys[element] = True
        for key in key_has:
            if key not in element:
                valid_keys[element] = False

    
    valid_values = {}
    for element in data_elements:    
        if valid_keys[element]:
            
            # Value Criteria  
            valid_values[element] = True
            for value in value_has:
                if value not in str(flat_data[element]):
                    valid_values[element] = False
                   
            if valid_values[element]:
                if fetch_part is not None:
                    results.append(flat_data['.'.join(element.split('.')[:-1] + [fetch_part])])
                else:
                    results.append(flat_data[element])

    return list(set(results))




# Get the Descriptor Info.
def getMeshInfo(id):
    x       = json.loads(requests.get(f'https://id.nlm.nih.gov/mesh/{id}.json-ld').text)
    flat_x  = flatten(x)
    print(f'https://id.nlm.nih.gov/mesh/{id}.json-ld')
    r       = {}
    r['id'] = x.get('@id',None)
    
    
    # Decriptor  -----------------------------------------------------
    if id[0] == 'D':
        r['label']  = extractFromFlatJson(flat_x, key_has = ['label','@language'], value_has  = ['en'], fetch_part = '@value')[0]
        r['treeNumber']         = x.get('treeNumber'        ,None)
        r['broaderDescriptor']  = x.get('broaderDescriptor' ,None)
        r['concept']            = x.get('concept'           ,None)
        r['preferredConcept']   = x.get('preferredConcept'  ,None)
        r['allowableQualifier'] = x.get('allowableQualifier',None)

    # Concept  ------------------------------------------------------
    if id[0] == 'M':
        r['label']            = extractFromFlatJson(flat_x, key_has = ['label','@language'], value_has  = ['en'], fetch_part = '@value')[0]
        r['scopeNotes']       = extractFromFlatJson(flat_x, key_has = ['scopeNote','@language'], value_has  = ['en'], fetch_part = '@value')[0]
        r['preferredTerm']    = x.get('preferredTerm'   ,None)
        r['narrowerConcept']  = x.get('narrowerConcept' ,None)
        r['broaderConcept']   = x.get('broaderConcept'  ,None)
        r['relatedConcept']   = x.get('relatedConcept'  ,None)
              
    # Qualifier  ------------------------------------------------------
    if id[0] == 'Q':
        r['label']            = extractFromFlatJson(flat_x, key_has = ['label','@language'], value_has  = ['en'], fetch_part = '@value')[0]
        r['preferredConcept'] = x.get('preferredConcept' ,None)
        r['preferredTerm']    = x.get('preferredTerm'    ,None)
        r['treeNumber']       = x.get('treeNumber'       ,None)
        r['broaderQualifier'] = x.get('narrowerConcept'  ,None)
        
    # Terms  ------------------------------------------------------
    if id[0] == 'T':
        r['label']  = extractFromFlatJson(flat_x, key_has = ['label','@language'], value_has  = ['en'], fetch_part = '@value')[0]
    
    for key, val in r.items():
        if val is None:
            r[key] = []
            continue
        
        r[key] = [val] if not isinstance(r[key], list) else val
        r[key] = [x.split('/')[-1] for x in r[key]]
                    
    return r

In [54]:
x = getMeshInfo('M0002885')
print('Concept')
print(x['label'], x['scopeNotes'], x['narrowerConcept'], x['broaderConcept'])


print('narrower')
for concept in x['narrowerConcept']:
    print(getMeshInfo(concept)['label'], getMeshInfo(concept)['id'])


print('broader')
for concept in x['broaderConcept']:
    print(getMeshInfo(concept)['label'], getMeshInfo(concept)['id'])

https://id.nlm.nih.gov/mesh/M0002885.json-ld
Concept
['Brain Neoplasms'] ['Neoplasms of the intracranial components of the central nervous system, including the cerebral hemispheres, basal ganglia, hypothalamus, thalamus, brain stem, and cerebellum. Brain neoplasms are subdivided into primary (originating from brain tissue) and secondary (i.e., metastatic) forms. Primary neoplasms are subdivided into benign and malignant forms. In general, brain tumors may also be classified by age of onset, histologic type, or presenting location in the brain.'] ['M0334228', 'M000677383', 'M0334225', 'M0334239', 'M0334226', 'M0334240'] ['M0334227']
narrower
https://id.nlm.nih.gov/mesh/M0334228.json-ld
https://id.nlm.nih.gov/mesh/M0334228.json-ld
['Benign Neoplasms, Brain'] ['M0334228']
https://id.nlm.nih.gov/mesh/M000677383.json-ld
https://id.nlm.nih.gov/mesh/M000677383.json-ld
['Brain Metastases'] ['M000677383']
https://id.nlm.nih.gov/mesh/M0334225.json-ld
https://id.nlm.nih.gov/mesh/M0334225.json-ld