# BTE metakg visualization

The goal of this notebook is to visualize the metaKG that BTE uses.  Similar to the [subway diagram](https://raw.githubusercontent.com/biothings/BioThings_Explorer_TRAPI/main/diagrams/smartapi_metagraph.png) we've used before, but updated to the size and scale of the current metakg.

This notebook takes as input an ndson file with the SmartAPI metakg (originally provided by Chunlei on 2023-03-01).

Optimizations
* remove less-commonly-used node types from subject/object
* only count in one direction (`A-treats-B` gets merged with `B-treated_by-A`)

In [1]:
import biothings_client
import json5
import networkx as nx
import pandas as pd
import re
import requests


## Read in data

### Option 1 -- Read in the Smart API ndjson file

In [2]:
df = pd.read_json('data/smartapi_metakg_03012023.ndjson.gz', lines=True)
df

Unnamed: 0,subject,object,predicate,api,provided_by
0,AnatomicalEntity,MolecularActivity,affects_activity_of,"{'name': 'CAM-KP API', 'smartapi': {'metadata'...",
1,AnatomicalEntity,MolecularEntity,affects_activity_of,"{'name': 'CAM-KP API', 'smartapi': {'metadata'...",
2,AnatomicalEntity,NamedThing,affects_activity_of,"{'name': 'CAM-KP API', 'smartapi': {'metadata'...",
3,AnatomicalEntity,NucleicAcidEntity,affects_activity_of,"{'name': 'CAM-KP API', 'smartapi': {'metadata'...",
4,AnatomicalEntity,Occurrent,affects_activity_of,"{'name': 'CAM-KP API', 'smartapi': {'metadata'...",
...,...,...,...,...,...
175543,Cell,Phenomenon,causes,{'name': 'ARAX Translator Reasoner - TRAPI 1.3...,
175544,Transcript,CellLine,physically_interacts_with,{'name': 'ARAX Translator Reasoner - TRAPI 1.3...,
175545,Disease,PhenotypicFeature,entity_positively_regulates_entity,{'name': 'ARAX Translator Reasoner - TRAPI 1.3...,
175546,Device,Vitamin,entity_negatively_regulates_entity,{'name': 'ARAX Translator Reasoner - TRAPI 1.3...,


### Option 2 (preferred) -- Query the SmartAPI API

This method takes longer, but it retrieves the most up-to-date data

In [3]:
c = biothings_client.get_client('metakg', url='https://dev.smart-api.info/api/metakg')
c._query_endpoint=''
a = c.query('*', fetch_all=True)

In [4]:
df = pd.DataFrame(a)
df

Fetching 304687 metakg(s) . . .
No more results to return.


Unnamed: 0,api,object,predicate,subject,provided_by
0,"{'name': 'CAM-KP API', 'smartapi': {'id': '480...",Gene,entity_regulates_entity,MacromolecularMachineMixin,
1,"{'name': 'CAM-KP API', 'smartapi': {'id': '480...",GeneOrGeneProduct,entity_regulates_entity,MacromolecularMachineMixin,
2,"{'name': 'CAM-KP API', 'smartapi': {'id': '480...",GeneProductMixin,entity_regulates_entity,MacromolecularMachineMixin,
3,"{'name': 'CAM-KP API', 'smartapi': {'id': '480...",GenomicEntity,entity_regulates_entity,MacromolecularMachineMixin,
4,"{'name': 'CAM-KP API', 'smartapi': {'id': '480...",MacromolecularComplexMixin,entity_regulates_entity,MacromolecularMachineMixin,
...,...,...,...,...,...
304682,{'name': 'ARAX Translator Reasoner - TRAPI 1.3...,Phenomenon,causes,Cell,
304683,{'name': 'ARAX Translator Reasoner - TRAPI 1.3...,CellLine,physically_interacts_with,Transcript,
304684,{'name': 'ARAX Translator Reasoner - TRAPI 1.3...,PhenotypicFeature,entity_positively_regulates_entity,Disease,
304685,{'name': 'ARAX Translator Reasoner - TRAPI 1.3...,Vitamin,entity_negatively_regulates_entity,Device,


parse out a couple lines for the API name and ID

In [5]:
df = df.assign(api_name = lambda x: pd.json_normalize(x['api'])['name'])
df = df.assign(api_id = lambda x: pd.json_normalize(x['api'])['smartapi.id'])
df

Unnamed: 0,api,object,predicate,subject,provided_by,api_name,api_id
0,"{'name': 'CAM-KP API', 'smartapi': {'id': '480...",Gene,entity_regulates_entity,MacromolecularMachineMixin,,CAM-KP API,4803457bdb4bfeeb63a88244830ece2e
1,"{'name': 'CAM-KP API', 'smartapi': {'id': '480...",GeneOrGeneProduct,entity_regulates_entity,MacromolecularMachineMixin,,CAM-KP API,4803457bdb4bfeeb63a88244830ece2e
2,"{'name': 'CAM-KP API', 'smartapi': {'id': '480...",GeneProductMixin,entity_regulates_entity,MacromolecularMachineMixin,,CAM-KP API,4803457bdb4bfeeb63a88244830ece2e
3,"{'name': 'CAM-KP API', 'smartapi': {'id': '480...",GenomicEntity,entity_regulates_entity,MacromolecularMachineMixin,,CAM-KP API,4803457bdb4bfeeb63a88244830ece2e
4,"{'name': 'CAM-KP API', 'smartapi': {'id': '480...",MacromolecularComplexMixin,entity_regulates_entity,MacromolecularMachineMixin,,CAM-KP API,4803457bdb4bfeeb63a88244830ece2e
...,...,...,...,...,...,...,...
304682,{'name': 'ARAX Translator Reasoner - TRAPI 1.3...,Phenomenon,causes,Cell,,ARAX Translator Reasoner - TRAPI 1.3.0,e248aefca0f469229e82cca80fbabc89
304683,{'name': 'ARAX Translator Reasoner - TRAPI 1.3...,CellLine,physically_interacts_with,Transcript,,ARAX Translator Reasoner - TRAPI 1.3.0,e248aefca0f469229e82cca80fbabc89
304684,{'name': 'ARAX Translator Reasoner - TRAPI 1.3...,PhenotypicFeature,entity_positively_regulates_entity,Disease,,ARAX Translator Reasoner - TRAPI 1.3.0,e248aefca0f469229e82cca80fbabc89
304685,{'name': 'ARAX Translator Reasoner - TRAPI 1.3...,Vitamin,entity_negatively_regulates_entity,Device,,ARAX Translator Reasoner - TRAPI 1.3.0,e248aefca0f469229e82cca80fbabc89


Read in the BTE config file that specifies currently-allowed APIs

In [6]:
bte_config_url = "https://raw.githubusercontent.com/biothings/BioThings_Explorer_TRAPI/main/src/config/apis.js"
r = requests.get(bte_config_url)
str_bte_config = r.text
#print(str_bte_config)
str_bte_config = re.sub("exports.API_LIST = ",  "", str_bte_config)                       # remove variable assignment step
str_bte_config = re.sub("\s*//.*",              "", str_bte_config)                       # remove commented lines
str_bte_config = re.sub(r'^$\n',                '', str_bte_config, flags=re.MULTILINE)   # remove blank lines
str_bte_config = re.sub(r',\s*exclude:[^\]]*]', '', str_bte_config, flags=re.MULTILINE)   # remove 'exclude' section
str_bte_config = re.sub(r';$',                  '', str_bte_config, flags=re.MULTILINE)   # remove 'exclude' section
#print(str_bte_config)

bte_config = json5.loads(str_bte_config)
bte_config

{'include': [{'id': 'd22b657426375a5295e7da8a303b9893', 'name': 'BioLink API'},
  {'id': '43af91b3d7cae43591083bff9d75c6dd', 'name': 'EBI Proteins API'},
  {'id': 'dca415f2d792976af9d642b7e73f7a41', 'name': 'LitVar API'},
  {'id': '1f277e1563fcfd124bfae2cc3c4bcdec', 'name': 'QuickGO API'},
  {'id': '1c056ffc7ed0dd1229e71c4752239465',
   'name': 'Ontology Lookup Service API'},
  {'id': '38e9e5169a72aee3659c9ddba956790d',
   'name': 'BioThings BindingDB API'},
  {'id': '55a223c6c6e0291dbd05f2faf27d16f4',
   'name': 'BioThings BioPlanet Pathway-Disease API'},
  {'id': 'b99c6dd64abcefe87dcd0a51c249ee6d',
   'name': 'BioThings BioPlanet Pathway-Gene API'},
  {'id': '00fb85fc776279163199e6c50f6ddfc6', 'name': 'BioThings DDInter API'},
  {'id': 'e3edd325c76f2992a111b43a907a4870', 'name': 'BioThings DGIdb API'},
  {'id': 'a7f784626a426d054885a5f33f17d3f8', 'name': 'BioThings DISEASES API'},
  {'id': '1f47552dabd67351d4c625adb0a10d00',
   'name': 'BioThings EBIgene2phenotype API'},
  {'id': 'cc

In [7]:
bte_config_ids = [ x['id'] for x in bte_config['include'] ]
bte_config_ids

['d22b657426375a5295e7da8a303b9893',
 '43af91b3d7cae43591083bff9d75c6dd',
 'dca415f2d792976af9d642b7e73f7a41',
 '1f277e1563fcfd124bfae2cc3c4bcdec',
 '1c056ffc7ed0dd1229e71c4752239465',
 '38e9e5169a72aee3659c9ddba956790d',
 '55a223c6c6e0291dbd05f2faf27d16f4',
 'b99c6dd64abcefe87dcd0a51c249ee6d',
 '00fb85fc776279163199e6c50f6ddfc6',
 'e3edd325c76f2992a111b43a907a4870',
 'a7f784626a426d054885a5f33f17d3f8',
 '1f47552dabd67351d4c625adb0a10d00',
 'cc857d5b7c8b7609b5bbb38ff990bfff',
 'f339b28426e7bf72028f60feefcd7465',
 '34bad236d77bea0a0ee6c6cba5be54a6',
 '316eab811fd9ef1097df98bcaa9f7361',
 'a5b0ec6bfde5008984d4b6cde402d61f',
 '32f36164fabed5d3abe6c2fd899c9418',
 '77ed27f111262d0289ed4f4071faa619',
 'edeb26858bd27d0322af93e7a9e08761',
 '03283cc2b21c077be6794e1704b1d230',
 '1d288b3a3caf75d541ffaae3aab386c8',
 'ec6d76016ef40f284359d17fbf78df20',
 '8f08d1446e0bb9c2b323713ce83e2bd3',
 '671b45c0301c8624abbd26ae78449ca2',
 '59dce17363dce279d389100834e43648',
 '09c8782d9f4027712e65b95424adba79',
 

## Join SmartAPI data with BTE config IDs

In [8]:
df_bte = df.query('api_id in @bte_config_ids').drop(columns=['api']).drop_duplicates()
df_bte

Unnamed: 0,object,predicate,subject,provided_by,api_name,api_id
37865,Gene,has_part,GeneFamily,,Automat-hgnc(Trapi v1.3.0),7382f0fabffce3cc7f7b8b6358c69259
43365,Disease,superclass_of,Disease,infores:disease-ontology,Ontology Lookup Service API,1c056ffc7ed0dd1229e71c4752239465
51379,ChemicalEntity,subclass_of,Protein,,Automat-uberongraph(Trapi v1.3.0),ef9027a7d2246c6540cc7b3ce202d89f
51380,ChemicalEntity,related_to,Protein,,Automat-uberongraph(Trapi v1.3.0),ef9027a7d2246c6540cc7b3ce202d89f
51381,ChemicalEntity,overlaps,Protein,,Automat-uberongraph(Trapi v1.3.0),ef9027a7d2246c6540cc7b3ce202d89f
...,...,...,...,...,...,...
259395,Gene,contribution_from,PhenotypicFeature,,Text Mining Targeted Association API,978fe380a147a8641caf72320862697b
259396,PhenotypicFeature,contributes_to,SmallMolecule,,Text Mining Targeted Association API,978fe380a147a8641caf72320862697b
259397,SmallMolecule,contribution_from,PhenotypicFeature,,Text Mining Targeted Association API,978fe380a147a8641caf72320862697b
259400,Disease,contributes_to,SmallMolecule,,Text Mining Targeted Association API,978fe380a147a8641caf72320862697b


In [9]:
df_bte[['subject','object','predicate','api_name','api_id']].drop_duplicates().to_csv("results/bte_operations.tsv", sep="\t", index=False)

In [10]:
df_bte['api_name'].value_counts()

Automat-uberongraph(Trapi v1.3.0)           1368
BioThings SEMMEDDB API                       932
Automat-ctd(Trapi v1.3.0)                    509
Automat-biolink(Trapi v1.3.0)                210
Automat-hetio(Trapi v1.3.0)                  143
Automat-ontology-hierarchy(Trapi v1.3.0)     134
Automat-drug-central(Trapi v1.3.0)            93
Automat-pharos(Trapi v1.3.0)                  79
Automat-hmdb(Trapi v1.3.0)                    66
Automat-human-goa(Trapi v1.3.0)               58
Automat-icees-kg(Trapi v1.3.0)                55
COHD TRAPI 1.3                                50
Multiomics EHR Risk KP API                    44
Automat-viral-proteome(Trapi v1.3.0)          35
Automat-gtopdb(Trapi v1.3.0)                  33
Automat-panther(Trapi v1.3.0)                 24
BioLink API                                   21
MyChem.info API                               21
Text Mining Targeted Association API          20
Automat-gwas-catalog(Trapi v1.3.0)            18
Automat-gtex(Trapi v

## Summarization

### by subject, object; count # of APIs

In [11]:
df1 = df_bte[["subject","object","api_name"]]
api_stats = df1.groupby(['subject','object'], group_keys=False)['api_name'].nunique().rename("count").to_frame()
api_stats['list'] = df1.groupby(['subject','object'], group_keys=False)['api_name'].unique().apply(list)
api_stats = api_stats.reset_index()

with pd.option_context('display.min_rows', 20,
                       'display.max_columns', None,
                       'display.precision', 3,
                       ):
    print(api_stats)


              subject                    object  count  \
0    AnatomicalEntity          AnatomicalEntity      4   
1    AnatomicalEntity         BiologicalProcess      1   
2    AnatomicalEntity                      Cell      3   
3    AnatomicalEntity         CellularComponent      6   
4    AnatomicalEntity            ChemicalEntity      1   
5    AnatomicalEntity                   Disease      7   
6    AnatomicalEntity                      Gene      2   
7    AnatomicalEntity  GrossAnatomicalStructure      3   
8    AnatomicalEntity         MolecularActivity      1   
9    AnatomicalEntity          MolecularMixture      1   
..                ...                       ...    ...   
384     SmallMolecule          MolecularMixture     10   
385     SmallMolecule       PathologicalProcess      1   
386     SmallMolecule                   Pathway      2   
387     SmallMolecule         PhenotypicFeature     11   
388     SmallMolecule      PhysiologicalProcess      1   
389     SmallM

In [12]:
api_stats.to_csv("results/api_stats.tsv", sep="\t", index=False)


### by subject, object; count # of predicates

In [13]:
df1 = df_bte[["subject","object","predicate"]]
predicate_stats = df1.groupby(['subject','object'], group_keys=False)['predicate'].nunique().rename("count").to_frame()
predicate_stats['list'] = df1.groupby(['subject','object'], group_keys=False)['predicate'].unique().apply(list)
predicate_stats = predicate_stats.reset_index()

with pd.option_context('display.min_rows', 20,
                       'display.max_columns', None,
                       'display.precision', 3,
                       ):
    print(predicate_stats)


              subject                    object  count  \
0    AnatomicalEntity          AnatomicalEntity     17   
1    AnatomicalEntity         BiologicalProcess     16   
2    AnatomicalEntity                      Cell     15   
3    AnatomicalEntity         CellularComponent      7   
4    AnatomicalEntity            ChemicalEntity      4   
5    AnatomicalEntity                   Disease     13   
6    AnatomicalEntity                      Gene      4   
7    AnatomicalEntity  GrossAnatomicalStructure     17   
8    AnatomicalEntity         MolecularActivity      4   
9    AnatomicalEntity          MolecularMixture      3   
..                ...                       ...    ...   
384     SmallMolecule          MolecularMixture      8   
385     SmallMolecule       PathologicalProcess     10   
386     SmallMolecule                   Pathway      2   
387     SmallMolecule         PhenotypicFeature     20   
388     SmallMolecule      PhysiologicalProcess      8   
389     SmallM

In [14]:
predicate_stats.to_csv("results/predicate_stats.tsv", sep="\t", index=False)

## Filter by most common types

Filter to only include the most common types of entities.  Also, since we _mostly_ have the same info in both directions, only keep one direction to simplify visualization

In [15]:
pd.concat([df_bte['subject'], df_bte['object']]).value_counts().head(20)

Disease                     955
SmallMolecule               824
Gene                        805
ChemicalEntity              629
PhenotypicFeature           490
BiologicalProcess           463
Polypeptide                 463
MolecularMixture            391
Protein                     339
Cell                        339
CellularComponent           333
AnatomicalEntity            331
GrossAnatomicalStructure    316
MolecularActivity           291
Pathway                     203
OrganismTaxon               201
PathologicalProcess         182
ChemicalMixture             164
PhysiologicalProcess        138
ComplexMolecularMixture     125
dtype: int64

In [16]:
NUM_TYPES_TO_KEEP = 10

keep = set(pd.concat([df_bte['subject'], df_bte['object']]).value_counts().head(NUM_TYPES_TO_KEEP).keys())
keep

{'BiologicalProcess',
 'Cell',
 'ChemicalEntity',
 'Disease',
 'Gene',
 'MolecularMixture',
 'PhenotypicFeature',
 'Polypeptide',
 'Protein',
 'SmallMolecule'}

In [17]:
predicate_stats_filt = predicate_stats.query("subject in @keep & object in @keep & subject <= object")
predicate_stats_filt.to_csv("results/predicate_stats_filt.tsv", sep="\t")
predicate_stats_filt

Unnamed: 0,subject,object,count,list
17,BiologicalProcess,BiologicalProcess,23,"[subclass_of, superclass_of, causes, caused_by..."
18,BiologicalProcess,Cell,13,"[related_to, has_participant, regulates, occur..."
20,BiologicalProcess,ChemicalEntity,12,"[has_input, has_participant, has_output, affec..."
23,BiologicalProcess,Disease,16,"[causes, has_participant, subclass_of, related..."
24,BiologicalProcess,Gene,2,"[has_part, has_participant]"
27,BiologicalProcess,MolecularMixture,7,"[affects_transport_of, related_to, has_partici..."
30,BiologicalProcess,PhenotypicFeature,4,"[affected_by, related_to, superclass_of, has_r..."
31,BiologicalProcess,Polypeptide,8,"[has_participant, related_to, affects_transpor..."
32,BiologicalProcess,Protein,20,"[is_output_of, caused_by, related_to, has_capa..."
33,BiologicalProcess,SmallMolecule,9,"[has_input, has_participant, has_output, affec..."


In [18]:
api_stats_filt = api_stats.query("subject in @keep & object in @keep & subject <= object")
api_stats_filt.to_csv("results/api_stats_filt.tsv", sep="\t")
api_stats_filt

Unnamed: 0,subject,object,count,list
17,BiologicalProcess,BiologicalProcess,8,"[Automat-uberongraph(Trapi v1.3.0), Automat-bi..."
18,BiologicalProcess,Cell,1,[Automat-uberongraph(Trapi v1.3.0)]
20,BiologicalProcess,ChemicalEntity,2,"[Automat-uberongraph(Trapi v1.3.0), Automat-ic..."
23,BiologicalProcess,Disease,10,"[Automat-uberongraph(Trapi v1.3.0), Automat-dr..."
24,BiologicalProcess,Gene,2,"[Automat-hetio(Trapi v1.3.0), MyGene.info API]"
27,BiologicalProcess,MolecularMixture,2,"[Automat-uberongraph(Trapi v1.3.0), Automat-ic..."
30,BiologicalProcess,PhenotypicFeature,7,"[Automat-uberongraph(Trapi v1.3.0), Automat-bi..."
31,BiologicalProcess,Polypeptide,1,[Automat-uberongraph(Trapi v1.3.0)]
32,BiologicalProcess,Protein,3,"[Automat-uberongraph(Trapi v1.3.0), Automat-hu..."
33,BiologicalProcess,SmallMolecule,2,"[Automat-uberongraph(Trapi v1.3.0), Automat-ic..."


## Export to graphml

In [19]:
def create_graph(df2, filename):
    G = nx.Graph()

    node_types = set(pd.concat([df2['subject'], df2['object']]))
        
    for node_type in node_types:
        G.add_node(node_type, label = add_spacing(node_type))

    for index,row in df2.iterrows():
        G.add_edge(row['subject'], row['object'], weight=row['count'])
    
    nx.write_graphml(G, filename, infer_numeric_types=True)

In [20]:
def add_spacing(str):
    key = {
        "BiologicalProcess":               "Biological\nProcess",
        "ChemicalEntity":                  "Chemical\nEntity",
        "MolecularMixture":                "Molecular\nMixture", 
        "PhysiologicalProcess":            "Physiological\nProcess",
        "SmallMolecule":                   "Small\nMolecule",
        "PhenotypicFeature":               "Phenotypic\nFeature",
        'ChemicalExposure':                'Chemical\nExposure',
        'ClinicalAttribute':               'Clinical\nAttribute',
        'ClinicalIntervention':            'Clinical\nIntervention',
        'ComplexMolecularMixture':         'Complex\nMolecular\nMixture',
        'EnvironmentalExposure':           'Environmental\nExposure',
        'InformationContentEntity':        'Information\nContentEntity',
        'MolecularMixture':                'Molecular\nMixture',
        'PhysiologicalProcess':            'Physiological\nProcess',
        'PopulationOfIndividualOrganisms': 'PopulationOf\nIndividualOrganisms'
    }
    if str in key.keys():
        return(key[str])
    else:
        return(str)

In [21]:
create_graph(api_stats_filt, "results/api_stats_filt.graphml")
create_graph(predicate_stats_filt, "results/predicate_stats_filt.graphml")

