### In this notebook we take care of:

- Merging and disentangling the `BioKG graph` with the `benchmark datasets` that are included in it.
    - Benchmarks include: `dpi_fda`, `dpi_fda_exp`, `ddi_efficacy`, `ddi_mineral`, `ppi_phosphorylation`
- Storing the new disentangled data

### For this notebook to run, you require a local installation of the BioKG data.


In [1]:
import pandas as pd

### Load all links in BioKG are being combined in one single large (2m rows) .tsv file


In [23]:
biokg_links_path = "data/biokg/biokg.links.tsv"

biokg_links = pd.read_csv(biokg_links_path,
                         sep='\t',
                         names=['left', 'property', 'right'])

len(biokg_links)

Check out the **unique** properties in the links file...

In [76]:
biokg_links['property'].unique()

array(['DISEASE_PATHWAY_ASSOCIATION', 'PROTEIN_PATHWAY_ASSOCIATION',
       'DRUG_DISEASE_ASSOCIATION', 'RELATED_GENETIC_DISORDER', 'PPI',
       'DRUG_TARGET', 'DRUG_CARRIER', 'DRUG_ENZYME', 'DRUG_TRANSPORTER',
       'COMPLEX_IN_PATHWAY', 'DRUG_PATHWAY_ASSOCIATION',
       'PROTEIN_DISEASE_ASSOCIATION', 'DISEASE_GENETIC_DISORDER',
       'MEMBER_OF_COMPLEX', 'DDI', 'DPI', 'COMPLEX_TOP_LEVEL_PATHWAY'],
      dtype=object)

#### As we can observe, the relations in the benchmarks are `normalised` before being merged in the graph.... 


Which implies that we should merge on right,left column ignoring the property name + order...

#### Alert: The phosphorylation benchmark is the only one with 4 columns as it includes not only the PPI relation but the specific substrate that is involved in the interaction... 

In [70]:
phospho_bk = pd.read_csv('data/output/benchmarks/' + 'phosphorylation.tsv', sep='\t', names=['left', 'property', 'right', 'substrate_site']
                         , index_col=None)


To solve this, we include the substrate column which **gets ignored** during the merging....

### Load & Merge all benchmark datasets

In [50]:
benchmark_path ='data/output/benchmarks/'


def load_and_merge_benchmark_datasets(path):
    
    dpi_fda_bk = pd.read_csv(path + 'dpi_fda.tsv', sep='\t', names=['left', 'property', 'right'])
    dpi_fda_exp_bk = pd.read_csv(path + 'dpi_fda.tsv', sep='\t', names=['left', 'property', 'right'])
    ddi_efficacy_bk = pd.read_csv(path + 'ddi_efficacy.tsv', sep='\t', names=['left', 'property', 'right'])
    ddi_mineral_bk = pd.read_csv(path + 'ddi_minerals.tsv', sep='\t', names=['left', 'property', 'right'])
    phospho_bk = pd.read_csv(path + 'phosphorylation.tsv', sep='\t', names=['left', 'property', 'right', 'substrate'])
    
    
    all_benchmarks = pd.concat([dpi_fda_bk, dpi_fda_exp_bk, ddi_efficacy_bk, ddi_mineral_bk, phospho_bk])
    
    return all_benchmarks

In [51]:
all_benchmarks = load_and_merge_benchmark_datasets(benchmark_path)
all_benchmarks

### Before merging we need to prepare and ignore the order between left and right columns... 


To do that:
- We create a new column with the set of columns `left` and `right`
- Due to the set being unordered we can then merge on the `combined` column.
- Then by picking out the `right_only` results we know we have the BioKG graph without the benchmarks.

In [58]:
### Let's try to merge ignoring the order...
all_benchmarks['combined'] = all_benchmarks.apply(lambda x: str(set([x['left'], x['right']])),axis=1)

In [71]:
## Check it out
all_benchmarks[:3]

Unnamed: 0,left,property,right,substrate,combined
0,DB01079,DPI,Q13639,,"{'DB01079', 'Q13639'}"
1,DB00114,DPI,P20711,,"{'DB00114', 'P20711'}"
2,DB01158,DPI,P13637,,"{'P13637', 'DB01158'}"


In [60]:
## Do the same for the LINKS of BioKG
biokg_links['combined'] = biokg_links.apply(lambda x: str(set([x['left'], x['right']])),axis=1)

In [72]:
# Check it out
biokg_links[:3]

Unnamed: 0,left,property,right,combined
0,C566487,DISEASE_PATHWAY_ASSOCIATION,hsa00071,"{'hsa00071', 'C566487'}"
1,C567839,DISEASE_PATHWAY_ASSOCIATION,map04810,"{'map04810', 'C567839'}"
2,C562476,DISEASE_PATHWAY_ASSOCIATION,hsa04512,"{'C562476', 'hsa04512'}"


### Final merge on `combined` column

In [73]:
merged = pd.merge(all_benchmarks, biokg_links, how='right', on=["combined"], indicator=True)

In [86]:
merged[merged['_merge']=='both'][:3]

Unnamed: 0,left_x,property_x,right_x,substrate,combined,left_y,property_y,right_y,_merge
328244,P28482,phosphorylates,O75582,S360,"{'O75582', 'P28482'}",O75582,PPI,P28482,both
328245,P28482,phosphorylates,O75582,T581,"{'O75582', 'P28482'}",O75582,PPI,P28482,both
328297,P06493,phosphorylates,P02545,T19,"{'P06493', 'P02545'}",P02545,PPI,P06493,both


In [79]:
biokg_without_benchmarks = merged[merged['_merge']=='right_only']

In [85]:
biokg_without_benchmarks_clean = biokg_without_benchmarks[['left_y', 'property_y', 'right_y']]

In [87]:
biokg_without_benchmarks_clean

Unnamed: 0,left_y,property_y,right_y
0,C566487,DISEASE_PATHWAY_ASSOCIATION,hsa00071
1,C567839,DISEASE_PATHWAY_ASSOCIATION,map04810
2,C562476,DISEASE_PATHWAY_ASSOCIATION,hsa04512
3,C567032,DISEASE_PATHWAY_ASSOCIATION,map00750
4,C562710,DISEASE_PATHWAY_ASSOCIATION,map04930
...,...,...,...
2110901,R-HSA-6801809,COMPLEX_TOP_LEVEL_PATHWAY,R-HSA-168256
2110902,R-HSA-2213211,COMPLEX_TOP_LEVEL_PATHWAY,R-HSA-168256
2110903,R-HSA-3730834,COMPLEX_TOP_LEVEL_PATHWAY,R-HSA-392499
2110904,R-HSA-5668765,COMPLEX_TOP_LEVEL_PATHWAY,R-HSA-168256


In [None]:
### Note: Figure out why the numbers don't exactly add up.... :/ 

In [64]:
250460 - 242483

7977

#### Store the new data.

In [94]:
biokg_without_benchmarks_clean.to_csv('data/biokg_no_benchmark.tsv', 
                                      sep='\t', 
                                      index=False,
                                      header=['left', 'property', 'right'])

In [95]:
biokg_without_benchmarks_clean

Unnamed: 0,left_y,property_y,right_y
0,C566487,DISEASE_PATHWAY_ASSOCIATION,hsa00071
1,C567839,DISEASE_PATHWAY_ASSOCIATION,map04810
2,C562476,DISEASE_PATHWAY_ASSOCIATION,hsa04512
3,C567032,DISEASE_PATHWAY_ASSOCIATION,map00750
4,C562710,DISEASE_PATHWAY_ASSOCIATION,map04930
...,...,...,...
2110901,R-HSA-6801809,COMPLEX_TOP_LEVEL_PATHWAY,R-HSA-168256
2110902,R-HSA-2213211,COMPLEX_TOP_LEVEL_PATHWAY,R-HSA-168256
2110903,R-HSA-3730834,COMPLEX_TOP_LEVEL_PATHWAY,R-HSA-392499
2110904,R-HSA-5668765,COMPLEX_TOP_LEVEL_PATHWAY,R-HSA-168256
