# Remove Outliers
This notebook filters the original databases used in DECAGON plus the protein feature databases to remove any unlinked node in the network. In short, it keeps only the elements present in all the databases so that the output database is consistent.<br>
In addition, it normalizes the protein features corresponding to the number of $\alpha$-helices, $\beta$-strands and turns.<br>
This code is in part the adaptation in `pandas` of the script `remove_outliers.sh`.

Author: Juan Sebastian Diaz Boada, May 2020

## Python 3

In [1]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [2]:
import pandas as pd
import numpy as np

Import DECAGON Data as `pandas` dataframes

In [4]:
PPI = pd.read_csv('../original_data/bio-decagon-ppi.csv',sep=',')
DTI = pd.read_csv('../original_data/bio-decagon-targets-all.csv',sep=',')
DDI = pd.read_csv('../original_data/bio-decagon-combo.csv',sep=',')
DSE = pd.read_csv('../original_data/bio-decagon-mono.csv',sep=',')

In [5]:
# Original number of interactions
orig_ppi = len(PPI.index)
orig_dti = len(DTI.index)
orig_ddi = len(DDI.index)
orig_dse = len(DSE.index)

### Common genes between PPI network and protein features

In [6]:
# PPI genes
PPI_genes = pd.unique(np.hstack((PPI['Gene 1'].values,PPI['Gene 2'].values))) #int
orig_genes_ppi = len(PPI_genes) # Original number of genes

### Common drugs between DDI network and drug single side effects

In [7]:
# DDI drugs
DDI_drugs = pd.unique(DDI[["STITCH 1", "STITCH 2"]].values.ravel())
orig_drugs_ddi = len(DDI_drugs) # Original number of drugs
orig_se_combo = len(pd.unique(DDI['Polypharmacy Side Effect'].values))
# Drugs with single side effects
DSE_drugs = pd.unique(DSE['STITCH'].values)
orig_drug_dse = len(DSE_drugs) # Original number of drugs
orig_se_mono = len(pd.unique(DSE['Side Effect Name']))

In [8]:
# Calculate the instersection of the DDI and DSE
# (i.e., the drugs in the intercation network that have single side effect)
inter_drugs = np.intersect1d(DDI_drugs,DSE_drugs,assume_unique=True)
# Choose only the entries in DDI that are in the intersection
DDI = DDI[np.logical_and(DDI['STITCH 1'].isin(inter_drugs).values,
                     DDI['STITCH 2'].isin(inter_drugs).values)]
# Some drugs in DDI that are common to all 3 datasets may only interact with genes that are
# non-common (outsiders). That is why we need to filter a second time using this array.
DDI_drugs = pd.unique(DDI[["STITCH 1", "STITCH 2"]].values.ravel())
DSE = DSE[DSE['STITCH'].isin(DDI_drugs)]
new_drugs_ddi = len(pd.unique(DDI[['STITCH 1','STITCH 2']].values.ravel()))
new_drugs_dse = len(pd.unique(DSE['STITCH'].values))
new_se_combo = len(pd.unique(DDI['Polypharmacy Side Effect'].values))
new_se_mono = len(pd.unique(DSE['Side Effect Name']))

### Selection of entries of DTI database

In [9]:
orig_genes_dti = len(pd.unique(DTI['Gene'].values))
orig_drugs_dti = len(pd.unique(DTI['STITCH'].values))
DTI = DTI[np.logical_and(DTI['STITCH'].isin(DDI_drugs),DTI['Gene'].isin(PPI_genes))]
new_genes_dti = len(pd.unique(DTI['Gene'].values))
new_drugs_dti = len(pd.unique(DTI['STITCH'].values))

In [13]:
# Interactions (edges)
print('Interactions (edges)')
print ('Original number of PPI interactions',orig_ppi)
print ('New number of PPI interactions',len(PPI.index))
print('\n')
print ('Original number of DTI interactions',orig_dti)
print ('New number of DTI interactions',len(DTI.index))
print('\n')
print ('Original number of DDI interactions',orig_ddi)
print ('New number of DDI interactions', len(DDI.index))
print('\n')
print ('Original number of DSE interactions',orig_dse)
print('New number of DSE interactions',len(DSE.index))
print('\n')
# Drugs and genes (nodes)
print('Drugs and genes (nodes)')
print("Original number of drugs in DSE:",orig_drug_dse)
print("New number of drugs in DSE:",new_drugs_dse)
print('\n')
print("Original number drugs in DTI",orig_drugs_dti)
print("New number of drugs in DTI",new_drugs_dti)
print('\n')
print('Original number of genes in DTI:',orig_genes_dti)
print('New number of genes in DTI:',new_genes_dti)
print('\n')
print('Original number of genes in PPI:',orig_genes_ppi)
print('New number of genes in PPI:',orig_genes_ppi)
print('\n')
print('Original number of drugs in DDI:',orig_drugs_ddi)
print('New number of drugs in DDI:',new_drugs_ddi)
print('\n')
# Side effects
print('Side effects')
print('Original number of joint side effects:',orig_se_combo)
print('New number of joint side effects:', new_se_combo)
print('\n')
print('Original number of single side effects:', orig_se_mono)
print('New number of single side effects:', new_se_mono)

Interactions (edges)
Original number of PPI interactions 715612
New number of PPI interactions 715612


Original number of DTI interactions 131034
New number of DTI interactions 18595


Original number of DDI interactions 4649441
New number of DDI interactions 4615522


Original number of DSE interactions 174977
New number of DSE interactions 174977


Drugs and genes (nodes)
Original number of drugs in DSE: 639
New number of drugs in DSE: 639


Original number drugs in DTI 1774
New number of drugs in DTI 283


Original number of genes in DTI: 7795
New number of genes in DTI: 3640


Original number of genes in PPI: 19081
New number of genes in PPI: 19081


Original number of drugs in DDI: 645
New number of drugs in DDI: 639


Side effects
Original number of joint side effects: 1317
New number of joint side effects: 1317


Original number of single side effects: 9702
New number of single side effects: 9702


## Export to csv

In [15]:
PPI.to_csv('small-decagon-ppi.csv',index=False,sep=',')
DTI.to_csv('small-decagon-targets.csv',index=False,sep=',')
DDI.to_csv('small-decagon-combo.csv',index=False,sep=',')
DSE.to_csv('small-decagon-mono.csv',index=False,sep=',')