# Small dataset
This notebook chooses a small consistent subset of the treated dataset to be run in small machines. The generated dataset is limited to a number of **drug-drug interactions** specified by the variable $N$.<br>
This code is the adaptation in `pandas` of the script `drug_dataset.sh`.

Author: Juan Sebastian Diaz Boada, May 2020

In [1]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [2]:
import pandas as pd
# Only for Python 2
#from __future__ import print_function

In [3]:
# Number of side effects
N = 3

### Import databases

In [4]:
PPI = pd.read_csv('clean_data/new-decagon-ppi.csv',sep=',')
PF = pd.read_csv('clean_data/new-decagon-genes.csv',sep=',')
DTI = pd.read_csv('clean_data/new-decagon-targets.csv',sep=',')
DDI = pd.read_csv('clean_data/new-decagon-combo.csv',sep=',')
DSE = pd.read_csv('clean_data/new-decagon-mono.csv',sep=',')
SE = pd.read_csv('original_data/bio-decagon-effectcategories.csv',sep=',')

In [5]:
# Number of interactions
orig_ppi = len(PPI.index)
orig_pf = len(PF.index)
orig_dti = len(DTI.index)
orig_ddi = len(DDI.index)
orig_dse = len(DSE.index)
orig_drugs = len(pd.unique(DDI[['STITCH 1','STITCH 2']].values.ravel()))
orig_genes = len(pd.unique(PPI[['Gene 1','Gene 2']].values.ravel()))

### Choose Side effects
Could have also used:
```
se = np.random.choice(SE['Side Effect'].values, size=N, replace=False)
```

In [6]:
se = SE.sample(n=N, axis=0)['Side Effect'].values

### Select DDIs

In [7]:
DDI = DDI[DDI['Polypharmacy Side Effect'].isin(se)].reset_index(drop=True)
DDI_drugs = pd.unique(DDI[['STITCH 1','STITCH 2']].values.ravel())
new_drugs = len(DDI_drugs)

### Select Drug side effects

In [8]:
DSE = DSE[DSE['STITCH'].isin(DDI_drugs)].reset_index(drop=True)

### Select DTIs

In [9]:
DTI = DTI[DTI['STITCH'].isin(DDI_drugs)].reset_index(drop=True)
DTI_genes = pd.unique(DTI['Gene'])

### Select PPIs

In [10]:
PPI = PPI[np.logical_or(PPI['Gene 1'].isin(DTI_genes),
                       PPI['Gene 2'].isin(DTI_genes))].reset_index(drop=True)
PPI_genes = pd.unique(PPI[['Gene 1','Gene 2']].values.ravel())
new_genes = len(PPI_genes)

### Select protein features

In [11]:
PF = PF[PF['GeneID'].isin(PPI_genes)].reset_index(drop=True)

### Print

In [12]:
print('Original number of PPI interactions:', orig_ppi)
print('New number of PPI interactions:', len(PPI.index))
print('\n')
print('Original number of DDI interactions:', orig_ddi)
print('New number of DDI interactions:', len(DDI.index))
print('\n')
print('Original number of DTI interactions:', orig_dti)
print('New number of DTI interactions:', len(DTI.index))
print('New number of DTI genes:',len(pd.unique(DTI['Gene'].values)))
print('New number of DTI drugs:',len(pd.unique(DTI['STITCH'].values)))
print('\n')
print('Original number of DSE interactions:', orig_dse)
print('New number of DSE interactions:', len(DSE.index))
print('\n')
print('Original number of proteins with features:', orig_pf)
print('New number of proteins with features:', len(PF.index))
print('\n')
print('Original number of genes:',orig_genes)
print('New number of genes:', new_genes)
print('\n')
print('Original number of drugs:',orig_drugs)
print('New number of drugs:', new_drugs)

Original number of PPI interactions: 693353
New number of PPI interactions: 288280


Original number of DDI interactions: 4615522
New number of DDI interactions: 4745


Original number of DTI interactions: 18291
New number of DTI interactions: 13763
New number of DTI genes: 3464
New number of DTI drugs: 172


Original number of DSE interactions: 174977
New number of DSE interactions: 115295


Original number of proteins with features: 17929
New number of proteins with features: 16227


Original number of genes: 17929
New number of genes: 16227


Original number of drugs: 639
New number of drugs: 357


## Export to csv

In [13]:
PPI.to_csv('./clean_data/ppi_mini.csv',header=False,index=False,sep=',')
DTI.to_csv('./clean_data/targets_mini.csv',header=False,index=False,sep=',')
DDI.to_csv('./clean_data/combo_mini.csv',header=False,index=False,sep=',')
DSE.to_csv('./clean_data/mono_mini.csv',header=False,index=False,sep=',')
PF.to_csv('./clean_data/genes_mini.csv',header=False,index=False,sep=',')