# Small dataset
This notebook chooses a small consistent subset of the treated dataset to be run in small machines. The generated dataset is limited to a number of **drug-drug interactions** specified by the variable $N$.<br>
This code is the adaptation in `pandas` of the script `drug_dataset.sh`.

Author: Juan Sebastian Diaz Boada, May 2020

In [1]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [2]:
import pandas as pd
# Only for Python 2
from __future__ import print_function

In [3]:
# Number of side effects
N = 3

### Import databases

In [4]:
PPI = pd.read_csv('modif_data/new-decagon-ppi.csv',sep=',')
PF = pd.read_csv('modif_data/new-decagon-genes.csv',sep=',')
DTI = pd.read_csv('modif_data/new-decagon-targets.csv',sep=',')
DDI = pd.read_csv('modif_data/new-decagon-combo.csv',sep=',')
DSE = pd.read_csv('modif_data/new-decagon-mono.csv',sep=',')
SE = pd.read_csv('orig_data/bio-decagon-effectcategories.csv')

In [5]:
# Number of interactions
orig_ppi = len(PPI.index)
orig_pf = len(PF.index)
orig_dti = len(DTI.index)
orig_ddi = len(DDI.index)
orig_dse = len(DSE.index)

### Choose Side effects
Could have also used:
```
se = np.random.choice(SE['Side Effect'].values, size=N, replace=False)
```

In [6]:
se = SE.sample(n=N, axis=0)['Side Effect'].values

### Select DDIs

In [7]:
DDI = DDI[DDI['Polypharmacy Side Effect'].isin(se)].reset_index(drop=True)
col_drugs = pd.unique(DDI[['STITCH 1','STITCH 2']].values.ravel())
n_drugs = len(col_drugs)

### Select Drug side effects

In [8]:
DSE = DSE[DSE['STITCH'].isin(col_drugs)].reset_index(drop=True)

### Select DTIs

In [9]:
DTI = DTI[DTI['STITCH'].isin(col_drugs)].reset_index(drop=True)
col_genes = pd.unique(DTI['Gene'])
n_genes = len(col_genes)

### Select PPIs

In [10]:
PPI = PPI[np.logical_or(PPI['Gene 1'].isin(col_genes),
                       PPI['Gene 2'].isin(col_genes))].reset_index(drop=True)

### Select protein features

In [11]:
PF = PF[PF['GeneID'].isin(col_genes)].reset_index(drop=True)

In [12]:
print('Original number of PPI interactions:', orig_ppi)
print('New number of PPI interactions:', len(PPI.index))

print('Original number of DDI interactions:', orig_ddi)
print('New number of DDI interactions:', len(DDI.index))

print('Original number of DTI interactions:', orig_dti)
print('New number of DTI interactions:', len(DTI.index))

print('Original number of DSE interactions:', orig_dse)
print('New number of DSE interactions:', len(DSE.index))

print('Original number og genes:', orig_pf)
print('New number of genes:', n_genes)
print('New number of drugs:', n_drugs)

Original number of PPI interactions: 581429
New number of PPI interactions: 296151
Original number of DDI interactions: 3504271
New number of DDI interactions: 6830
Original number of DTI interactions: 18293
New number of DTI interactions: 14944
Original number of DSE interactions: 81286
New number of DSE interactions: 68852
Original number og genes: 7628
New number of genes: 3497
New number of drugs: 457


## Export to csv

In [13]:
PPI.to_csv('./modif_data/ppi_mini.csv',index=False,sep=',')
DTI.to_csv('./modif_data/targets_mini.csv',index=False,sep=',')
DDI.to_csv('./modif_data/combo_mini.csv',index=False,sep=',')
DSE.to_csv('./modif_data/mono_mini.csv',index=False,sep=',')
PF.to_csv('./modif_data/genes_mini.csv',index=False,sep=',')