# GDSC Drug Sensitivity Data Importation
**Local Version**: 2
**Source Version**: 6.0

This notebook will import prepared drug sensitivity data through the [GDSC](http://www.cancerrxgene.org/downloads) portal which hosts these files on the [Sanger FTP Server](ftp://ftp.sanger.ac.uk/pub/project/cancerrxgene/releases/) (release-6.0 in this case)

Note that the GDSC exposes 3 drug datasets, labeled as the following:

1. Raw - "compound sensitivity data for Cell lines"
2. Preprocessed - "log(IC50) and AUC values"
3. Preprocessed - "ANOVA results"

In this case option 2 will be used, but the others are worth future consideration.

In [46]:
%run -m ipy_startup
%run -m ipy_logging
from mgds.data_aggregation import database as db
from mgds.data_aggregation import source as src
from mgds.data_aggregation import io_utils as io
from py_utils import assertion_utils
pd.set_option('display.max_info_rows', 50000000)

## Load Raw Sensitivity Data

In [26]:
url = 'ftp://ftp.sanger.ac.uk/pub/project/cancerrxgene/releases/release-6.0/v17_fitted_dose_response.xlsx'
filepath = db.raw_file(src.GDSC_v2, 'drug-sensitivity.xlsx')
filepath = io.download(url, filepath, check_exists=True)
filepath

2016-11-28 12:10:51,823:DEBUG:mgds.data_aggregation.io_utils: Returning previously downloaded path for "/Users/eczech/data/research/mgds/raw/gdsc_v2_drug-sensitivity.xlsx"


'/Users/eczech/data/research/mgds/raw/gdsc_v2_drug-sensitivity.xlsx'

In [27]:
c_str = ['DATASET_VERSION', 'IC50_RESULTS_ID', 'COSMIC_ID', 'DRUG_ID']
d = pd.read_excel(filepath, sheetname='Export Worksheet', converters={c:str for c in c_str})
d = d.rename(columns={'COSMIC_ID': 'CELL_LINE_ID:COSMIC', 'DRUG_ID': 'DRUG_ID:COSMIC'})

# Ensure that the data all has the same associated version ("17"), and then lose this field
assert np.all(d['DATASET_VERSION'] == '17')
d = d.drop('DATASET_VERSION', axis=1)

d.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 224510 entries, 0 to 224509
Data columns (total 7 columns):
IC50_RESULTS_ID        224510 non-null object
CELL_LINE_ID:COSMIC    224510 non-null object
DRUG_ID:COSMIC         224510 non-null object
MAX_CONC_MICROMOLAR    224510 non-null float64
LN_IC50                224510 non-null float64
AUC                    224510 non-null float64
RMSE                   224510 non-null float64
dtypes: float64(4), object(3)
memory usage: 12.0+ MB


In [28]:
d.head()

Unnamed: 0,IC50_RESULTS_ID,CELL_LINE_ID:COSMIC,DRUG_ID:COSMIC,MAX_CONC_MICROMOLAR,LN_IC50,AUC,RMSE
0,335,924100,1026,1.0,0.717722,0.89941,0.105665
1,336,924100,1028,2.0,2.6641,0.957206,0.178243
2,337,924100,1029,2.0,3.336828,0.973893,0.079845
3,338,924100,1030,10.0,5.164909,0.977844,0.094228
4,339,924100,1031,0.2,-4.325309,0.50818,0.090478


## Load GDSC Drug Meta Data

In [29]:
url = 'ftp://ftp.sanger.ac.uk/pub/project/cancerrxgene/releases/release-6.0/Screened_Compounds.xlsx'
filepath = db.raw_file(src.GDSC_v2, 'drug-meta.xlsx')
filepath = io.download(url, filepath, check_exists=True)
filepath

2016-11-28 12:11:37,142:DEBUG:mgds.data_aggregation.io_utils: Returning previously downloaded path for "/Users/eczech/data/research/mgds/raw/gdsc_v2_drug-meta.xlsx"


'/Users/eczech/data/research/mgds/raw/gdsc_v2_drug-meta.xlsx'

In [32]:
c_str = ['DRUG ID', 'DRUG NAME', 'TARGET', 'TARGET PATHWAY']
d_meta = pd.read_excel(filepath, sheetname='Export Worksheet', converters={c:str for c in c_str})
d_meta = d_meta.rename(columns={'DRUG ID': 'DRUG_ID:COSMIC', 'DRUG NAME': 'DRUG_NAME', 'TARGET PATHWAY': 'TARGET_PATHWAY'})
d_meta.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 265 entries, 0 to 264
Data columns (total 5 columns):
DRUG_ID:COSMIC    265 non-null object
DRUG_NAME         265 non-null object
SYNONYMS          149 non-null object
TARGET            264 non-null object
TARGET_PATHWAY    265 non-null object
dtypes: object(5)
memory usage: 10.4+ KB


In [33]:
d_meta.head()

Unnamed: 0,DRUG_ID:COSMIC,DRUG_NAME,SYNONYMS,TARGET,TARGET_PATHWAY
0,1,Erlotinib,,EGFR,EGFR signaling
1,3,Rapamycin,"AY-22989,Sirolimus,WY-090217",mTOR,TOR signaling
2,5,Sunitinib,Sutent,"PDGFRA, PDGFRB, KDR, KIT, FLT3",RTK signaling
3,6,PHA-665752,,MET,RTK signaling
4,9,MG-132,zLLL,Proteasome,other


## Load Cell Line Meta Data

In [36]:
d_cl = db.load(src.GDSC_v2, db.IMPORT, 'cellline-meta')[['CELL_LINE_ID', 'CELL_LINE_ID:COSMIC']]
d_cl.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1029 entries, 0 to 1033
Data columns (total 2 columns):
CELL_LINE_ID           1029 non-null object
CELL_LINE_ID:COSMIC    1029 non-null object
dtypes: object(2)
memory usage: 24.1+ KB


## Merge Drug Data

In [44]:
# Merge sensitivity data to drug and cell line metadata
d_rx = pd.merge(d, d_meta, on=['DRUG_ID:COSMIC'], how='left')
d_rx = pd.merge(d_rx, d_cl, on=['CELL_LINE_ID:COSMIC'], how='left')

# Make sure that all drugs were joined (this will be null otherwise)
assert np.all(d_rx['DRUG_NAME'].notnull())

# Ensure some other key fields are never null
assert np.all(d_rx[['DRUG_ID:COSMIC', 'CELL_LINE_ID:COSMIC', 'AUC', 'RMSE', 'LN_IC50']].notnull())

# Make sure cell line + drug combinations are unique for both types of cell line identifiers
assert not np.any(d_rx[['CELL_LINE_ID:COSMIC', 'DRUG_ID:COSMIC']].duplicated())
assert not np.any(d_rx[['CELL_LINE_ID', 'DRUG_ID:COSMIC']].dropna().duplicated())

# Make sure cell line ids have no conflicts
assert d_rx.groupby('CELL_LINE_ID')['CELL_LINE_ID:COSMIC'].nunique().max() == 1
assert d_rx.groupby('CELL_LINE_ID:COSMIC')['CELL_LINE_ID'].nunique().max() == 1

d_rx.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 224510 entries, 0 to 224509
Data columns (total 12 columns):
IC50_RESULTS_ID        224510 non-null object
CELL_LINE_ID:COSMIC    224510 non-null object
DRUG_ID:COSMIC         224510 non-null object
MAX_CONC_MICROMOLAR    224510 non-null float64
LN_IC50                224510 non-null float64
AUC                    224510 non-null float64
RMSE                   224510 non-null float64
DRUG_NAME              224510 non-null object
SYNONYMS               121336 non-null object
TARGET                 223541 non-null object
TARGET_PATHWAY         224510 non-null object
CELL_LINE_ID           216576 non-null object
dtypes: float64(4), object(8)
memory usage: 22.3+ MB


## Export

In [47]:
assertion_utils.assert_object_types(d_rx)
db.save(d_rx, src.GDSC_v2, db.IMPORT, 'drug-sensitivity')

'/Users/eczech/data/research/mgds/import/gdsc_v2_drug-sensitivity.pkl'