# NCI60 Drug Sensitivity Data Importation
**Local Version**: 2
**Source Version**: NA

This notebook will import NCI60 drug sensitivity data using the [CellMiner Portal](https://discover.nci.nih.gov/cellminer/loadDownload.do).


Notes:

- NCI 60 drug sensitivity data expressed as GI 50 -- See here for more details: https://discover.nci.nih.gov/cellminer/drug_zscore.html
- More details on IC50 vs GI50 [source](https://www.bioconductor.org/packages/release/bioc/vignettes/rcellminer/inst/doc/rcellminerUsage.html):  "GI50 values are similar to IC50 values, which are the concentrations that cause 50% growth inhibition, but have been renamed to emphasize the correction for the cell count at time zero"
- Clicking "Compound Activity" in the "Download Raw Data Set" section of the CellMiner Portal gives the following link to all https://discover.nci.nih.gov/cellminerdata/rawdata/DTP_NCI60_RAW.zip
- The above zip archive ships with a README that contains the following:

        PMID:	22802077
        Manufacturer:	"Developmental Therapeutics Program, NCI/NIH"
        Platform:	Sulforhodamine assay
        Platform Description:	"Negative log10 (GI50) values of sulforhodamine B assay for ~ 50K compounds, including more than 20,000 that passed quality control, 158 Food and Drug Administration approved and 79 clinical trial drugs. Higher values equate to higher sensitivity of cell lines.
        Molecular Target:	Drug
        Principal Collaborators:	"LMP, CCR, NCI (K Kohn)"
            "DPT, NCI (J Morris)"
            "Genomics and Bioinf Gp, LMP, CCR, NCI"


In [34]:
%run -m ipy_startup
%run -m ipy_seaborn
%matplotlib inline
import io
from mgds.data_aggregation import database as db
from mgds.data_aggregation import source as src
from mgds.data_aggregation import io_utils
from py_utils import zip_utils
pd.set_option('display.max_info_rows', 50000000)

In [7]:
url = 'https://discover.nci.nih.gov/cellminerdata/rawdata/DTP_NCI60_RAW.zip'
filepath = db.raw_file(src.NCI60_v2, 'drug-sensitivity.zip')
filepath = io_utils.download(url, filepath, check_exists=True)
filepath

'/Users/eczech/data/research/mgds/raw/nci60_v2_drug-sensitivity.zip'

In [10]:
files = zip_utils.get_zip_archive_files(filepath)
files

['DTP_NCI60_RAW.xlsx',
 'documents/_README_Compound_Activity__DTP_NCI_60.txt',
 'DTP_NCI60_RAW_footnote.html',
 'DTP_NCI60_RAW.html']

In [None]:
data = zip_utils.get_zip_archive_file_data(filepath, 'DTP_NCI60_RAW.xlsx')
d = pd.read_excel(io.BytesIO(data), sheetname='all', skiprows=[0,1,2,3,4,5,6,7], na_values=['-'])
d.head()

In [21]:
d.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74506 entries, 0 to 74505
Data columns (total 69 columns):
NSC #                          74506 non-null int64
Drug name                      74506 non-null object
FDA status                     74506 non-null object
Mechanism of action            74506 non-null object
PubChem SID                    74506 non-null object
Total probes                   74506 non-null int64
Total after quality control    74506 non-null int64
Failure reason                 74506 non-null object
Experiment name                74506 non-null object
BR:MCF7                        74506 non-null object
BR:MDA_MB_231                  74506 non-null object
BR:HS578T                      74506 non-null object
BR:BT_549                      74506 non-null object
BR:T47D                        74506 non-null object
CNS:SF_268                     74506 non-null object
CNS:SF_295                     74506 non-null object
CNS:SF_539                     74506 non-null 

## Melt to Long Format

In [27]:
d_tr = pd.melt(d, id_vars=list(d.columns[:9]), var_name='CELL_LINE_ID', value_name='VALUE')
c_m = {
    'Drug name': 'DRUG_NAME',
    'NSC #': 'DRUG_ID:NSC',
    'FDA status': 'FDA_STATUS',
    'PubChem SID': 'DRUG_ID:PUBCHEM',
    'Total probes': 'NUM_PROBES',
    'Total after quality control': 'NUM_PROBES_AFTER_QC',
    'Failure reason': 'FAILURE_REASON',
    'Experiment name': 'EXPERIMENT_ID',
    'Mechanism of action': 'ACTION'
}
d_tr = d_tr.rename(columns=c_m)
d_tr.head()

Unnamed: 0,DRUG_ID:NSC,DRUG_NAME,FDA_STATUS,ACTION,DRUG_ID:PUBCHEM,NUM_PROBES,NUM_PROBES_AFTER_QC,FAILURE_REASON,EXPERIMENT_ID,CELL_LINE_ID,VALUE
0,1,tolylquinone,-,-,66954,3,2,-,0807NS58,BR:MCF7,4.83
1,1,tolylquinone,-,-,66954,3,2,-,0809RS22,BR:MCF7,4.76
2,1,tolylquinone,-,-,66954,3,2,-,0810RS52,BR:MCF7,4.86
3,17,4-amino-3-pentadecylphenol,-,-,66970,3,3,-,0904NS01,BR:MCF7,5.18
4,17,4-amino-3-pentadecylphenol,-,-,66970,3,3,-,0907RS71,BR:MCF7,4.73


In [35]:
d_tr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4470360 entries, 0 to 4470359
Data columns (total 11 columns):
DRUG_ID:NSC            4470360 non-null int64
DRUG_NAME              4470360 non-null object
FDA_STATUS             4470360 non-null object
ACTION                 4470360 non-null object
DRUG_ID:PUBCHEM        4470360 non-null object
NUM_PROBES             4470360 non-null int64
NUM_PROBES_AFTER_QC    4470360 non-null int64
FAILURE_REASON         4470360 non-null object
EXPERIMENT_ID          4470360 non-null object
CELL_LINE_ID           4470360 non-null object
VALUE                  4470360 non-null object
dtypes: int64(3), object(8)
memory usage: 375.2+ MB


In [32]:
d_tr['NUM_PROBES_AFTER_QC'].value_counts()

0      1979160
1       956460
2       755940
253     247320
3       141060
4        65880
5        36540
6        34980
121      23520
7        16080
97       15960
75       15480
8        14400
11       11340
10        9960
12        8880
114       8160
92        7920
123       7800
113       7800
102       7800
78        7740
9         7740
47        7680
117       7680
96        7560
80        5820
32        5760
15        5040
18        4980
16        4740
25        3780
39        3720
19        3300
14        3240
48        3240
42        3060
29        3000
43        3000
46        3000
13        2700
17        1140
Name: NUM_PROBES_AFTER_QC, dtype: int64