# NCI60 Mutation Data Importation
**Local Version**: 1
**Source Version**: NA

This notebook will import raw NCI60 mutation data using the [Cell Miner R](http://bioconductor.statistik.tu-dortmund.de/packages/3.4/data/experiment/manuals/rcellminerData/man/rcellminerData.pdf) package.

In [1]:
%run -m ipy_startup
%load_ext rpy2.ipython
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
from mgds.data_aggregation import database as db
from mgds.data_aggregation import source as src
from py_utils.collection_utils import subset

In [2]:
%%R 
# Load CellMiner package containing necessary data
library('rcellminerData')

In [3]:
%%R -o d
# Load NCI60 mutation data
d <- as.data.frame(molData@eSetList$mut)

In [4]:
# All mutation values should be 0 or 1, so convert the resulting DataFrame to an integer type
# and ensure that all values are either 0 or 1 (and never NA/NULL)
d = d.astype(np.int64)
assert not np.any(d.isnull())
assert np.all(d.applymap(lambda x: x == 1 or x == 0))

In [5]:
d.index.name = 'CELL_LINE_ID'
d = d.reset_index()
d.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Columns: 10665 entries, CELL_LINE_ID to ZZEF1
dtypes: int64(10664), object(1)
memory usage: 4.9+ MB


In [6]:
d.head()

Unnamed: 0,CELL_LINE_ID,A1BG,A1CF,A2ML1,A4GALT,A4GNT,AAAS,AACS,AADACL2,AADACL3,...,ZSWIM4,ZSWIM5,ZSWIM7,ZW10,ZWILCH,ZWINT,ZXDC,ZYG11B,ZYX,ZZEF1
0,BR:MCF7,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,BR:MDA_MB_231,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,BR:HS578T,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,BR:BT_549,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,BR:T47D,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [7]:
# Ensure that each tumor has at least one mutation and that each 
# gene has at least one mutation as well
assert d.set_index('CELL_LINE_ID').sum(axis=1).min() > 0
assert d.set_index('CELL_LINE_ID').sum(axis=0).min() > 0

In [8]:
d = pd.melt(d, id_vars='CELL_LINE_ID', var_name='GENE_ID', value_name='VALUE')
assert d.groupby(['CELL_LINE_ID', 'GENE_ID']).size().max() == 1
d = subset(d, lambda df: df[df['VALUE'].notnull()], subset_op='Remove null values for column "VALUE"')
d.head()

[Remove null values for column "VALUE"] Records before = 639840, Records after = 639840, Records removed = 0 (%0.00)


Unnamed: 0,CELL_LINE_ID,GENE_ID,VALUE
0,BR:MCF7,A1BG,0
1,BR:MDA_MB_231,A1BG,0
2,BR:HS578T,A1BG,0
3,BR:BT_549,A1BG,0
4,BR:T47D,A1BG,0


In [9]:
d.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 639840 entries, 0 to 639839
Data columns (total 3 columns):
CELL_LINE_ID    639840 non-null object
GENE_ID         639840 non-null object
VALUE           639840 non-null int64
dtypes: int64(1), object(2)
memory usage: 19.5+ MB


In [10]:
assert d['CELL_LINE_ID'].nunique() == 60, 'Did not find data for exactly 60 cell lines'
d['CELL_LINE_ID'].nunique()

60

In [11]:
assert np.all(pd.notnull(d))
db.save(d, src.NCI60_v1, db.RAW, 'gene-mutations')

'/Users/eczech/data/research/mgds/raw/nci60_v1_gene-mutations.pkl'