# NCI60 Exome Sequencing Data Importation
**Local Version**: 2
**Source Version**: NA

This notebook will import raw NCI60 exome sequencing data using the [CGDS](http://www.cbioportal.org/cgds_r.jsp) (aka "Cancer Genomic Data Server") portal.

In [5]:
%run -m ipy_startup
%run -m ipy_logging
from mgds.data_aggregation import database as db
from mgds.data_aggregation import source as src
from mgds.data_aggregation import api
from mgds.data_aggregation.import_lib import cgds
from mgds.data_aggregation.import_lib import nci60
pd.set_option('display.max_info_rows', 25000000)

In [8]:
case_list_id = nci60.CASE_LIST_ID
genetic_profile_id = nci60.PROF_MUTATION
batch_size = 50

op = lambda: cgds.get_mutation_data(
    case_list_id, genetic_profile_id,
    api.get_hugo_gene_ids(), gene_id_batch_size=batch_size
)
d = db.cache_raw_operation(op, src.NCI60_v2, 'gene-exome-seq', overwrite=False)

2016-11-19 20:04:18,238:INFO:mgds.data_aggregation.import_lib.cgds: Processing batch 1 of 789
2016-11-19 20:05:35,853:INFO:mgds.data_aggregation.import_lib.cgds: Processing batch 79 of 789
2016-11-19 20:06:47,541:INFO:mgds.data_aggregation.import_lib.cgds: Processing batch 157 of 789
2016-11-19 20:08:07,588:INFO:mgds.data_aggregation.import_lib.cgds: Processing batch 235 of 789
2016-11-19 20:09:22,516:INFO:mgds.data_aggregation.import_lib.cgds: Processing batch 313 of 789
2016-11-19 20:10:31,484:INFO:mgds.data_aggregation.import_lib.cgds: Processing batch 391 of 789
2016-11-19 20:11:45,746:INFO:mgds.data_aggregation.import_lib.cgds: Processing batch 469 of 789
2016-11-19 20:12:44,772:INFO:mgds.data_aggregation.import_lib.cgds: Processing batch 547 of 789
2016-11-19 20:15:25,405:INFO:mgds.data_aggregation.import_lib.cgds: Processing batch 625 of 789
2016-11-19 20:16:47,429:INFO:mgds.data_aggregation.import_lib.cgds: Processing batch 703 of 789
2016-11-19 20:18:13,467:INFO:mgds.data_aggr

In [9]:
d.head()

Unnamed: 0,entrez_gene_id,gene_symbol,case_id,sequencing_center,mutation_status,mutation_type,validation_status,amino_acid_change,functional_impact_score,xvar_link,...,chr,start_position,end_position,reference_allele,variant_allele,reference_read_count_tumor,variant_read_count_tumor,reference_read_count_normal,variant_read_count_normal,genetic_profile_id
0,1.0,A1BG,HCT_15,discover.nci.nih.gov,,Splice_Region,,MUTATED,,,...,19.0,58858398.0,58858398.0,A,G,,,,,cellline_nci60_mutations
1,1.0,A1BG,HCC_2998,discover.nci.nih.gov,,Missense_Mutation,,T46M,M,"getma.org/?cm=var&var=hg19,19,58864497,G,A&fts...",...,19.0,58864497.0,58864497.0,G,A,,,,,cellline_nci60_mutations
2,1.0,A1BG,KM12,discover.nci.nih.gov,,Missense_Mutation,,T257N,M,"getma.org/?cm=var&var=hg19,19,58862897,G,T&fts...",...,19.0,58862897.0,58862897.0,G,T,,,,,cellline_nci60_mutations
3,29974.0,A1CF,MOLT_4,discover.nci.nih.gov,,Missense_Mutation,,G320V,L,"getma.org/?cm=var&var=hg19,10,52575948,C,A&fts...",...,10.0,52575948.0,52575948.0,C,A,,,,,cellline_nci60_mutations
4,29974.0,A1CF,DU_145,discover.nci.nih.gov,,Missense_Mutation,,N275D,N,"getma.org/?cm=var&var=hg19,10,52580356,T,C&fts...",...,10.0,52580356.0,52580356.0,T,C,,,,,cellline_nci60_mutations


In [10]:
d.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 34073 entries, 0 to 14
Data columns (total 22 columns):
entrez_gene_id                 34073 non-null float64
gene_symbol                    34073 non-null object
case_id                        34073 non-null object
sequencing_center              34052 non-null object
mutation_status                0 non-null object
mutation_type                  34073 non-null object
validation_status              0 non-null object
amino_acid_change              34073 non-null object
functional_impact_score        24795 non-null object
xvar_link                      26561 non-null object
xvar_link_pdb                  9710 non-null object
xvar_link_msa                  24830 non-null object
chr                            34073 non-null float64
start_position                 34073 non-null float64
end_position                   34073 non-null float64
reference_allele               34073 non-null object
variant_allele                 34073 non-null objec

In [17]:
c_rm = cgds.DEFAULT_IGNORABLE_MUTATION_COLS + [
    'reference_read_count_normal', 
    'variant_read_count_normal',
    'reference_read_count_tumor',
    'variant_read_count_tumor'
]
d_exp = cgds.prep_mutation_data(d, c_rm)
d_exp['SEQUENCING_CENTER'] = d_exp['SEQUENCING_CENTER'].fillna('Unknown')
d_exp['FUNCTIONAL_IMPACT_SCORE'] = d_exp['FUNCTIONAL_IMPACT_SCORE'].fillna('Unknown')
d_exp.info()

[Remove duplicate records] Records before = 34073, Records after = 33995, Records removed = 78 (%0.23)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 33995 entries, 0 to 14
Data columns (total 13 columns):
GENE_ID:ENTREZ             33995 non-null int64
GENE_ID:HGNC               33995 non-null object
CELL_LINE_ID               33995 non-null object
SEQUENCING_CENTER          33995 non-null object
MUTATION_TYPE              33995 non-null object
AMINO_ACID_CHANGE          33995 non-null object
FUNCTIONAL_IMPACT_SCORE    33995 non-null object
CHR                        33995 non-null float64
START_POSITION             33995 non-null float64
END_POSITION               33995 non-null float64
REFERENCE_ALLELE           33995 non-null object
VARIANT_ALLELE             33995 non-null object
GENETIC_PROFILE_ID         33995 non-null object
dtypes: float64(3), int64(1), object(9)
memory usage: 3.6+ MB


In [19]:
# Note that records may not necessarily be unique to cell line, gene, and amino acid change
c_unique = ['CELL_LINE_ID', 'GENE_ID:HGNC', 'AMINO_ACID_CHANGE']
cts = d_exp.groupby(c_unique).size()
cts.value_counts()

1    33766
2      113
3        1
dtype: int64

In [20]:
assert np.all(pd.notnull(d_exp))
db.save(d_exp, src.NCI60_v2, db.IMPORT, 'gene-exome-seq')

'/Users/eczech/data/research/mgds/import/nci60_v2_gene-exome-seq.pkl'