# Chapter 1: Set up data

Before executing the following chapters, one dataset needs to be downloaded and some data needs to be modified slightly. To do this, please follow the steps below.

## 1. Get gene dependencies dataset from the Achilles project website

Watch [this quick video](https://www.youtube.com/watch?v=wj0cJC9-XYw&feature=youtu.be) or follow these steps:

1. Go [here](https://portals.broadinstitute.org/achilles/users/sign_up) to make an account with Project Achilles.
2. Confirm your email with the confirmation link in an email you'll receive from Project Achilles.
3. Go [here](https://portals.broadinstitute.org/achilles/datasets/15/download) and click the "ExpandedGeneZSolsCleaned.csv" link to download the gene dependencies dataset
4. Move the downloaded "ExpandedGeneZSolsCleaned.csv" file to the onco-gps-paper-analysis/data directory


## 2. Prepare data

Run all the cells below.

### Set up notebook and import [CCAL](https://github.com/KwatME/ccal)

In [1]:
from notebook_environment import *


%load_ext autoreload
%autoreload 2
%matplotlib inline

### Unzip and prepare datasets

In [None]:
# Unzip data
ccal.unzip('../data/CTRPv2.2_2015_pub_CancerDisc_5_1210.zip')
ccal.unzip('../data/gene_set__gene_set_x_ccle_cellline.gct.zip')
ccal.unzip('../data/CCLE_MUT_EXPR_RPPA_OncoGPS.zip')
for fn in [
        'mutation__gene_x_ccle_cellline.gct',
        'rpkm__gene_x_ccle_cellline.gct',
        'rppa__protein_x_ccle_cellline.gct',
        'annotation__feature_x_ccle_cellline.gct',
]:
    shutil.move('../data/CCLE_datasets/{0}'.format(fn), '../data/{0}'.format(fn))

# Rename Achilles RNAi dataset
# df = pd.read_csv('../data/ExpandedGeneZSolsCleaned.csv', index_col=0)
# ccal.write_gct(df, '../data/achilles__gene_x_ccle_cellline.gct')

# Read compound data
auc = pd.read_table('../data/v22.data.auc_sensitivities.txt')
print(auc.shape)

cpd = pd.read_table('../data/v22.meta.per_compound.txt', index_col=0)
print(cpd.shape)

ccl = pd.read_table('../data/v22.meta.per_cell_line.txt', index_col=0)
print(ccl.shape)

# Make dict for faster ID-to-name look up
cpd_d = cpd['cpd_name'].to_dict()
ccl_d = ccl['ccl_name'].to_dict()

# Make empty compound-x-cellline matrix
compound_x_cellline = pd.DataFrame(
    index=sorted(set(cpd['cpd_name'])), columns=sorted(set(ccl['ccl_name'])))
print(compound_x_cellline.shape)

# Populate compound-x-cellline matrix
for i, (i_cpd, i_ccl, a) in auc.iterrows():

    # Get compound name
    cpd_n = cpd_d[i_cpd]

    # Get cellline name
    ccl_n = ccl_d[i_ccl]

    # Get current AUC
    a_ = compound_x_cellline.loc[cpd_n, ccl_n]

    # If the current AUC is not set, set with this AUC
    if pd.isnull(a_):
        compound_x_cellline.loc[cpd_n, ccl_n] = a

    # If this AUC is smaller than the current AUC, set with this AUC
    elif a < a_:

        print('Updating AUC of compound {} on cellline {}: {:.3f} ==> {:.3f}'.
              format(cpd_n, ccl_n, a_, a))

        compound_x_cellline.loc[cpd_n, ccl_n] = a

# Update cellline names to match CCLE cellline names
columns = list(compound_x_cellline.columns)

# Read CCLE cellline annotations
a = pd.read_table('../data/CCLE_sample_info_file_2012-10-18.txt', index_col=0)

# Get CCLE cellline names
for i, ccl_n in enumerate(compound_x_cellline.columns):

    matches = []

    for ccle_n in a.index:
        if ccl_n.lower() == ccle_n.lower().split('_')[0]:
            matches.append(ccle_n)

    if 0 == len(matches):
        print('0 match: {}; matching substring ...'.format(ccl_n))

        for ccle_n in a.index:

            if ccl_n.lower() in ccle_n.lower():

                print('\t{} ==> {}.'.format(ccl_n, ccle_n))
                matches.append(ccle_n)

    if 1 == len(matches):

        print('{} ==> {}.'.format(ccl_n, matches[0]))
        columns[i] = matches[0]

    else:
        print('1 < matches: {} ==> {}'.format(ccl_n, matches))

# Update with CCLE cellline names
compound_x_cellline.columns = columns

# Write .gct file
ccal.write_gct(compound_x_cellline,
               '../data/ctd2__compound_x_ccle_cellline.gct')

compound_x_cellline

(260496, 3)
(481, 9)
(664, 8)
(481, 645)
Updating AUC of compound ML311 on cellline CCFSTTG1: 15.000 ==> 14.352
Updating AUC of compound ML311 on cellline NCIH460: 13.004 ==> 12.245
Updating AUC of compound ML311 on cellline IGROV1: 13.147 ==> 12.092
Updating AUC of compound ML311 on cellline OAW28: 13.205 ==> 11.837
Updating AUC of compound ML311 on cellline ASPC1: 13.483 ==> 13.200
Updating AUC of compound ML311 on cellline CAL51: 12.028 ==> 11.655
Updating AUC of compound ML311 on cellline NCIH1869: 14.149 ==> 13.606
Updating AUC of compound ML311 on cellline LOXIMVI: 13.386 ==> 12.332
Updating AUC of compound ML311 on cellline MKN74: 13.969 ==> 12.234
Updating AUC of compound zebularine on cellline CCFSTTG1: 14.999 ==> 14.634
Updating AUC of compound zebularine on cellline A549: 14.507 ==> 13.312
Updating AUC of compound zebularine on cellline NCIH460: 14.581 ==> 12.742
Updating AUC of compound zebularine on cellline ASPC1: 15.699 ==> 13.944
Updating AUC of compound zebularine on c

Updating AUC of compound paclitaxel on cellline CAL51: 9.707 ==> 5.014
Updating AUC of compound paclitaxel on cellline A375: 11.572 ==> 7.637
Updating AUC of compound paclitaxel on cellline LOXIMVI: 14.997 ==> 7.762
Updating AUC of compound paclitaxel on cellline MKN74: 9.152 ==> 6.965
Updating AUC of compound hyperforin on cellline CCFSTTG1: 15.447 ==> 14.839
Updating AUC of compound hyperforin on cellline NCIH1299: 15.580 ==> 14.640
Updating AUC of compound hyperforin on cellline A375: 18.677 ==> 14.182
Updating AUC of compound hyperforin on cellline NCIH1869: 14.920 ==> 14.371
Updating AUC of compound hyperforin on cellline MKN74: 14.038 ==> 12.211
Updating AUC of compound brefeldin A on cellline CCFSTTG1: 14.679 ==> 12.948
Updating AUC of compound brefeldin A on cellline OAW28: 9.371 ==> 7.765
Updating AUC of compound brefeldin A on cellline SUIT2: 9.336 ==> 9.333
Updating AUC of compound brefeldin A on cellline CAL51: 7.300 ==> 5.649
Updating AUC of compound brefeldin A on celllin

Updating AUC of compound fluvastatin on cellline A375: 14.413 ==> 13.805
Updating AUC of compound fluvastatin on cellline KE39: 13.867 ==> 13.123
Updating AUC of compound PX-12 on cellline CCFSTTG1: 15.116 ==> 14.770
Updating AUC of compound PX-12 on cellline A549: 14.621 ==> 14.587
Updating AUC of compound PX-12 on cellline NCIH520: 14.270 ==> 13.087
Updating AUC of compound PX-12 on cellline IGROV1: 14.234 ==> 13.212
Updating AUC of compound PX-12 on cellline OAW28: 14.310 ==> 12.950
Updating AUC of compound PX-12 on cellline LOXIMVI: 14.278 ==> 13.498
Updating AUC of compound PX-12 on cellline KE39: 14.126 ==> 13.446
Updating AUC of compound PX-12 on cellline MKN74: 14.744 ==> 14.205
Updating AUC of compound PD318088 on cellline CCFSTTG1: 14.840 ==> 14.451
Updating AUC of compound PD318088 on cellline A549: 11.127 ==> 9.559
Updating AUC of compound PD318088 on cellline NCIH460: 14.742 ==> 14.265
Updating AUC of compound PD318088 on cellline IGROV1: 11.728 ==> 9.794
Updating AUC of c

Updating AUC of compound ML050 on cellline IGROV1: 16.151 ==> 14.990
Updating AUC of compound ML050 on cellline ASPC1: 15.009 ==> 14.631
Updating AUC of compound ML050 on cellline SUIT2: 15.174 ==> 14.821
Updating AUC of compound ML050 on cellline A375: 14.681 ==> 13.675
Updating AUC of compound fulvestrant on cellline SUIT2: 14.637 ==> 14.606
Updating AUC of compound BRD-A86708339 on cellline CCFSTTG1: 15.157 ==> 13.818
Updating AUC of compound BRD-A86708339 on cellline NCIH1869: 11.099 ==> 9.681
Updating AUC of compound BRD-A86708339 on cellline SKUT1: 10.673 ==> 8.740
Updating AUC of compound CID-5951923 on cellline NCIH1299: 15.854 ==> 15.721
Updating AUC of compound CID-5951923 on cellline NCIH460: 14.711 ==> 14.060
Updating AUC of compound CID-5951923 on cellline ASPC1: 14.094 ==> 13.453
Updating AUC of compound CID-5951923 on cellline CAL51: 16.163 ==> 15.276
Updating AUC of compound CID-5951923 on cellline NCIH1869: 14.176 ==> 13.611
Updating AUC of compound CID-5951923 on cell

Updating AUC of compound BRD-K02492147 on cellline IGROV1: 14.794 ==> 14.248
Updating AUC of compound BRD-K02492147 on cellline CAL51: 15.714 ==> 15.211
Updating AUC of compound BRD-K02492147 on cellline NCIH1869: 14.582 ==> 14.040
Updating AUC of compound BRD-K02492147 on cellline LOXIMVI: 14.593 ==> 14.292
Updating AUC of compound BRD-K02492147 on cellline DU145: 15.752 ==> 15.000
Updating AUC of compound QS-11 on cellline AGS: 16.464 ==> 14.627
Updating AUC of compound AT-406 on cellline CCFSTTG1: 14.762 ==> 14.484
Updating AUC of compound AT-406 on cellline A549: 14.299 ==> 13.299
Updating AUC of compound AT-406 on cellline NCIH520: 14.112 ==> 11.507
Updating AUC of compound AT-406 on cellline IGROV1: 16.035 ==> 14.843
Updating AUC of compound AT-406 on cellline LOXIMVI: 15.256 ==> 11.978
Updating AUC of compound SU11274 on cellline A549: 14.070 ==> 13.611
Updating AUC of compound SU11274 on cellline IGROV1: 14.074 ==> 12.604
Updating AUC of compound SU11274 on cellline OAW28: 14.2

Updating AUC of compound PDMP on cellline IGROV1: 13.329 ==> 13.286
Updating AUC of compound PDMP on cellline CAL51: 14.299 ==> 13.751
Updating AUC of compound PDMP on cellline MKN74: 13.518 ==> 13.278
Updating AUC of compound LE-135 on cellline CCFSTTG1: 14.876 ==> 13.952
Updating AUC of compound LE-135 on cellline A549: 14.107 ==> 13.828
Updating AUC of compound LE-135 on cellline NCIH460: 15.263 ==> 14.868
Updating AUC of compound LE-135 on cellline IGROV1: 13.428 ==> 13.398
Updating AUC of compound LE-135 on cellline ASPC1: 14.788 ==> 14.587
Updating AUC of compound LE-135 on cellline LOXIMVI: 12.985 ==> 12.246
Updating AUC of compound GSK1059615 on cellline IGROV1: 10.670 ==> 9.815
Updating AUC of compound GSK1059615 on cellline OAW28: 11.180 ==> 10.260
Updating AUC of compound GSK1059615 on cellline NCIH1869: 11.295 ==> 10.489
Updating AUC of compound narciclasine on cellline CCFSTTG1: 11.376 ==> 10.191
Updating AUC of compound narciclasine on cellline NCIH460: 9.691 ==> 9.513
Up

Updating AUC of compound GW-405833 on cellline KE39: 12.844 ==> 12.409
Updating AUC of compound GW-405833 on cellline MKN74: 13.417 ==> 13.008
Updating AUC of compound B02 on cellline CCFSTTG1: 14.596 ==> 14.118
Updating AUC of compound B02 on cellline NCIH460: 14.838 ==> 14.379
Updating AUC of compound B02 on cellline IGROV1: 13.643 ==> 13.183
Updating AUC of compound B02 on cellline OAW28: 13.339 ==> 13.233
Updating AUC of compound B02 on cellline ASPC1: 13.534 ==> 12.003
Updating AUC of compound B02 on cellline CAL51: 13.598 ==> 13.332
Updating AUC of compound B02 on cellline A375: 14.016 ==> 13.921
Updating AUC of compound B02 on cellline NCIH1869: 13.219 ==> 11.947
Updating AUC of compound B02 on cellline LOXIMVI: 13.907 ==> 13.592
Updating AUC of compound B02 on cellline MKN74: 14.283 ==> 13.368
Updating AUC of compound SR-II-138A on cellline CCFSTTG1: 10.609 ==> 9.448
Updating AUC of compound SR-II-138A on cellline NCIH460: 11.374 ==> 10.987
Updating AUC of compound SR-II-138A o

Updating AUC of compound BRD-K13999467 on cellline SKUT1: 14.614 ==> 14.576
Updating AUC of compound BRD-K14844214 on cellline A549: 14.932 ==> 14.802
Updating AUC of compound BRD-K14844214 on cellline ASPC1: 14.199 ==> 14.005
Updating AUC of compound BRD-K14844214 on cellline CAL51: 14.917 ==> 14.840
Updating AUC of compound BRD-K14844214 on cellline LOXIMVI: 14.195 ==> 13.279
Updating AUC of compound BRD-K14844214 on cellline DU145: 14.820 ==> 14.459
Updating AUC of compound R428 on cellline CCFSTTG1: 15.029 ==> 12.009
Updating AUC of compound R428 on cellline A549: 12.433 ==> 11.936
Updating AUC of compound R428 on cellline IGROV1: 12.611 ==> 11.868
Updating AUC of compound R428 on cellline OAW28: 12.185 ==> 12.039
Updating AUC of compound R428 on cellline SKUT1: 11.473 ==> 11.255
Updating AUC of compound gemcitabine on cellline CCFSTTG1: 13.918 ==> 13.509
Updating AUC of compound gemcitabine on cellline CAL51: 7.693 ==> 6.288
Updating AUC of compound gemcitabine on cellline NCIH186

Updating AUC of compound indisulam on cellline NCIH460: 16.377 ==> 14.654
Updating AUC of compound indisulam on cellline NCIH520: 13.786 ==> 13.749
Updating AUC of compound indisulam on cellline IGROV1: 10.997 ==> 10.187
Updating AUC of compound indisulam on cellline OAW28: 14.592 ==> 13.949
Updating AUC of compound indisulam on cellline SUIT2: 16.013 ==> 13.994
Updating AUC of compound indisulam on cellline A375: 13.747 ==> 13.741
Updating AUC of compound indisulam on cellline NCIH1869: 12.906 ==> 12.415
Updating AUC of compound indisulam on cellline DU145: 13.544 ==> 13.518
Updating AUC of compound indisulam on cellline MKN74: 14.031 ==> 12.660
Updating AUC of compound belinostat on cellline CCFSTTG1: 12.675 ==> 12.488
Updating AUC of compound birinapant on cellline CCFSTTG1: 15.741 ==> 15.000
Updating AUC of compound birinapant on cellline A549: 14.967 ==> 14.498
Updating AUC of compound birinapant on cellline NCIH520: 15.570 ==> 15.030
Updating AUC of compound birinapant on celllin

### Check that all datasets exist

If you get an error running the cell below, get the dataset the error says you're missing, and run it again.

In [None]:
for fn in [
        'gene_x_kras_isogenic_and_imortalized_celllines.gct',
        'mutation__gene_x_ccle_cellline.gct',
        'rpkm__gene_x_ccle_cellline.gct',
        'gene_set__gene_set_x_ccle_cellline.gct',
        'regulator__gene_set_x_ccle_cellline.gct',
        'rppa__protein_x_ccle_cellline.gct',
        'achilles__gene_x_ccle_cellline.gct',
        'ctd2__compound_x_ccle_cellline.gct',
        'annotation__feature_x_ccle_cellline.gct',
]:
    assert fn in os.listdir('../data'), 'Missing {}!'.format(fn)

### Make the CCLE data object

In [None]:
# Make the CCLE data object used in coming chapters.

ccle = {
    'Mutation': {
        'df': ccal.read_gct('../data/mutation__gene_x_ccle_cellline.gct'),
        'emphasis': 'high',
        'data_type': 'binary'
    },
    'Gene Expression': {
        'df': ccal.read_gct('../data/rpkm__gene_x_ccle_cellline.gct'),
        'emphasis': 'high',
        'data_type': 'continuous'
    },
    'Gene Set': {
        'df': ccal.read_gct('../data/gene_set__gene_set_x_ccle_cellline.gct'),
        'emphasis': 'high',
        'data_type': 'continuous'
    },
    'Regulator Gene Set': {
        'df': ccal.read_gct('../data/regulator__gene_set_x_ccle_cellline.gct'),
        'emphasis': 'high',
        'data_type': 'continuous'
    },
    'Protein Expression': {
        'df': ccal.read_gct('../data/rppa__protein_x_ccle_cellline.gct'),
        'emphasis': 'high',
        'data_type': 'continuous'
    },
    'Gene Dependency (Achilles)': {
        'df': ccal.read_gct('../data/achilles__gene_x_ccle_cellline.gct'),
        'emphasis': 'low',
        'data_type': 'continuous'
    },
    'Drug Sensitivity (CTD^2)': {
        'df': ccal.read_gct('../data/ctd2__compound_x_ccle_cellline.gct'),
        'emphasis': 'low',
        'data_type': 'continuous'
    },
    'Primary Site': {
        'df':
        ccal.make_membership_df_from_categorical_series(
            ccal.read_gct('../data/annotation__feature_x_ccle_cellline.gct')
            .loc['Site Primary']),
        'emphasis':
        'high',
        'data_type':
        'binary'
    }
}

with gzip.open('../data/ccle.pickle.gz', 'wb') as f:

    pickle.dump(ccle, f)

### [Next chapter(2)](2 Generate oncogenic-activation signature.ipynb)