In [1]:
import pandas as pd

<p style="text-align:justify">The LINCS-L1000 perturbation gene expression profiles used in our study are located as [GSE92742](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE92742) (Phase I data, as described in [Subramanian A et al; Cell; 2017](https://www.ncbi.nlm.nih.gov/pubmed/29195078)) and [GSE70138](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE70138) (Phase II, ongoing, last updated March 30, 2017) on [GEO](https://www.ncbi.nlm.nih.gov/geo/). You should manually download **Level5** (replicate-consensus sigantures) and all metadata do *'../data/LINCS/GSE92742/'* and *'../data/LINCS/GSE70138/'*, respectively. If you are ready with this, the following cell just opens the *GSE92742_Broad_LINCS_sig_info.txt* file, which contatins most experiment related metadata.

In [20]:
lincs=pd.read_table('../data/LINCS/GSE92742/GSE92742_Broad_LINCS_sig_info.txt',
                   sep='\t',header=0,index_col=None,low_memory=False)
print(lincs.head())

                                  sig_id        pert_id     pert_iname  \
0                    AML001_CD34_24H:A05           DMSO           DMSO   
1                    AML001_CD34_24H:A06           DMSO           DMSO   
2                    AML001_CD34_24H:B05           DMSO           DMSO   
3                    AML001_CD34_24H:B06           DMSO           DMSO   
4  AML001_CD34_24H:BRD-A03772856:0.37037  BRD-A03772856  BRD-A03772856   

     pert_type cell_id pert_dose pert_dose_unit pert_idose  pert_time  \
0  ctl_vehicle    CD34       0.1              %      0.1 %         24   
1  ctl_vehicle    CD34       0.1              %      0.1 %         24   
2  ctl_vehicle    CD34       0.1              %      0.1 %         24   
3  ctl_vehicle    CD34       0.1              %      0.1 %         24   
4       trt_cp    CD34   0.37037             µM     500 nM         24   

  pert_time_unit pert_itime                                          distil_id  
0              h       24 h        

Our model is built on cell viability values form [CTRPv2]() and [Achilles]() screens. For CTRP you should download **Cancer Therapeutics Response Portal (CTRP v2, 2015) Dataset** from [CTD<sup>2</sup> Data Portal](https://ocg.cancer.gov/programs/ctd2/data-portal) to *'../data/CTRP/'*. For Achilles you should download **Achilles_v2.4.6.rnai.gct** and **achilles-v2-19-2-mapped-rnai_v1-data.gct** from [DepMap](https://depmap.org/portal/download/all/) to *'../data/Achilles/'*. If you are ready with this, the following code jsut opens two files to check everything is fine.

In [21]:
ctrp=pd.read_table('../data/CTRP/v20.meta.per_compound.txt',sep='\t',
                   header=0,index_col=None)
print(ctrp.head())

   master_cpd_id             cpd_name   broad_cpd_id  top_test_conc_umol  \
0           1788                CIL55  BRD-K46556387                10.0   
1           3588              BRD4132  BRD-K86574132               160.0   
2          12877              BRD6340  BRD-K35716340                33.0   
3          17712                ML006  BRD-K89692698               530.0   
4          18311  Bax channel blocker  BRD-A18763547                33.0   

  cpd_status  inclusion_rationale gene_symbol_of_protein_target  \
0      probe            pilot-set                           NaN   
1      probe  chromatin;pilot-set                           NaN   
2      probe  chromatin;pilot-set                           NaN   
3      probe            pilot-set                         S1PR3   
4      probe            pilot-set                           BAX   

                      target_or_activity_of_compound          source_name  \
0                                      screening hit  Columbia 

In [22]:
achilles=pd.read_table('../data/Achilles/Achilles_v2.4.6.rnai.gct',sep='\t',
                       header=0,index_col=None,skiprows=2)
print(achilles.head())

                            Name Description  22RV1_PROSTATE  \
0  AAAAATGGCATCAACCACCAT_RPS6KA1     RPS6KA1        0.745271   
1    AAACACATTTGGGATGTTCCT_IGF1R       IGF1R        1.547808   
2     AAAGAAGAAGCTGCAATATCT_TSC1        TSC1        1.802590   
3    AAGCGTGCCGTAGACTGTCCA_CHEK1       CHEK1             NaN   
4    AATCTAAGAGAGCTGCCATCG_XRCC5       XRCC5       -2.809293   

   697_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE  786O_KIDNEY  \
0                               -0.041123    -0.242875   
1                                2.190488     1.813222   
2                                1.536374     1.671655   
3                                     NaN          NaN   
4                               -1.330285    -2.912347   

   A1207_CENTRAL_NERVOUS_SYSTEM  A172_CENTRAL_NERVOUS_SYSTEM  \
0                     -0.720484                     1.114986   
1                      1.055123                     1.711305   
2                      1.749983                     2.372121   
3         

As you can see, both CTRP and LINCS-L1000 uses Broad Compound IDs for their small molecules, so it makes it easy to match between these datasets. However LINCS-L1000 uses different IDs, we need some metadata to translate between them. This can be downloaded form [Genetic Perturbation Platform](https://portals.broadinstitute.org/gpp/public/pool/index). We need **CP0001_20150109_reference.csv** and **CP0003_20150109_reference.csv** so you should download them to '../data/Achille/'. Opening this file shows us how we can translate later between different IDs.

In [23]:
shrna=pd.read_table('../data/Achilles/CP0001_reference_20150109.csv',sep=',',
                    header=None,index_col=None)
print(shrna.head())

                       0               1                      2
0  AAAAATGGCATCAACCACCAT  TRCN0000010427  AAAAATGGCATCAACCACCAT
1  AAAAGGATAACCCAGGTGTTT  TRCN0000018361  AAAAGGATAACCCAGGTGTTT
2  AAAATGGCATCAACCACCATC  TRCN0000221515  AAAATGGCATCAACCACCATC
3  AAACACATTTGGGATGTTCCT  TRCN0000010361  AAACACATTTGGGATGTTCCT
4  AAACAGCTGCCTTAGCTTCAC  TRCN0000010475  GAAACAGCTGCCTTAGCTTCA


Later we will need cell line gene expression data and drug sensitivity data from [GDSC](https://www.cancerrxgene.org/downloads). You should download **RMA normalised expression data for Cell lines** and **Annotated list of Cell lines** to *../data/GDSC/'*.

In [2]:
expression=pd.read_table('../data/GDSC/sanger1018_brainarray_ensemblgene_rma.txt',sep='\t',header=0,index_col=None)

In [6]:
expression.head()
expression['ensembl_gene'].to_csv('genes.csv')

In [4]:
annotation=pd.read_excel('../data/GDSC/Cell_Lines_Details.xlsx')

In [5]:
annotation.head()

Unnamed: 0,Sample Name,COSMIC identifier,Whole Exome Sequencing (WES),Copy Number Alterations (CNA),Gene Expression,Methylation,Drug Response,GDSC Tissue descriptor 1,GDSC Tissue descriptor 2,Cancer Type (matching TCGA label),Microsatellite instability Status (MSI),Screen Medium,Growth Properties
0,A253,906794.0,Y,Y,Y,Y,Y,aero_dig_tract,head and neck,,MSS/MSI-L,D/F12,Adherent
1,BB30-HNC,753531.0,Y,Y,Y,Y,Y,aero_dig_tract,head and neck,HNSC,MSS/MSI-L,D/F12,Adherent
2,BB49-HNC,753532.0,Y,Y,Y,Y,Y,aero_dig_tract,head and neck,HNSC,MSS/MSI-L,D/F12,Adherent
3,BHY,753535.0,Y,Y,Y,Y,Y,aero_dig_tract,head and neck,HNSC,MSS/MSI-L,D/F12,Adherent
4,BICR10,1290724.0,Y,Y,Y,Y,Y,aero_dig_tract,head and neck,HNSC,MSS/MSI-L,D/F12,Adherent
