<a href="https://colab.research.google.com/github/cappelchi/Modeling-Precision-Medicine/blob/master/Modeling_Precision_Medicine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Associated Publication and Dataset
The course uses data presented by Daemen et al. (Modeling precision treatment of breast cancer, Genome Biol. 2013. The researchers tested 90 therapeutic compounds on 70 breast cancer cell lines (out of 84 lines comprising their collection) and determined GI50 — concentrations required to inhibit cell growth by 50%. GI50 can be treated as a measure of efficacy of a given compound for a given breast cancer cell line. In addition to a table with GI50 values, Daemen et al. deposited results of RNA-Seq (GSE4821) for 56 cell lines, including 52 for which GI50 data was also available. Finally, the authors specified an associated breast cancer subtype for each cell line. For this course, we will be using data from those 52 cell lines for which GI50 data was available.

ABSTRACT:

Background

First-generation molecular profiles for human breast cancers have enabled the identification of features that can predict therapeutic response; however, little is known about how the various data types can best be combined to yield optimal predictors. Collections of breast cancer cell lines mirror many aspects of breast cancer molecular pathobiology, and measurements of their omic and biological therapeutic responses are well-suited for development of strategies to identify the most predictive molecular feature sets.

Results

We used least squares-support vector machines and random forest algorithms to identify molecular features associated with responses of a collection of 70 breast cancer cell lines to 90 experimental or approved therapeutic agents. The datasets analyzed included measurements of copy number aberrations, mutations, gene and isoform expression, promoter methylation and protein expression. Transcriptional subtype contributed strongly to response predictors for 25% of compounds, and adding other molecular data types improved prediction for 65%. No single molecular dataset consistently outperformed the others, suggesting that therapeutic response is mediated at multiple levels in the genome. Response predictors were developed and applied to TCGA data, and were found to be present in subsets of those patient samples.

Conclusions

These results suggest that matching patients to treatments based on transcriptional subtype will improve response rates, and inclusion of additional features from other profiling data types may provide additional benefit. Further, we suggest a systems biology strategy for guiding clinical trials so that patient cohorts most likely to respond to new therapies may be more efficiently identified.

<img src="https://raw.githubusercontent.com/cappelchi/Modeling-Precision-Medicine/master/img/CellLines_Pub_Fig.jpg" width="600" height="400" />
 

FIGURE:

 



 

Figure 1: Cell line-based response prediction strategy.

(A) We assembled a collection of 84 breast cancer cell lines composed of 35 luminal, 27 basal, 10 claudin-low, 7 normal-like, 2 matched normal and 3 of unknown subtype. Fourteen luminal and 7 basal cell lines were also ERBB2-amplified.

(B) Seventy lines were tested for response to 138 compounds by growth inhibition assays. Compounds with low variation in response in the cell line panel were eliminated, leaving a response data set of 90 compounds. Cell lines were divided into a sensitive and resistant group for each compound using the mean GI50 value for that compound.

(C) Seven pretreatment molecular profiling data sets were analyzed to identify molecular features associated with response. Exome-seq data were available for 75 cell lines, followed by SNP6 data for 74 cell lines, RNAseq for 56, exon array for 56, RPPA for 49, methylation for 47, and U133A expression array data for 46 cell lines. All 70 lines were used in development of at least some predictors depending on data type availability.

(D) Classification signatures were developed using the molecular feature data (after filtering) and with response status as the target. Two methods, weighted least squares support vector machine (LS-SVM) and random forests (RF), were utilized. The best performing signature was chosen for each drug and data type combination. This allows prediction of response for additional cell lines or tumors with any given combination of input data types.

(E) Cell line-based response predictors were applied to 306 TCGA breast tumors for which expression (Exp), copy number (CNV) and methylation (Meth) measurements were all available.

(F) This identified 22 compounds with a model AUC >0.7 for which at least some patients were predicted to be responsive with a probability >0.65. Thresholds for considering a tumor responsive were objectively chosen for each compound from the distribution of predicted probabilities and each patient was assigned to a status of resistant, intermediate or sensitive. WPMV, weighted percent of model variables.

In [0]:
mypath = '/content/Modeling-Precision-Medicine/'

In [0]:
!wget https://raw.githubusercontent.com/cappelchi/Modeling-Precision-Medicine/master/CellLines_Student_ProjectData.zip
!unzip CellLines_Student_ProjectData.zip -d {mypath}

--2020-03-09 22:23:48--  https://raw.githubusercontent.com/cappelchi/Modeling-Precision-Medicine/master/CellLines_Student_ProjectData.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 839357 (820K) [application/zip]
Saving to: ‘CellLines_Student_ProjectData.zip’


2020-03-09 22:23:48 (16.3 MB/s) - ‘CellLines_Student_ProjectData.zip’ saved [839357/839357]

Archive:  CellLines_Student_ProjectData.zip
   creating: /content/Modeling-Precision-Medicine/CellLines_Student_ProjectData/
  inflating: /content/Modeling-Precision-Medicine/CellLines_Student_ProjectData/CellLines_ClinSubtypes.txt  
   creating: /content/Modeling-Precision-Medicine/__MACOSX/
   creating: /content/Modeling-Precision-Medicine/__MACOSX/CellLines_Student_ProjectData/
  inflating: /content/Modelin

In [0]:
import pandas as pd
from os import listdir
from os.path import isfile, join

In [0]:
filespath = '/content/Modeling-Precision-Medicine/CellLines_Student_ProjectData/'

In [0]:
results = [f for f in listdir(filespath) if isfile(join(filespath, f))]

In [0]:
results

['CellLines_ExprData.txt',
 'CellLines_DrugsInit.txt',
 'CellLines_DrugsRes.txt',
 'CellLines_ClinSubtypes.txt']

In [0]:
for result in results:
    if result[:10] == 'CellLines_':
        df_name = result[10:-4]
    else:
        df_name = result[:-4]
    comand = f'{df_name} = pd.read_csv("{filespath + result}", sep = "\s+"' + ', dtype={"id":"str"})'
    print (comand)
    exec(comand)
    print (df_name)
    exec(f'print ({df_name}.describe())')

ExprData = pd.read_csv("/content/Modeling-Precision-Medicine/CellLines_Student_ProjectData/CellLines_ExprData.txt", sep = "\s+", dtype={"id":"str"})
ExprData
             184A1        184B5  ...       ZR7530        ZR75B
count  6916.000000  6916.000000  ...  6916.000000  6916.000000
mean      4.347667     4.182436  ...     3.995236     4.227894
std       3.167484     3.118536  ...     3.039926     3.179839
min      -3.000000    -3.000000  ...    -3.000000    -3.000000
25%       3.438750     3.269500  ...     3.006000     3.288750
50%       5.027500     4.766000  ...     4.627000     4.920500
75%       6.186250     5.931000  ...     5.820000     6.106000
max      13.977000    13.831000  ...    14.992000    14.537000

[8 rows x 52 columns]
DrugsInit = pd.read_csv("/content/Modeling-Precision-Medicine/CellLines_Student_ProjectData/CellLines_DrugsInit.txt", sep = "\s+", dtype={"id":"str"})
DrugsInit
          17-AAG      5-FU     5-FdUR  ...  Vinorelbine     XRP44X   ZM447439
count  51.000