In [5]:
%load_ext rpy2.ipython
import rpy2
import warnings
def rmagic_warning(
    message,
    category = rpy2.rinterface.RRuntimeWarning,
    filename = '',
    lineno = -1,
    file=None, 
    line=None):
    print(message)
default_showwarning = warnings.showwarning

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython


In [12]:
import os

# Process Affy data for PDXs and 7 primaries
These are the expression data used throughout the paper.

## Data provenance
On 3/28/17, Sebastian Brabetz (s.brabetz@dkfz-heidelberg.de) sent me through WeTransfer a zip archive including the tumor cels, cels for 7 primaries (it had been 8, but Med1612 was excluded because of contamination), and the most recent annotation Excel file, 170328_PDXScreen_Anno_v5.xlsx.

Most if not all of these files had been sent previously, spread over several occasions, and the annotation changed a couple of times as it was decided that e.g. one sample of a tumor was better than another (fewer passages etc). So I asked Sebastian to send everything fresh to start with a clean slate and make sure we were all using the same data.

## Process cel files to probewise abundance using hspeccdf

[Isella et al. 2015](https://www.ncbi.nlm.nih.gov/pubmed/25706627) showed that in order to reduce the influence of infiltration of mouse stroma on measuring gene expression in PDXs, it can help to remove probes that cross-hybridize to mouse transcripts. This is done using a custom content descriptor file (CDF) rather than the default hgu133plus2 CDF.

In [6]:
warnings.showwarning = rmagic_warning # to only print the warning text, not text + returned warning object
from rpy2.robjects import numpy2ri
numpy2ri.activate()

In [7]:
%%R
pkg.is.installed <- function(mypkg)
{
    return(mypkg %in% rownames(installed.packages()))
}

Note that some of the installation and configuration of these R packages may have to be done in the terminal. If the following cell does not work, try installing the packages manually and re-running the notebook.

'/pdx-hts/Notebooks'

In [13]:
BASE_DIR = os.getcwd()
sys.path.append(BASE_DIR)
DATA_DIR = os.path.join(BASE_DIR, 'data')
RAW_DATA_DIR = os.path.join(DATA_DIR, 'raw')
RAW_EXP_DIR = os.path.join(RAW_DATA_DIR, 'exp')
pdx_affy_cel_dir = os.path.join(RAW_EXP_DIR, 'sebastian_3.28.17_cels/')

NameError: name '__file__' is not defined

In [9]:
pdx_affy_probe_exp_file

NameError: name 'pdx_affy_probe_exp_file' is not defined

In [4]:
%%R -i pdx_affy_cel_dir,pdx_affy_probe_exp_file
source("http://bioconductor.org/biocLite.R")
if(!pkg.is.installed('affy'))
{
    biocLite('affy')
}
if(!pkg.is.installed('hspeccdf'))
{
    biocLite('hspeccdf')
}
if(!pkg.is.installed('annotate'))
{
    biocLite('annotate')
}
library('affy')
library('annotate')
library('hspeccdf')
print('Reading in cels')
data <- ReadAffy(celfile.path=pdx_affy_cel_dir, cdfname='hspeccdf')
print('Performing RMA')
result <- affy::rma(data)
write.exprs(result, pdx_affy_probe_exp_file)

Bioconductor version 3.6 (BiocInstaller 1.28.0), ?biocLite for help

Loading required package: BiocGenerics

Loading required package: parallel


Attaching package: ‘BiocGenerics’


The following objects are masked from ‘package:parallel’:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB


The following objects are masked from ‘package:stats’:

    IQR, mad, sd, var, xtabs


The following objects are masked from ‘package:base’:

    anyDuplicated, append, as.data.frame, cbind, colMeans, colnames,
    colSums, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, lengths, Map, mapply, match,
    mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
    table, tapply, union, unique, unsplit, which, which.max, which.min


Loading req

[1] "Reading in cels"
[1] "Performing RMA"
Background correcting
Normalizing
Calculating Expression


In [5]:
numpy2ri.deactivate()
# Return to default warning handling.
warnings.showwarning = default_showwarning  

## Convert to genewise abundance

In [2]:
def fix_dkfz_tumor_name(name):
    name = name.replace('DKFZ-', '')
    name = name.upper()
    name = name.replace("FH", '')
    name = name.replace('-', '')
    name = re.sub('MB$', '', name)
    return name

In [3]:
pdx_annot_df = pd.read_excel(PDX_AFFY_ANNOT_FILE, index_col=0).astype(str)
cel_to_tumor = dict(zip(['{}.cel'.format(cel) for cel in pdx_annot_df.CEL_file], 
                        [fix_dkfz_tumor_name(name) for name in pdx_annot_df.index]))

have_primary = pdx_annot_df[pdx_annot_df.CEL_primary != '0']
cel_to_primary = dict(zip(['{}.cel'.format(cel) for cel in have_primary.CEL_primary], 
                        [fix_dkfz_tumor_name(name) for name in have_primary.index]))

Rather than using the hgu133plus2 annotation, hspeccdf maps probes to entrez genes, so for example the probeset 171484_at matches to the gene with entrez ID 171484. I use gene ID information from [HUGO](https://www.genenames.org/cgi-bin/download) (last downloaded 4.13.18) to map entrez IDs to HUGO gene symbols.

In [4]:
entrez2hugo = utils.make_col2col_dict(GENE_ID_CONVERSION_FILE, 'entrez', 'hugo')

In [None]:
probe_exp = pd.read_table(pdx_affy_probe_exp_file, index_col=0)
p2g = {}
for probe in probe_exp.index:
    entrez = probe.split('_')[0]
    if entrez in entrez2hugo:
        hugo = entrez2hugo[entrez]
        p2g[probe] = hugo
probe_df = probe_exp.copy()
keepers = []
for probe in probe_df.index:
    if probe in p2g:
        if not pd.isnull(p2g[probe]):
            keepers.append(probe)
gene_df = probe_df.loc[keepers]
gene_df.index = [p2g[probe] for probe in gene_df.index]

In [7]:
gene_df = utils.merge_redundant_series(gene_df, axis=0).T.sort_index(axis=1)

In [8]:
#gene_df = quantile_normalize(gene_df) # was done initially, but later considered unnecessary.

In [9]:
primaries_df = gene_df.loc[list(cel_to_primary.keys())]
primaries_df.index = [cel_to_primary[cel] for cel in primaries_df.index]
tumors_df = gene_df.loc[list(cel_to_tumor.keys())]
tumors_df.index = [cel_to_tumor[cel] for cel in tumors_df.index]

In [11]:
tumors_df.sort_index(axis=0).to_csv(PREPROC_PDX_AFFY_EXP_FILE)
primaries_df.sort_index(axis=0).to_csv(PREPROC_PRIMARY_AFFY_EXP_FILE)

In [12]:
tumors_df.shape

(20, 16247)

# Project expression onto genesets using ssGSEA

In [None]:
from genesets import load_genesets
from gsea import ssgsea

In [14]:
pdx_exp = pd.read_csv(PREPROC_PDX_AFFY_EXP_FILE, index_col=0)
primary_exp = pd.read_csv(PREPROC_PRIMARY_AFFY_EXP_FILE, index_col=0)

In [16]:
genesets = load_genesets(GENESETS_DIR, which='all')

In [8]:
%%time
pdx_ssgsea = ssgsea(pdx_exp, genesets, alpha=0.75)

CPU times: user 20.8 s, sys: 1.64 s, total: 22.5 s
Wall time: 9min 33s


In [11]:
pdx_ssgsea.to_csv(PDX_SSGSEA_FILE)

In [12]:
%%time
primary_ssgsea = ssgsea(primary_exp, genesets, alpha=0.75)

CPU times: user 6.89 s, sys: 416 ms, total: 7.31 s
Wall time: 3min 25s


In [13]:
primary_ssgsea.to_csv(PRIMARY_SSGSEA_FILE)

# Process Illumina Beadchip data for 3 PDXs and primaries

These data (available [here](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE28192)) were not used in the paper but are included in the supplement and for reference.

In [None]:
from utils import merge_redundant_series

In [12]:
exp_matrix = pd.read_table(PDX_PRIMARY_BEADCHIP_SERIES_MATRIX_FILE, comment='!', index_col=0)
exp_matrix.head()

Unnamed: 0_level_0,GSM697986,GSM697987,GSM697988,GSM697989,GSM697990,GSM697991,GSM697992,GSM697993,GSM697994,GSM697995,...,GSM698068,GSM698069,GSM698070,GSM698071,GSM698072,GSM698073,GSM698074,GSM698075,GSM698076,GSM698077
ID_REF,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ILMN_1343289,8873.765,9019.939,8321.701,8597.173,7972.136,8454.363,7222.372,7160.046,5800.359,5789.193,...,3966.823,4510.685,7933.08,7655.048,3969.261,4202.289,6584.224,6251.0,7800.057,7169.415
ILMN_1343290,10581.03,11269.9,3117.435,3319.712,8648.661,8912.334,16107.97,16039.77,14342.82,15226.37,...,6941.944,6465.565,5421.817,5232.795,8435.935,8016.704,10082.87,9275.18,4842.793,5161.231
ILMN_1343291,27723.54,26947.17,25108.23,24472.99,23876.24,25107.19,30136.33,31310.06,24931.08,26643.79,...,13985.14,15141.92,21393.43,23346.1,19563.94,20737.94,21042.82,20138.19,22116.22,24224.9
ILMN_1343292,3720.148,3716.892,3045.436,3093.125,3431.755,3183.872,6163.287,5984.875,6520.189,6664.529,...,7127.288,7485.877,10175.23,11400.91,6385.931,6290.254,7631.206,7654.023,10857.41,10923.24
ILMN_1343293,8413.422,8265.768,5738.696,5639.104,9057.703,9153.695,4510.629,4578.813,5262.731,5183.555,...,3803.817,3709.52,3953.159,4444.244,2922.038,2886.281,2843.153,2871.426,3956.004,4251.552


In [13]:
ilmn_id_to_tumor = {"GSM698018": "Pt-1299",
"GSM698019": "Pt-1299",
"GSM698020": "ICb1299",
"GSM698021": "ICb1299",
"GSM698032": "Pt-1487",
"GSM698033": "Pt-1487",
"GSM698034": "ICb-1487",
"GSM698035": "ICb-1487",
"GSM698046": "Pt-1572",
"GSM698047": "Pt-1572",
"GSM698048": "ICb-1572",
"GSM698049": "ICb-1572"}

In [14]:
exp_df = exp_matrix.loc[:, ilmn_id_to_tumor.keys()]
exp_df.columns = [ilmn_id_to_tumor[ilmn_id] for ilmn_id in exp_df.columns]
exp_df = merge_redundant_series(exp_df, axis=1)

In [15]:
# downloaded from GEO's page for the platform
ilmn_annot = pd.read_table(PDX_PRIMARY_BEADCHIP_ILMN_ANNOT_FILE, comment='#', index_col=0).Symbol
ilmn_annot.head()

ID
ILMN_1725881     LOC23117
ILMN_1910180          NaN
ILMN_1804174       FCGR2B
ILMN_1810835        SPRR3
ILMN_1758197    LOC653895
Name: Symbol, dtype: object

In [16]:
nonan = ilmn_annot.dropna()
nonan.head()

ID
ILMN_1725881     LOC23117
ILMN_1804174       FCGR2B
ILMN_1810835        SPRR3
ILMN_1758197    LOC653895
ILMN_1668162      DGAT2L3
Name: Symbol, dtype: object

In [17]:
exp_df = exp_df.loc[nonan.index]
exp_df.index = [nonan.loc[idx] for idx in exp_df.index]
exp_df = merge_redundant_series(exp_df, axis=0).T.sort_index(axis=1)
exp_df.shape

(6, 24614)

In [18]:
def rename_icb_samples(df):
    newdf = df.copy()
    newdf.index = [re.sub('.*?([0-9].+).*?', r"ICB\1", s) for s in df.index]
    return newdf

In [19]:
primaries = [idx for idx in exp_df.index if idx.startswith('Pt')]
primary_exp = rename_icb_samples(exp_df.loc[primaries].sort_index(axis=0))
tumor_exp = rename_icb_samples(exp_df.drop(primaries, axis=0).sort_index(axis=0))

In [20]:
tumor_exp.to_csv(PREPROC_PDX_BEADCHIP_EXP_FILE)
primary_exp.to_csv(PREPROC_PRIMARY_BEADCHIP_EXP_FILE)

# Write expression to supplementary file

I am actually not sure if we'll need this supplement, because the expression will be going in GEO.

In [None]:
bc_pdx = pd.read_csv(PREPROC_PDX_BEADCHIP_EXP_FILE, index_col=0).sort_index(axis=1)
bc_pt = pd.read_csv(PREPROC_PRIMARY_BEADCHIP_EXP_FILE, index_col=0).sort_index(axis=1)
pdx_exp = pd.read_csv(PREPROC_PDX_AFFY_EXP_FILE, index_col=0)
pdx_primary_exp = pd.read_csv(PREPROC_PRIMARY_AFFY_EXP_FILE, index_col=0)

In [None]:
from modules.utils import save_xls
sheetnames = 'affy_pdx affy_primary illumina_pdx illumina_primary'.split()
dfs = [pdx_exp, pdx_primary_exp, bc_pdx, bc_pt]
save_xls(dfs, sheetnames, 'SUPPL_PREPROC_EXP_FILE')