CUPLR is a classifier for identifying the primary tumor location of metastatic tumors based on features derived from whole genome sequencing data. The model itself is an ensemble of binary random forest classifiers which each predict the probability of one cancer type. The final predicted cancer type is the one with the highest probability.
The features used by CUPLR are extracted from the output of the HMF pipeline 5, specifically from: (i) the somatic VCF file containing SBS, DBS and indel mutations, (ii) PURPLE (v2.53) output, and (iii) LINX (v1.17) output. Below is a summary of the features.
Feature type | Source | Description |
---|---|---|
sigs | VCF | Relative contribution of SBS, DBS and indel mutational signatures |
mut_load | VCF | Total number of SBSs, DBSs and indels |
rmd | VCF | Regional mutational density (RMD) profiles as extracted by NMF |
chrom_arm | PURPLE | Chromosome arm gains/losses compared to the overall genome ploidy |
gender | PURPLE | Gender as derived from copy number data |
genome | PURPLE | Genome properties, including overall ploidy, diploid proportion, and presence of whole genome duplication |
gene | LINX | Deep deletions, amplifications, biallelic losses and mutations of cancer associated genes |
sv | LINX | Simple and complex structural variants |
fusion | LINX | Presence of gene fusions |
viral_ins | LINX | Presence of viral sequence insertions |
CUPLR was trained with tumor samples from 6756 patients from the Hartwig Medical Foundation (HMF) and the Pan-Cancer Analysis of Whole Genomes (PCAWG) consortium. The model can predict the primary tumor location with an overall recall and precision of ~90% based on cross-validation and predictions on a held out test set.
For more details about the model, please see the CUPLR paper:
Machine learning-based tissue of origin classification for cancer of
unknown primary diagnostics using genome-wide mutation features
Luan Nguyen, Arne Van Hoeck, Edwin Cuppen. Nature Communications, 2022
Link: https://www.nature.com/articles/s41467-022-31666-w
CUPLR depends on several R packages for basic functionality. These can be installed as follows:
install.packages(c('randomForest','reshape2','ggplot2','ggrepel'))
## For calculating random forest feature contributions:
install.packages('rfFC', repos='http://R-Forge.R-project.org')
## For reading VCFs and extracting mutational signatures:
install.packages('devtools')
devtools::install_github('https://github.com/UMCUGenetics/mutSigExtractor/')
Certain functions require other R packages. For details on optional dependencies, please see the DESCRIPTION files.
CUPLR is composed of the following R packages:
cuplr
: random forest training, prediction, and performancefeatureExtractor
: extract the features used by CUPLRstatsExtra
: statistics used for univariate feature selection for training CUPLRnmf
: wrapper around the NNLM package for non-negative matrix factorization for generating RMD profiles.
To download CUPLR, run the following commands in the terminal:
cd /working/dir/
git clone https://github.com/UMCUGenetics/cuplr/
Then load the featureExtractor
, cuplr
, and mutSigExtractor
packages. The remaining CUPLR packages (statsExtra
and nmf
) are only
needed for training CUPLR.
cuplr_dir <- '/path/to/cuplr/'
devtools::load_all(paste0(cuplr_dir,'/featureExtractor/'))
devtools::load_all(paste0(cuplr_dir,'/cuplr/'))
This tutorial will demonstrate how cancer type prediction can be
performed using example input files located at doc/data/
. These are 3
primary tumor samples from the PCAWG consortium, whose BAM files were
analyzed with the HMF pipeline.
Extraction of features per sample is performed using
extractFeaturesCuplr()
. This typically takes about 15-30 seconds per
sample.
Paths to the required input files can be provided to input.paths
as a
named character vector. Make sure that this vector has the names shown
below. The below code will return a 1-row dataframe with the values for
each feature.
sample_dir <- paste0(cuplr_dir,'/doc/data/DO35886/')
features <- extractFeaturesCuplr(
input.paths=c(
purple.smnv= paste0(sample_dir,'/DO35886T.purple.somatic.vcf.gz'), ## *.purple.somatic.vcf.gz
purple.cnv= paste0(sample_dir,'/DO35886T.purple.cnv.somatic.tsv'), ## *.purple.cnv.somatic.tsv
purple.purity= paste0(sample_dir,'/DO35886T.purple.purity.tsv'), ## *.purple.purity.tsv
linx.drivers= paste0(sample_dir,'/DO35886T.linx.driver.catalog.tsv'), ## *.linx.driver.catalog.tsv
linx.fusions= paste0(sample_dir,'/DO35886T.linx.fusion.tsv'), ## *.linx.fusion.tsv
linx.viral.inserts= paste0(sample_dir,'/DO35886T.linx.viral_inserts.tsv'), ## *.linx.viral_inserts.tsv
linx.vis.sv.data= paste0(sample_dir,'/DO35886T.linx.vis_sv_data.tsv') ## *.linx.vis_sv_data.tsv
)
)
features[,1:5]
## sigs.SBS1 sigs.SBS2 sigs.SBS3 sigs.SBS4 sigs.SBS5
## 1 0.04792343 0.01030294 0.009803203 0 0
Alternatively, the path to a folder with all the required input files
can be provided to in.dir
. The filenames should end with those shown
in the comments and the above example (e.g. *.purple.somatic.vcf.gz).
An output folder can optionally be specified to out.dir
. With this
approach, intermediate files are also written and allows resuming the
feature extraction if a crash occurs (instead of starting the extraction
from the beginning).
For this tutorial, we will use in.dir
and out.dir
to load the
example input data and write the outputs. The below code extracts the
features for all 3 example samples.
sample_dirs <- list.dirs(paste0(cuplr_dir,'doc/data/'), recursive=F, full.names=T)
for(in_dir in sample_dirs){
message('\nExtracting features at: ', in_dir)
out_dir <- paste0(in_dir,'/output/')
dir.create(out_dir, showWarnings=F)
extractFeaturesCuplr(
in.dir=in_dir, ## Provide `input` to in.dir instead of `input.paths`
out.dir=out_dir,
verbose=1
)
}
We can then read the all_features.txt.gz
files which are the 1-row
dataframes (as mentioned above). It is important to specify
check.names=F
to read.delim()
as the column names have some illegal
characters which would otherwise be modified, and we don’t want this to
happen.
out_paths <- paste0(sample_dirs,'/output/all_features.txt.gz')
## Read txt files
features <- lapply(out_paths, function(i){
read.delim(i, check.names=F)
})
## Merge 1-row dataframes into a single dataframe
features <- do.call(rbind, features)
## Assign sample names to rownames
sample_names <- basename(sample_dirs)
rownames(features) <- sample_names
features[,1:5]
## sigs.SBS1 sigs.SBS2 sigs.SBS3 sigs.SBS4 sigs.SBS5
## DO220844 0.07132016 0.02757172 0.000000000 0 0.06116417
## DO35886 0.04792343 0.01030294 0.009803203 0 0.00000000
## HMF000185A 0.11327875 0.00000000 0.000000000 0 0.27081490
There are 4000+ features; too many to print on screen. However, with below code we can see the names of the first couple of features in each group.
feature_groups <- groupFeaturesByTag(colnames(features), rm.tags=F)
lapply(feature_groups, head)
## $sigs
## [1] "sigs.SBS1" "sigs.SBS2" "sigs.SBS3" "sigs.SBS4" "sigs.SBS5" "sigs.SBS6"
##
## $mut_load
## [1] "mut_load.snv" "mut_load.indel" "mut_load.dbs"
##
## $rmd
## [1] "rmd.1p_1" "rmd.1p_2" "rmd.1p_3" "rmd.1p_4" "rmd.1p_5" "rmd.1p_6"
##
## $sv
## [1] "sv.sv_load" "sv.LINEs" "sv.DEL_[0,300)"
## [4] "sv.DEL_[300,1e+03)" "sv.DEL_[1e+03,1e+04)" "sv.DEL_[1e+04,1e+05)"
##
## $chrom_arm
## [1] "chrom_arm.gain.1p" "chrom_arm.gain.1q" "chrom_arm.gain.2p"
## [4] "chrom_arm.gain.2q" "chrom_arm.gain.3p" "chrom_arm.gain.3q"
##
## $gene
## [1] "gene.ACVR1.amp" "gene.ACVR1.deep_del" "gene.ACVR1.biall"
## [4] "gene.ACVR1.monoall" "gene.ACVR2A.amp" "gene.ACVR2A.deep_del"
##
## $fusion
## [1] "fusion.@IGH_*" "fusion.@IGL_MYC" "fusion.*_ETV1/4/5"
## [4] "fusion.*_RET" "fusion.*_TFE3" "fusion.*_TFEB"
##
## $viral_ins
## [1] "viral_ins.AAV" "viral_ins.EBV" "viral_ins.HBV" "viral_ins.HCV"
## [5] "viral_ins.HIV" "viral_ins.HPV"
##
## $genome
## [1] "genome.ploidy" "genome.diploid_proportion"
## [3] "genome.has_wgd"
##
## $gender
## [1] "gender.gender"
Now that we have the features, we can predict the cancer type. We first need to load CUPLR itself as well as the probability calibration curves.
The scores outputted by a random forest need to be calibrated to yield true probabilities. This means that a probability of 0.9 should mean that for example: amongst 100 hypothetical samples predicted as breast cancer, 90 of them are actually breast cancer. For more info, see the scikit-learn documentation about this topic.
model <- readRDS(MODEL_PATH)
prob_cal_curves <- read.delim(PROB_CAL_CURVES_PATH)
We can then use predict()
to get the probabilities of each cancer type
for each sample. The output of predict()
is a list containing several
objects. Here we have specified calc.feat.contrib=T
which will
calculate the contribution of each feature to each cancer type
prediction. This is required to generate the patient report (see next
section).
pred_report <- predict(
object=model,
newdata=features,
prob.cal.curves=prob_cal_curves,
calc.feat.contrib=T
)
pred_report
## Objects in list:
## $prob $prob_scaled $class_pred $feat_contrib
##
## Calibrated probabilities:
## $prob_scaled
## Biliary Breast Cervix CNS_Glioma CNS_Medullo CNS_PiloAstro
## DO220844 0.0005773561 0.213022 0 0.000000000 0.000000000 0.0000000
## DO35886 0.0000000000 0.000000 0 0.001242909 0.005507125 0.9135719
## HMF000185A 0.0000000000 0.000000 0 0.000000000 0.000000000 0.0000000
## Colorectal Gastroesophageal HeadAndNeck_Other
## DO220844 0.0003462293 0.0002115484 0.001378098
## DO35886 0.0004806873 0.0016663170 0.000000000
## HMF000185A 0.0000000000 0.0000000000 0.000000000
## HeadAndNeck_SalivaryGland Kidney_Chromophobe Kidney_ClearCell
## DO220844 0.0000000000 0.0000000000 0.0000000000
## DO35886 0.0000000000 0.0007175474 0.0001646324
## HMF000185A 0.0004121185 0.0000000000 0.0001097549
## Kidney_Papillary Liver Lung_NonSmallCell Lung_SmallCell
## DO220844 0.000000e+00 0.000000000 0.003274019 0
## DO35886 0.000000e+00 0.001093323 0.004961213 0
## HMF000185A 9.299103e-05 0.000000000 0.001241159 0
## Lymphoid Mesothelium Myeloid NET_Gastrointestinal NET_Lung
## DO220844 0 0 0.0000000000 1.507647e-04 0.000000e+00
## DO35886 0 0 0.0004481103 4.186282e-05 0.000000e+00
## HMF000185A 0 0 0.0000000000 1.116342e-04 6.116457e-05
## NET_Pancreas Ovarian Pancreas Prostate Sarcoma_GIST
## DO220844 0.0004615711 0.0015382461 0.0011009668 0 0
## DO35886 0.0383353453 0.0001209926 0.0002253971 0 0
## HMF000185A 0.0392612195 0.0000000000 0.0004507942 0 0
## Sarcoma_Leiomyo Sarcoma_Lipo Sarcoma_Osteo Sarcoma_Other
## DO220844 0.0003592354 0.0000000 0.0022210274 0.0004005747
## DO35886 0.0007545717 0.0000000 0.0031881031 0.0010086661
## HMF000185A 0.0023550295 0.5667544 0.0006195843 0.4688011203
## Skin_Carcinoma Skin_Melanoma Thyroid Urothelial Uterus
## DO220844 0.000000000 0.8560547 0.0000000000 0.0003200161 0
## DO35886 0.000000000 0.0000000 0.0037040469 0.0004442942 0
## HMF000185A 0.003430203 0.0000000 0.0003721351 0.0007926048 0
However, the above raw prediction output is not informative at a glance.
We can use summary()
to show the prediction in a neat table.
summary(
pred_report,
top.n.classes=3, ## Optional: Shows the top 3 classes and their probabilities
top.n.feat=3 ## Optional: Shows the top 3 features contributing to the top class prediction
)
## sample pred_class.1 pred_class.2 pred_class.3 pred_prob.1
## 1 DO220844 Skin_Melanoma Breast Lung_NonSmallCell 0.856
## 2 DO35886 CNS_PiloAstro NET_Pancreas CNS_Medullo 0.914
## 3 HMF000185A Sarcoma_Lipo Sarcoma_Other NET_Pancreas 0.567
## pred_prob.2 pred_prob.3 feat_contrib.1.1 feat_contrib.1.2
## 1 0.213 0.003 rmd.Skin_Melanoma.1=0.512 sigs.SBS38=0.267
## 2 0.038 0.006 fusion.KIAA1549_BRAF=0.214 mut_load.indel=0.165
## 3 0.469 0.039 fusion.FUS_DDIT3=0.325 rmd.Sarcoma_Lipo.2=0.319
## feat_contrib.1.3
## 1 sv.COMPLEX.largest_cluster=0.021
## 2 sigs.ID12=0.082
## 3 sigs.SBS8=0.017
Using patientReport()
we show plot the output of CUPLR graphically for
one patient. The left panel of the patient report shows the cancer type
probabilities. The right panels show the values of the most important
features contributing to each of the top predicted cancer types.
In the right panels, the feature value averages in patients with the respective predicted cancer type and patients with other cancer types are also shown to provide context to the feature values of the patient. For numeric features, (e.g. SBS mutational load) the average refers to the interquartile mean. For boolean features, (e.g. presence of a gene fusion) the average refers to the proportion of patients with the feature, with the patient feature values of 0% and 100% indicating absence/presence respectively.
patient_report <- patientReport(
probs=pred_report$prob_scaled,
feat.contrib=pred_report$feat_contrib,
sample.name='HMF000185A',
## Shows all 3 panels. Set to 2 to show only the first 2, or 1 to show only the first
level.of.detail=3
)
plot(patient_report)
In the first panel we can see that this sample was predicted as a liposarcoma (this sample was indeed labelled as a liposarcoma).
In the middle panel, we can see that the prediction is primarily supported by this sample having a FUS-DDIT3 fusion (fusion.FUS_DDIT3), as well as having the RMD profile of liposarcoma (rmd.Sarcoma_Lipo.2).
The third panel shows the feature values compared to the average feature values in liposarcoma (pink) and in other cancer types (red). We can see that the liposarcoma prediction for this sample was supported by this sample having a higher proportion of SBSs contributing (33.5% of SBSs) to the RMD profile of liposarcoma compared to in other cancer types (0%). We can also see that FUS-DDIT3 fusions occur in 18.4% of liposarcomas, but in 0% of other cancer types. The sample value is 1 (100% occurrence) indicating that this feature is present in this sample.
We can also see in the first panel that Sarcoma_Other also has a high probability. In cases where the probabilities are more uncertain (like as shown here), more feature contribution panels corresponding to the top predicted classes will be shown. This should aid in determining the cancer type, since it’s possible that e.g. the 2nd predicted cancer type better matches with evidence from pathology.