Daniel Huebschmann and Lea Jopp-Saile 20/05/2019
The package is currently being submitted to Bioconductor. Please use it once it is accepted there and a suitable citation is provided.
The concept of mutational signatures was introduced in a series of papers by Ludmil Alexandrov, Serena Nik-Zainal, Michael Stratton and others (for precise citations please refer to the vignette of the package). The concept was initialy only valid for SNV signatures and in 2018 extedended to allow also the analysis of INDEL mutational sigantures. The general approach is as follows:
*The SNVs are categorized by their nucleotide exchange. In total there are
4 x 3 = 12
different nucleotide exchanges, but if summing over reverse
complements only 12 / 2 = 6
different categories are left. For every SNV
detected, the motif context around the position of the SNV is extracted. This
may be a trinucleotide context if taking one base upstream and one base
downstream of the position of the SNV, but larger motifs may be taken as well
(e.g. pentamers). Taking into account the motif context increases combinatorial
complexity: in the case of the trinucleotide context, there are
4 x 6 x 4 = 96
different variant categories. These categories are
called features in the following text. The number of features will be
called n
.
*The INDELs are categorized bt their insertion and deletion and the immediate sequence context. In total there are 83 differnet categories, features. In total there are 47 delition freatures and 36 inserion freatures. For each INDEL the sequence context is extracted and the number of motive repetition determined. In the case of microhomology delition only a partial repetition of the sequence context is considert.
- A cohort consists of different samples with the number of samples denoted by
m
. For each sample we can count the occurences of each feature, yielding ann
-dimensional vector (n
being the number of features) per sample. For a cohort, we thus get ann x m
-dimensional matrix, called the mutational catalogueV
. It can be understood as a summary indicating which sample has how many variants of which category, but omitting the information of the genomic coordinates of the variants. - The mutational catalogue
V
is quite big and still carries a lot of complexity. For many analyses a reduction of complexity is desirable. One way to achieve such a complexity reduction is a matrix decomposition: we would like to find two smaller matricesW
andH
which if multiplied would span a high fraction of the complexity of the big matrixV
(the mutational catalogue). Remember thatV
is ann x m
-dimensional matrix,n
being the number of features andm
being the number of samples.W
in this setting is ann x l
-dimensional matrix andH
is anl x m
-dimensional matrix. The columns ofW
are called the mutational signatures and the columns ofH
are called exposures.l
denotes the number of mutational signatures. Hence the signatures aren
-dimensional vectors (withn
being the number of features), while the exposures arel
-dimensional vectors (l
being the number of signatures). Note that as we are dealing with count data, we would like to have only positive entries inW
andH
. A mathematical method which is able to do such a decomposition is the NMF (nonnegative matrix factorization). It basically solves the problem as illustrated in the following figure:
Of course the whole concept only leads to a reduction in complexity if l < n
,
i.e. if the number of signatures is smaller than the number of features, as
indicated in the above figure. Note that the NMF itself solves the above
problem for a given number of signatures l
. Addinional criteria exist to
evaluate the true number of signatures.
In a context where mutational signatures W
are already known (because they
were decribed and published or they are available in a
database as under http://cancer.sanger.ac.uk/cosmic/signatures), we might
want to just find the exposures H
for these known signatures in the
mutational catalogue V
of a given cohort.
The YAPSA-package (Yet Another Package for Signature Analysis) presented
here provides the function LCD
(linear combination decomposition)
to perform this task. The advantage of this method is that there are no
constraints on the cohort size, so LCD
can be run for as little as one
sample and thus be used e.g. for signature analysis in personalized oncology.
In contrast to NMF, LCD
is very fast and requires very little computational
resources. The YAPSA package provides additional functions for signature
analysis, e.g. for stratifying the mutational catalogue to determine signature
exposures in different strata, part of which is discussed in the vignette of
the package.
As long as YAPSA
is not yet accepted on
Bioconductor it may be downloaded and installed
from github:
library(devtools)
install_github("huebschm/YAPSA")
Of course, devtools
has to be installed:
install.packages("devtools")
YAPSA
already has preparations to use the newest versions of the pacakges
circlize
and ComplexHeatmap
by Zuguang Gu. These are currently not in the
release branch of Bioconductor. If you want your system to be ready for the
next coming update of YAPSA
you may already now install the newest versions of
these packages from github as well:
install_github("jokergoo/circlize")
install_github("jokergoo/ComplexHeatmap")
If you ran into dependency conflicts before, try rerunning
install_github("huebschm/YAPSA")
now.
library(YAPSA)
library(knitr)
opts_chunk$set(echo=TRUE)
opts_chunk$set(fig.show='asis')
Load data in a vcf-like format:
data("lymphoma_Nature2013_raw")
Adapt the data structure:
names(lymphoma_Nature2013_raw_df) <- c("PID","TYPE","CHROM","START",
"STOP","REF","ALT","FLAG")
lymphoma_Nature2013_df <- subset(lymphoma_Nature2013_raw_df,TYPE=="subs",
select=c(CHROM,START,REF,ALT,PID))
names(lymphoma_Nature2013_df)[2] <- "POS"
head(lymphoma_Nature2013_df)
## CHROM POS REF ALT PID
## 1 1 183502381 G A 07-35482
## 2 18 60985506 T A 07-35482
## 3 18 60985748 G T 07-35482
## 4 18 60985799 T C 07-35482
## 5 2 242077457 A G 07-35482
## 6 6 13470412 C T 07-35482
Note that there are 48 different samples:
unique(lymphoma_Nature2013_df$PID)
## [1] 07-35482 1060 1061 1065
## [5] 1093 1096 1102 4101316
## [9] 4105105 4108101 4112512 4116738
## [13] 4119027 4121361 4125240 4133511
## [17] 4135350 4142267 4158726 4159170
## [21] 4163639 4175837 4177856 4182393
## [25] 4189200 4189998 4190495 4193278
## [29] 4194218 4194891 515 DLBCL-PatientA
## [33] DLBCL-PatientB DLBCL-PatientC DLBCL-PatientD DLBCL-PatientE
## [37] DLBCL-PatientF DLBCL-PatientG DLBCL-PatientH DLBCL-PatientI
## [41] DLBCL-PatientJ DLBCL-PatientK DLBCL-PatientL DLBCL-PatientM
## [45] EB2 FL009 FL-PatientA G1
## 48 Levels: 07-35482 1060 1061 1065 1093 1096 1102 4101316 ... G1
For convenience later on, we annotate subgroup information to every variant (indirectly through the sample it occurs in). For reasons of simplicity, we also restrict the analysis to the Whole Genome Sequencing (WGS) datasets:
lymphoma_Nature2013_df$SUBGROUP <- "unknown"
DLBCL_ind <- grep("^DLBCL.*",lymphoma_Nature2013_df$PID)
lymphoma_Nature2013_df$SUBGROUP[DLBCL_ind] <- "DLBCL_other"
MMML_ind <- grep("^41[0-9]+$",lymphoma_Nature2013_df$PID)
lymphoma_Nature2013_df <- lymphoma_Nature2013_df[MMML_ind,]
data(lymphoma_PID)
for(my_PID in rownames(lymphoma_PID_df)) {
PID_ind <- which(as.character(lymphoma_Nature2013_df$PID)==my_PID)
lymphoma_Nature2013_df$SUBGROUP[PID_ind] <-
lymphoma_PID_df$subgroup[which(rownames(lymphoma_PID_df)==my_PID)]
}
lymphoma_Nature2013_df$SUBGROUP <- factor(lymphoma_Nature2013_df$SUBGROUP)
unique(lymphoma_Nature2013_df$SUBGROUP)
## [1] WGS_D WGS_F WGS_B WGS_I
## Levels: WGS_B WGS_D WGS_F WGS_I
As stated above, one of the functions in the YAPSA package (LCD
) is
designed to do mutational signatures analysis with known signatures. There are
(at least) two possible sources for signature data: i) the ones published
initially by Alexandrov, and ii) an updated and curated
current set of mutational signatures is maintained by Ludmil Alexandrov at http://cancer.sanger.ac.uk/cosmic/signatures.
data(sigs)
current_sig_df <- AlexCosmicValid_sig_df
current_sigInd_df <- AlexCosmicValid_sigInd_df
Now we can start using main functions of the YAPSA package: LCD
and
LCD_complex_cutoff
.
Prepare.
library(BSgenome.Hsapiens.UCSC.hg19)
lymphoma_Nature2013_df <- translate_to_hg19(lymphoma_Nature2013_df,"CHROM")
lymphoma_Nature2013_df$change <-
attribute_nucleotide_exchanges(lymphoma_Nature2013_df)
lymphoma_Nature2013_df <-
lymphoma_Nature2013_df[order(lymphoma_Nature2013_df$PID,
lymphoma_Nature2013_df$CHROM,
lymphoma_Nature2013_df$POS),]
This section uses functions which are to a large extent wrappers for functions
in the package SomaticSignatures
by Julian Gehring.
word_length <- 3
lymphomaNature2013_mutCat_list <-
create_mutation_catalogue_from_df(
lymphoma_Nature2013_df,
this_seqnames.field = "CHROM", this_start.field = "POS",
this_end.field = "POS", this_PID.field = "PID",
this_subgroup.field = "SUBGROUP",
this_refGenome = BSgenome.Hsapiens.UCSC.hg19,
this_wordLength = word_length)
The function create_mutation_catalogue_from_df
returns a list object with
several entries. We will use the one called matrix
.
lymphomaNature2013_mutCat_df <- as.data.frame(
lymphomaNature2013_mutCat_list$matrix)
When using LCD_complex_cutoff
, we have to supply a vector of cutoffs with as
many entries as there are signatures. It may make sense to provide different
cutoffs for different signatures.
my_cutoff <- 0.06
general_cutoff_vector <- rep(my_cutoff,dim(current_sig_df)[2])
specific_cutoff_vector <- general_cutoff_vector
specific_cutoff_vector[c(1,5)] <- 0
specific_cutoff_vector
## [1] 0.00 0.06 0.06 0.06 0.00 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06
## [15] 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06 0.06
## [29] 0.06 0.06
In this example, the cutoff for signatures AC1 and AC5 is thus set to 0, whereas
the cutoffs for all other signatures remains at 0.06. Running the function
LCD_complex_cutoff
:
CosmicValid_cutoffSpec_LCDlist <- LCD_complex_cutoff(
in_mutation_catalogue_df = lymphomaNature2013_mutCat_df,
in_signatures_df = current_sig_df,
in_cutoff_vector = specific_cutoff_vector,
in_sig_ind_df = current_sigInd_df)
Some adaptation (extracting and reformatting the information which sample belongs to which subgroup):
COSMIC_subgroups_df <-
make_subgroups_df(CosmicValid_cutoffSpec_LCDlist$exposures,
lymphoma_Nature2013_df)
Plotting absolute exposures for visualization:
exposures_barplot(
in_exposures_df = CosmicValid_cutoffSpec_LCDlist$exposures,
in_signatures_ind_df = CosmicValid_cutoffSpec_LCDlist$out_sig_ind_df,
in_subgroups_df = COSMIC_subgroups_df)
And relative exposures:
exposures_barplot(
in_exposures_df = CosmicValid_cutoffSpec_LCDlist$norm_exposures,
in_signatures_ind_df = CosmicValid_cutoffSpec_LCDlist$out_sig_ind_df,
in_subgroups_df = COSMIC_subgroups_df)
Note that the signatures extracted with the signature-specific cutoffs are the same in the example displayed here. Depending on the analyzed cohort and the choice of cutoffs, the extracted signatures may vary considerably.
To identify groups of samples which were exposed to similar mutational
processes, the exposure vectors of the samples can be compared. The YAPSA
package provides a custom function for this task: complex_heatmap_exposures
,
which uses the package ComplexHeatmap by Zuguang Gu. It produces
output as follows:
complex_heatmap_exposures(CosmicValid_cutoffSpec_LCDlist$norm_exposures,
COSMIC_subgroups_df,
CosmicValid_cutoffSpec_LCDlist$out_sig_ind_df,
in_data_type="norm exposures",
in_subgroup_colour_column="col",
in_method="manhattan",
in_subgroup_column="subgroup")
If you are interested only in the clustering and not in the heatmap
information, you could also use hclust_exposures
:
hclust_list <-
hclust_exposures(CosmicValid_cutoffSpec_LCDlist$norm_exposures,
COSMIC_subgroups_df,
in_method="manhattan",
in_subgroup_column="subgroup")
The dendrogram produced by either the function complex_heatmap_exposures
or
the function hclust_exposures
can be cut to yield signature exposure specific
subgroups of the PIDs.
cluster_vector <- cutree(hclust_list$hclust,k=4)
COSMIC_subgroups_df$cluster <- cluster_vector
subgroup_colour_vector <- rainbow(length(unique(COSMIC_subgroups_df$cluster)))
COSMIC_subgroups_df$cluster_col <-
subgroup_colour_vector[factor(COSMIC_subgroups_df$cluster)]
complex_heatmap_exposures(CosmicValid_cutoffSpec_LCDlist$norm_exposures,
COSMIC_subgroups_df,
CosmicValid_cutoffSpec_LCDlist$out_sig_ind_df,
in_data_type="norm exposures",
in_subgroup_colour_column="cluster_col",
in_method="manhattan",
in_subgroup_column="cluster")
This type of analysis is performed using the function run_SMC
where SMC stands
for stratification of the mutational catalogue. For details on this
function please consult the vignette.