MCFA is a method for analysis of multimodal, high-dimensional data that combines principals from multiset canonical correlation analysis and factor analysis to jointly model shared and private features across the datasets. It was designed with multi-omic analysis in mind, but may be useful for any dataset where two or more data types are gathered for the same samples. MPCFA is implemented in Python 3 using pytorch on the backend for speed and GPU compatibility.
MCFA is available on PyPI and installable via pip
. Just run
pip install multicor_fa
Dependencies are listed in pyproject.toml
and should be automatically installed.
These include torch
, numpy
, pandas
, scipy
, and sklearn
.
The primary interface to the fitting routines is the function mcfa_model.fit()
which
takes a list of pandas DataFrames with equal numbers of rows (samples by features) and a set
of analysis options. By default MPCFA learns the relevant hyperparameters from the
data and thus the default options will suffice for many cases.
This interface is intended to be used via a Jupyter or
Google Colab notebook.
Given data in the matrices Y_1, Y_2, Y_3
, the primary input to MCFA is an iterable
containing the datasets. If it is a named iterable (dictionary), those names
will be used in the returned results for clarity.
from multicor_fa import mcfa_model
Y = {"Mode1": Y_1, "Mode2": Y_2, "Mode3": Y_3}
mcfa_res = mcfa_model.fit(Y)
Calling mcfa_model.fit(Y)
will calculate the dataset PCs, infer the hyperparameters, and fit
the model. If you'd prefer to set these yourself, you can use the argument n_pcs
to pass
a list of number of components of each dataset to keep, d
to pass an integer specifying
the shared dimensionality, and k
to pass a list specifying the private space dimensionality.
mcfa_res = mcfa_model.fit(Y, n_pcs = [24, 11, 57], d = 5, k = [12, 3, 22])
The boolean arguments center
and scale
(default True
) will standardize the data
if it is not already standardized. This is generally recommended and simplifies interpretaiton
of the variance explained.
If you don't need to do analysis of the factor loadings matrix, you can set the parameter
result_space = 'pc'
which avoids transforming the loadings matrix back into the full data
space. This can save some time and memory if you have very large hidden dimensionalities and
feature spaces for your input matrices.
mcfa_model.fit()
returns an MCFARes
object that can be used in downstream analysis.
mcfa_res.Z
contains a samples x d
pd.DataFrame
with the location of the points in
the shared space, mcfa_res.X
contains an iterable of samples x k_i
pd.DataFrame
's
with the corresponding private spaces. W
is likewise an iterable of shared space
factor loadings dataframes d x p_i
, and L
is an iterable of private space
factor loadings matrices k_i x p_i
. var_exp_Z
and var_exp_X
contain the
variance in each dataset explained by the shared and private spaces, respectively.
See the dosctring of the function mcfa_model.fit()
for complete details, and
analysis_notebooks/analyze_mesa.ipynb
to see how we used the functionality of
this package in our analysis of the MESA/TOPMed multi-omic pilot data.
Once you have fit your model, you may consider doing cross-validation to assure that you
have not over-fit. The function mcfa_model.cv()
provides this functionality.
Z_cv, X_cv, nrmse_tr, nrmse_te = mcfa_model.cv(Y, mcfa_res, folds = 'loo')
Z_cv
and X_cv
will contain a pd.DataFrame
and iterable of such where each sample
in the DataFrame corresponds to the location of that sample in the respective space
in the model fit without that sample included. The nrmse_tr
and nrmse_te
contain
normalized root mean square error for each fold in each dataset. The folds
argument
specifies the number of folds, or use folds='loo'
for leave-one-out cross-validation.
Keep in mind that proper choice of the hidden dimensionality is directly related to the sample size - with fewer samples, you have to model fewer hidden dimensions. Thus, if you use too few folds, you may observe over-fitting just because each fold doesn't have enough samples to model the same number of hidden dimensions as the full dataset. In the manuscript we use leave-one-out to get around this, but you can probably get away with less.
Generally speaking, we recommend analying variance-normalized feature loadings matrices or Z-transformed correlations of dataset features with the hidden space of interest to find highly-weighted features in downstream analysis. This is generally less-prone to mis-specification than "set"-type enrichment analyses. The primary reason for this is that the feature loadings matrix is correlated, and dealing with this correlation is not straightforward.
You can use the function mcfa_model.score
to calculate correlations between data features
inferred dimensions
cor1 = mcfa_model.score(Y['Y_1'], mcfa_res.Z, transfrom = True
This returns a features x factors
pd.DataFrame
with entries the corelation between
the feature and factor. With transform = True
these will use Fisher's Z-transformation
to turn these into roughly normally-distributed Z-scores.
However, we do also provide functionality for
gene set enrichment analysis using the gsea.parametric
and gsea.permutation
functions. These functions implement the proposed approach in Frost et al. BioDataMining
2015.
The simplest and most problematic approach is parametric -
from multicor_fa import gsea
scores, pvals = gsea.parametric(mcfa_res.W['Y_1'], annot, ref_sample = None)
This will compute a score and p-value for each annotation in annot
against each dimesion
in the factor loadings matrix for dataset Y_1
. Here, annot
is a dictionary mapping
annotation names to feature names. You can use the function gsea.load_annot()
to load
a GMT annotation file into the proper format. ref_sample
gives a reference sample for
correcting for the correlation present in the factor loadings matrix. ref_sample = None
will
perform no correction, and will likely result in many false positives. On the other hand,
you could use in-sample correction ref_sample = Y['Y_1']
, which will likely result
in very low power. An independent reference panel may provide better correction.
If you have one, you can also specify that here.
The other approach is to do permutations -
scores, pvals = gsea.permutation(Y['Y_1'], mcfa_res.Z, annot, n_perm=100000, threads=10)
This looks for (transformed) correlations of the input data (Y['Y_1']
) with the
reduced space (mcfa_res.Z
) and uses permutations to correct for the correlations
across features. This generally works quite well, with the trouble being that you
need to do a lot of permutations. For a standard gene set enrichment analysis, you
likely need to do at least 100k permutations to get somewhat accurate p-value estimates.
If you have multiple cores available, you can speed things up with the threads
argument.
Finally, you may be interested in doing feature loadings analysis on sets of less-interpretable
features, such as methylation markers. For this, you can use the funcion gsea.gene_score()
.
scores = gsea.gene_score(mcfa_res.W['Y_2'], mapping, method = 'mean')
Where mapping is a dictionary mapping gene names to columns of W
. This calculates the
mean (or optionally method = meansq
) value of the features and returns a pd.DataFrame
of gene scores.
Please note that MCFA does no preprocessing or data input checking (eg for sample alignment)
other than optional feature centering and scaling at this time.
All additional preprocessing such as normalization, filtering of problematic samples,
imputation of missing values, and alignment of samples across datasets needs to be done
prior to calling mcfa_model.fit()
. For an example of preprocessing steps that were taken
for the MESA cohort, see MCFA/analysis_notebooks/preprocess_mesa.ipynb
This software is under active development. Please report any issues encountered using the issues tab. For the exact code used in the manuscript "Multiset correlation and factor analysis enables exploration of multi-omic data", Brown, Wang et al Cell Genomics forthcoming, see the tagged first release. An archive of that repository is also available on Zenodo - https://doi.org/10.5281/zenodo.7951370.