Skip to content

ViCCo-Group/frrsa

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“” Table of Contents

🌟 About the project

frrsa is a Python package to conduct Feature-Reweighted Representational Similarity Analysis (FR-RSA). The classical approach of Representational Similarity Analysis (RSA) is to correlate two Representational Matrices, in which each cell gives a measure of how (dis-)similar two conditions are represented by a given system (e.g., the human brain or a model like a deep neural network (DNN)). However, this might underestimate the true correspondence between the systems' representational spaces, since it assumes that all features (e.g., fMRI voxel or DNN units) contribute equally to the establishment of condition-pair (dis-)similarity, and in turn, to correspondence between representational matrices. FR-RSA deploys regularized regression techniques (currently: L2-regularization) to maximize the fit between two representational matrices. The core idea behind FR-RSA is to recover a subspace of the predicting matrix that best fits to the target matrix. To do so, the matrices' cells of the target system are explained by a linear reweighted combination of the feature-specific (dis-)similarities of the respective conditions in the predicting system. Importantly, the Representational Matrix of each feature of the predicting system receives its own weight. This all is implemented in a nested cross-validation, which avoids overfitting on the level of (hyper-)parameters.

🚨 Please also see the published article accompanying this repository. To use this package successfully, follow this README. 🚨

(back to top)

πŸƒ Getting started

πŸ’» Installing

The package is written in Python 3.8. Installation expects you to have a working conda on your system (e.g. via miniconda). If you have pip available already, you can skip the conda env create part.

Execute the following lines from a terminal to clone this repository and install it as a local package using pip.

cd [directory on your system where you want to download this repo to]
git clone https://github.com/ViCCo-Group/frrsa
conda env create --file=./frrsa/environment.yml
conda activate frrsa
cd frrsa
pip install -e .

πŸ” How to use

There is only one user-facing function in frrsa. To use it, activate the conda environment, import and then call frrsa with your data:

from frrsa import frrsa

# load your "target" RDM or RSM.
# load your "predictor" data.
# set the necessary flags ("preprocess", "nonnegative", "measures", ...)

scores, predicted_matrix, betas, predictions = frrsa(target,
                                                     predictor,
                                                     preprocess,
                                                     nonnegative,
                                                     measures,
                                                     cv=[5, 10],
                                                     hyperparams=None,
                                                     score_type='pearson',
                                                     wanted=['predicted_matrix', 'betas', 'predictions'],
                                                     parallel='1',
                                                     random_state=None)

See frrsa/test.py for another simple demonstration.

πŸ” Parameters and returned objects

Parameters.

There are default values for all parameters, which we partly assessed (see our paper). However, you can input custom parameters as you wish. For an explanation of all parameters please see the docstring.

Returned objects.

  1. scores: Holds the the representational correspondency scores between each target and the predictor. These scores can be sensibly used in downstream analyses.
  2. predicted_matrix: The reweighted predicted representational matrix averaged across outer folds with shape (n_conditions, n_conditions, n_targets). The value 9999 denotes condition pairs for which no (dis-)similarity was predicted (why?). This matrix should only be used for visualizational purposes.
  3. betas: Holds the weights for each target's measurement channel with the shape (n_conditions, n_targets). Note that the first weight for each target is not a channel-weight but an offset. These betas are currently computed suboptimally and should only be used for informational purposes. Do not use them to recreate the reweighted_matrix or to reweight something else (see #43).
  4. predictions: Holds (dis-)similarities for the target and for the predictor, and to which condition pairs they belong, for all cross-validations and targets separately. This is a potentially very large object. Only request if you really need it. For an explanation of the columns see the docstring.

(back to top)

❓ FAQ

How does my data have to look like to use the FR-RSA package?

At present, the package expects data of two systems (e.g., a specific DNN layer and a brain region measured with fMRI) the representational spaces of which ought to be compared. The predicting system, that is, the one of which the feature-specific (dis-)similarities shall be reweighted, is expected to be a p x k numpy array. The target system contributes its full representational matrix in the form of a k x k numpy array (where p:=Number of measurement channels aka features and k:=Number of conditions see Diedrichsen & Kriegeskorte, 2017). There are no hard-coded upper limits on the size of each dimension; however, the bigger k and p become, the larger becomes the computational problem to solve. See Known issues for a lower limit of k.

You say that every feature gets its own weight - can those weights take on any value or are they restricted to be non-negative?

The function's parameter nonnegative can be set to either True or False and forces weights to be nonnegative (or not), accordingly.

What about the covariances / interactive effects between predicting features?

One may argue that it could be the case that the interaction of (dis-)similarities in two or more features in one system could help in the prediction of overall (dis-)similarity in another system. Currently, though, feature reweighting does not take into account these interaction terms (nor does classical RSA), which probably also is computationally too expensive for predicting systems with a lot of features (e.g. early DNN layers).

FR-RSA uses regularization. Which kinds of regularization regimes are implemented?

As of now, only L2-regularization aka Ridge Regression.

You say ridge regression; which hyperparameter space should I check?

If you set the parameter nonnegative to False, L2-regularization is implemented using Fractional Ridge Regression (FRR; Rokem & Kay, 2020). One advantage of FRR is that the hyperparameter to be optimized is the fraction between ordinary least squares and L2-regularized regression coefficients, which ranges between 0 and 1. Hence, FRR allows assessing the full range of possible regularization parameters. In the context of FR-RSA, twenty evenly spaced values between 0.1 and 1 are pre-set. If you want to specify custom regularization values that shall be assessed, you are able to do so by providing a list of candidate values as the hyperparams argument of the frrsa function.
If you set the parameter nonnegative to True, L2-regularization is currently implemented using Scikit-Learn functions. They have the disadvantage that one has to define the hyperparameter space oneself, which can be tricky. If you do not provide hyerparameter candidates yourself, 14 pre-set values will be used which might be sufficient (see our paper).

Which (dis-)similarity measures can/should be used?

Use the parameter measures to indicate which (dis-)similarity measures to use. See the docstring for possible arguments. Which measure you should choose for the predictor depends on your data. Additionally, if your target holds similarities, it is likely more intuitive to select 'dot' (to have similarities on both sides of the equation). If, though, your target holds dissimilarities, it might conversely be more intuitive to select 'sqeuclidean'.

In the returned predicted_matrix, why are there some condition pairs for which there are no predicted (dis-)similarities?

To conduct a proper cross-validation that does not lead to leakge, one needs to split the data based on conditions not pairs. However, if conditions A, B, C, D are in the training set, and E, F, G are in the test set, then e.g. the pair (A, E) would never be used for either fitting or testing the statistical model. Therefore, even if one repeats such a k-fold cross-validation a few times it could be that a few pairs never receive a predicted (dis-)similarity.

(back to top)

πŸ—’οΈ Known issues

  1. If your data has less than 9 conditions, frrsa cannot be executed successfully. This won't be fixed (see #28).

    Expand for details.

    The data (i.e. target and predictor) are split along the condition dimension to conduct the nested cross-validation. Therefore, there exists an absolute lower bound for the number of conditions below which inner test folds will occur that contain data from just two conditions, which would lead to just one predicted (dis-)similarity (for one condition pair): this absolute lower bound is 9. However, to determine the goodness-of-fit, currently the predicted (dis-)similarities of each cross-validation are correlated with the respective target (dis-)similarities. This does not work with vectors that have a length < 2.

  2. The default fold size, outer_k, for the outer crossvalidation is 5 (denoted by the first element of cv). In that case, the minimum number of conditions needed is 14.

    Expand for details.

    This is due to the inner workings of data_splitter. It uses sklearn.model_selection.ShuffleSplit which allows to specifically set the proportion of the current dataset to be included in the test fold. This proportion is set to 1/outer_k due to historical reasons so that it was comparable to splitter='kfold' (see #26). Therefore, when there are 14 conditions, this leads to an outer test fold size of 2.8 β‰ˆ 3, and to an outer training fold size of (14 - 3) = 11. This in turn guarantees an inner test fold size of 2.2 β‰ˆ 3 (note that the sklearn's SuffleSplit function rounds up).

    However, if there are only 13 or less conditions and outer_k is set to 5, 2.6 β‰ˆ 3 conditions are allocated to an outer test fold, but that leads to an outer training fold size of (13 - 3) = 10 which leads to inner test folds sizes of only 2, which wouldn't work (as explained in 1.). Therefore:

  3. If your data has between 9 and 13 conditions, frrsa will run. However, the default outer_k and the hard-coded inner_k will be adapted automatically (see #22).

  4. There are other combinations of outer_k and the number of conditions (also when the number of conditions is bigger than 14) that would yield too few (inner or outer) test conditions if unchanged, but could be executed successfully otherwise. Therefore, in these cases, outer_k and inner_k will be adapted automatically (see #17).

  5. The optionally returned betas are currently computed suboptimally and should only be used for informational purposes. Do not use them to recreate the reweighted_matrix or to reweight something else (see #43).

(back to top)

πŸ‘‹ How to contribute

If you come across problems or have suggestions please submit an issue!

(back to top)

⚠️ License

This GitHub repository is licensed under the GNU AFFERO GENERAL PUBLIC LICENSE Version 3 - see the LICENSE.md file for details.

(back to top)

πŸ“ƒ Citation

If you use frrsa (or any of its modules), please cite our associated paper as follows:

@article{KANIUTH2022119294,
         author = {Philipp Kaniuth and Martin N. Hebart},
         title = {Feature-reweighted representational similarity analysis: A method for improving the fit between computational models, brains, and behavior},
         journal = {NeuroImage},
         pages = {119294},
         year = {2022},
         issn = {1053-8119},
         doi = {https://doi.org/10.1016/j.neuroimage.2022.119294},
         url = {https://www.sciencedirect.com/science/article/pii/S105381192200413X}
}

(back to top)

πŸ’Ž Contributions

The Python package itself was mostly written by Philipp Kaniuth (GitHub, MPI CBS), with key contributions and amazing help as well as guidance provided by Martin Hebart (Personal Homepage, MPI CBS) and Hannes Hansen (GitHub, Personal Homepage).

Further thanks go to Katja Seliger (GitHub, Personal Homepage), Lukas Muttenthaler (GitHub, Personal Homepage), and Oliver Contier (GitHub, Personal Homepage) for valuable discussions and hints.

Check our lab home page for more information on the cool work we do! πŸ€“

(back to top)