Skip to content

Understanding permutations

dorianps edited this page Oct 17, 2018 · 27 revisions

Permutations are efficient for thresholding statistical results. However, the word "permutation" can be used in different scenarios and can sometimes be confusing. Here is how permutations are used in LESYMAP in two different scenarios.


Voxel-based permutations

In this case, permutations are used to obtain a p-value from each voxel separately. The rationale behind this approach is that, if there is a true relationship of a voxel with behavior, permutations will rarely produce statistical scores that reach or exceed the statistical score obtained with the original data. To compute the p-value of the voxel the order of the behavioral values is permuted many times. The permuted p-value is derived from the proportion of times the statistical value reaches or exceeds the original statistical values. With 100 permutations, the smaller p-value that can be obtained 1/101, or 0.0099099. Smaller p-values can be obtained with a larger number of permutations.

Voxel-based permutations are used when data violate the assumptions of standard tests (i.e., variance homogeneity, normality of distribution). They are also used a default approach to find significant weights from Support Vector Regression approaches (see Zhang 2015 and DeMarco 2018). It is important to understand that voxel-based permutations are not corrected for multiple comparisons. There is a good chance that, among the thousands of voxels in the brain, some might form random relationships with behavior that lead to low p-values. For this reason, multiple comparison correction is still required after voxel-based permutations. LESYMAP offers voxel-based permutations with "regresPerm" and "chisqPerm" methods. Due to limitations of the Brunner-Munzel method, voxel-based permutations are also required with this method for voxels lesioned in less than 9 patients (see Medina 2010), but this is performed automatically in LESYMAP.


Whole brain statistical permutations

In this case, permutations are used to achieve multiple comparison correction. They do not involve a single voxel but the entire ensemble of voxels. The idea is that a random permutation of the brain behavior relationship should produce statistical peaks or cluster sizes that are typically smaller than the values obtained from true non-scrambled analyses. To assess this, behavioral data are scrambled and a full statistical map is obtained each time. Note, the voxel locations are not permuted per se, only behavioral scores are permuted. The statistical maps obtained from permuted behavioral scores can be used in different ways. For example, FWERperm correction relies on the distribution of peak voxel values to establish the whole brain voxel threshold at the 95th percentile of this distribution. clusterPerm correction relies on the distribution of maximal cluster sizes to establish the cluster threshold at the 95th percentile of this distribution.

I realize that the link between permutations and multiple comparison correction may not be obvious at first, but this solution has received careful considerations. Detailed explanations can be found in Nichols 2002, Winkler 2014, Winkler 2016, and Kimberg 2007.


Assumption of permutations

Permutation testing is based on a single important assumption, that is, the items being permuted must be exchangable. A thorough explanation of this principle is beyond the scope of this documentation, and I invite you to read the nice descriptions in Nichols 2002 and Winkler 2014. A naive example would be that you can exchange apples with apples, but not apples with bananas. When permuting values at a single voxel, we assume that patients can be exchanged with each other and the distribution of error would not change. This is what allows us to obtain a distribution of statistical values. However, when we use permutations to correct for multiple comparisons, we assume that voxels can be exchanged. In my opinion, this is a more delicate situation that requires further consideration below.

In lesion studies voxels are not all the same. Some may have high statistical power (i.e., half of the subjects are lesioned there), others may have low statistical power (only few subjects are lesioned there). A voxel sitting in an area of high statistical power is more likely to produce high statistical scores when one of the permutations randomly hits it with a good correlation with behavior. Vice versa, random correlations in an area with low statistical power will generally produce lower statistical values. As a result, true results located in an area with low statistical power will be overshadowed more easily by random results derived from areas with high statistical power. The problem extends to cluster thresholding methods. Some voxels may sit in an area with high spatial correlation, and a random correlation with a voxel will likely find all surrounding voxels to be correlated with behavior, producing a large cluster. If the true functional area is small, it might well be overshadowed by the large clusters found randomly during permutations. This means that clusterPerm thresholding may favor findings in areas with high spatial correlation and overlook small functional areas. The degree to which these biases may affect the data is currently unknown, empirical research is required. In my experience, permutation based thresholding is a robust method that produces among the most accurate results from simulated data. However, it is worth knowing the above mentioned limitations. The variations in power and spatial correlations present in lesion data create a unique problem that deserves dedicated research. I have expressed some of my concerns to Dr. Tom Nichols and he agreed that these are legitimate concerns.

If you have any thoughts on the matter, please open a Github issue or contact me privately.

UPDATE: One of the reviewers of our submitted paper has noted that the variation of statistical power is not an issue in itself; once the sample size reaches a certain number the power reaches 1 and does not change any more. We think this comment is correct, and requires a clarification of our mis-statement above. The problem is not the variation of power per se but the variation of statistical T-scores due to the variation of sample size at each voxel. A simple simulation shows that an identical strength of brain behavior relationship leads to different T-scores depending on whether the voxel splits the patients 50%:50% (higher T-scores) or 20%:80% (lower T-scores). So, technically speaking, the problem is not the variation of power but the variation of T-scores. By the way, the above simulation shows that the problem becomes more pronounced with larger sample sizes; i.e., the gap between T-scores is larger with a larger sample size.