Skip to content

Ridge regression manual

Zaixu Cui edited this page Oct 22, 2018 · 36 revisions

Ridge regression applied L2 regularization based on least squares method, leading to better generalization (Cui and Gong, 2018, NeuroImage). Based on scikit-learn software, ridge is faster than elastic-net or lasso.

Here, we provide three choices, i.e., Ridge_CZ_Sort.py, Ridge_CZ_RandomCV.py and Ridge_CZ_LOOCV.py. With enough samples, I generally use Ridge_CZ_Sort.py. It is K-fold corss-validation and 5 or 10 fold cross-validation is recommended, generally smaller k is better accepted by community. Splitting the samples into K subsets is random, therefore, we first sorted all subjects according to the scores and then split according to the rank (Cui and Gong, 2018, NeuroImage; Cui et al., 2018, Cerebral Cortex). So that, the distribution of scores for all subsets are the same. But some people may argue that we cannot make sure that an unseen individual is within the same distribution as training samples, maybe a repeated random K-fold cross-validation is better. Therefore, we also provided function Ridge_CZ_RandomCV.py. I always use Ridge_CZ_Sort.py as main result and Ridge_CZ_Random.py in the supplementary. Because if you put repeated random split cross-validation as main results, you will have many results and it is hard to make figures. We also implemented leave-one-out cross-validation (LOOCV) in the function Ridge_CZ_LOOCV.py. However, LOOCV is not recommended as it always leads to bias and unstable results (Varoquaux, et al., 2017, NeuroImage). But if you have very small sample size, you may have to use LOOCV. But in order to be make the results more robust, it is better to do a validation using K-fold cross validation (i.e., K=10).

About usage for the function, see the help within the function for the introduction of each variable.

Ridge_CZ_Sort.py

Doing K-fold cross-valition (split into K subsets by rank of subjects' scores)

First, Run function Ridge_KFold_Sort, set Permutation_Flag as 0. Prediction accuracy will be created with Pearson correlation and mean absolute error between actual and predicted scores.

Second, Run function Ridge_KFold_Sort_Permutation to create the random distribution of prediction accuracy. Based on this permutation results, we can determine whether our prediction accuracy acquired above is significant.

Third, Run function Ridge_Weight to create the contribution of all features. Here we used all subjects to calculate the contribution weight (Cui et al., 2018, Cerebral Cortex; Cui and Gong, 2018, NeuroImage).

Using one group to predict another group

First, Run function Ridge_APredictB, set Permutation_Flag as 0. Prediction accuracy will be created.

Second, Run function Ridge_APredictB_Permutation to create the random distribution of prediction accuracy for permutation testing.

Third, Run function Ridge_Weight to create the contribution of all features. Here we used all subjects to calculate the contribution weight (Cui et al., 2018, Cerebral Cortex; Cui and Gong, 2018, NeuroImage).

Ridge_CZ_RandomCV.py

Run function Ridge_KFold_Random_MultiTimes to create prediction accuracy of repeated random K-fold cross-validation. This function repeated function Ridge_KFold_RandomCV several times.

As we generally use repeated random K-fold cross-validation as a validation to show its prediction accuracy is similar to Ridge_CZ_Sort.py function. Therefore, here, we did not include the permutation test for repeated random K-fold cross-validation.

Ridge_CZ_LOOCV.py

LOOCV is not recommended. But if you have pretty small sample size, you may have to use this.

First, run function Ridge_LOOCV, set Permutation_Flag as 0. Prediction accuracy will be created.

Second, run function Ridge_LOOCV_Permutation to create the random distribution of prediction accuracy for permutation testing.

Third, run function Ridge_Weight to create the contribution of all features.