Possible bug: the random value generator used in svm_binary_svc_probability() function will not work well when training data size is large. #103

yangguangfd · 2017-09-04T05:13:59Z

In svm_binary_svc_probability() function, random shuffle is applied on the train data before it is used in the 5-fold cross-validation process. The random shuffle is realized by the following codes:

for(i=0;il;i++) perm[i]=i;
for(i=0;il;i++)
{
int j = i+rand()%(prob->l-i);
swap(perm[i],perm[j]);
}

The C++ rand() function in the codes returns a random number in the range between 0 and RAND_MAX. Normally, RAND_MAX is 32767 (on my PC, windows, x64-based processor, RAND_MAX is also this value). So if prob->l-i is larger than RAND_MAX, the codes above can only shuffle index between 0 and RAND_MAX. I noticed that the train data input svm_problem *prob of the function svm_binary_svc_probability() had already been sorted by the data label (+1, -1 for binary classification), so the first part of prob->y[i] are for label being +1. If the number of train data with label being +1 is above RAND_MAX, in the 5-fold cross-validation, the first "predicting data set" will probably be the ones all with label +1. This will create weird results for estimating probA and probB.

So I suggest using the random function from William H. Press, et al.,
Numerical Recipes in C, which can return a random float value between 0 and 1. And another question is, in svm_binary_svc_probability() function, why not using stratified shuffle as it is used in svm_cross_validation() function?

cjlin1 · 2017-09-04T05:21:30Z

this is a known issue. See for example the following liblinear faq:

…

-------------------- Q: When using the default solver on large data, why the number of iterations on windows is much more that that on linux? In linear.cpp, for the implementation of coordinate descent methods we use rand() to permute data instances. Unfortunately on MS windows, rand() returns a value in [0, 32767]. This is too small to ensure the randomness of the data permutation, so the convergence becomes slow. In contrast, on linux rand() returns in a value in a much larger range, so this problem does not occur. A quick solution is to replace rand() with (rand()*32768+rand()) and rebuild the code.

----------- At this moment we still use rand() to conform c89 but later we may use a different random number generator Guang Yang writes: In svm_binary_svc_probability() function, random shuffle is applied on the train data before it is used in the 5-fold cross-validation process. The random shuffle is realized by the following codes: for(i=0;il;i++) perm[i]=i; for(i=0;il;i++) { int j = i+rand()%(prob->l-i); swap(perm[i],perm[j]); } The C++ rand() function in the codes returns a random number in the range between 0 and RAND_MAX. Normally, RAND_MAX is 32767 (on my PC, windows, x64-based processor, RAND_MAX is also this value). So if prob->l-i is larger than RAND_MAX, the codes above can only shuffle index between 0 and RAND_MAX. I noticed that the train data input svm_problem *prob of the function svm_binary_svc_probability() had already been sorted by the data label (+1, -1 for binary classification), so the first part of prob->y[i] are for label being +1. If the number of train data with label being +1 is above RAND_MAX, in the 5-fold cross-validation, the first "predicting data set" will probably be the ones all with label +1. This will create weird results for estimating probA and probB. So I suggest using the random function from William H. Press, et al., Numerical Recipes in C, which can return a random float value between 0 and 1. And another question is, in svm_binary_svc_probability() function, why not using stratified shuffle as it is used in svm_cross_validation() function? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.*

smarie · 2019-03-25T16:33:17Z

FYI if you're still interested :) I submitted a fix in PR #140

This was referenced Mar 25, 2019

Liblinear convergence failure everywhere? scikit-learn/scikit-learn#11536

Open

[MRG] Libsvm and liblinear rand() fix for convergence on windows targets (and improvement on all targets) scikit-learn/scikit-learn#13511

Merged

smarie pushed a commit to smarie/libsvm that referenced this issue Mar 25, 2019

Fixes random number generator on windows. Fixes cjlin1#103

c221e96

smarie linked a pull request Mar 25, 2019 that will close this issue

Fixes random number generator on windows. Fixes #103 #140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible bug: the random value generator used in svm_binary_svc_probability() function will not work well when training data size is large. #103

Possible bug: the random value generator used in svm_binary_svc_probability() function will not work well when training data size is large. #103

yangguangfd commented Sep 4, 2017

cjlin1 commented Sep 4, 2017 via email

smarie commented Mar 25, 2019

Possible bug: the random value generator used in svm_binary_svc_probability() function will not work well when training data size is large. #103

Possible bug: the random value generator used in svm_binary_svc_probability() function will not work well when training data size is large. #103

Comments

yangguangfd commented Sep 4, 2017

cjlin1 commented Sep 4, 2017 via email

smarie commented Mar 25, 2019