-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible bug: the random value generator used in svm_binary_svc_probability() function will not work well when training data size is large. #103
Comments
this is a known issue. See for example the following liblinear
faq:
…--------------------
Q: When using the default solver on large data, why the number of
iterations on windows is much more that that on linux?
In linear.cpp, for the implementation of coordinate descent
methods we use rand() to permute data instances. Unfortunately on
MS windows, rand() returns a value in [0, 32767]. This is too
small to ensure the randomness of the data permutation, so the
convergence becomes slow. In contrast, on linux rand() returns in
a value in a much larger range, so this problem does not occur.
A quick solution is to replace
rand()
with
(rand()*32768+rand())
and rebuild the code.
-----------
At this moment we still use rand() to conform c89
but later we may use a different random number generator
Guang Yang writes:
In svm_binary_svc_probability() function, random shuffle is
applied on the train data before it is used in the 5-fold
cross-validation process. The random shuffle is realized by the
following codes:
for(i=0;il;i++) perm[i]=i;
for(i=0;il;i++)
{
int j = i+rand()%(prob->l-i);
swap(perm[i],perm[j]);
}
The C++ rand() function in the codes returns a random number in
the range between 0 and RAND_MAX. Normally, RAND_MAX is 32767 (on
my PC, windows, x64-based processor, RAND_MAX is also this
value). So if prob->l-i is larger than RAND_MAX, the codes above
can only shuffle index between 0 and RAND_MAX. I noticed that the
train data input svm_problem *prob of the function
svm_binary_svc_probability() had already been sorted by the data
label (+1, -1 for binary classification), so the first part of
prob->y[i] are for label being +1. If the number of train data
with label being +1 is above RAND_MAX, in the 5-fold
cross-validation, the first "predicting data set" will probably
be the ones all with label +1. This will create weird results for
estimating probA and probB.
So I suggest using the random function from William H. Press, et
al.,
Numerical Recipes in C, which can return a random float value
between 0 and 1. And another question is, in
svm_binary_svc_probability() function, why not using stratified
shuffle as it is used in svm_cross_validation() function?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the
thread.*
|
FYI if you're still interested :) I submitted a fix in PR #140 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
In svm_binary_svc_probability() function, random shuffle is applied on the train data before it is used in the 5-fold cross-validation process. The random shuffle is realized by the following codes:
for(i=0;il;i++) perm[i]=i;
for(i=0;il;i++)
{
int j = i+rand()%(prob->l-i);
swap(perm[i],perm[j]);
}
The C++ rand() function in the codes returns a random number in the range between 0 and RAND_MAX. Normally, RAND_MAX is 32767 (on my PC, windows, x64-based processor, RAND_MAX is also this value). So if prob->l-i is larger than RAND_MAX, the codes above can only shuffle index between 0 and RAND_MAX. I noticed that the train data input svm_problem *prob of the function svm_binary_svc_probability() had already been sorted by the data label (+1, -1 for binary classification), so the first part of prob->y[i] are for label being +1. If the number of train data with label being +1 is above RAND_MAX, in the 5-fold cross-validation, the first "predicting data set" will probably be the ones all with label +1. This will create weird results for estimating probA and probB.
So I suggest using the random function from William H. Press, et al.,
Numerical Recipes in C, which can return a random float value between 0 and 1. And another question is, in svm_binary_svc_probability() function, why not using stratified shuffle as it is used in svm_cross_validation() function?
The text was updated successfully, but these errors were encountered: