-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[jvm-packages] Performance differ XGBoost4j prediction in windows and linux? #4562
Comments
I see your concern. I am a bit surprised that your benchmark performs so poorly on linux (I expected it the other way around). Would you be able to publish your full benchmark somewhere so we could try to reproduce it? Another question. You used xgboost4j-0.82-criteo-20190412_2.11-win64.jar to run your windows inference but did you use I hope we would be able to understand what is going on |
Hi @trams , That is also surprised me, and I am trying to work this out for more than one week. I upload a pre-trained model and the java file I used to make the predictions. The output for windows and Linux are in duration_windows.txt and the duration_linux.txt. Could you try to reproduce those on your end? I tried to used different versions of jar in linux, including the xgboost4j-0.82-criteo-20190412_2.11-linux.jar and the official ones. However, that did not change or improve the performance in the Linux. So I think maybe it is not the problem of the jar version. |
The Linux CPU is a Broadwell architecture chip with a max turbo speed of 3.4GHz, and the Windows CPU is a Kaby Lake chip with a max clock speed of 4.2GHz. The kaby lake architecture is faster, and I'm unclear as to how many threads your demo is using (as it's a parameter of the model). If it's single threaded then I'm not really surprised it's slower on Linux (though I am surprised it's that much slower). Your demo script might also be confusing things as it keeps opening and closing a file in the loop, which isn't good practice (and might induce different behaviours based on the OS). |
Hi @Craigacp , Thanks for your reply. The opening and closing file is just to write down the time which the prediction line use in each loop. So I think the output might just reflect the time difference between Windows and the CentOS. CPU speed might be one reason, but it's much much slower than expectation even if we consider the CPU difference. I use four threads both in the Windows and CentOS. But I would double check that to make sure I did not miss anything. Could you run the demo on your end and see whether you can reproduce a similar result? |
The time isn't counted, but file opening and closing will mess about with the kernel state and other things you probably don't want in the benchmarking loop, polluting the caches and predictors. I doubt it's a major effect but it's not good benchmarking practice. I'll try to replicate it. |
Running on Windows 10 x64 1809 (on a 2012 rMBP) I get an average of around 400 microseconds, with a few small modifications to the file I get an average of 165 microseconds (which is moving the file open and print out into a separate loop). Running on Ubuntu 18.04 via the WSL on the same machine (which is not a good test case for Linux) I get an average of around 380 microseconds, and with the modifications I get an average of 93 microseconds. Everything used Java 8u212. Not had chance to run it on macOS yet, but I'm not seeing the same behaviour. This is the modified main I used which ran faster: public static void main(String[] args) throws XGBoostError {
Booster _booster;
_booster = load_model("./model/demo.model");
int iterations = 3000;
float[] predictions = new float[iterations];
long[] durations = new long[iterations];
long durationSum = 0L;
float[][] predict;
for(int k=0; k<iterations; k++){
Random rand=new Random(100);
float[] falist = new float[30];
for(int i=0; i<30; i++) {
falist[i] = (float) ((rand.nextFloat() - 0.5)/5);
}
int nrow = 1;
int ncol = 30;
DMatrix dmat = null;
try {
dmat = new DMatrix(falist, nrow, ncol, Float.NaN);
} catch (XGBoostError e2) {
// TODO Auto-generated catch block
e2.printStackTrace();
}
long startTime = System.nanoTime();
try {
predict = _booster.predict(dmat);
predictions[k] = predict[0][0];
} catch (XGBoostError e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
long endTime = System.nanoTime();
durations[k] = (endTime - startTime)/1000;
durationSum += durations[k];
}
double durationMean = durationSum / ((double) iterations);
double durationVariance = 0.0;
for (int i = 0; i < iterations; i++) {
double diff = durations[i] - durationMean;
durationVariance += diff * diff;
}
durationVariance /= (iterations-1);
System.out.println("Mean duration = " + durationMean + ", variance = " + durationVariance);
File file = new File("./duration-mod" + System.getProperty("os.name") + ".txt");
try (PrintWriter pr = new PrintWriter(new BufferedWriter(new FileWriter(file)))) {
for (int i = 0; i < iterations; i++) {
pr.println(durations[i]);
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} |
Ran on macOS and got some odd malloc errors which I'll try to look into later, but when it ran it was running somewhere between the Windows and WSL scores. |
Closing as the performance has changed significantly since then. |
I am using XGBoost4j to get predtions. But I found the performances in windows and Linux are so different. And it is quite confusing.
My codes are:
I used the same code to evaluate time of the predictions.
In windows, the output is like
3339
2518
1456
2531
4241
1873
2738
1591
1464
while in Linux, the output is like
31257
43467
33659
26732
32744
30719
28504
28195
39985
which are almost 10 times of the those in Linux.
both java version are
java --version
java 11.0.1 2018-10-16 LTS
Java(TM) SE Runtime Environment 18.9 (build 11.0.1+13-LTS)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.1+13-LTS, mixed mode)
both xgboost4j versions are both 0.82
in windows, I used xgboost4j-0.82-criteo-20190412_2.11-win64.jar
windows cpu is:
i7-7700 @ 3.6GHz
Linux cpu is:
Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz
The text was updated successfully, but these errors were encountered: