Skip to content
This repository has been archived by the owner on May 24, 2018. It is now read-only.

XGBoost repeatedly copying data across machines - slowing down computation #33

Closed
ankurd28 opened this issue Oct 5, 2015 · 5 comments
Closed

Comments

@ankurd28
Copy link

ankurd28 commented Oct 5, 2015

Fellow XGBoost Users,

I am facing a strange problem that I am hoping to get some help from you!
It seems that multi-machine multi-threaded XGBoost is taking more time to finish the task as compared to the multi-threaded version on a single machine!

Initially, I was experiencing trouble that XGBoost kept complaining that it was compiled in the local mode. However, I followed this issue reported by another user: xgboost is compiled in local mode #31 and solved it by following their advice.

However, now my job when run with a single machine but two threads completes in 17 seconds, whereas the same job with two machines and three threads (2 threads on one machine and 1 thread on another machine) takes ~90 seconds. I am running these jobs on AWS t2.medium and t2.micro instance.

Does anyone know why this might be happening? At this point of time, it seems to me, that either there is some thing wrong with my MPI setup (not sure what that might be though) or perhaps the way distributed XGBoost was compiled in issue #31 is not the correct way.

Thanks,
Ankur

@ankurd28 ankurd28 changed the title Multi-threaded XGBoost taking more time than single-threaded version Multi-machine multi-threaded XGBoost taking more time than single-machine multi-threaded version Oct 5, 2015
@ankurd28 ankurd28 changed the title Multi-machine multi-threaded XGBoost taking more time than single-machine multi-threaded version Multi-machine XGBoost taking more time than single-machine version Oct 5, 2015
@ankurd28 ankurd28 closed this as completed Oct 5, 2015
@ankurd28 ankurd28 reopened this Oct 5, 2015
@ankurd28
Copy link
Author

ankurd28 commented Oct 5, 2015

So, after some digging we found out the reason it was slow.
Distributed XGBoost with MPI is copying the data back and forth across the two machines and that is making the whole computation slow-down.

Anybody has any ideas on how to fix this data copying issue?

Thanks,
Ankur

@ankurd28 ankurd28 changed the title Multi-machine XGBoost taking more time than single-machine version XGBoost repeatedly copying data across machines - slowing down computation Oct 5, 2015
@tqchen
Copy link
Member

tqchen commented Oct 5, 2015

The data is indeed loaded from distributed data store, but only at startup time. So you can tell the difference from longer number of rounds.

The major goal of distributed xgboost is to scale up to the scale that could not be handled by single machine version. So it is totally possible that distributed version running slower than single node version, if the data fits into single node.

@ankurd28
Copy link
Author

ankurd28 commented Oct 6, 2015

Hi Tianqi,

Thank you for your response!

So, if I understand you correctly, speed would be of secondary concern as long as distributed xgboost can scale up across machines. It is good to understand the design goal, since that makes clear the trade-offs that have been made in the development.

Having said that, do you have any ideas on how it might be possible to speed up the distributed implementation of xgboost? In your opinion, would moving to Hadoop framework be beneficial here for speedup as compared to the MPI framework, in other words, does the xgboost implementation on top of Hadoop also loads data from a distributed data store over the network?

Thanks,
Ankur

@tqchen
Copy link
Member

tqchen commented Oct 6, 2015

Hi @ankurd28 Speed is definitely important for us.

As the data scales up, the data loading cost over network is minor compared to the running cost of training in our experience (This is different from data processing problems like mapreduce, where little computation is done on each examples, and data locality is crucial).

Because more computation hits in as we get more data. It is likely not a problem for larger dataset. For small dataset, however, as the running cost already was low, and the data loading bottleneck surface up.

@ankurd28
Copy link
Author

ankurd28 commented Oct 6, 2015

Hi Tianqi,

Thanks a lot for your response!
I completely understand your point!

Best,
Ankur

@ankurd28 ankurd28 closed this as completed Oct 6, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants