Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending External Memory with Iterator-based Interface #7022

Closed
bridgream opened this issue Jun 4, 2021 · 7 comments
Closed

Extending External Memory with Iterator-based Interface #7022

bridgream opened this issue Jun 4, 2021 · 7 comments

Comments

@bridgream
Copy link
Contributor

bridgream commented Jun 4, 2021

I am currently working on extending the external memory feature to allow the construction of DMatrix from multiple binary data files. The current implementation only allows reading data from a single text file (csv and libsvm), which can be impractical with large datasets. Early experiments show that, the impact on performance is insignificant if caching data to SSD and training on CPUs.

This is a two-step objective. The first step is to allow reading from multiple files. Prior discussion in #6336 and #6719 suggested an approach by providing an iterator interface. The single file case can be reduced to an iterator with only one batch under this solution. While I believe there is already progress with DQD (thanks to @trivialfis), extra work is needed to also allow DMatrix to support this.

The second step is to allow construction from a more general source of data. This might involve an extension to the xgboost parser library, but alternative approaches could emerge, after we implement the feature to read from multiple files.

I will mainly focus on implementation in Python but will also try to figure out the feasibility to modify the C++ core. Any advice, thoughts and help are very much appreciated.

@trivialfis
Copy link
Member

Thanks for submitting the issue. I plan to introduce a similar feature to XGBoost by this working in progress branch: https://github.com/trivialfis/xgboost/tree/external-iterative-dmatrix Also I talked to @RAMitchell about this feature and possibly replace the current implementation of external memory (which is rusty and limiting).

Right now my branch is just a proof of concept, I have a simple test in Python. For an actual implementation, we need to have a cache for processed data. Also, we might need to investigate the efficiency of hist based tree method as it concatenates all the histogram index.

@bridgream
Copy link
Contributor Author

It would be ideal if we can read from and cache into multiple files. The goal is to enable external memory for large datasets that cannot fit into physical memory.

For the reading part, I notice there are prototypes of iterators in dmlc-core. Yet their docstrings claim they are in-memory based. I also cannot find any references to them in xgboost (perhaps they are used in other dmlc projects). It's also promising to use a wrapped proxy for iterator-based DMatrix but this seems to complicate the design, not to mention the difficulty in transplanting to other languages.

For the caching part, there is related code to use more than one file (shard) as cache. However, I cannot enable this on a single machine (it might work in rabit but I haven't tested that).

From a practical perspective, we may also need to solve this #4093 as with large datasets it's usually too time-consuming to run the exact or approximate algorithms.

@trivialfis
Copy link
Member

It's also promising to use a wrapped proxy for iterator-based DMatrix but this seems to complicate the design, not to mention the difficulty in transplanting to other languages

Right now it's used in Python, also an internal branch of jvm.

@trivialfis
Copy link
Member

Closing this issue as the interface is here now. But we still need to change internal algorithms to make it useful for training.

@talipini
Copy link

This is very exciting. Is gpu_hist mode supported with the iterator interface? Demo code seems to suggest that it does but my test throws memory errors and wanted to check. Appreciate all the work that went into this.

@trivialfis
Copy link
Member

trivialfis commented Jul 26, 2021

For GPU Hist, it's just like DeviceQuantileDMatrix with a cached ellpack matrix. Internally all batches are still concatenated, but with a compressed format. So it saves memory, but with an upper bound. For that one might simply use DeviceQuantileDMatrix with iterator instead, which is more efficient. I tried to bring a real implementation for GPU (actual batching), but failed to obtain reasonable performance even with all the optimization I can come up with. Here is a screenshot of the ratio between the time spent on copying memory from host to device, and the actual computation:

Screenshot from 2021-07-22 03-19-30

@talipini
Copy link

@trivialfis Thank you! Appreciate the details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants