New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extending External Memory with Iterator-based Interface #7022
Comments
Thanks for submitting the issue. I plan to introduce a similar feature to XGBoost by this working in progress branch: https://github.com/trivialfis/xgboost/tree/external-iterative-dmatrix Also I talked to @RAMitchell about this feature and possibly replace the current implementation of external memory (which is rusty and limiting). Right now my branch is just a proof of concept, I have a simple test in Python. For an actual implementation, we need to have a cache for processed data. Also, we might need to investigate the efficiency of hist based tree method as it concatenates all the histogram index. |
It would be ideal if we can read from and cache into multiple files. The goal is to enable external memory for large datasets that cannot fit into physical memory. For the reading part, I notice there are prototypes of iterators in dmlc-core. Yet their docstrings claim they are in-memory based. I also cannot find any references to them in xgboost (perhaps they are used in other dmlc projects). It's also promising to use a wrapped proxy for iterator-based DMatrix but this seems to complicate the design, not to mention the difficulty in transplanting to other languages. For the caching part, there is related code to use more than one file (shard) as cache. However, I cannot enable this on a single machine (it might work in rabit but I haven't tested that). From a practical perspective, we may also need to solve this #4093 as with large datasets it's usually too time-consuming to run the exact or approximate algorithms. |
Right now it's used in Python, also an internal branch of jvm. |
Closing this issue as the interface is here now. But we still need to change internal algorithms to make it useful for training. |
This is very exciting. Is gpu_hist mode supported with the iterator interface? Demo code seems to suggest that it does but my test throws memory errors and wanted to check. Appreciate all the work that went into this. |
For GPU Hist, it's just like |
@trivialfis Thank you! Appreciate the details. |
I am currently working on extending the external memory feature to allow the construction of DMatrix from multiple binary data files. The current implementation only allows reading data from a single text file (csv and libsvm), which can be impractical with large datasets. Early experiments show that, the impact on performance is insignificant if caching data to SSD and training on CPUs.
This is a two-step objective. The first step is to allow reading from multiple files. Prior discussion in #6336 and #6719 suggested an approach by providing an iterator interface. The single file case can be reduced to an iterator with only one batch under this solution. While I believe there is already progress with DQD (thanks to @trivialfis), extra work is needed to also allow DMatrix to support this.
The second step is to allow construction from a more general source of data. This might involve an extension to the xgboost parser library, but alternative approaches could emerge, after we implement the feature to read from multiple files.
I will mainly focus on implementation in Python but will also try to figure out the feasibility to modify the C++ core. Any advice, thoughts and help are very much appreciated.
The text was updated successfully, but these errors were encountered: