Extending External Memory with Iterator-based Interface #7022

bridgream · 2021-06-04T13:20:55Z

I am currently working on extending the external memory feature to allow the construction of DMatrix from multiple binary data files. The current implementation only allows reading data from a single text file (csv and libsvm), which can be impractical with large datasets. Early experiments show that, the impact on performance is insignificant if caching data to SSD and training on CPUs.

This is a two-step objective. The first step is to allow reading from multiple files. Prior discussion in #6336 and #6719 suggested an approach by providing an iterator interface. The single file case can be reduced to an iterator with only one batch under this solution. While I believe there is already progress with DQD (thanks to @trivialfis), extra work is needed to also allow DMatrix to support this.

The second step is to allow construction from a more general source of data. This might involve an extension to the xgboost parser library, but alternative approaches could emerge, after we implement the feature to read from multiple files.

I will mainly focus on implementation in Python but will also try to figure out the feasibility to modify the C++ core. Any advice, thoughts and help are very much appreciated.

trivialfis · 2021-06-04T17:18:51Z

Thanks for submitting the issue. I plan to introduce a similar feature to XGBoost by this working in progress branch: https://github.com/trivialfis/xgboost/tree/external-iterative-dmatrix Also I talked to @RAMitchell about this feature and possibly replace the current implementation of external memory (which is rusty and limiting).

Right now my branch is just a proof of concept, I have a simple test in Python. For an actual implementation, we need to have a cache for processed data. Also, we might need to investigate the efficiency of hist based tree method as it concatenates all the histogram index.

bridgream · 2021-06-08T13:07:37Z

It would be ideal if we can read from and cache into multiple files. The goal is to enable external memory for large datasets that cannot fit into physical memory.

For the reading part, I notice there are prototypes of iterators in dmlc-core. Yet their docstrings claim they are in-memory based. I also cannot find any references to them in xgboost (perhaps they are used in other dmlc projects). It's also promising to use a wrapped proxy for iterator-based DMatrix but this seems to complicate the design, not to mention the difficulty in transplanting to other languages.

For the caching part, there is related code to use more than one file (shard) as cache. However, I cannot enable this on a single machine (it might work in rabit but I haven't tested that).

From a practical perspective, we may also need to solve this #4093 as with large datasets it's usually too time-consuming to run the exact or approximate algorithms.

trivialfis · 2021-06-08T14:11:33Z

It's also promising to use a wrapped proxy for iterator-based DMatrix but this seems to complicate the design, not to mention the difficulty in transplanting to other languages

Right now it's used in Python, also an internal branch of jvm.

trivialfis · 2021-07-22T07:19:25Z

Closing this issue as the interface is here now. But we still need to change internal algorithms to make it useful for training.

talipini · 2021-07-25T14:22:12Z

This is very exciting. Is gpu_hist mode supported with the iterator interface? Demo code seems to suggest that it does but my test throws memory errors and wanted to check. Appreciate all the work that went into this.

trivialfis · 2021-07-26T08:19:29Z

For GPU Hist, it's just like DeviceQuantileDMatrix with a cached ellpack matrix. Internally all batches are still concatenated, but with a compressed format. So it saves memory, but with an upper bound. For that one might simply use DeviceQuantileDMatrix with iterator instead, which is more efficient. I tried to bring a real implementation for GPU (actual batching), but failed to obtain reasonable performance even with all the optimization I can come up with. Here is a screenshot of the ratio between the time spent on copying memory from host to device, and the actual computation:

talipini · 2021-07-26T13:25:11Z

@trivialfis Thank you! Appreciate the details.

trivialfis added the feature-request label Jun 4, 2021

trivialfis mentioned this issue Jun 30, 2021

Export Python Interface for external memory. #7070

Merged

7 tasks

trivialfis closed this as completed Jul 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extending External Memory with Iterator-based Interface #7022

Extending External Memory with Iterator-based Interface #7022

bridgream commented Jun 4, 2021 •

edited

trivialfis commented Jun 4, 2021

bridgream commented Jun 8, 2021

trivialfis commented Jun 8, 2021

trivialfis commented Jul 22, 2021

talipini commented Jul 25, 2021

trivialfis commented Jul 26, 2021 •

edited

talipini commented Jul 26, 2021

Extending External Memory with Iterator-based Interface #7022

Extending External Memory with Iterator-based Interface #7022

Comments

bridgream commented Jun 4, 2021 • edited

trivialfis commented Jun 4, 2021

bridgream commented Jun 8, 2021

trivialfis commented Jun 8, 2021

trivialfis commented Jul 22, 2021

talipini commented Jul 25, 2021

trivialfis commented Jul 26, 2021 • edited

talipini commented Jul 26, 2021

bridgream commented Jun 4, 2021 •

edited

trivialfis commented Jul 26, 2021 •

edited