Using XGBoost External Memory Version (beta)

There is no big difference between using external memory version and in-memory version. The only difference is the filename format.

The external memory version takes in the following filename format:

filename#cacheprefix

The filename is the normal path to libsvm file you want to load in, and cacheprefix is a path to a cache file that XGBoost will use for external memory cache.

Note

External memory is not available with GPU algorithms

External memory is not available when tree_method is set to gpu_exact or gpu_hist.

The following code was extracted from demo/guide-python/external_memory.py:

dtrain = xgb.DMatrix('../data/agaricus.txt.train#dtrain.cache')

You can find that there is additional #dtrain.cache following the libsvm file, this is the name of cache file. For CLI version, simply add the cache suffix, e.g. "../data/agaricus.txt.train#dtrain.cache".

Performance Note

the parameter nthread should be set to number of physical cores
- Most modern CPUs use hyperthreading, which means a 4 core CPU may carry 8 threads
- Set nthread to be 4 for maximum performance in such case

Distributed Version

The external memory mode naturally works on distributed version, you can simply set path like

data = "hdfs://path-to-data/#dtrain.cache"

XGBoost will cache the data to the local position. When you run on YARN, the current folder is temporal so that you can directly use dtrain.cache to cache to current folder.

Usage Note

This is a experimental version
Currently only importing from libsvm format is supported
- Contribution of ingestion from other common external memory data source is welcomed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

external_memory.rst

external_memory.rst

Using XGBoost External Memory Version (beta)

Performance Note

Distributed Version

Usage Note

Files

external_memory.rst

Latest commit

History

external_memory.rst

File metadata and controls

Using XGBoost External Memory Version (beta)

Performance Note

Distributed Version

Usage Note