Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] cuDF integration into XGBoost #3997

Open
wants to merge 36 commits into
base: master
from

Conversation

7 participants
@mt-jones
Copy link
Contributor

commented Dec 13, 2018

Enable XGBoost by providing native support for cuDF: https://github.com/rapidsai/cudf

Note: this PR is currently a work in progress. We are publishing it to acquire early feedback, so that we can ensure smooth integration with DMLC/XGBoost.

This PR provides the following:

  • Support for cudf.Dataframe at data ingest in Python
  • Support for DMatrix construction from gdf_column types
  • Updates to build infrastructure for compiling with cudf support.
@RAMitchell
Copy link
Member

left a comment

In general the impact of this PR on the rest of the code bases too high. The conversion from cudf should occur in c_api.h/c_api.cu and the rest of the code base should have no idea what cudf is.

We should consider putting <cudf/types.h> in src/c_api/external/cudf_types.h. This would mean xgboost can be built with support for cudf without any dependencies (although obviously you will need cudf to actually use the functionality via python) and the official release will support this.

Show resolved Hide resolved cmake/modules/FindGDF.cmake Outdated
Show resolved Hide resolved include/xgboost/data.h Outdated
Show resolved Hide resolved src/data/gdf.cuh Outdated
Show resolved Hide resolved src/tree/updater_gpu_hist.cu Outdated
Show resolved Hide resolved tests/python-gpu/test_gpu_gdf.py Outdated
Show resolved Hide resolved tests/python-gpu/test_gpu_gdf.py Outdated
Show resolved Hide resolved tests/python-gpu/test_gpu_gdf.py Outdated
@@ -16,6 +16,10 @@
#include <stdint.h>
#endif

#ifdef XGBOOST_USE_CUDF
#include <cudf/types.h>

This comment has been minimized.

Copy link
@trivialfis

trivialfis Jan 3, 2019

Member

Just a suggestion, is it possible to use opaque type in c_api.h ? I have been trying to rewrite the CMake scripts and added a installation target, bring in an external header will cause some troubles.

This comment has been minimized.

Copy link
@mt-jones

mt-jones Jan 30, 2019

Author Contributor

I'm not sure what you mean. Could you clarify and provide an example?

@mt-jones
Copy link
Contributor Author

left a comment

Some of the Python code likely needs to change to match new function definitions, and the removal of cudf as a dependency.

In general, the specification you outlined in the interchange format looks solid to me. It is not likely to change in the near future, making it a viable long term solution.

Show resolved Hide resolved python-package/xgboost/core.py Outdated
Show resolved Hide resolved python-package/xgboost/core.py Outdated
INT64,
FLOAT32,
FLOAT64,
} cudf_interchange_dtype;

This comment has been minimized.

Copy link
@mt-jones

mt-jones Feb 27, 2019

Author Contributor

Restricting to numeric types is a good idea. Those are not likely to change any time soon.

@mt-jones mt-jones referenced this pull request Feb 28, 2019

Closed

[WIP] CUDF Interchange Format #1065

1 of 6 tasks complete
@RAMitchell

This comment has been minimized.

Copy link
Member

commented Feb 28, 2019

@mt-jones in general there is no dependency on cudf in python due to the compat.py module creating dummy classes if there is no cudf. Obviously you will need cudf installed to actually pass a cudf dataframe.

I will tidy up the python code when I see what the cudf side looks like.

@hcho3 can I get a review please. The python code is in flux but the native code is ready.

@hcho3 hcho3 self-requested a review Feb 28, 2019

@RAMitchell

This comment has been minimized.

Copy link
Member

commented Mar 1, 2019

@hcho3 I just noticed we have some strange methods for DMatrix in the Python API such as set_weight_npy2d. Shouldn't set_float_info automatically detect the input type and call the appropriate c_api function? All of these _npy2d functions seem redundant.

@RAMitchell RAMitchell force-pushed the rapidsai:cudf branch from 6685954 to b326237 Mar 1, 2019

@RAMitchell RAMitchell force-pushed the rapidsai:cudf branch from b326237 to 63c9f8b Mar 1, 2019

@hcho3

This comment has been minimized.

Copy link
Collaborator

commented Mar 1, 2019

@RAMitchell Will review this after 0.82 release.

@mt-jones

This comment has been minimized.

Copy link
Contributor Author

commented Mar 6, 2019

@hcho3 @trivialfis @RAMitchell given that this will not be integrated into the 0.82 Release, we're thinking of generalizing this PR which would involve the following changes:

  1. Remove the interchange format altogether
  2. Enable XGB ingest of Apache Arrow via passing opaque pointers to data structure (device side)

I believe this is a more robust solution that provides a large community access to XGB, but I wanted to probe for thoughts on this matter.

Thanks.

@trivialfis

This comment has been minimized.

Copy link
Member

commented Mar 6, 2019

@mt-jones Sounds interesting. :)

@hcho3

This comment has been minimized.

Copy link
Collaborator

commented Mar 6, 2019

@mt-jones Interesting. I particularly like the idea of integrating Apache Arrow.

@CodingCat

This comment has been minimized.

Copy link
Member

commented Mar 7, 2019

@mt-jones interesting idea, I'd share some experience related to Arrow&XGB before

we have tried to integrate with Apache Arrow but find it doesn't provide enough benefits which beats the concerning of code complexity in the middle of the journey

our previous concern is that the memory footprint to copy data from JVM to native layer is too high in XGBoost-Spark leading to the scalability concerning. So we choose to use Arrow as the intermediate layer to avoid data copying

however, after we fix the memory leak caused by iterator.duplicate() in Scala language, we found the savings would be only the size of 32K records in the memory (which is the number of records we copy from jvm to native for every batch) => after this fix, xgb 0.81 can scale to 10+TB training dataset with XGB-Spark

more details with back-of-envelope calculation:

With Arrow: we translate training data from Spark's LabeledPoint to Arrow format

Without Arrow: we translate LabeledPoints from Spark's LabeledPoint to XGBoost's LabeledPoint and then copy to native memory with 32K records batch by batch

so, the additional memory footprint in Without Arrow case is that 32K records

BTW I think this type of potentially huge change and roadmap thing is better to come up with a RFC for discussion (especially on what we want to get from the Arrow integration (e.g. Spark triggers the work of Arrow integration because they want to interact with Pandas more efficiently, maybe we can come up with the same story and make the integration of Arrow and better Pandas perf in another PR") )

@CodingCat

This comment has been minimized.

Copy link
Member

commented Mar 7, 2019

Another concern making us discard the Arrow experiment in the middle is that, Arrow essentially put everything in memory and in most of shared Hadoop clusters, memory size used by each container is strictly limited, Arrow brings more user complaints about why my container is killed (actually because they use too much memory)

to make it more aligned with other undergoing work, maybe we should focus on one thing in this PR, integrate cuDF and work with others to have a co-design about how the data layer of xgboost should look like, e.g. as @trivialfis mentioned in the thread of #4193

@mt-jones

This comment has been minimized.

Copy link
Contributor Author

commented Mar 7, 2019

@CodingCat these are excellent considerations, and I think there are a lot of moving parts to this. I would say they can be broken down into three components.

  1. Arrow ingest from the host
  2. Arrow ingest from the device
  3. Defining a data specification

Regarding (1), I think your points are well made. Probing how data migrates through different ecosystems helps us better understand the path of greatest performance, and of least code impact. I’m not sure I’m sold on a good solution for handling data with Arrow on the host.

Regarding (2) the proposition is as follows: cuDF’s specification is itself in flux as developments are made and features are added. One thing we work diligently to achieve is cohesion with the Arrow specification of the data structure. We are always compliant. So rather than building an interchange format which tries to encapsulate numerical types as floats, and building a matching structure in cuDF, we thought it made more sense to explicitly match the Arrow specification for the device side struct. That way, we guarantee some consistency moving forward. Materially, this will place a definition of the Arrow data struct device side into XGB. This is what we’d use to perform instantiation.

Ideally, there would be a matching feature addition for Arrow host-side. I don’t think it belongs in this PR because it would touch a lot more code, and delay this PR unnecessarily.

Regarding (3) I think co-designing a specification for a data structure will help us create matching interoperability in cuDF, making it easier to maintain a functional workflow that involves both libraries.

It would be awesome if we could perform a conversion internally, and pass a pointer to XGB and have it build a distributed DMatrix for training, etc.

Let me know what you’re thinking is with the above. Again, if you think it’s better to build in direct support for cuDF, we’re open to the idea; though, it would be predicated on (3), I think.

@CodingCat

This comment has been minimized.

Copy link
Member

commented Mar 7, 2019

@mt-jones thanks for the reply!

IIUC, Arrow is chosen as a standardized specification for the intermediate data layer between "any frameworks running with hardware other than host" and XGBoost,

in future, if XGBoost supports other devices beyond GPU, say XPU, the frameworks operating on those devices are supposed to transform data to Arrow first so that XGBoost can directly consume from that

@mt-jones

This comment has been minimized.

Copy link
Contributor Author

commented Mar 7, 2019

@CodingCat thanks for the insight. Let me make sure I understand.

  1. Arrow is the standard for device interoperability.
  2. If cuDF output a data structure which was Arrow compliant, XGBoost will handle it in the future
  3. As far as I can tell, cuDF interoperability represents the first implementable undertaking which would require defining infrastructural components for ingest from GPU DataFrame (or similar) libraries

Is this correct?

If so, (3) prompts us to go a bit further by defining some of infrastructural components in XGBoost in conjunction with co-designing the DMatrix specification.

Looking forward to your thinking.

@hcho3

This comment has been minimized.

Copy link
Collaborator

commented Mar 7, 2019

@CodingCat @mt-jones So we should review this PR as it is? Can I review it now?

@RAMitchell

This comment has been minimized.

Copy link
Member

commented Mar 7, 2019

@hcho3 I think we may go in a different direction, so there is no need right now.

@CodingCat

This comment has been minimized.

Copy link
Member

commented Mar 8, 2019

@mt-jones yes, that’s what I meant and what I got from your first reply to me?

Did I bring something beyond what you originally scoped?

@mt-jones

This comment has been minimized.

Copy link
Contributor Author

commented Mar 8, 2019

@mt-jones yes, that’s what I meant and what I got from your first reply to me?

Did I bring something beyond what you originally scoped?

Not at all. This is entirely consistent with our plans for this PR. Just making sure we’re aligned :)

I was surprised to hear about the host side issues with Arrow. But it makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.