Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jvm-packages] Allow for bypassing spark missing value check #4805

Merged
merged 2 commits into from Dec 18, 2019

Conversation

@cpfarrell
Copy link
Contributor

cpfarrell commented Aug 23, 2019

This pull request adds a new parameter in Spark XGBoost that allows for bypassing assertion added as part of https://github.com/dmlc/xgboost/pull/4349/files (related comment #4349 (comment))

The reason for this change is that I have my own VectorAssembler implementation that allows for creating sparse vectors where any value can be encoded as missing. I had been using this assembler to create feature vectors with NaN encoded as missing which integrated well with XGBoost (both in Spark and saving the models in binary form for use on other platforms). This change would let me continue doing that by providing this special flag, essentially its a way of addressing the "Note" section at https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html#dealing-with-missing-values. I plan on providing my VectorAssembler implementation here in case any one else might be interested in it but will take just a little bit of time to do that.

For the actual code here: I mostly just copied what was happening for the missing parameter and passed my new variable around the same way. I tried to follow the style of the rest of the code base but I've never written Scala before so likely have a few style issues would be happy to hear about.

Probably the biggest thing here is how to make it clear why this exists, I am not at all sure on my naming so welcome feedback. Also should maybe document more what it does and how it should be used (including when it shouldn't be used).

For testing, I ran jvm-packages/dev/build-linux.sh and it succeeded. Would like to try actually running this inside a Spark job to verify it works how I think it does.

Let me know thoughts on this.

@cpfarrell

This comment has been minimized.

Copy link
Contributor Author

cpfarrell commented Aug 23, 2019

@CodingCat FYI since talked with you about this in #4349 (comment)

@CodingCat CodingCat changed the title Allow for bypassing spark missing value check [jvm-packages] Allow for bypassing spark missing value check Aug 23, 2019
@CodingCat

This comment has been minimized.

Copy link
Member

CodingCat commented Aug 27, 2019

@cpfarrell

This comment has been minimized.

Copy link
Contributor Author

cpfarrell commented Aug 30, 2019

Thaks, and yea will do on updating docs.

Mind if I do #4727 at the same time since would be modifying the same place in documentation? Or could do it in separate branch if you want.

@CodingCat

This comment has been minimized.

Copy link
Member

CodingCat commented Aug 31, 2019

@cpfarrell

This comment has been minimized.

Copy link
Contributor Author

cpfarrell commented Sep 11, 2019

Due to various things I probably won't be able to get back to this for 3-4 weeks but am still planning on updating the docs so can push this

@cpfarrell cpfarrell force-pushed the cpfarrell:allow_bypassing_spark_missing_check branch from e91adf9 to 2054ea9 Nov 19, 2019
@cpfarrell

This comment has been minimized.

Copy link
Contributor Author

cpfarrell commented Nov 19, 2019

I just updated review with a new version after rebasing on master and updating documentation for how to handle missing values. I ended up rewriting the documentation a decent amount as I've seen this issue cause lots of confusion with people trying to use spark on xgboost and I'm not sure what was there before was exactly right. Open to reverting those changes if desired though (or changed in any way). Also open to adding more explanation about what makes this so tricky (I've found it hard to explain to folks) but that might become kind of long in the documentation.

Also, any thought of having an implementation of a VectorAssembler that would keep zeros checked into this project so could directly point people at it to use and thus make integration easier

@hcho3

This comment has been minimized.

Copy link
Collaborator

hcho3 commented Dec 10, 2019

@CodingCat Is this ready to be merged?

@CodingCat

This comment has been minimized.

Copy link
Member

CodingCat commented Dec 10, 2019

@CodingCat CodingCat merged commit bc9d882 into dmlc:master Dec 18, 2019
7 of 10 checks passed
7 of 10 checks passed
Jenkins Linux: Build Failed to build stage
Details
continuous-integration/jenkins/pr-merge This commit cannot be built
Details
continuous-integration/travis-ci/pr The Travis CI build failed
Details
Jenkins Linux: Formatting Check Stage built successfully
Details
Jenkins Linux: Get sources Stage built successfully
Details
Jenkins Linux: Test Stage did not run due to earlier failure(s)
Details
Jenkins Win64: Build Stage built successfully
Details
Jenkins Win64: Get sources Stage built successfully
Details
Jenkins Win64: Test Stage built successfully
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
@CodingCat

This comment has been minimized.

Copy link
Member

CodingCat commented Dec 18, 2019

merged, thanks for the contribution!

@sriramch

This comment has been minimized.

Copy link
Contributor

sriramch commented Dec 18, 2019

@cpfarrell @hcho3 this is failing builds. can someone please help looking into this?

@hcho3

This comment has been minimized.

Copy link
Collaborator

hcho3 commented Dec 18, 2019

@sriramch It appears a minor infraction in Scala style check.

sriramch added a commit to sriramch/xgboost that referenced this pull request Dec 18, 2019
@hcho3

This comment has been minimized.

Copy link
Collaborator

hcho3 commented Dec 18, 2019

See #5134

@lucagiovagnoli lucagiovagnoli mentioned this pull request Jan 24, 2020
8 of 9 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants
You can’t perform that action at this time.