Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BREAKING][jvm-packages] fix the non-zero missing value handling #4349

Open
wants to merge 5 commits into
base: master
from

Conversation

Projects
None yet
4 participants
@CodingCat
Copy link
Member

CodingCat commented Apr 9, 2019

it's a successive PR for ad4de0d

Spark's vector assembler transformer only accepts 0 as the missing value, which creates problems when the user takes 0 as the meaningful value and there are enough number of 0-valued features leading vector assembler to use SparseVector to represent the feature vector. The reason is that those 0-valued features has been filtered out by Spark.

It's fine if the user's DataFrame only contains DenseVector, as the 0 is kept.

This PR changes the behavior of XGBoost-Spark as:

when the user sets a non-zero missing value and we detect there is sparse vector, we stop the application to prevent error

@CodingCat CodingCat changed the title [jvm-packages] fix the nan and non-zero missing value handling [jvm-packages] fix the non-zero missing value handling Apr 9, 2019

@CodingCat CodingCat requested a review from superbobry Apr 9, 2019

@CodingCat

This comment has been minimized.

Copy link
Member Author

CodingCat commented Apr 9, 2019

@yanboliang the PR is changed to this

@CodingCat CodingCat changed the title [jvm-packages] fix the non-zero missing value handling [breaking][jvm-packages] fix the non-zero missing value handling Apr 9, 2019

@CodingCat CodingCat changed the title [breaking][jvm-packages] fix the non-zero missing value handling [BREAKING][jvm-packages] fix the non-zero missing value handling Apr 9, 2019

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Apr 9, 2019

Codecov Report

Merging #4349 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #4349   +/-   ##
=======================================
  Coverage   67.84%   67.84%           
=======================================
  Files         132      132           
  Lines       12206    12206           
=======================================
  Hits         8281     8281           
  Misses       3925     3925

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b72eab3...00a70c7. Read the comment docs.

@superbobry

This comment has been minimized.

Copy link
Contributor

superbobry commented Apr 18, 2019

I'm sorry, I'm not an active user of XGBoost any more, so I don't feel confident reviewing this. @alois-bissuel, do you want to have a look?

@superbobry superbobry removed their request for review Apr 18, 2019

@alois-bissuel

This comment has been minimized.

Copy link

alois-bissuel commented Apr 18, 2019

A small comment on the aim of the review, as I did not have enough time to look at the code in details (and won't have time before next Tuesday).
I might be mistaken, but I think using SparseVector is a way (without specifying a special number which signals a missing value) to correctly handle missing value.
There is a specific use of it if one has only integer (and obviously ordinal) features (there is no NaN for ints).
I also looked at the code of VectorAssembler in Spark MLlib, and it seems that it correctly handles null features only since Spark v2.4 (see VectorAssembler.scala#L282 in Spark 2.4).
So after this review, there should be no other way of handling missing values than using a special value (either Nan or set it through the parameter "missing")

@CodingCat

This comment has been minimized.

Copy link
Member Author

CodingCat commented Apr 18, 2019

@alois-bissuel thanks for the review

and to clarify the point of the PR and help the further review, this PR is to prevent the following case happening

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.