Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
[BREAKING][jvm-packages] fix the non-zero missing value handling #4349
it's a successive PR for ad4de0d
Spark's vector assembler transformer only accepts 0 as the missing value, which creates problems when the user takes 0 as the meaningful value and there are enough number of 0-valued features leading vector assembler to use SparseVector to represent the feature vector. The reason is that those 0-valued features has been filtered out by Spark.
It's fine if the user's DataFrame only contains DenseVector, as the 0 is kept.
This PR changes the behavior of XGBoost-Spark as:
when the user sets a non-zero missing value and we detect there is sparse vector, we stop the application to prevent error
@@ Coverage Diff @@ ## master #4349 +/- ## ======================================= Coverage 67.84% 67.84% ======================================= Files 132 132 Lines 12206 12206 ======================================= Hits 8281 8281 Misses 3925 3925
referenced this pull request
Apr 16, 2019
A small comment on the aim of the review, as I did not have enough time to look at the code in details (and won't have time before next Tuesday).
@alois-bissuel thanks for the review
and to clarify the point of the PR and help the further review, this PR is to prevent the following case happening