Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jvm-packages] Tutorial on handling missing values #4425

Merged
merged 9 commits into from May 6, 2019

Conversation

Projects
None yet
3 participants
@Daniel8hen
Copy link
Contributor

commented Apr 30, 2019

add tutorial on missing values and how to handle those within XGBoost.

Update XGBoost Tutorial
add tutorial on missing values and how to handle those within XGBoost.

@hcho3 hcho3 requested a review from CodingCat Apr 30, 2019

Daniel8hen referenced this pull request Apr 30, 2019

[BREAKING][jvm-packages] fix the non-zero missing value handling (#4349)
* fix the nan and non-zero missing value handling

* fix nan handling part

* add missing value

* Update MissingValueHandlingSuite.scala

* Update MissingValueHandlingSuite.scala

* stylistic fix
@hcho3

hcho3 approved these changes Apr 30, 2019

@hcho3 hcho3 self-requested a review May 1, 2019

@hcho3
Copy link
Collaborator

left a comment

We also need to mention the case where 0's are meaningful values, i.e. 0 != missing

@Daniel8hen

This comment has been minimized.

Copy link
Contributor Author

commented May 1, 2019

Hey guys, I'm updating per your chnages and will ping once all is done for the 2nd review.
Thanks,
Daniel

update tutorial both Null case and 0 values
update tutorial both Null case and 0 values
@Daniel8hen

This comment has been minimized.

Copy link
Contributor Author

commented May 1, 2019

@hcho3 , @CodingCat I've updated the tutorial accordingly. Waiting for your input once again.

@Daniel8hen

This comment has been minimized.

Copy link
Contributor Author

commented May 1, 2019

We also need to mention the case where 0's are meaningful values, i.e. 0 != missing

@hcho3 note it was fixed and added.

@Daniel8hen

This comment has been minimized.

Copy link
Contributor Author

commented May 2, 2019

@hcho3 @CodingCat guys any update? What do you think about the last version? Thank you!

@hcho3

hcho3 approved these changes May 3, 2019

Copy link
Collaborator

left a comment

LGTM. But I'm not an expert in JVM packages, so I'd like to ask another person to look at this as well.

@CodingCat @alois-bissuel Can you look at this tutorial and see if the explanation is reasonable? I'd like to see this merged, because I see many people tripping over missing values.

@hcho3 hcho3 changed the title Update XGBoost Tutorial [jvm-packages] Tutorial on handling missing values May 3, 2019

Show resolved Hide resolved doc/jvm/xgboost4j_spark_tutorial.rst Outdated
Show resolved Hide resolved doc/jvm/xgboost4j_spark_tutorial.rst Outdated
@Daniel8hen

This comment has been minimized.

Copy link
Contributor Author

commented May 4, 2019

Thank you @hcho3 .

Show resolved Hide resolved doc/jvm/xgboost4j_spark_tutorial.rst Outdated
Show resolved Hide resolved doc/jvm/xgboost4j_spark_tutorial.rst Outdated
@CodingCat
Copy link
Member

left a comment

I left some suggestions inlined

Update dealing with missing values tutorial
Updated Per @CodingCat's notes the tutorial that will be relevant for dealing with NULL values.
(*) Skip VectorAssembler (using setHandleInvalid = "skip") directly. Used in (2), (3).
(*) Keep it (using setHandleInvalid = "keep"), and set the "missing" parameter in XGBClassifier/XGBRegressor as the value representing missing. Used in (2) and (4).
(*) Keep it (using setHandleInvalid = "keep") and transform to other irregular values. Used in (3).
(*) Transform it to other values. Used in (1).

This comment has been minimized.

Copy link
@CodingCat

CodingCat May 6, 2019

Member

you can keep your original description about why it is tricky with (1)

This comment has been minimized.

Copy link
@Daniel8hen

Daniel8hen May 6, 2019

Author Contributor

Never mind, already updated it and the message to the user is basically the same.

Daniel8hen added some commits May 6, 2019

final update - dealing with null values
final update - dealing with null values
update header for Dealing with missing values
update header for Dealing with missing values
@Daniel8hen

This comment has been minimized.

Copy link
Contributor Author

commented May 6, 2019

@CodingCat I'm done. Let me know if you see anything else, will be happy to fix that as well. Reviewd both header and content few times and saw it fits plus users understand. Thank you.

1. Skip VectorAssembler (using setHandleInvalid = "skip") directly. Used in (2), (3).
2. Keep it (using setHandleInvalid = "keep"), and set the "missing" parameter in XGBClassifier/XGBRegressor as the value representing missing. Used in (2) and (4).
3. Keep it (using setHandleInvalid = "keep") and transform to other irregular values. Used in (3).
4. Transform it to other values. Used in (1).

This comment has been minimized.

Copy link
@CodingCat

CodingCat May 6, 2019

Member

"due to the fact that Spark's VectorAssembler transformer only accepts 0 as the missing values. This issue creates a problem when the user has 0 as meaningful value, plus there are enough 0's to use SparseVector.

In case the dataset is represented by a DenseVector, the 0 is kept."

I mean this paragraph from your original commits is informative, you can keep it

This comment has been minimized.

Copy link
@Daniel8hen

Daniel8hen May 6, 2019

Author Contributor

ah I see. I will add it as well then.

This comment has been minimized.

Copy link
@CodingCat

CodingCat May 6, 2019

Member

I changed accordingly based on your current structure

This comment has been minimized.

Copy link
@Daniel8hen

Daniel8hen May 6, 2019

Author Contributor

OK, thanks for the changes from your side as well. When expected to see it in tutorial, @CodingCat ?
Thank you

This comment has been minimized.

Copy link
@hcho3

hcho3 May 6, 2019

Collaborator

The tutorial will be available right after this PR is merged.

This comment has been minimized.

Copy link
@Daniel8hen

Daniel8hen May 6, 2019

Author Contributor

Thank u @hcho3 . I saw failures on Travis CI - should I ignore it for now @CodingCat ?

This comment has been minimized.

Copy link
@hcho3

hcho3 May 6, 2019

Collaborator

@Daniel8hen The failed test is most likely due to flaky internet connection. I re-started the test.

This comment has been minimized.

Copy link
@Daniel8hen

Daniel8hen May 6, 2019

Author Contributor

Thx @hcho3 . So merge will be automatically once all tests and builds are done?

This comment has been minimized.

Copy link
@hcho3

hcho3 May 6, 2019

Collaborator

Yes, I'll merge it. @CodingCat Are we good with what we have here now?

This comment has been minimized.

Copy link
@CodingCat

CodingCat May 6, 2019

Member

yes, thanks

Daniel8hen and others added some commits May 6, 2019

update theory and background - missing values
update theory and background - missing values

@hcho3 hcho3 merged commit eabcc0e into dmlc:master May 6, 2019

6 of 7 checks passed

continuous-integration/appveyor/pr Waiting for AppVeyor build to complete
Details
Build Stage built successfully
Details
Formatting Check Stage built successfully
Details
Get sources Stage built successfully
Details
Test Stage built successfully
Details
continuous-integration/jenkins/pr-merge This commit looks good
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@hcho3

This comment has been minimized.

Copy link
Collaborator

commented May 6, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.