Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DMatrix is now an AbstractMatrix #136

Merged
merged 4 commits into from
Nov 18, 2022
Merged

Conversation

ExpandingMan
Copy link
Collaborator

This is added by virtue of the new XGDMatrixGetDataCSR method in libxgboost 1.7 which is now required. Unit tests for DMatrix constructors have been vastly improved as they now check that the matrices were actually constructed correctly rather than merely having the same shape.

The behavior I'm seeing in the presence of null values doesn't make sense to me, so this issue is going to have to be resolved before this gets merged.

@ExpandingMan
Copy link
Collaborator Author

I'm admittedly still somewhat confused here, so confirmation of what I'm about to say would be appreciated.

Following the discussion here, it seems that our existing sparse matrix DMatrix constructor was invalid because, while the Julia SparseMatrixCSC necessarily uses 0 as a default value, libxgboost effectively considers these values to be missing.

Furthermore, I've had to "hack" the getindex(::SparseMatrixCSR, ::Any) method to return a missing instead of a 0 as a default value. This allows for the possibility that if a user sets their own missing_value argument (as opposed to the default I've provided) they may construct a DMatrix with distinct values from the matrix they used to construct it. This does not seem ideal. The only alternatives to this I can think of would be:

  1. Do nothing other than add a documentation warning that if users set missing_value this can happen.
  2. Disallow users to set missing_value.
  3. Have the Julia DMatrix struct store a settable default value.

Of these I favor 1 as it seems like by far the simplest approach and does not compromise the utility of the DMatrix object. At this point I'm not even sure 3 is valid, I think probably not as it seems increasingly to me like libxgboost considers those values missing.

cc @ablaom @trivialfis

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there were discussions on XGBoost's side to restore 0s from sparse matrix input, but we weren't quite sure whether it's necessary for a ML library to perform these types of data manipulation (which is not trivial in the existing code base). It seems the argument is getting stronger.

@ExpandingMan
Copy link
Collaborator Author

I can't really say how necessary it is, in most cases whatever data was used to construct the DMatrix is probably still around somewhere since the DMatrix is acting very much like an internal libxgboost library object. Indeed, in the Julia wrapper it is not necessary to pass a DMatrix anywhere, it will in all cases be automatically constructed from the provided AbstractMatrix.

My goal for this PR was to ensure that the DMatrix at least has correct values. Therefore, my latest changes in which I "hacked" the SparseMatrixCSR return to give missing instead of 0 seems to be correct.

I'll clean up docs and stuff and then this should be ready to merge, thanks for the clarification @trivialfis .

@ExpandingMan
Copy link
Collaborator Author

This PR should now be complete. I'll wait approximately 24 hours to merge it to give people a chance to object or comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants