Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame to DMatrix Conversion Spec #874

Closed
tqchen opened this issue Feb 25, 2016 · 1 comment
Closed

DataFrame to DMatrix Conversion Spec #874

tqchen opened this issue Feb 25, 2016 · 1 comment
Assignees

Comments

@tqchen
Copy link
Member

tqchen commented Feb 25, 2016

This is a centralized issue giving specification of how a dataframe(pandas, R's dataframe) can be converted into DMatrix. Dataframe can be a helpful data source. Giving such specification will give chance to direct data ingestion from dataframe, and avoid memory copy issues and possible ease of external memory integration.

Currently it is straightforward to do so for continuous features. Less obvious to do so for categorical features and sparse input.

Goal

Let us not aim to do complicated things. For example, automatically indexing all the factors(categorical features) and accept string input type.

Instead have a _minimum_ specification of how to represent sparse input and categorical features and being able to quickly convert to sparse matrix type. Let the dataframe solutions do the jobs such as feature engineering.

Example Proposal 1

All the categorical columns must already been maped to unique integers. So column C1 will be in [0, n) and column C2 will be in [n, n+m). Where n is number of unique categories in C1, and m is number of unique categories in C2.

Example Proposal 2

Map existing categorical columns into unique integers. C1 will be in [0, n) C2 will be in [0, m). When constructing DMatrix, also pass size of each column [n, m] to the constructor

@tqchen tqchen mentioned this issue Feb 25, 2016
4 tasks
@khotilov
Copy link
Member

Could the Feather/Arrow be of any use in here?
https://blog.rstudio.org/2016/03/29/feather/
It's supposed to be a light, language-agnostic, fast, and data frame-friendly format.

@tqchen tqchen closed this as completed Jul 4, 2018
@lock lock bot locked as resolved and limited conversation to collaborators Oct 25, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants