Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong target! #54

Closed
daniel0710goldberg opened this issue Jun 8, 2020 · 3 comments
Closed

Wrong target! #54

daniel0710goldberg opened this issue Jun 8, 2020 · 3 comments
Assignees

Comments

@daniel0710goldberg
Copy link
Collaborator

daniel0710goldberg commented Jun 8, 2020

The current target is a train/test selector, not a dependent variable.
According to openML, most analysis was properly done on this dataset uses the 6th feature, "Drinks", as the target.

The PMLB dataset should be changed to reflect this: treat 6th feature, currently "Drinks", as the new target, and remove 7th feature.

P.S. The metadata.yaml file is currently reflecting this change, minus a TODO that needs to be removed.

@weixuanfu
Copy link
Contributor

Thank you for catching this error!
We need discuss about whether we should keep this train/test selector into the dataset or not. If we keep the selector then the metafeature need annotate that and pmlb should ignore the column based on it.

Besides that, I think it should still be classification task instead of regression task one beacuse:

  1. drinks has 16 unique values in 345 samples
  2. It was considered as classification benchmark in the beginning ( From the important note in OpenML: Researchers who wish to use this dataset as a classification benchmark should follow the method used in experiments by the donor (Forsyth & Rada, 1986, Machine learning: applications in expert systems and information retrieval))

@trangdata
Copy link
Collaborator

I think we should remove the train/test selector column completely to avoid serious downstream errors. Plus, this column (despite being "not entirely random") does not add meaningful information to the data.

In this survey (pdf), they discussed papers listed on the UCI page that cited the dataset. Many studies used it incorrectly. A few other studies (Turney, Tang et al.), dichotomize the 6th variable (number of alcoholic drinks) using 3 as the numeric threshold (x6 < 3 vs x6 >= 3). If we want to keep the task as classification, I think we should use this threshold and note in the metadata description of target how it's dichotomized.

(Note that the original study used different threshold values in different experiments.)

Alternatively, we can have both datasets bupa_class and bupa_reg for two different tasks.

@weixuanfu
Copy link
Contributor

I fixed this issue with commits above.

weixuanfu pushed a commit that referenced this issue Jul 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants