Wrong target! #54

daniel0710goldberg · 2020-06-08T15:48:23Z

The current target is a train/test selector, not a dependent variable.
According to openML, most analysis was properly done on this dataset uses the 6th feature, "Drinks", as the target.

The PMLB dataset should be changed to reflect this: treat 6th feature, currently "Drinks", as the new target, and remove 7th feature.

P.S. The metadata.yaml file is currently reflecting this change, minus a TODO that needs to be removed.

The text was updated successfully, but these errors were encountered:

weixuanfu · 2020-06-09T13:49:42Z

Thank you for catching this error!
We need discuss about whether we should keep this train/test selector into the dataset or not. If we keep the selector then the metafeature need annotate that and pmlb should ignore the column based on it.

Besides that, I think it should still be classification task instead of regression task one beacuse:

drinks has 16 unique values in 345 samples
It was considered as classification benchmark in the beginning ( From the important note in OpenML: Researchers who wish to use this dataset as a classification benchmark should follow the method used in experiments by the donor (Forsyth & Rada, 1986, Machine learning: applications in expert systems and information retrieval))

trangdata · 2020-06-10T14:12:32Z

I think we should remove the train/test selector column completely to avoid serious downstream errors. Plus, this column (despite being "not entirely random") does not add meaningful information to the data.

In this survey (pdf), they discussed papers listed on the UCI page that cited the dataset. Many studies used it incorrectly. A few other studies (Turney, Tang et al.), dichotomize the 6th variable (number of alcoholic drinks) using 3 as the numeric threshold (x6 < 3 vs x6 >= 3). If we want to keep the task as classification, I think we should use this threshold and note in the metadata description of target how it's dichotomized.

(Note that the original study used different threshold values in different experiments.)

Alternatively, we can have both datasets bupa_class and bupa_reg for two different tasks.

weixuanfu · 2020-07-06T20:03:42Z

I fixed this issue with commits above.

trangdata assigned weixuanfu Jun 8, 2020

trangdata mentioned this issue Jul 6, 2020

Add metadata to bupa, read description! #53

Merged

weixuanfu pushed a commit that referenced this issue Jul 6, 2020

edit bupa to classificatin dataset #54

0e8929b

weixuanfu pushed a commit that referenced this issue Jul 6, 2020

update README.nd #54

20c1a73

weixuanfu closed this as completed Jul 6, 2020

weixuanfu pushed a commit that referenced this issue Jul 6, 2020

fix encoding #54

6801562

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong target! #54

Wrong target! #54

daniel0710goldberg commented Jun 8, 2020 •

edited by trangdata

Loading

weixuanfu commented Jun 9, 2020

trangdata commented Jun 10, 2020

weixuanfu commented Jul 6, 2020

Wrong target! #54

Wrong target! #54

Comments

daniel0710goldberg commented Jun 8, 2020 • edited by trangdata Loading

weixuanfu commented Jun 9, 2020

trangdata commented Jun 10, 2020

weixuanfu commented Jul 6, 2020

daniel0710goldberg commented Jun 8, 2020 •

edited by trangdata

Loading