Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include target column name and class labels in predict/predict_proba output #645

Closed
dsherry opened this issue Apr 15, 2020 · 4 comments · Fixed by #951
Closed

Include target column name and class labels in predict/predict_proba output #645

dsherry opened this issue Apr 15, 2020 · 4 comments · Fixed by #951
Assignees
Labels
enhancement An improvement to an existing feature.
Milestone

Comments

@dsherry
Copy link
Contributor

dsherry commented Apr 15, 2020

#236 covers standardizing our predict methods to return pd.Series or pd.DataFrame.

A note on this from the usability blitz: Once #236 is done, we need to populate the name field on pd.Series or column name on pd.DataFrame with the users' target column name.

And, for predict_proba output, which can have n columns, one for each class value in the target, we should label each of those with the corresponding class value.

@dsherry
Copy link
Contributor Author

dsherry commented Jun 25, 2020

First question here: do we support categorical-type inputs for classification today? Or do we expect classification inputs to be int?

If we don't support categorical-type, then this is blocked on #215 (supporting string-typed targets) and the scope of #215 should be expanded to include supporting categoricals.

@dsherry dsherry changed the title Include target column name and class labels in returned pd.Series/pd.DataFrame from predict/predict_proba Include target column name and class labels in predict/predict_proba output Jun 25, 2020
@dsherry dsherry changed the title Include target column name and class labels in predict/predict_proba output Include target column name / class labels in predict/predict_proba output Jun 25, 2020
@dsherry dsherry changed the title Include target column name / class labels in predict/predict_proba output Include target column name and class labels in predict/predict_proba output Jun 25, 2020
@angela97lin
Copy link
Contributor

Did some testing, and it looks like we don't support categorical dtypes, and expect numerical target values. More specifically:

  • We don't support categorical cols that are string-like (y can have a dtype of categorical but the values in the categories are “benign” and “malignant”; I think if the categories are [0,1] this is fine)
  • For binary classification, it seems like any numerical values for the target that aren’t 0/1 will error (ex: [0, 5] or even [False, True])

Here's some snippets of code I used to test, primarily using sklearn's datasets:

For binary classification:

from sklearn.datasets import load_breast_cancer

X, _ = load_breast_cancer(return_X_y=True, as_frame=True) # lol yes not the most efficient but wanted to grab X
cancer = load_breast_cancer()
y = pd.Categorical(pd.Series(cancer.target).map(lambda x: cancer.target_names[x]))
automl = AutoClassificationSearch(max_pipelines=5)
automl.search(X, y)

For multiclass classification:

from sklearn.datasets import load_iris

X, _ = load_iris(return_X_y=True, as_frame=True)
iris = load_iris()
y = pd.Categorical(pd.Series(iris.target).map(lambda x: iris.target_names[x]))
automl = AutoClassificationSearch(max_pipelines=5, objective='log_loss_multi', multiclass=True)
automl.search(X, y)

In both cases, , we get the same error as reported in #828 due to our check for label leakage not being able to handle non-numeric datatypes. But even when we skip data checks using automl.search(X, y, data_checks=EmptyDataChecks()) , we still logged errors for scoring objectives:

Error in PipelineBase.score while scoring objective F1 Weighted: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

@dsherry
Copy link
Contributor Author

dsherry commented Jun 25, 2020

@angela97lin I just added this to the epic #886 for the July milestone. Should we move this back to Development Backlog? And I'd say please either unassign yourself from this issue, or assign yourself to the epic #886 if you want to tackle it!

@dsherry dsherry modified the milestones: June 2020, July 2020 Jun 25, 2020
@dsherry
Copy link
Contributor Author

dsherry commented Jun 25, 2020

(forgot to post from this morning:)

@angela97lin thank you for the thorough explanation!

So, for classification, we don't support:

In the good news category, we may currently support boolean type, but we need to verify and unit-test that.

Well, it seems like we should fix this! 😂

I just filed epic #886 for this, and moved this and #215 into that epic. Let's get this done for the July release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An improvement to an existing feature.
Projects
None yet
2 participants