Include target column name and class labels in predict/predict_proba output #645

dsherry · 2020-04-15T04:19:57Z

#236 covers standardizing our predict methods to return pd.Series or pd.DataFrame.

A note on this from the usability blitz: Once #236 is done, we need to populate the name field on pd.Series or column name on pd.DataFrame with the users' target column name.

And, for predict_proba output, which can have n columns, one for each class value in the target, we should label each of those with the corresponding class value.

The text was updated successfully, but these errors were encountered:

dsherry · 2020-06-25T14:45:27Z

First question here: do we support categorical-type inputs for classification today? Or do we expect classification inputs to be int?

If we don't support categorical-type, then this is blocked on #215 (supporting string-typed targets) and the scope of #215 should be expanded to include supporting categoricals.

angela97lin · 2020-06-25T16:08:44Z

Did some testing, and it looks like we don't support categorical dtypes, and expect numerical target values. More specifically:

We don't support categorical cols that are string-like (y can have a dtype of categorical but the values in the categories are “benign” and “malignant”; I think if the categories are [0,1] this is fine)
For binary classification, it seems like any numerical values for the target that aren’t 0/1 will error (ex: [0, 5] or even [False, True])

Here's some snippets of code I used to test, primarily using sklearn's datasets:

For binary classification:

from sklearn.datasets import load_breast_cancer

X, _ = load_breast_cancer(return_X_y=True, as_frame=True) # lol yes not the most efficient but wanted to grab X
cancer = load_breast_cancer()
y = pd.Categorical(pd.Series(cancer.target).map(lambda x: cancer.target_names[x]))
automl = AutoClassificationSearch(max_pipelines=5)
automl.search(X, y)

For multiclass classification:

from sklearn.datasets import load_iris

X, _ = load_iris(return_X_y=True, as_frame=True)
iris = load_iris()
y = pd.Categorical(pd.Series(iris.target).map(lambda x: iris.target_names[x]))
automl = AutoClassificationSearch(max_pipelines=5, objective='log_loss_multi', multiclass=True)
automl.search(X, y)

In both cases, , we get the same error as reported in #828 due to our check for label leakage not being able to handle non-numeric datatypes. But even when we skip data checks using automl.search(X, y, data_checks=EmptyDataChecks()) , we still logged errors for scoring objectives:

Error in PipelineBase.score while scoring objective F1 Weighted: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

dsherry · 2020-06-25T20:11:27Z

@angela97lin I just added this to the epic #886 for the July milestone. Should we move this back to Development Backlog? And I'd say please either unassign yourself from this issue, or assign yourself to the epic #886 if you want to tackle it!

dsherry · 2020-06-25T20:22:27Z

(forgot to post from this morning:)

@angela97lin thank you for the thorough explanation!

So, for classification, we don't support:

targets of string type (Support string targets for binary and multiclass problems #215 )
targets of categorical type
targets which are anything other than ordered integers ranging from 0 to n-1

In the good news category, we may currently support boolean type, but we need to verify and unit-test that.

Well, it seems like we should fix this! 😂

I just filed epic #886 for this, and moved this and #215 into that epic. Let's get this done for the July release.

dsherry added the enhancement An improvement to an existing feature. label Apr 15, 2020

This was referenced May 27, 2020

Add baseline models for a given dataset #746

Merged

Standardize component outputs to be either dataframes or series #236

Closed

dsherry added this to the June 2020 milestone May 27, 2020

dsherry mentioned this issue Jun 3, 2020

Adds ROC multi-class plotting capability #832

Merged

dsherry mentioned this issue Jun 17, 2020

Standardize component predict/predict_proba in/out to pandas series/dataframe #853

Merged

angela97lin self-assigned this Jun 24, 2020

dsherry changed the title ~~Include target column name and class labels in returned pd.Series/pd.DataFrame from predict/predict_proba~~ Include target column name and class labels in predict/predict_proba output Jun 25, 2020

dsherry changed the title ~~Include target column name and class labels in predict/predict_proba output~~ Include target column name / class labels in predict/predict_proba output Jun 25, 2020

dsherry changed the title ~~Include target column name / class labels in predict/predict_proba output~~ Include target column name and class labels in predict/predict_proba output Jun 25, 2020

dsherry mentioned this issue Jun 25, 2020

TypeError in detect_label_leakage #828

Closed

dsherry modified the milestones: June 2020, July 2020 Jun 25, 2020

angela97lin removed their assignment Jun 26, 2020

dsherry assigned angela97lin Jun 30, 2020

angela97lin mentioned this issue Jul 8, 2020

Support targets of any datatype #886

Closed

dsherry mentioned this issue Jul 15, 2020

Support string / categorical targets for binary and multiclass problems #932

Merged

This was referenced Jul 20, 2020

Include target column name and class labels in predict/predict_proba output #950

Closed

Include target column name and class labels in predict/predict_proba output #951

Merged

dsherry mentioned this issue Jul 23, 2020

Running AutoML on Iris Dataset Fails #966

Closed

angela97lin closed this as completed in #951 Jul 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include target column name and class labels in predict/predict_proba output #645

Include target column name and class labels in predict/predict_proba output #645

dsherry commented Apr 15, 2020

dsherry commented Jun 25, 2020 •

edited

angela97lin commented Jun 25, 2020

dsherry commented Jun 25, 2020 •

edited

dsherry commented Jun 25, 2020 •

edited

Include target column name and class labels in predict/predict_proba output #645

Include target column name and class labels in predict/predict_proba output #645

Comments

dsherry commented Apr 15, 2020

dsherry commented Jun 25, 2020 • edited

angela97lin commented Jun 25, 2020

dsherry commented Jun 25, 2020 • edited

dsherry commented Jun 25, 2020 • edited

dsherry commented Jun 25, 2020 •

edited

dsherry commented Jun 25, 2020 •

edited

dsherry commented Jun 25, 2020 •

edited