Improve inference of booleans to handle string representations #153

gsheni · 2020-09-25T21:50:38Z

To improve the boolean inference, we can say that if rows value counts fall into:
- [1, True, "true", "True", "yes", "t", "T"]
- [0, False, "false", "False", "no", "f", "F"]
- --> we should infer the Logical Type to be Boolean.
This would prevent the following weird inference, where both columns are inferred to be categorical.

The text was updated successfully, but these errors were encountered:

gsheni · 2020-09-25T21:59:14Z

Relates to

Add better boolean inference #52

dsherry · 2020-11-05T15:00:22Z

@gsheni can you explain why the example you gave is currently inferred as categorical? I wonder if this should get classified as a bug instead, and the fix is simply to set np.nan/pd.NA values aside when we do type inference. Because if you did that, the only remaining value in your example is True.

Yeah, if I follow this right, my suggestion is:

Close this issue
File a bug to track fixing the specific example you gave and attach it to the type inference epic
File a new feature issue to track "Use sampling during inference" and attach it to the type inference epic

gsheni · 2020-11-05T21:58:07Z

@dsherry My example wasn't clear enough. Let say we had we had some Data Columns like this:

[1, 0, 1, 1]
["true", "false", "true", "true"]
["True", "False", "True", "True"]
["yes", "no", "yes", "yes"]
["t", "f", "t", "t"]
["T", "F", "T", "T"]

All of DataColumns should be inferred with the Boolean Logical Type, and converted to the following representation (pd.BooleanDtype).

[True, False, True, True]

If there is np.nan/pd.NA in the column, it should be ignored when inferring the Logical Type.

dsherry · 2020-11-06T23:28:33Z

Got it. So these

[True, False, True, True, np.nan]
[True, False, True, True, pd.NA]

would also end up as boolean logical type, converted to pd.BooleanDtype resulting in

[True, False, True, True, pd.NA]

yes?

It occurs to me we'll want the same nan-tolerant behavior when we infer any type, not just booleans, right? Are there other types which we need to address right now? Whoever picks this up, please look into that / add test coverage to look into that :)

gsheni · 2020-11-08T16:24:25Z

@dsherry Yes, we want the NaNs converted properly for Boolean Logical Types.

Though, this issue is more about converting string representations of boolean:

["true", "false", "true", "true"]
["True", "False", "True", "True"]
["yes", "no", "yes", "yes"]
["t", "f", "t", "t"]

thehomebrewnerd · 2021-04-16T20:03:53Z

If we update the inference of booleans to identify series such as [1, 0, 1, 1] a Boolean logical type, this series will also be a match for the Integer logical type. Both of these inference functions will call series.isnull().any() separately, which could be inefficient for large datasets. As part of the implementation, it would be good to update so this call only needs to happen once, if possible.

See PR #830 for additional context.

gsheni mentioned this issue Sep 25, 2020

Add tests to confirm dtypes are as expected after dt creation #152

Merged

gsheni added the new feature suggestions for new functionality label Sep 29, 2020

dsherry changed the title ~~Add sampling logic to inference code, and look at first X unique values to determine Logical Type~~ Type inference: use sampling for performance speedup Nov 5, 2020

dsherry changed the title ~~Type inference: use sampling for performance speedup~~ Logical type inference: use sampling for performance speedup Nov 5, 2020

dsherry mentioned this issue Nov 5, 2020

Add better boolean inference #52

Closed

gsheni changed the title ~~Logical type inference: use sampling for performance speedup~~ Add better boolean inference Nov 5, 2020

dsherry changed the title ~~Add better boolean inference~~ Improve inference of booleans to handle nans Nov 6, 2020

dsherry mentioned this issue Nov 6, 2020

Ensure nans in input data don't break type inference #53

Closed

gsheni mentioned this issue Mar 16, 2021

Support converting to Boolean type if column contains yes/no representation #714

Closed

thehomebrewnerd mentioned this issue Apr 15, 2021

Add BooleanNullable and IntegerNullable logical types #830

Merged

gsheni changed the title ~~Improve inference of booleans to handle nans~~ Improve inference of booleans to handle string representations May 21, 2021

gsheni mentioned this issue Oct 27, 2022

Add Boolean Inference #1549

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve inference of booleans to handle string representations #153

Improve inference of booleans to handle string representations #153

gsheni commented Sep 25, 2020 •

edited

Loading

gsheni commented Sep 25, 2020

dsherry commented Nov 5, 2020

gsheni commented Nov 5, 2020

dsherry commented Nov 6, 2020 •

edited

Loading

gsheni commented Nov 8, 2020

thehomebrewnerd commented Apr 16, 2021

Improve inference of booleans to handle string representations #153

Improve inference of booleans to handle string representations #153

Comments

gsheni commented Sep 25, 2020 • edited Loading

gsheni commented Sep 25, 2020

dsherry commented Nov 5, 2020

gsheni commented Nov 5, 2020

dsherry commented Nov 6, 2020 • edited Loading

gsheni commented Nov 8, 2020

thehomebrewnerd commented Apr 16, 2021

gsheni commented Sep 25, 2020 •

edited

Loading

dsherry commented Nov 6, 2020 •

edited

Loading