Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve inference of booleans to handle string representations #153

Open
gsheni opened this issue Sep 25, 2020 · 6 comments
Open

Improve inference of booleans to handle string representations #153

gsheni opened this issue Sep 25, 2020 · 6 comments
Labels
new feature suggestions for new functionality

Comments

@gsheni
Copy link
Contributor

gsheni commented Sep 25, 2020

  • To improve the boolean inference, we can say that if rows value counts fall into:
    • [1, True, "true", "True", "yes", "t", "T"]
    • [0, False, "false", "False", "no", "f", "F"]
    • --> we should infer the Logical Type to be Boolean.
  • This would prevent the following weird inference, where both columns are inferred to be categorical.
@gsheni
Copy link
Contributor Author

gsheni commented Sep 25, 2020

Relates to

@gsheni gsheni added the new feature suggestions for new functionality label Sep 29, 2020
@dsherry dsherry changed the title Add sampling logic to inference code, and look at first X unique values to determine Logical Type Type inference: use sampling for performance speedup Nov 5, 2020
@dsherry dsherry changed the title Type inference: use sampling for performance speedup Logical type inference: use sampling for performance speedup Nov 5, 2020
@dsherry
Copy link

dsherry commented Nov 5, 2020

@gsheni can you explain why the example you gave is currently inferred as categorical? I wonder if this should get classified as a bug instead, and the fix is simply to set np.nan/pd.NA values aside when we do type inference. Because if you did that, the only remaining value in your example is True.

Yeah, if I follow this right, my suggestion is:

  • Close this issue
  • File a bug to track fixing the specific example you gave and attach it to the type inference epic
  • File a new feature issue to track "Use sampling during inference" and attach it to the type inference epic

@gsheni
Copy link
Contributor Author

gsheni commented Nov 5, 2020

@dsherry My example wasn't clear enough. Let say we had we had some Data Columns like this:

[1, 0, 1, 1]
["true", "false", "true", "true"]
["True", "False", "True", "True"]
["yes", "no", "yes", "yes"]
["t", "f", "t", "t"]
["T", "F", "T", "T"]

All of DataColumns should be inferred with the Boolean Logical Type, and converted to the following representation (pd.BooleanDtype).

[True, False, True, True]

If there is np.nan/pd.NA in the column, it should be ignored when inferring the Logical Type.

@gsheni gsheni changed the title Logical type inference: use sampling for performance speedup Add better boolean inference Nov 5, 2020
@dsherry
Copy link

dsherry commented Nov 6, 2020

Got it. So these

[True, False, True, True, np.nan]
[True, False, True, True, pd.NA]

would also end up as boolean logical type, converted to pd.BooleanDtype resulting in

[True, False, True, True, pd.NA]

yes?

It occurs to me we'll want the same nan-tolerant behavior when we infer any type, not just booleans, right? Are there other types which we need to address right now? Whoever picks this up, please look into that / add test coverage to look into that :)

@dsherry dsherry changed the title Add better boolean inference Improve inference of booleans to handle nans Nov 6, 2020
@gsheni
Copy link
Contributor Author

gsheni commented Nov 8, 2020

@dsherry Yes, we want the NaNs converted properly for Boolean Logical Types.

Though, this issue is more about converting string representations of boolean:

["true", "false", "true", "true"]
["True", "False", "True", "True"]
["yes", "no", "yes", "yes"]
["t", "f", "t", "t"]

@thehomebrewnerd
Copy link
Contributor

If we update the inference of booleans to identify series such as [1, 0, 1, 1] a Boolean logical type, this series will also be a match for the Integer logical type. Both of these inference functions will call series.isnull().any() separately, which could be inefficient for large datasets. As part of the implementation, it would be good to update so this call only needs to happen once, if possible.

See PR #830 for additional context.

@gsheni gsheni changed the title Improve inference of booleans to handle nans Improve inference of booleans to handle string representations May 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new feature suggestions for new functionality
Projects
None yet
Development

No branches or pull requests

3 participants