Skip to content

[Feature] Infer string feature type from pandas 'object' dtype #3505

@athewsey

Description

@athewsey

Describe the feature you'd like

FeatureGroup.load_feature_definitions() (incidentally, why doesn't this method show on the FeatureGroup API doc? It has a docstring...) should default to String data type for columns with pandas dtype object, instead of raising: ValueError: Failed to infer Feature type based on dtype object for column ....

How would this feature be used? Please describe.

Per the Pandas doc, although Pandas does now have a string dtype and it's the preferred way to handle text data:

  1. It wasn't available before pandas v1.0
  2. For backward-compatibility, object remains the default dtype inferred when parsing lists of strings or reading CSVs.

For both of these reasons, it's common that users will have dataframes with object columns representing strings. IMO the process for converting columns explicitly to the new string dtype is non-obvious (see below "additional context"), and therefore a bit of a pain that this function isn't able to just infer string by default for "object"s.

Describe alternatives you've considered

  1. Leave as-is (users must figure out how to explicitly convert all their object dtype fields to a different dtype)
  2. Map object dtype to SMFS String during type inference
  3. (Preferred?) Map object dtype to SMFS String, but raise a warning (because theoretically an object dataframe field could be any Python object - just that it's very likely in practice to be text strings)

Additional context

My current workaround for explicitly setting object->str dtypes in Pandas 1.0+ is (as per here):

for col in df:
    if pd.api.types.is_object_dtype(df[col].dtype):
        df[col] = df[col].astype(pd.StringDtype())

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions