-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Describe the feature you'd like
FeatureGroup.load_feature_definitions() (incidentally, why doesn't this method show on the FeatureGroup API doc? It has a docstring...) should default to String
data type for columns with pandas dtype object
, instead of raising: ValueError: Failed to infer Feature type based on dtype object for column ...
.
How would this feature be used? Please describe.
Per the Pandas doc, although Pandas does now have a string dtype and it's the preferred way to handle text data:
- It wasn't available before pandas v1.0
- For backward-compatibility,
object
remains the default dtype inferred when parsing lists of strings or reading CSVs.
For both of these reasons, it's common that users will have dataframes with object columns representing strings. IMO the process for converting columns explicitly to the new string dtype is non-obvious (see below "additional context"), and therefore a bit of a pain that this function isn't able to just infer string by default for "object"s.
Describe alternatives you've considered
- Leave as-is (users must figure out how to explicitly convert all their
object
dtype fields to a different dtype) - Map
object
dtype to SMFSString
during type inference - (Preferred?) Map
object
dtype to SMFSString
, but raise a warning (because theoretically anobject
dataframe field could be any Python object - just that it's very likely in practice to be text strings)
Additional context
My current workaround for explicitly setting object->str dtypes in Pandas 1.0+ is (as per here):
for col in df:
if pd.api.types.is_object_dtype(df[col].dtype):
df[col] = df[col].astype(pd.StringDtype())