New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update variable type inference to better check for string values #683
Conversation
@@ -39,20 +39,28 @@ def infer_variable_types(df, link_vars, variable_types, time_index, secondary_ti | |||
else: | |||
inferred_type = vtypes.Numeric | |||
|
|||
elif variable in link_vars: | |||
inferred_type = vtypes.Categorical |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
link_vars
should be categorical, not ordinal.
# heuristics to predict this some other than categorical | ||
sample = df[variable].sample(min(10000, len(df[variable]))) | ||
|
||
# catch cases where object dtype cannot be interpreted as a string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the core of the fix
return False | ||
else: | ||
return True | ||
# if it can be casted to numeric, it's not a datetime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed the todo, to specifically throw away datetimes that are numeric
featuretools/utils/entity_utils.py
Outdated
# finally, try to cast to datetime | ||
if col.dtype.name.find('str') > -1 or col.dtype.name.find('object') > -1: | ||
try: | ||
pd.to_datetime(col.dropna(), errors='raise') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to just check all columns
Codecov Report
@@ Coverage Diff @@
## master #683 +/- ##
==========================================
+ Coverage 97.44% 97.47% +0.02%
==========================================
Files 118 118
Lines 9643 9670 +27
==========================================
+ Hits 9397 9426 +29
+ Misses 246 244 -2
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Fixes a small issue that came up in pandas 0.25.0 testing.
While fixing, also did a small refactor to how we infer variable types