New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update detect_problem_types
implementation
#1476
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1476 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 223 223
Lines 15019 15024 +5
=========================================
+ Hits 15012 15017 +5
Misses 7 7
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch! LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bchen1116 This looks good to me!
@@ -46,7 +45,7 @@ def detect_problem_type(y): | |||
raise ValueError("Less than 2 classes detected! Target unusable for modeling") | |||
if num_classes == 2: | |||
return ProblemTypes.BINARY | |||
if y.dtype in numeric_dtypes: | |||
if is_numeric_dtype(y.dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks @bchen1116 !
I think the ultimate goal is to update this (and all our utilities) to standardize to woodwork (#1229 ) and then check if the "numeric" semantic tag has been applied to the target. (@angela97lin FYI)
fix #1469
Updated this implementation to catch Int64 dtypes. Previously, using
analyze_metadata
in looking_glass, if the target data wasInt64
, the problem_type would be classified as multiclass as long as there were>2
unique values.After using

is_numeric_dtype
:Since we drop NaN data, classifying Boolean (nullable) as a numeric dtype is ok, since we'll catch this binary case before determining if it is regression or multiclass.