-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Health: Uniqueness #1785
Data Health: Uniqueness #1785
Conversation
cfa642d
to
a6c6df6
Compare
Codecov Report
@@ Coverage Diff @@
## main #1785 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 250 252 +2
Lines 19972 20047 +75
=========================================
+ Hits 19964 20039 +75
Misses 8 8
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments, but LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool stuff, love the examples! 🚀
@@ -15,3 +15,4 @@ | |||
from .class_imbalance_data_check import ClassImbalanceDataCheck | |||
from .high_variance_cv_data_check import HighVarianceCVDataCheck | |||
from .multicollinearity_data_check import MulticollinearityDataCheck | |||
from .uniqueness_data_check import UniquenessDataCheck |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonderful, let's add this to the API ref too! 😁
|
||
res = X.apply(UniquenessDataCheck.uniqueness_score) | ||
|
||
if is_regression(self.problem_type): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, what happens in the case of a binary classification problem? Is nothing flagged? :o Maybe we should make explicit that this is only used for multiclass/regression problems (rather than classification)?
"import pandas as pd\n", | ||
"from evalml.data_checks import UniquenessDataCheck\n", | ||
"\n", | ||
"X = pd.DataFrame({'most_unique': [float(x) for x in range(10)], # [0,1,2,3,4,5,6,7,8,9]\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, super clear example!!
"errors": [] | ||
} | ||
|
||
data = pd.DataFrame({'multiclass_too_unique': ["Cats", "Are", "Absolutely", "The", "Best"] * 20, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
b85482a
to
24f79ba
Compare
addresses #1744