Data Health: Uniqueness#1785
Conversation
cfa642d to
a6c6df6
Compare
Codecov Report
@@ Coverage Diff @@
## main #1785 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 250 252 +2
Lines 19972 20047 +75
=========================================
+ Hits 19964 20039 +75
Misses 8 8
Continue to review full report at Codecov.
|
bchen1116
left a comment
There was a problem hiding this comment.
Left a few comments, but LGTM!
angela97lin
left a comment
There was a problem hiding this comment.
Cool stuff, love the examples! 🚀
| from .class_imbalance_data_check import ClassImbalanceDataCheck | ||
| from .high_variance_cv_data_check import HighVarianceCVDataCheck | ||
| from .multicollinearity_data_check import MulticollinearityDataCheck | ||
| from .uniqueness_data_check import UniquenessDataCheck |
There was a problem hiding this comment.
Wonderful, let's add this to the API ref too! 😁
|
|
||
| res = X.apply(UniquenessDataCheck.uniqueness_score) | ||
|
|
||
| if is_regression(self.problem_type): |
There was a problem hiding this comment.
Interesting, what happens in the case of a binary classification problem? Is nothing flagged? :o Maybe we should make explicit that this is only used for multiclass/regression problems (rather than classification)?
| "import pandas as pd\n", | ||
| "from evalml.data_checks import UniquenessDataCheck\n", | ||
| "\n", | ||
| "X = pd.DataFrame({'most_unique': [float(x) for x in range(10)], # [0,1,2,3,4,5,6,7,8,9]\n", |
There was a problem hiding this comment.
Nice, super clear example!!
| "errors": [] | ||
| } | ||
|
|
||
| data = pd.DataFrame({'multiclass_too_unique': ["Cats", "Are", "Absolutely", "The", "Best"] * 20, |
b85482a to
24f79ba
Compare
addresses #1744