Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Health: Uniqueness #1785

Merged
merged 13 commits into from
Feb 9, 2021
Merged

Data Health: Uniqueness #1785

merged 13 commits into from
Feb 9, 2021

Conversation

chukarsten
Copy link
Contributor

addresses #1744

@chukarsten chukarsten added enhancement An improvement to an existing feature. new feature Features which don't yet exist. labels Feb 4, 2021
@chukarsten chukarsten force-pushed the 1744-uniqueness_score branch from cfa642d to a6c6df6 Compare February 8, 2021 16:01
@codecov
Copy link

codecov bot commented Feb 8, 2021

Codecov Report

Merging #1785 (24f79ba) into main (3e8b5a1) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1785     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         250      252      +2     
  Lines       19972    20047     +75     
=========================================
+ Hits        19964    20039     +75     
  Misses          8        8             
Impacted Files Coverage Δ
evalml/data_checks/__init__.py 100.0% <100.0%> (ø)
evalml/data_checks/data_check_message_code.py 100.0% <100.0%> (ø)
evalml/data_checks/uniqueness_data_check.py 100.0% <100.0%> (ø)
...ts/data_checks_tests/test_uniqueness_data_check.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3e8b5a1...24f79ba. Read the comment docs.

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments, but LGTM!

docs/source/user_guide/data_checks.ipynb Outdated Show resolved Hide resolved
evalml/data_checks/uniqueness_data_check.py Outdated Show resolved Hide resolved
evalml/data_checks/uniqueness_data_check.py Show resolved Hide resolved
evalml/data_checks/uniqueness_data_check.py Outdated Show resolved Hide resolved
evalml/data_checks/uniqueness_data_check.py Outdated Show resolved Hide resolved
evalml/data_checks/uniqueness_data_check.py Show resolved Hide resolved
Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool stuff, love the examples! 🚀

@@ -15,3 +15,4 @@
from .class_imbalance_data_check import ClassImbalanceDataCheck
from .high_variance_cv_data_check import HighVarianceCVDataCheck
from .multicollinearity_data_check import MulticollinearityDataCheck
from .uniqueness_data_check import UniquenessDataCheck
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful, let's add this to the API ref too! 😁


res = X.apply(UniquenessDataCheck.uniqueness_score)

if is_regression(self.problem_type):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, what happens in the case of a binary classification problem? Is nothing flagged? :o Maybe we should make explicit that this is only used for multiclass/regression problems (rather than classification)?

"import pandas as pd\n",
"from evalml.data_checks import UniquenessDataCheck\n",
"\n",
"X = pd.DataFrame({'most_unique': [float(x) for x in range(10)], # [0,1,2,3,4,5,6,7,8,9]\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, super clear example!!

"errors": []
}

data = pd.DataFrame({'multiclass_too_unique': ["Cats", "Are", "Absolutely", "The", "Best"] * 20,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

@chukarsten chukarsten force-pushed the 1744-uniqueness_score branch from b85482a to 24f79ba Compare February 9, 2021 21:01
@chukarsten chukarsten merged commit fde0e01 into main Feb 9, 2021
@ParthivNaresh ParthivNaresh mentioned this pull request Feb 9, 2021
@freddyaboulton freddyaboulton deleted the 1744-uniqueness_score branch May 13, 2022 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An improvement to an existing feature. new feature Features which don't yet exist.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants