Skip to content

Data Health: Uniqueness#1785

Merged
chukarsten merged 13 commits into
mainfrom
1744-uniqueness_score
Feb 9, 2021
Merged

Data Health: Uniqueness#1785
chukarsten merged 13 commits into
mainfrom
1744-uniqueness_score

Conversation

@chukarsten
Copy link
Copy Markdown
Contributor

addresses #1744

@chukarsten chukarsten added enhancement An improvement to an existing feature. new feature Features which don't yet exist. labels Feb 4, 2021
@chukarsten chukarsten force-pushed the 1744-uniqueness_score branch from cfa642d to a6c6df6 Compare February 8, 2021 16:01
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 8, 2021

Codecov Report

Merging #1785 (24f79ba) into main (3e8b5a1) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1785     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         250      252      +2     
  Lines       19972    20047     +75     
=========================================
+ Hits        19964    20039     +75     
  Misses          8        8             
Impacted Files Coverage Δ
evalml/data_checks/__init__.py 100.0% <100.0%> (ø)
evalml/data_checks/data_check_message_code.py 100.0% <100.0%> (ø)
evalml/data_checks/uniqueness_data_check.py 100.0% <100.0%> (ø)
...ts/data_checks_tests/test_uniqueness_data_check.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3e8b5a1...24f79ba. Read the comment docs.

Comment thread evalml/data_checks/uniqueness_data_check.py Outdated
Copy link
Copy Markdown
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments, but LGTM!

Comment thread docs/source/user_guide/data_checks.ipynb Outdated
Comment thread evalml/data_checks/uniqueness_data_check.py Outdated
Comment thread evalml/data_checks/uniqueness_data_check.py
Comment thread evalml/data_checks/uniqueness_data_check.py Outdated
Comment thread evalml/data_checks/uniqueness_data_check.py Outdated
Comment thread evalml/data_checks/uniqueness_data_check.py
Copy link
Copy Markdown
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool stuff, love the examples! 🚀

from .class_imbalance_data_check import ClassImbalanceDataCheck
from .high_variance_cv_data_check import HighVarianceCVDataCheck
from .multicollinearity_data_check import MulticollinearityDataCheck
from .uniqueness_data_check import UniquenessDataCheck
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful, let's add this to the API ref too! 😁


res = X.apply(UniquenessDataCheck.uniqueness_score)

if is_regression(self.problem_type):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, what happens in the case of a binary classification problem? Is nothing flagged? :o Maybe we should make explicit that this is only used for multiclass/regression problems (rather than classification)?

"import pandas as pd\n",
"from evalml.data_checks import UniquenessDataCheck\n",
"\n",
"X = pd.DataFrame({'most_unique': [float(x) for x in range(10)], # [0,1,2,3,4,5,6,7,8,9]\n",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, super clear example!!

"errors": []
}

data = pd.DataFrame({'multiclass_too_unique': ["Cats", "Are", "Absolutely", "The", "Best"] * 20,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

@chukarsten chukarsten force-pushed the 1744-uniqueness_score branch from b85482a to 24f79ba Compare February 9, 2021 21:01
@chukarsten chukarsten merged commit fde0e01 into main Feb 9, 2021
@ParthivNaresh ParthivNaresh mentioned this pull request Feb 9, 2021
@freddyaboulton freddyaboulton deleted the 1744-uniqueness_score branch May 13, 2022 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement An improvement to an existing feature. new feature Features which don't yet exist.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants