Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

197 no variance data check #893

Merged
merged 14 commits into from Jul 1, 2020
Merged

Conversation

freddyaboulton
Copy link
Contributor

@freddyaboulton freddyaboulton commented Jun 26, 2020

Pull Request Description

Addresses #197 and #889 by adding NoVarianceDataCheck to DefaultDataChecks


After creating the pull request: in order to pass the changelog_updated check you will need to update the "Future Release" section of docs/source/changelog.rst to include this pull request by adding :pr:123.

@codecov
Copy link

codecov bot commented Jun 26, 2020

Codecov Report

Merging #893 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #893   +/-   ##
=======================================
  Coverage   99.79%   99.79%           
=======================================
  Files         195      197    +2     
  Lines        9142     9204   +62     
=======================================
+ Hits         9123     9185   +62     
  Misses         19       19           
Impacted Files Coverage Δ
evalml/data_checks/__init__.py 100.00% <100.00%> (ø)
evalml/data_checks/default_data_checks.py 100.00% <100.00%> (ø)
evalml/data_checks/no_variance_data_check.py 100.00% <100.00%> (ø)
evalml/tests/data_checks_tests/test_data_checks.py 100.00% <100.00%> (ø)
...s/data_checks_tests/test_no_variance_data_check.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8be871b...72243c4. Read the comment docs.

@@ -311,7 +311,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
"version": "3.8.3"
Copy link
Contributor

@ctduffy ctduffy Jun 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to back out this change in python version here!

for name in unique_counts:

message = self._check_for_errors(f"Column {name} has", unique_counts[name], any_nulls[name])

Copy link
Contributor

@ctduffy ctduffy Jun 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in terms of style, these lines (like 67-77) seem oddly spaced out to me relative to the way other functions are written in this library. @dsherry can probably give more definitive guidance on this, but quickly looking at other data check files+from working with other evalml files, it seems like there should not be extra lines between if statements/for loops and their contents.

Copy link
Contributor Author

@freddyaboulton freddyaboulton Jun 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! It would be nice if our linter could help enforce this. I'll look into it!

Copy link
Contributor

@ctduffy ctduffy Jun 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea! I bet you could file an issue about this and we could put it into a future milestone

message = f"{column_name} {int(count_unique)} unique value."

if count_unique <= 1:

Copy link
Contributor

@ctduffy ctduffy Jun 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here, there probably shouldn't be an empty line between 39 and 41



@pytest.mark.parametrize("X,y,countna,answer", cases)
def test_no_variance_data_check_warnings(X, y, countna, answer):
Copy link
Contributor

@ctduffy ctduffy Jun 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woah this is a cool testing infrastructure that I haven't seen before! Also, you probably want to delete the blank line in 47 as it seems unnecessary.

Copy link
Collaborator

@dsherry dsherry Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton yeah this is pretty slick!


Arguments:
count_nan_as_value (bool): If True, missing values will be counted as their own unique value.
If set to True, a feature that has one unique value and all other data is missing, a
Copy link
Contributor

@angela97lin angela97lin Jun 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"If set to True, a feature with one unique value and all other data missing will return a DataCheckWarning instead of an error." or something along those lines?

def validate(self, X, y):
"""Check if any of the features or if the labels have no variance (1 unique value).

Args:
Copy link
Contributor

@angela97lin angela97lin Jun 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Args --> Arguments


Args:
X (pd.DataFrame): The input features.
y (pd.Series): The labes.
Copy link
Contributor

@angela97lin angela97lin Jun 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

labes --> labels

"""Checks if a column has no variance.

Arguments:
column_name (str): Name of column we are currently checking.
Copy link
Contributor

@angela97lin angela97lin Jun 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this argument is called column_name but it's not exactly used that way; for example, we pass in "The Labels have" and "Column {name} has" which is more just the string we want, not the actual target label column name. Could we update this so it's more representative of the argument's value or change the values we're passing into the argument?

Copy link
Contributor Author

@freddyaboulton freddyaboulton Jun 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are your thoughts on column_name_representation as the new name? I'll clarify some more in the docstring as well.

Copy link
Contributor

@angela97lin angela97lin Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmmm, is it possible to remove the "has"/"have" from the argument and format string argument? Rather than self._check_for_errors(f"Column {name} has", unique_counts[name], any_nulls[name]) and self._check_for_errors("The Labels have", y.nunique(dropna=self.dropnan), y.isnull().any()), we can just pass in {name} and "labels" (or what if y already has a name, as pd.Series sometimes do? Might be better to just use the name then!) and then format the string message = f"{column_name} has {int(count_unique)} unique values."

Copy link
Contributor

@angela97lin angela97lin Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then using column_name as the parameter name would make more sense!

Copy link
Contributor Author

@freddyaboulton freddyaboulton Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to be grammatically correct 😅 Ok, how does the following sound?

  • Change the message to f"{column_name} has {int(count_unique)} unique values." for both labels and feature columns. Don't worry about grammar.
  • The column_name for the labels will be the series name if it is present and "Labels" if not.

Copy link
Contributor

@angela97lin angela97lin Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good! Hahaha, we could also use "y" instead of "labels" if we want to be grammatically correct but I don't have a strong opinion on this 😅

Copy link
Contributor Author

@freddyaboulton freddyaboulton Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep it simple. Correct grammar may be premature optimization 😬

@freddyaboulton freddyaboulton force-pushed the 197-no-variance-data-check branch from b5a75bd to 38897c5 Compare Jun 29, 2020
@freddyaboulton freddyaboulton added this to the June 2020 milestone Jun 29, 2020
@freddyaboulton
Copy link
Contributor Author

freddyaboulton commented Jun 30, 2020

@dsherry Please let me know if you'd like me to make any changes!

@freddyaboulton freddyaboulton removed this from the June 2020 milestone Jun 30, 2020
@freddyaboulton freddyaboulton force-pushed the 197-no-variance-data-check branch from 2cd8d6a to a960047 Compare Jun 30, 2020
@freddyaboulton freddyaboulton added this to the July 2020 milestone Jun 30, 2020
@@ -25,6 +25,7 @@ Changelog
* Added ability to save and load AutoML objects to file :pr:`888`
* Updated `AutoSearchBase.get_pipelines` to return an untrained pipeline instance :pr:`876`
* Saved learned binary classification thresholds in automl results cv data dict :pr:`876`
* Added `NoVarianceDataCheck` to `DefaultDataChecks` :pr:`893`
Copy link
Contributor

@angela97lin angela97lin Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooo just FYI with the release, you'll want to move this to the new "Future Releases" :D

@freddyaboulton freddyaboulton force-pushed the 197-no-variance-data-check branch from a960047 to 93549a2 Compare Jul 1, 2020
@@ -76,7 +76,7 @@
"\n",
"X = X.drop(['customer_id', 'date_registered', 'birthday','phone', 'email',\n",
" 'owner', 'company', 'id', 'time_x',\n",
" 'session', 'referrer', 'time_y', 'label'], axis=1)\n",
" 'session', 'referrer', 'time_y', 'label', 'country'], axis=1)\n",
Copy link
Contributor

@angela97lin angela97lin Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a necessary change? / what was the reason for this?

Copy link
Contributor Author

@freddyaboulton freddyaboulton Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Country column only has one value (USA) so it fails the default data checks.

dsherry
dsherry approved these changes Jul 1, 2020
Copy link
Collaborator

@dsherry dsherry left a comment

Looks great. I love the test!



@pytest.mark.parametrize("X,y,countna,answer", cases)
def test_no_variance_data_check_warnings(X, y, countna, answer):
Copy link
Collaborator

@dsherry dsherry Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton yeah this is pretty slick!

messages = []

for name in unique_counts:
message = self._check_for_errors(f"Column {name} has", unique_counts[name], any_nulls[name])
Copy link
Collaborator

@dsherry dsherry Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton just an impl note, we could just pass in name and unique_counts/any_nulls, and then have _check_for_errors do the string formatting, right?

Copy link
Contributor Author

@freddyaboulton freddyaboulton Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. The latest version of the code does this (per @angela97lin 's suggestion!)

If set to True, a feature that has one unique value and all other data is missing, a
DataCheckWarning will be returned instead of an error. Defaults to False.
"""
self.dropnan = not count_nan_as_value
Copy link
Collaborator

@dsherry dsherry Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Suggest self._drop_nans. And if we're gonna make this public can we save self.count_nan_as_value instead since that's what was provided to init?

Copy link
Contributor Author

@freddyaboulton freddyaboulton Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make it private!

elif count_unique == 2 and not self.dropnan and any_nulls:
return DataCheckWarning(f"{column_name_representation} two unique values including nulls. "
"Consider encoding the nulls for "
"this column to be useful for machine learning.", self.name)
Copy link
Collaborator

@dsherry dsherry Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this is helpful.

if message:
messages.append(message)

label_message = self._check_for_errors("The Labels have", y.nunique(dropna=self.dropnan), y.isnull().any())
Copy link
Collaborator

@dsherry dsherry Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest "The labels" no caps

@pytest.mark.parametrize("X,y,countna,answer", cases)
def test_no_variance_data_check_warnings(X, y, countna, answer):
check = NoVarianceDataCheck(countna)
assert check.validate(X, y) == answer
Copy link
Collaborator

@dsherry dsherry Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be nice to change answer to expected_validation_result, and countna to count_nan_as_value to align with API

Copy link
Contributor Author

@freddyaboulton freddyaboulton Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

@freddyaboulton freddyaboulton force-pushed the 197-no-variance-data-check branch from 93549a2 to f629e26 Compare Jul 1, 2020
@freddyaboulton freddyaboulton merged commit 24a0bcc into master Jul 1, 2020
2 checks passed
@freddyaboulton freddyaboulton deleted the 197-no-variance-data-check branch Jul 10, 2020
@dsherry dsherry mentioned this pull request Jul 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants