Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Augment InvalidTargetDataCheck to check for mismatched indices in target and features #1816

Merged
merged 15 commits into from Feb 11, 2021

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Feb 10, 2021

Closes #1726.

I chose to check if the values of the indices match, so we don't run into weird behavior with RangeIndex vs Index, but would love to discuss if that's too flexible or if we should check for explicit matches!

After discussion w/ @freddyaboulton @chukarsten and @dsherry, separated out case where different lengths cause index mismatch vs different values. Also separated out the case where the indices have the same values (set equivalent) but are of a different order. I thought this could be a good idea because if we just printed the set difference, it would be an empty set and confuse users as to why there are mismatched indices.

Choosing not to change this behavior as pointed out by @dsherry here and filed #1820 to track that update instead.

TODO:

  • Make sure tests are passing for mismatched indices, not-in-order-but-same indices, different length indices
  • Separate out different lengths case from different indices case
  • Only list X number of mismatched indices, rather than everything

@angela97lin angela97lin added this to the Sprint 2021 Feb A milestone Feb 10, 2021
@angela97lin angela97lin self-assigned this Feb 10, 2021
@codecov
Copy link

codecov bot commented Feb 10, 2021

Codecov Report

Merging #1816 (5e7a497) into main (d4e2f54) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1816     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         255      255             
  Lines       20564    20621     +57     
=========================================
+ Hits        20556    20613     +57     
  Misses          8        8             
Impacted Files Coverage Δ
evalml/data_checks/data_check_message_code.py 100.0% <100.0%> (ø)
evalml/data_checks/invalid_targets_data_check.py 100.0% <100.0%> (ø)
evalml/tests/data_checks_tests/test_data_checks.py 100.0% <100.0%> (ø)
...ta_checks_tests/test_invalid_targets_data_check.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d4e2f54...5e7a497. Read the comment docs.

@angela97lin angela97lin marked this pull request as ready for review February 10, 2021 23:14
Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I think checking for list equality is fine for this.

Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angela97lin Looks good! My only comment is that outputting the entire indices may overload users if their data is big.

messages["warnings"].append(DataCheckWarning(message="Input target and features have mismatched indices",
data_check_name=self.name,
message_code=DataCheckMessageCode.MISMATCHED_INDICES,
details={"feature_index": X_index, "target_index": y_index}).to_dict())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's a good idea to return the entire indices of X and y because they can be really big. Maybe we just return a sample of the diff?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton Ooo yes, I love this suggestion!! Do you think returning the entire diff is okay, or should we do something like use a parameter n_unique to determine how many we should display / keep like we have for unique target values?

Pro: more flexibility
Con: another parameter

Or I could also combine it with n_unique / make that handle both cases 🤔

X = _convert_to_woodwork_structure(X)
X_index = list(X.to_dataframe().index)
y_index = list(y_df.index)
if X_index != y_index:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense! pandas treats RangeIndex and Int64Index the same as long as they have the same values when it comes to doing operations on different series/dataframes so no NaNs will be imputed. So I think list comparison is a good idea.

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another Angela classic. Well played.

assert invalid_targets_check.validate(X, y=y_multiclass_low_classes) == {"warnings": [], "errors": []}


def test_invalid_target_data_check_mismatched_indices():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are really good tests and well written. I like that I can read them and understand exactly what the check is validating for. I think this is fine as is, but since I don't understand entirely how this propagates to the user...I'll ask : is it useful ultimately for there to be a distinction between the length of the index being the same and the values being different and vice versa? If not, then let's continue mission. If so, an issue to further augment works.

@angela97lin
Copy link
Contributor Author

@freddyaboulton @chukarsten @dsherry I believe I addressed everything that we discussed post-standup. Also separated out the case of equivalent indices sets, but not order. Please take another look if you get the chance! 😁

Comment on lines +446 to +458
# Test that we only store ten mismatches when there are more than 10 differences in indices found
X_large = pd.DataFrame({"col": range(20)})
y_more_than_ten_diff_indices = pd.Series([0, 1] * 10, index=range(20, 40))
X_index_missing = list(set(y_more_than_ten_diff_indices.index) - set(X.index))
y_index_missing = list(set(X_large.index) - set(y_more_than_ten_diff_indices.index))
assert invalid_targets_check.validate(X_large, y_more_than_ten_diff_indices) == {
"warnings": [DataCheckWarning(message="Input target and features have mismatched indices",
data_check_name=invalid_targets_data_check_name,
message_code=DataCheckMessageCode.MISMATCHED_INDICES,
details={"indices_not_in_features": X_index_missing[:10],
"indices_not_in_target": y_index_missing[:10]}).to_dict()],
"errors": []
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The legend of n_unique <3

Copy link
Contributor

@ParthivNaresh ParthivNaresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent test case coverage, looks ready to go!

@angela97lin angela97lin merged commit 2804bd0 into main Feb 11, 2021
@angela97lin angela97lin deleted the 1726_mismatched_indices_dc branch February 11, 2021 23:51
@chukarsten chukarsten mentioned this pull request Feb 23, 2021
@dsherry dsherry mentioned this pull request Mar 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Augment InvalidTargetDataCheck to check for mismatched indices in target and training data
5 participants