New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Augment InvalidTargetDataCheck to check for mismatched indices in target and features #1816
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1816 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 255 255
Lines 20564 20621 +57
=========================================
+ Hits 20556 20613 +57
Misses 8 8
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I think checking for list equality is fine for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@angela97lin Looks good! My only comment is that outputting the entire indices may overload users if their data is big.
messages["warnings"].append(DataCheckWarning(message="Input target and features have mismatched indices", | ||
data_check_name=self.name, | ||
message_code=DataCheckMessageCode.MISMATCHED_INDICES, | ||
details={"feature_index": X_index, "target_index": y_index}).to_dict()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's a good idea to return the entire indices of X and y because they can be really big. Maybe we just return a sample of the diff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton Ooo yes, I love this suggestion!! Do you think returning the entire diff is okay, or should we do something like use a parameter n_unique
to determine how many we should display / keep like we have for unique target values?
Pro: more flexibility
Con: another parameter
Or I could also combine it with n_unique / make that handle both cases 🤔
X = _convert_to_woodwork_structure(X) | ||
X_index = list(X.to_dataframe().index) | ||
y_index = list(y_df.index) | ||
if X_index != y_index: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense! pandas treats RangeIndex
and Int64Index
the same as long as they have the same values when it comes to doing operations on different series/dataframes so no NaNs will be imputed. So I think list comparison is a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another Angela classic. Well played.
assert invalid_targets_check.validate(X, y=y_multiclass_low_classes) == {"warnings": [], "errors": []} | ||
|
||
|
||
def test_invalid_target_data_check_mismatched_indices(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are really good tests and well written. I like that I can read them and understand exactly what the check is validating for. I think this is fine as is, but since I don't understand entirely how this propagates to the user...I'll ask : is it useful ultimately for there to be a distinction between the length of the index being the same and the values being different and vice versa? If not, then let's continue mission. If so, an issue to further augment works.
@freddyaboulton @chukarsten @dsherry I believe I addressed everything that we discussed post-standup. Also separated out the case of equivalent indices sets, but not order. Please take another look if you get the chance! 😁 |
# Test that we only store ten mismatches when there are more than 10 differences in indices found | ||
X_large = pd.DataFrame({"col": range(20)}) | ||
y_more_than_ten_diff_indices = pd.Series([0, 1] * 10, index=range(20, 40)) | ||
X_index_missing = list(set(y_more_than_ten_diff_indices.index) - set(X.index)) | ||
y_index_missing = list(set(X_large.index) - set(y_more_than_ten_diff_indices.index)) | ||
assert invalid_targets_check.validate(X_large, y_more_than_ten_diff_indices) == { | ||
"warnings": [DataCheckWarning(message="Input target and features have mismatched indices", | ||
data_check_name=invalid_targets_data_check_name, | ||
message_code=DataCheckMessageCode.MISMATCHED_INDICES, | ||
details={"indices_not_in_features": X_index_missing[:10], | ||
"indices_not_in_target": y_index_missing[:10]}).to_dict()], | ||
"errors": [] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The legend of n_unique <3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent test case coverage, looks ready to go!
Closes #1726.
I chose to check if the values of the indices match, so we don't run into weird behavior with RangeIndex vs Index, but would love to discuss if that's too flexible or if we should check for explicit matches!
After discussion w/ @freddyaboulton @chukarsten and @dsherry, separated out case where different lengths cause index mismatch vs different values. Also separated out the case where the indices have the same values (set equivalent) but are of a different order. I thought this could be a good idea because if we just printed the set difference, it would be an empty set and confuse users as to why there are mismatched indices.
Choosing not to change this behavior as pointed out by @dsherry here and filed #1820 to track that update instead.
TODO: