-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port over all other existing guardrails as data checks #789
Conversation
Codecov Report
@@ Coverage Diff @@
## master #789 +/- ##
==========================================
+ Coverage 99.52% 99.56% +0.03%
==========================================
Files 160 161 +1
Lines 6333 6404 +71
==========================================
+ Hits 6303 6376 +73
+ Misses 30 28 -2
Continue to review full report at Codecov.
|
@angela97lin I'll review the docs changes but will hold off on making comments to the code while you're updating it. Lmk if you'd rather I just wait for you to finish. |
@angela97lin I took a look at the docs and I have some thoughts. I see that your changes so far update the existing documentation to refer to data checks rather than guard rails, which is good. But we should show off the new API and make it easy for people to use it and extend it. So, what do you think of this reorganization:
Note I didn't include a section where we describe each implemented data check in detail. I think we should rely on the API ref for that. |
@angela97lin I took another look at the docs. Good stuff! I think what's there now is good enough to merge. All the content is there. I think at this point it can be improved mainly by deleting some stuff and shuffling things around a bit. Specific thoughts:
I also think its totally fine to punt on this stuff, file an issue and get the impl done first. |
DataCheckWarning("Column 'd' is 50.0% or more correlated with the target", "LabelLeakageDataCheck")] | ||
|
||
|
||
def test_label_leakage_data_check_input_formats(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain what this test is covering? I'm not sure I follow at first glance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just different data types passed in as X and y!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, got it, so X contains int/float/bool? What's the difference between columns a and b? If they're both int perhaps one can be deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm not sure if it’s linking to the right thing. Are you talking about test_label_leakage_data_check_input_formats()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm not sure if it’s linking to the right thing. Are you talking about test_label_leakage_data_check_input_formats()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first test was just a test that I posted over which checks that warnings are returned when expected, the second test tests different inputs for X and y (ex: X as a np.array, pd.DataFrame, y as array, list)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, and I was talking about the X
dataframe you set up here, which defines columns a - e. I guess I should've linked directly to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, values in X dataframe don't really matter, I was just using it as I set y to a list, and then grabbing a numpy version via X.to_numpy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This rocks! I left another round of docs suggestions. I don't really have any code issues--I could comment on the impls but I know you ported them from previous code, plus the test coverage looks great. Nice work!
|
||
# test y as list | ||
messages = label_leakage_check.validate(X, [1, 0, 1, 1]) | ||
assert messages == expected_messages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can save a line and do assert label_leakage_check.validate(X, [1, 0, 1, 1]) == expected_messages
, same for other callsites
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, changed this for test_highly_null_data_check_input_formats
but the lines are so incredibly long, I don't know if I really like this better lol
Closes #370
Adds IDColumnsDataCheck, LabelLeakageDataCheck, HighlyNullDataCheck to DefaultDataChecks. We chose not to add OutliersDataChecks because it trains a model and takes too much time; on a dataset with 100,000 rows, it took ~30 seconds!