Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port over all other existing guardrails as data checks #789

Merged
merged 49 commits into from May 29, 2020
Merged

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented May 21, 2020

Closes #370

Adds IDColumnsDataCheck, LabelLeakageDataCheck, HighlyNullDataCheck to DefaultDataChecks. We chose not to add OutliersDataChecks because it trains a model and takes too much time; on a dataset with 100,000 rows, it took ~30 seconds!

@angela97lin angela97lin self-assigned this May 21, 2020
@codecov
Copy link

codecov bot commented May 21, 2020

Codecov Report

Merging #789 into master will increase coverage by 0.03%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #789      +/-   ##
==========================================
+ Coverage   99.52%   99.56%   +0.03%     
==========================================
  Files         160      161       +1     
  Lines        6333     6404      +71     
==========================================
+ Hits         6303     6376      +73     
+ Misses         30       28       -2     
Impacted Files Coverage Δ
evalml/__init__.py 100.00% <ø> (ø)
evalml/data_checks/__init__.py 100.00% <100.00%> (ø)
evalml/data_checks/default_data_checks.py 100.00% <100.00%> (ø)
evalml/data_checks/highly_null_data_check.py 100.00% <100.00%> (ø)
evalml/data_checks/id_columns_data_check.py 100.00% <100.00%> (ø)
evalml/data_checks/label_leakage_data_check.py 100.00% <100.00%> (ø)
evalml/data_checks/outliers_data_check.py 100.00% <100.00%> (ø)
evalml/tests/data_checks_tests/test_data_check.py 100.00% <100.00%> (ø)
evalml/tests/data_checks_tests/test_data_checks.py 100.00% <100.00%> (ø)
...s/data_checks_tests/test_highly_null_data_check.py 100.00% <100.00%> (ø)
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5cb5842...8563fb9. Read the comment docs.

@angela97lin angela97lin marked this pull request as ready for review May 26, 2020
@angela97lin angela97lin requested a review from dsherry May 26, 2020
@dsherry
Copy link
Collaborator

dsherry commented May 27, 2020

@angela97lin I'll review the docs changes but will hold off on making comments to the code while you're updating it. Lmk if you'd rather I just wait for you to finish.

@dsherry
Copy link
Collaborator

dsherry commented May 27, 2020

@angela97lin I took a look at the docs and I have some thoughts. I see that your changes so far update the existing documentation to refer to data checks rather than guard rails, which is good. But we should show off the new API and make it easy for people to use it and extend it.

So, what do you think of this reorganization:

  • Under the "Automated Machine Learning" header:
    • "Overfitting Protections": move most of what's currently under "Overfitting Data Checks" to here, like our discussion of cross-validation and the holdout. We could also refer to the API reference for the label leakage data check.
  • Under the "Data Checks" header
    • "Data Check API": explain what data checks are and give a simple example of how to write one.
    • "Data Checks in AutoML": list the data checks used in automl, perhaps link to the API doc, and talk about how those work (errors will stop the search, can disable, etc)

Note I didn't include a section where we describe each implemented data check in detail. I think we should rely on the API ref for that.

@dsherry
Copy link
Collaborator

dsherry commented May 28, 2020

@angela97lin I took another look at the docs. Good stuff! I think what's there now is good enough to merge. All the content is there. I think at this point it can be improved mainly by deleting some stuff and shuffling things around a bit.

Specific thoughts:

  • On the Data Check "Overview" page
  • The second Data Check page shows up with three entries in the sidebar. I suggest we cut that down to one: "Data Checks in AutoML". Other than that I think the content organization is pretty solid.

I also think its totally fine to punt on this stuff, file an issue and get the impl done first.

DataCheckWarning("Column 'd' is 50.0% or more correlated with the target", "LabelLeakageDataCheck")]


def test_label_leakage_data_check_input_formats():
Copy link
Collaborator

@dsherry dsherry May 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain what this test is covering? I'm not sure I follow at first glance

Copy link
Contributor Author

@angela97lin angela97lin May 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just different data types passed in as X and y!

Copy link
Collaborator

@dsherry dsherry May 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, got it, so X contains int/float/bool? What's the difference between columns a and b? If they're both int perhaps one can be deleted.

Copy link
Contributor Author

@angela97lin angela97lin May 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not sure if it’s linking to the right thing. Are you talking about test_label_leakage_data_check_input_formats()?

Copy link
Contributor Author

@angela97lin angela97lin May 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not sure if it’s linking to the right thing. Are you talking about test_label_leakage_data_check_input_formats()?

Copy link
Contributor Author

@angela97lin angela97lin May 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first test was just a test that I posted over which checks that warnings are returned when expected, the second test tests different inputs for X and y (ex: X as a np.array, pd.DataFrame, y as array, list)

Copy link
Collaborator

@dsherry dsherry May 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, and I was talking about the X dataframe you set up here, which defines columns a - e. I guess I should've linked directly to that.

Copy link
Contributor Author

@angela97lin angela97lin May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, values in X dataframe don't really matter, I was just using it as I set y to a list, and then grabbing a numpy version via X.to_numpy?

Copy link
Collaborator

@dsherry dsherry left a comment

This rocks! I left another round of docs suggestions. I don't really have any code issues--I could comment on the impls but I know you ported them from previous code, plus the test coverage looks great. Nice work!


# test y as list
messages = label_leakage_check.validate(X, [1, 0, 1, 1])
assert messages == expected_messages
Copy link
Collaborator

@dsherry dsherry May 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can save a line and do assert label_leakage_check.validate(X, [1, 0, 1, 1]) == expected_messages, same for other callsites

Copy link
Contributor Author

@angela97lin angela97lin May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, changed this for test_highly_null_data_check_input_formats but the lines are so incredibly long, I don't know if I really like this better lol

@angela97lin angela97lin merged commit c026a89 into master May 29, 2020
2 checks passed
@dsherry dsherry deleted the 370_port branch Oct 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data Checks API: Port all other existing guard rails to use new API
2 participants