Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change OutliersDataCheck to find outliers for columns #1377

Merged
merged 8 commits into from
Nov 6, 2020

Conversation

bchen1116
Copy link
Contributor

fix #1313

Return warnings for list of columns rather than rows

@bchen1116 bchen1116 self-assigned this Oct 30, 2020
@codecov
Copy link

codecov bot commented Oct 30, 2020

Codecov Report

Merging #1377 into main will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1377     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         213      213             
  Lines       13936    13942      +6     
=========================================
+ Hits        13929    13935      +6     
  Misses          7        7             
Impacted Files Coverage Δ
evalml/data_checks/outliers_data_check.py 100.0% <100.0%> (ø)
...ests/data_checks_tests/test_outliers_data_check.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 03bf466...5c59b1b. Read the comment docs.

@bchen1116 bchen1116 marked this pull request as ready for review November 2, 2020 14:45
Copy link
Contributor

@freddyaboulton freddyaboulton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 This is great! I have a question about whether or not we should still keep the old behavior of using scores from an IsolationForest.

evalml/data_checks/outliers_data_check.py Outdated Show resolved Hide resolved
indices = set()
# get the columns that fall out of the bounds, which means they contain outliers
for idx, bound in enumerate([lower_bound, upper_bound]):
boundary = ((X[bound.keys()] >= bound.values) if idx == 0 else (X[bound.keys()] <= bound.values))[bound.keys()].all()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we were computing the IQR of the scores returned by an IsolationForest and now we are computing the IQR of the column values. Are those two things the same?

If the established best practice is to use an isolation forest, then one thing we can do is keep the isolation forest but then use SHAP to find the columns that contribute to the high/low scores. Here is a proof-of-concept based on one of the existing unit tests:

image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think they're exactly the same; here's the post I had referenced while implementing via IsolationForest: https://towardsdatascience.com/isolation-forest-with-statistical-rules-4dd27dad2da9; that being said, I'm for keeping it simple via IQR only or exploring using SHAP--what @freddyaboulton outlined looks pretty cool 😲

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, they are different! I was experimenting around with SHAP since I'm not very familiar with it, and I think for now, it makes more sense to keep the simpler implementation and not use IsolationForest/SHAP. One of the tests I ran was this:
Screen Shot 2020-11-02 at 11 59 59 AM

When I included lots of other values in column 3, where half the dataset was > 500 and half the dataset was very small values, the iso/SHAP combo still classifies column 3 as an outlier, while the simpler implementation doesn't. Not sure which performance we prefer, but statistically, I'm more inclined to say the simpler implementation makes more sense. Let me know your thoughts @angela97lin @freddyaboulton

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 If I follow your example, column 3 is classified as an outlier for row 0 but I don't think we should compute the SHAP value for row 0 in your example because it doesn't have an isolation forest score outside the IQR. I computed the shap for row 0 in my example because I knew it would have a score outside the IQR (that was the behavior from the old unit test). Sorry if that was confusing!

I think your solution makes sense and I think between a complex implementation and a simple implementation we should pick the simple one. That being said, I'm not sure what the best practice is for outlier detection! Maybe others on the team have more decisive thoughts hehe 🙈

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I see. Yeah I misunderstood what your code was doing, but this makes sense now! Thanks for clarifying

I agree though, I think this implementation should catch values that are statistically outliers, but would love other input on whether using an IsolationForest would be preferred or not.

Copy link
Contributor

@jeremyliweishih jeremyliweishih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! One comment for improvement.

evalml/data_checks/outliers_data_check.py Outdated Show resolved Hide resolved
Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @freddyaboulton's comment, could be cool to look into. Regardless of approach, let's update the docs too: https://evalml.alteryx.com/en/bc_1313_outlier/generated/evalml.data_checks.OutliersDataCheck.html?next=https%3A%2F%2Fevalml.alteryx.com%2Fen%2Fbc_1313_outlier%2Fgenerated%2Fevalml.data_checks.OutliersDataCheck.html&ticket=ST-1604333709-o5ChxGrdvwTlEh71ywOXKxHGtWpqhSTB#evalml.data_checks.OutliersDataCheck

(The text still says: "Checks if there are any outliers in input data by using an Isolation Forest to obtain the anomaly score of each index and then using IQR to determine score anomalies. Indices with score anomalies are considered outliers.")

@CLAassistant
Copy link

CLAassistant commented Nov 3, 2020

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchen1116 looks good! I left some impl and test suggestions

@bchen1116 bchen1116 merged commit 4bf2f9b into main Nov 6, 2020
@dsherry dsherry mentioned this pull request Nov 24, 2020
@freddyaboulton freddyaboulton deleted the bc_1313_outlier branch May 13, 2022 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Outlier detection data check: identify column, not row, with outliers
6 participants