-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change OutliersDataCheck to find outliers for columns #1377
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1377 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 213 213
Lines 13936 13942 +6
=========================================
+ Hits 13929 13935 +6
Misses 7 7
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bchen1116 This is great! I have a question about whether or not we should still keep the old behavior of using scores from an IsolationForest.
indices = set() | ||
# get the columns that fall out of the bounds, which means they contain outliers | ||
for idx, bound in enumerate([lower_bound, upper_bound]): | ||
boundary = ((X[bound.keys()] >= bound.values) if idx == 0 else (X[bound.keys()] <= bound.values))[bound.keys()].all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before we were computing the IQR of the scores returned by an IsolationForest and now we are computing the IQR of the column values. Are those two things the same?
If the established best practice is to use an isolation forest, then one thing we can do is keep the isolation forest but then use SHAP to find the columns that contribute to the high/low scores. Here is a proof-of-concept based on one of the existing unit tests:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think they're exactly the same; here's the post I had referenced while implementing via IsolationForest: https://towardsdatascience.com/isolation-forest-with-statistical-rules-4dd27dad2da9; that being said, I'm for keeping it simple via IQR only or exploring using SHAP--what @freddyaboulton outlined looks pretty cool 😲
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, they are different! I was experimenting around with SHAP since I'm not very familiar with it, and I think for now, it makes more sense to keep the simpler implementation and not use IsolationForest/SHAP. One of the tests I ran was this:
When I included lots of other values in column 3, where half the dataset was > 500 and half the dataset was very small values, the iso/SHAP combo still classifies column 3 as an outlier, while the simpler implementation doesn't. Not sure which performance we prefer, but statistically, I'm more inclined to say the simpler implementation makes more sense. Let me know your thoughts @angela97lin @freddyaboulton
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bchen1116 If I follow your example, column 3 is classified as an outlier for row 0 but I don't think we should compute the SHAP value for row 0 in your example because it doesn't have an isolation forest score outside the IQR. I computed the shap for row 0 in my example because I knew it would have a score outside the IQR (that was the behavior from the old unit test). Sorry if that was confusing!
I think your solution makes sense and I think between a complex implementation and a simple implementation we should pick the simple one. That being said, I'm not sure what the best practice is for outlier detection! Maybe others on the team have more decisive thoughts hehe 🙈
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, I see. Yeah I misunderstood what your code was doing, but this makes sense now! Thanks for clarifying
I agree though, I think this implementation should catch values that are statistically outliers, but would love other input on whether using an IsolationForest would be preferred or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! One comment for improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with @freddyaboulton's comment, could be cool to look into. Regardless of approach, let's update the docs too: https://evalml.alteryx.com/en/bc_1313_outlier/generated/evalml.data_checks.OutliersDataCheck.html?next=https%3A%2F%2Fevalml.alteryx.com%2Fen%2Fbc_1313_outlier%2Fgenerated%2Fevalml.data_checks.OutliersDataCheck.html&ticket=ST-1604333709-o5ChxGrdvwTlEh71ywOXKxHGtWpqhSTB#evalml.data_checks.OutliersDataCheck
(The text still says: "Checks if there are any outliers in input data by using an Isolation Forest to obtain the anomaly score of each index and then using IQR to determine score anomalies. Indices with score anomalies are considered outliers.")
a68f506
to
183f118
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bchen1116 looks good! I left some impl and test suggestions
fix #1313
Return warnings for list of columns rather than rows