Change OutliersDataCheck to find outliers for columns #1377

bchen1116 · 2020-10-30T18:16:40Z

Return warnings for list of columns rather than rows

codecov · 2020-10-30T18:24:58Z

Codecov Report

Merging #1377 into main will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@            Coverage Diff            @@
##             main    #1377     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         213      213             
  Lines       13936    13942      +6     
=========================================
+ Hits        13929    13935      +6     
  Misses          7        7

Impacted Files	Coverage Δ
evalml/data_checks/outliers_data_check.py	`100.0% <100.0%> (ø)`
...ests/data_checks_tests/test_outliers_data_check.py	`100.0% <100.0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 03bf466...5c59b1b. Read the comment docs.

freddyaboulton

@bchen1116 This is great! I have a question about whether or not we should still keep the old behavior of using scores from an IsolationForest.

evalml/data_checks/outliers_data_check.py

freddyaboulton · 2020-11-02T15:32:23Z

evalml/data_checks/outliers_data_check.py

+        indices = set()
+        # get the columns that fall out of the bounds, which means they contain outliers
+        for idx, bound in enumerate([lower_bound, upper_bound]):
+            boundary = ((X[bound.keys()] >= bound.values) if idx == 0 else (X[bound.keys()] <= bound.values))[bound.keys()].all()


Before we were computing the IQR of the scores returned by an IsolationForest and now we are computing the IQR of the column values. Are those two things the same?

If the established best practice is to use an isolation forest, then one thing we can do is keep the isolation forest but then use SHAP to find the columns that contribute to the high/low scores. Here is a proof-of-concept based on one of the existing unit tests:

Don't think they're exactly the same; here's the post I had referenced while implementing via IsolationForest: https://towardsdatascience.com/isolation-forest-with-statistical-rules-4dd27dad2da9; that being said, I'm for keeping it simple via IQR only or exploring using SHAP--what @freddyaboulton outlined looks pretty cool 😲

Yeah, they are different! I was experimenting around with SHAP since I'm not very familiar with it, and I think for now, it makes more sense to keep the simpler implementation and not use IsolationForest/SHAP. One of the tests I ran was this:

When I included lots of other values in column 3, where half the dataset was > 500 and half the dataset was very small values, the iso/SHAP combo still classifies column 3 as an outlier, while the simpler implementation doesn't. Not sure which performance we prefer, but statistically, I'm more inclined to say the simpler implementation makes more sense. Let me know your thoughts @angela97lin @freddyaboulton

@bchen1116 If I follow your example, column 3 is classified as an outlier for row 0 but I don't think we should compute the SHAP value for row 0 in your example because it doesn't have an isolation forest score outside the IQR. I computed the shap for row 0 in my example because I knew it would have a score outside the IQR (that was the behavior from the old unit test). Sorry if that was confusing!

I think your solution makes sense and I think between a complex implementation and a simple implementation we should pick the simple one. That being said, I'm not sure what the best practice is for outlier detection! Maybe others on the team have more decisive thoughts hehe 🙈

Ahh, I see. Yeah I misunderstood what your code was doing, but this makes sense now! Thanks for clarifying

I agree though, I think this implementation should catch values that are statistically outliers, but would love other input on whether using an IsolationForest would be preferred or not.

evalml/tests/data_checks_tests/test_outliers_data_check.py

jeremyliweishih

Looks good! One comment for improvement.

evalml/data_checks/outliers_data_check.py

angela97lin

Agree with @freddyaboulton's comment, could be cool to look into. Regardless of approach, let's update the docs too: https://evalml.alteryx.com/en/bc_1313_outlier/generated/evalml.data_checks.OutliersDataCheck.html?next=https%3A%2F%2Fevalml.alteryx.com%2Fen%2Fbc_1313_outlier%2Fgenerated%2Fevalml.data_checks.OutliersDataCheck.html&ticket=ST-1604333709-o5ChxGrdvwTlEh71ywOXKxHGtWpqhSTB#evalml.data_checks.OutliersDataCheck

(The text still says: "Checks if there are any outliers in input data by using an Isolation Forest to obtain the anomaly score of each index and then using IQR to determine score anomalies. Indices with score anomalies are considered outliers.")

CLAassistant · 2020-11-03T15:09:28Z

All committers have signed the CLA.

dsherry

@bchen1116 looks good! I left some impl and test suggestions

evalml/data_checks/outliers_data_check.py

evalml/tests/data_checks_tests/test_outliers_data_check.py

…313_outlier

bchen1116 self-assigned this Oct 30, 2020

bchen1116 marked this pull request as ready for review November 2, 2020 14:45

bchen1116 requested review from angela97lin, dsherry, freddyaboulton, christopherbunn, eccabay and jeremyliweishih and removed request for angela97lin and dsherry November 2, 2020 14:45

freddyaboulton reviewed Nov 2, 2020

View reviewed changes

jeremyliweishih approved these changes Nov 2, 2020

View reviewed changes

evalml/data_checks/outliers_data_check.py Outdated Show resolved Hide resolved

angela97lin reviewed Nov 2, 2020

View reviewed changes

bchen1116 requested review from angela97lin and freddyaboulton November 2, 2020 19:15

Fix CLA

183f118

bchen1116 force-pushed the bc_1313_outlier branch from a68f506 to 183f118 Compare November 3, 2020 21:49

bchen1116 added 2 commits November 5, 2020 15:18

Merge branch 'main' into bc_1313_outlier

018480c

Merge branch 'main' into bc_1313_outlier

53b55ce

dsherry approved these changes Nov 5, 2020

View reviewed changes

bchen1116 added 5 commits November 5, 2020 18:07

outlier data check update

7143741

Merge branch 'main' into bc_1313_outlier

a7c16f1

linting

e97c742

Merge branch 'bc_1313_outlier' of github.com:alteryx/evalml into bc_1…

9d2fdac

…313_outlier

Merge branch 'main' into bc_1313_outlier

5c59b1b

bchen1116 merged commit 4bf2f9b into main Nov 6, 2020

dsherry mentioned this pull request Nov 24, 2020

Release v0.16.0 #1468

Merged

freddyaboulton deleted the bc_1313_outlier branch May 13, 2022 14:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change OutliersDataCheck to find outliers for columns #1377

Change OutliersDataCheck to find outliers for columns #1377

bchen1116 commented Oct 30, 2020

codecov bot commented Oct 30, 2020 •

edited

freddyaboulton left a comment

freddyaboulton Nov 2, 2020 •

edited

angela97lin Nov 2, 2020

bchen1116 Nov 2, 2020

freddyaboulton Nov 2, 2020 •

edited

bchen1116 Nov 2, 2020

jeremyliweishih left a comment

angela97lin left a comment

CLAassistant commented Nov 3, 2020 •

edited

dsherry left a comment

Change OutliersDataCheck to find outliers for columns #1377

Change OutliersDataCheck to find outliers for columns #1377

Conversation

bchen1116 commented Oct 30, 2020

codecov bot commented Oct 30, 2020 • edited

Codecov Report

freddyaboulton left a comment

Choose a reason for hiding this comment

freddyaboulton Nov 2, 2020 • edited

Choose a reason for hiding this comment

angela97lin Nov 2, 2020

Choose a reason for hiding this comment

bchen1116 Nov 2, 2020

Choose a reason for hiding this comment

freddyaboulton Nov 2, 2020 • edited

Choose a reason for hiding this comment

bchen1116 Nov 2, 2020

Choose a reason for hiding this comment

jeremyliweishih left a comment

Choose a reason for hiding this comment

angela97lin left a comment

Choose a reason for hiding this comment

CLAassistant commented Nov 3, 2020 • edited

dsherry left a comment

Choose a reason for hiding this comment

codecov bot commented Oct 30, 2020 •

edited

freddyaboulton Nov 2, 2020 •

edited

freddyaboulton Nov 2, 2020 •

edited

CLAassistant commented Nov 3, 2020 •

edited