Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding basic detect id columns guardrail #135

Merged
merged 22 commits into from Nov 5, 2019
Merged

Adding basic detect id columns guardrail #135

merged 22 commits into from Nov 5, 2019

Conversation

angela97lin
Copy link
Contributor

@angela97lin angela97lin commented Oct 16, 2019

Fixes #115

@angela97lin angela97lin reopened this Oct 16, 2019
@codecov
Copy link

codecov bot commented Oct 16, 2019

Codecov Report

Merging #135 into master will increase coverage by 0.01%.
The diff coverage is 97.36%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #135      +/-   ##
==========================================
+ Coverage   96.64%   96.65%   +0.01%     
==========================================
  Files          89       90       +1     
  Lines        2233     2271      +38     
==========================================
+ Hits         2158     2195      +37     
- Misses         75       76       +1
Impacted Files Coverage Δ
evalml/models/auto_regressor.py 90.9% <ø> (ø) ⬆️
evalml/models/auto_classifier.py 100% <ø> (ø) ⬆️
...ests/preprocessing_tests/test_detect_id_columns.py 100% <100%> (ø)
evalml/guardrails/utils.py 96.42% <100%> (+3.09%) ⬆️
evalml/models/auto_base.py 93.19% <80%> (-0.29%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4a58a11...6933c6b. Read the comment docs.

@angela97lin angela97lin self-assigned this Oct 17, 2019
@angela97lin angela97lin requested a review from kmax12 Oct 17, 2019
@angela97lin angela97lin removed the request for review from kmax12 Oct 18, 2019
evalml/guardrails/utils.py Outdated Show resolved Hide resolved
evalml/models/auto_classifier.py Show resolved Hide resolved
evalml/models/auto_regressor.py Show resolved Hide resolved
A dictionary of features with column name or index and their probability of being ID columns
"""
id_cols = {}
col_names = [str(col) for col in X.columns.tolist()]
Copy link
Contributor

@kmax12 kmax12 Oct 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the logic internally here a bit a hard to follow.

it seems to me that if

.95 if any one of the 3 cases are true

or

1.0 if case 1 and 2 are true or 2 and 3 are true (case 1 and 3 being true isn't possible, but it's not immediately obvious through reading).

maybe we can take another stab to refactor? happy to discuss more if needed

Copy link
Contributor

@jeremyliweishih jeremyliweishih Oct 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to my comment on parameters about being more generous; since we're just issuing warnings would it be better to just set to 1.0 if any of the cases are true?

Copy link
Contributor Author

@angela97lin angela97lin Oct 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the current checks, it may make sense to do as you suggested @jeremyliweishih, as each of the checks are decent indications of an ID column... Otherwise, I could give each check a "confidence percentage" and sum up a column's percentage across the three current checks. Thoughts?

Copy link
Contributor

@kmax12 kmax12 Nov 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think for now, let's not worry about the implementation much. as long as we're happy with API, we can change implementation in the future

Copy link
Contributor

@jeremyliweishih jeremyliweishih left a comment

I think regardless of intent to use as a separate tool or as part of AutoBase: if our process is clear through documentation and we're not actively removing columns, I think it would be best to set a column as ID if it passes any of the checks.

kmax12
kmax12 previously approved these changes Nov 5, 2019
Copy link
Contributor

@kmax12 kmax12 left a comment

LGTM

A dictionary of features with column name or index and their probability of being ID columns
"""
id_cols = {}
col_names = [str(col) for col in X.columns.tolist()]
Copy link
Contributor

@kmax12 kmax12 Nov 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think for now, let's not worry about the implementation much. as long as we're happy with API, we can change implementation in the future

@angela97lin angela97lin merged commit 9525b3f into master Nov 5, 2019
@angela97lin angela97lin mentioned this pull request Nov 15, 2019
@angela97lin angela97lin deleted the gr_id branch Apr 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Guardrail: Detect id columns
3 participants