-
Notifications
You must be signed in to change notification settings - Fork 89
Adding basic detect id columns guardrail #135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #135 +/- ##
==========================================
+ Coverage 96.64% 96.65% +0.01%
==========================================
Files 89 90 +1
Lines 2233 2271 +38
==========================================
+ Hits 2158 2195 +37
- Misses 75 76 +1
Continue to review full report at Codecov.
|
A dictionary of features with column name or index and their probability of being ID columns | ||
""" | ||
id_cols = {} | ||
col_names = [str(col) for col in X.columns.tolist()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the logic internally here a bit a hard to follow.
it seems to me that if
.95 if any one of the 3 cases are true
or
1.0 if case 1 and 2 are true or 2 and 3 are true (case 1 and 3 being true isn't possible, but it's not immediately obvious through reading).
maybe we can take another stab to refactor? happy to discuss more if needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to my comment on parameters about being more generous; since we're just issuing warnings would it be better to just set to 1.0 if any of the cases are true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the current checks, it may make sense to do as you suggested @jeremyliweishih, as each of the checks are decent indications of an ID column... Otherwise, I could give each check a "confidence percentage" and sum up a column's percentage across the three current checks. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think for now, let's not worry about the implementation much. as long as we're happy with API, we can change implementation in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think regardless of intent to use as a separate tool or as part of AutoBase: if our process is clear through documentation and we're not actively removing columns, I think it would be best to set a column as ID if it passes any of the checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
A dictionary of features with column name or index and their probability of being ID columns | ||
""" | ||
id_cols = {} | ||
col_names = [str(col) for col in X.columns.tolist()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think for now, let's not worry about the implementation much. as long as we're happy with API, we can change implementation in the future
Fixes #115