-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added data check action for setting first column as primary key if identified as ID #3634
Added data check action for setting first column as primary key if identified as ID #3634
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3634 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 335 335
Lines 33638 33666 +28
=======================================
+ Hits 33515 33543 +28
Misses 123 123
Help us with your feedback. Take ten seconds to tell us how you rate us. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, I think we should expect other columns that could be IDs to also show up in the response alongside the potential primary key column
docs/source/release_notes.rst
Outdated
@@ -28,6 +28,7 @@ Release Notes | |||
* Changes | |||
* Add pre-commit hooks for linting :pr:`3608` | |||
* Implemented a lower threshold and window size for the ``TimeSeriesRegularizer`` and ``DatetimeFormatDataCheck`` :pr:`3627` | |||
* Datasets where the first column is identified as an id column now produce a recommended action to set the column as the primary key :pr:`3634` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you include the reference to the data check that was changed, so something like
"Updated IDColumnsDataCheck
to return an action to set the first column as the primary key if it is identified as an ID column"
) | ||
message_code = DataCheckMessageCode.HAS_ID_FIRST_COLUMN | ||
action_code = DataCheckActionCode.SET_FIRST_COL_ID | ||
details = {"column": col_names[0]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be changed to columns
message_code = DataCheckMessageCode.HAS_ID_FIRST_COLUMN | ||
action_code = DataCheckActionCode.SET_FIRST_COL_ID | ||
details = {"column": col_names[0]} | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we trying to make this an either/or scenario? I think we would still want to know if other columns could be ID columns right?
"col_6": [0.1, 0.2, 0.3, 0.4], | ||
} | ||
|
||
X = pd.DataFrame.from_dict(X_dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we would expect col_3_id
to also show up under DataCheckMessageCode.HAS_ID_COLUMN
.
), | ||
], | ||
).to_dict(), | ||
] | ||
|
||
# test np.array | ||
assert id_cols_check.validate( | ||
np.array([[0, 1], [1, 2], [2, 3], [3, 4], [4, 5]]), | ||
np.array([[0, 1], [0, 2], [2, 3], [3, 4], [4, 5]]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why were some of these other tests changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with the comments Parthiv left. Left some nits, but otherwise looks good to me!
@@ -22,6 +22,9 @@ class DataCheckActionCode(Enum): | |||
REGULARIZE_AND_IMPUTE_DATASET = "regularize_and_impute_dataset" | |||
"""Action code for regularizing and imputing all features and target time series data.""" | |||
|
|||
SET_FIRST_COL_ID = "set_first_col_id" | |||
"""Action code for setting the first column as an id column""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: period
@@ -17,6 +17,9 @@ class DataCheckMessageCode(Enum): | |||
HAS_ID_COLUMN = "has_id_column" | |||
"""Message code for data that has ID columns.""" | |||
|
|||
HAS_ID_FIRST_COLUMN = "has_id_first_column" | |||
"""Message code for data that has an ID column as the first column""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: period
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Pull Request Description
Added data check action for setting first column as primary key if identified as ID