Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added data check action for setting first column as primary key if identified as ID #3634

Merged
merged 5 commits into from
Aug 1, 2022

Conversation

simha104
Copy link
Contributor

@simha104 simha104 commented Aug 1, 2022

Pull Request Description

Added data check action for setting first column as primary key if identified as ID

@codecov
Copy link

codecov bot commented Aug 1, 2022

Codecov Report

Merging #3634 (00cd963) into main (e83e7db) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #3634     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        335     335             
  Lines      33638   33666     +28     
=======================================
+ Hits       33515   33543     +28     
  Misses       123     123             
Impacted Files Coverage Δ
evalml/data_checks/data_check_action_code.py 100.0% <100.0%> (ø)
evalml/data_checks/data_check_message_code.py 100.0% <100.0%> (ø)
evalml/data_checks/id_columns_data_check.py 100.0% <100.0%> (ø)
...ts/data_checks_tests/test_id_columns_data_check.py 100.0% <100.0%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us.

@simha104 simha104 changed the title Added data check action for setting first column as primary key if in… Added data check action for setting first column as primary key if identified as ID Aug 1, 2022
@simha104 simha104 marked this pull request as ready for review August 1, 2022 16:35
Copy link
Contributor

@ParthivNaresh ParthivNaresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, I think we should expect other columns that could be IDs to also show up in the response alongside the potential primary key column

@@ -28,6 +28,7 @@ Release Notes
* Changes
* Add pre-commit hooks for linting :pr:`3608`
* Implemented a lower threshold and window size for the ``TimeSeriesRegularizer`` and ``DatetimeFormatDataCheck`` :pr:`3627`
* Datasets where the first column is identified as an id column now produce a recommended action to set the column as the primary key :pr:`3634`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you include the reference to the data check that was changed, so something like
"Updated IDColumnsDataCheck to return an action to set the first column as the primary key if it is identified as an ID column"

)
message_code = DataCheckMessageCode.HAS_ID_FIRST_COLUMN
action_code = DataCheckActionCode.SET_FIRST_COL_ID
details = {"column": col_names[0]}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be changed to columns

message_code = DataCheckMessageCode.HAS_ID_FIRST_COLUMN
action_code = DataCheckActionCode.SET_FIRST_COL_ID
details = {"column": col_names[0]}
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we trying to make this an either/or scenario? I think we would still want to know if other columns could be ID columns right?

"col_6": [0.1, 0.2, 0.3, 0.4],
}

X = pd.DataFrame.from_dict(X_dict)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would expect col_3_id to also show up under DataCheckMessageCode.HAS_ID_COLUMN.

),
],
).to_dict(),
]

# test np.array
assert id_cols_check.validate(
np.array([[0, 1], [1, 2], [2, 3], [3, 4], [4, 5]]),
np.array([[0, 1], [0, 2], [2, 3], [3, 4], [4, 5]]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why were some of these other tests changed?

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with the comments Parthiv left. Left some nits, but otherwise looks good to me!

@@ -22,6 +22,9 @@ class DataCheckActionCode(Enum):
REGULARIZE_AND_IMPUTE_DATASET = "regularize_and_impute_dataset"
"""Action code for regularizing and imputing all features and target time series data."""

SET_FIRST_COL_ID = "set_first_col_id"
"""Action code for setting the first column as an id column"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: period

@@ -17,6 +17,9 @@ class DataCheckMessageCode(Enum):
HAS_ID_COLUMN = "has_id_column"
"""Message code for data that has ID columns."""

HAS_ID_FIRST_COLUMN = "has_id_first_column"
"""Message code for data that has an ID column as the first column"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: period

Copy link
Contributor

@ParthivNaresh ParthivNaresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@simha104 simha104 merged commit 05a38e0 into main Aug 1, 2022
@simha104 simha104 deleted the TML-2022-backend-1st-column-id-mark-as-primary-key branch August 1, 2022 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants