-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uniqueness of first column is now checked before flagging it as a primary key #3639
Uniqueness of first column is now checked before flagging it as a primary key #3639
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3639 +/- ##
=======================================
+ Coverage 99.7% 99.7% +0.1%
=======================================
Files 337 337
Lines 34004 34021 +17
=======================================
+ Hits 33873 33890 +17
Misses 131 131
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
action_options=[ | ||
DataCheckActionOption( | ||
DataCheckActionCode.SET_FIRST_COL_ID, | ||
data_check_name=id_data_check_name, | ||
metadata={"columns": [0, 1]}, | ||
metadata={"columns": ["ID", "col_2", "col_3_id"]}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this working exactly? How can we tell which columns the recommendation is to drop and which columns it recommends to mark primary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is checking if the first column satisfies these two characteristics:
- It is either named 'ID' or its name ends with '_id'
- All of its values are unique
If those characteristics are met, it suggests to mark as primary key. Otherwise, it suggests to drop columns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case , the column's name was 'ID' and had the values ["a", "b", "c", "d"] which are all unique
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I think the point of confusion comes in the metadata. When we look at the action code SET_FIRST_COL_ID
and then look at the column metadata ['ID', 'col_2', ...]
, how do we know which of these is the col_id to set as the primary key? It would be clearer if the metadata had one value only when we need set the column so that we don't have to worry about confusion on later steps. The additional columns should be listed somewhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh okay in the case where the data check action code is SET_FIRST_COL_ID
, the first column name in the metadata is always the column to set as the primary key and the other columns are the ones to drop. Would it be a good idea for me to change the metadata structure to something like
{ "primary_key": "ID", "drop": ['col_2', 'col_3_id'] }
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, this seems clearer to me! It makes more sense and will be much easier for developers down the line to understand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, rather than using drop
for the other columns, can you just continue to list them under columns
value? I think that'd be fine rather than introducing a new value to check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left one non-blocking comment for discussion about how we do the data check actions, but the tests and implementation look good to me!
Would probably be good to update the docs to mention this additional capability of the ID column check!
data_check_name=id_data_check_name, | ||
metadata={ | ||
"primary_key": "col_1_id", | ||
"columns": ["col_2", "col_3_id"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious what the others think about this. I think it'd be ideal if the SET_FIRST_COL_ID
data check action is only associated with the column that we want to set as the primary key, and DROP_COL
data check action to be associated with the other columns, rather than associating those columns with the SET_FIRST_COL_ID
action. I don't think this is blocking, but might be something to file for future fix if we don't get to it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work on changing this! Left some nits and suggestions, but once those are addressed, lmk and I'll approve!
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Primary key columns however, can be useful. Though they are ignored from the modeling process, they can be used as an identifier to query on before or after the modeling process. `IDColumnsDataCheck` will also remind you if a primary key exists. In the given example, 'user_id' is identified as a primary key, while 'revenue_id' was identified as a regular unique identifier. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put something in here about how the primary key must is only recommended if first column is identified as ID?
@@ -858,8 +891,13 @@ | |||
"nbconvert_exporter": "python", | |||
"pygments_lexer": "ipython3", | |||
"version": "3.8.6" | |||
}, | |||
"vscode": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the changes!
col_names | ||
and col_names[0] in id_cols_above_threshold | ||
and id_cols_above_threshold[col_names[0]] == 1.0 | ||
): | ||
first_col_id = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense to just do the warning append in this conditional block to keep all that logic together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you're right, made a change to remove the redundant logic
@@ -89,66 +89,81 @@ def validate(self, X, y=None): | |||
... } | |||
... ] | |||
|
|||
If the first column of the dataframe is identified as an ID column it is most likely the primary key. | |||
Despite being all unique, "Country_Rank" will not be identified as an ID column as id_threshold is set to 1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for clearing this up, this text is much more helpful.
NAB but is this confusing? If the threshold is set to 1, I would sort of expect it to be identified as ID whether or not that was including in the column header. Might be worth discussing later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we reverse the threshold measurement such that the lower it is, the more selective ID
column identification is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good to me, but lets make sure to get 1 review from the EvalML folks before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just left a couple small nitpicks, but otherwise looks good!
We're currently in the process of generating a release, so if you could hold off on merging this until after the release has gone out, that would be appreciated. It should be all set sometime today or early next week.
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Primary key columns however, can be useful. Though they are ignored from the modeling process, they can be used as an identifier to query on before or after the modeling process. `IDColumnsDataCheck` will also remind you if it finds that the first column of the DataFrame is a primary key. In the given example, 'user_id' is identified as a primary key, while 'revenue_id' was identified as a regular unique identifier. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a bit more clarification for users here about what a primary key column is? There's useful information in here but having a clearer definition will make this more understandable!
... ] | ||
|
||
Despite being all unique, "Country_Rank" will not be identified as an ID column as id_threshold is set to 1.0 | ||
by default and its name doesn't indicate that it's an ID. | ||
If the first column of the dataframe is 100% likely to be an ID column, it is probably the primary key. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: "is 100% likely" doesn't make much sense. Can you rephrase this to something along the lines of "has all unique values" instead?
and id_cols_above_threshold[col_names[0]] == 1.0 | ||
): | ||
del id_cols_above_threshold[col_names[0]] | ||
warning_msg = "The first column '{}' has a high likelihood of being the primary key" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can we rephrase this to something like "The first column '{}' is likely to be the primary key"?
@@ -168,6 +141,194 @@ def test_id_columns_strings(): | |||
).to_dict(), | |||
] | |||
|
|||
|
|||
def test_id_cols_data_check_input_formats(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test has a lot of repeated code. You can clean this up by parameterizing over the data type and setting up the input data that way - see here for an example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with @eccabay , let's please use @pytest.mark.parametrize to vastly reduced repeated code in this PR. Feel free to reach out to me if you need help with this! Would be happy to show you how.
@@ -89,6 +89,48 @@ def graphviz(): | |||
return graphviz | |||
|
|||
|
|||
@pytest.fixture | |||
def get_test_data_with_or_without_primary_key(): | |||
def _get_test_data_with_primary_key(input_type, has_primary_key): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clever, haven't seen a nested fixture like this. Very cool.
Pull Request Description
IDColumnsDataCheck
now only returns an action code to set the first column as the primary key if it contains unique values