Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uniqueness of first column is now checked before flagging it as a primary key #3639

Merged
merged 27 commits into from
Aug 18, 2022

Conversation

simha104
Copy link
Contributor

@simha104 simha104 commented Aug 4, 2022

Pull Request Description

IDColumnsDataCheck now only returns an action code to set the first column as the primary key if it contains unique values

@codecov
Copy link

codecov bot commented Aug 4, 2022

Codecov Report

Merging #3639 (88a39de) into main (39e5102) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #3639     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        337     337             
  Lines      34004   34021     +17     
=======================================
+ Hits       33873   33890     +17     
  Misses       131     131             
Impacted Files Coverage Δ
evalml/data_checks/id_columns_data_check.py 100.0% <100.0%> (ø)
evalml/tests/conftest.py 98.0% <100.0%> (+0.1%) ⬆️
...ts/data_checks_tests/test_id_columns_data_check.py 100.0% <100.0%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@simha104 simha104 marked this pull request as ready for review August 4, 2022 18:06
action_options=[
DataCheckActionOption(
DataCheckActionCode.SET_FIRST_COL_ID,
data_check_name=id_data_check_name,
metadata={"columns": [0, 1]},
metadata={"columns": ["ID", "col_2", "col_3_id"]},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this working exactly? How can we tell which columns the recommendation is to drop and which columns it recommends to mark primary?

Copy link
Contributor Author

@simha104 simha104 Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is checking if the first column satisfies these two characteristics:

  • It is either named 'ID' or its name ends with '_id'
  • All of its values are unique

If those characteristics are met, it suggests to mark as primary key. Otherwise, it suggests to drop columns

Copy link
Contributor Author

@simha104 simha104 Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case , the column's name was 'ID' and had the values ["a", "b", "c", "d"] which are all unique

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I think the point of confusion comes in the metadata. When we look at the action code SET_FIRST_COL_ID and then look at the column metadata ['ID', 'col_2', ...], how do we know which of these is the col_id to set as the primary key? It would be clearer if the metadata had one value only when we need set the column so that we don't have to worry about confusion on later steps. The additional columns should be listed somewhere else.

Copy link
Contributor Author

@simha104 simha104 Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh okay in the case where the data check action code is SET_FIRST_COL_ID , the first column name in the metadata is always the column to set as the primary key and the other columns are the ones to drop. Would it be a good idea for me to change the metadata structure to something like
{ "primary_key": "ID", "drop": ['col_2', 'col_3_id'] } ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, this seems clearer to me! It makes more sense and will be much easier for developers down the line to understand

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay done!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, rather than using drop for the other columns, can you just continue to list them under columns value? I think that'd be fine rather than introducing a new value to check.

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one non-blocking comment for discussion about how we do the data check actions, but the tests and implementation look good to me!

Would probably be good to update the docs to mention this additional capability of the ID column check!

data_check_name=id_data_check_name,
metadata={
"primary_key": "col_1_id",
"columns": ["col_2", "col_3_id"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious what the others think about this. I think it'd be ideal if the SET_FIRST_COL_ID data check action is only associated with the column that we want to set as the primary key, and DROP_COL data check action to be associated with the other columns, rather than associating those columns with the SET_FIRST_COL_ID action. I don't think this is blocking, but might be something to file for future fix if we don't get to it now.

@simha104 simha104 requested a review from bchen1116 August 8, 2022 16:37
Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on changing this! Left some nits and suggestions, but once those are addressed, lmk and I'll approve!

"cell_type": "markdown",
"metadata": {},
"source": [
"Primary key columns however, can be useful. Though they are ignored from the modeling process, they can be used as an identifier to query on before or after the modeling process. `IDColumnsDataCheck` will also remind you if a primary key exists. In the given example, 'user_id' is identified as a primary key, while 'revenue_id' was identified as a regular unique identifier. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put something in here about how the primary key must is only recommended if first column is identified as ID?

@@ -858,8 +891,13 @@
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.6"
},
"vscode": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this

@simha104 simha104 requested a review from bchen1116 August 8, 2022 20:31
Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the changes!

@simha104 simha104 requested review from chukarsten and fjlanasa and removed request for ParthivNaresh August 8, 2022 20:54
col_names
and col_names[0] in id_cols_above_threshold
and id_cols_above_threshold[col_names[0]] == 1.0
):
first_col_id = True
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to just do the warning append in this conditional block to keep all that logic together?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you're right, made a change to remove the redundant logic

@@ -89,66 +89,81 @@ def validate(self, X, y=None):
... }
... ]

If the first column of the dataframe is identified as an ID column it is most likely the primary key.
Despite being all unique, "Country_Rank" will not be identified as an ID column as id_threshold is set to 1.0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for clearing this up, this text is much more helpful.

NAB but is this confusing? If the threshold is set to 1, I would sort of expect it to be identified as ID whether or not that was including in the column header. Might be worth discussing later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we reverse the threshold measurement such that the lower it is, the more selective ID column identification is?

Copy link

@Cmancuso Cmancuso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good to me, but lets make sure to get 1 review from the EvalML folks before merging.

Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just left a couple small nitpicks, but otherwise looks good!

We're currently in the process of generating a release, so if you could hold off on merging this until after the release has gone out, that would be appreciated. It should be all set sometime today or early next week.

"cell_type": "markdown",
"metadata": {},
"source": [
"Primary key columns however, can be useful. Though they are ignored from the modeling process, they can be used as an identifier to query on before or after the modeling process. `IDColumnsDataCheck` will also remind you if it finds that the first column of the DataFrame is a primary key. In the given example, 'user_id' is identified as a primary key, while 'revenue_id' was identified as a regular unique identifier. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a bit more clarification for users here about what a primary key column is? There's useful information in here but having a clearer definition will make this more understandable!

... ]

Despite being all unique, "Country_Rank" will not be identified as an ID column as id_threshold is set to 1.0
by default and its name doesn't indicate that it's an ID.
If the first column of the dataframe is 100% likely to be an ID column, it is probably the primary key.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "is 100% likely" doesn't make much sense. Can you rephrase this to something along the lines of "has all unique values" instead?

and id_cols_above_threshold[col_names[0]] == 1.0
):
del id_cols_above_threshold[col_names[0]]
warning_msg = "The first column '{}' has a high likelihood of being the primary key"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can we rephrase this to something like "The first column '{}' is likely to be the primary key"?

@@ -168,6 +141,194 @@ def test_id_columns_strings():
).to_dict(),
]


def test_id_cols_data_check_input_formats():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test has a lot of repeated code. You can clean this up by parameterizing over the data type and setting up the input data that way - see here for an example

Copy link
Contributor

@chukarsten chukarsten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed with @eccabay , let's please use @pytest.mark.parametrize to vastly reduced repeated code in this PR. Feel free to reach out to me if you need help with this! Would be happy to show you how.

@@ -89,6 +89,48 @@ def graphviz():
return graphviz


@pytest.fixture
def get_test_data_with_or_without_primary_key():
def _get_test_data_with_primary_key(input_type, has_primary_key):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clever, haven't seen a nested fixture like this. Very cool.

@chukarsten chukarsten enabled auto-merge (squash) August 18, 2022 21:35
@simha104 simha104 disabled auto-merge August 18, 2022 21:37
@simha104 simha104 enabled auto-merge (squash) August 18, 2022 21:40
@simha104 simha104 merged commit 85472b6 into main Aug 18, 2022
@simha104 simha104 deleted the TML-2022-backend-1st-column-id-mark-as-primary-key branch August 18, 2022 22:06
@chukarsten chukarsten mentioned this pull request Aug 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants