Uniqueness of first column is now checked before flagging it as a primary key #3639

simha104 · 2022-08-04T17:18:15Z

Pull Request Description

IDColumnsDataCheck now only returns an action code to set the first column as the primary key if it contains unique values

…mary key

codecov · 2022-08-04T17:26:39Z

Codecov Report

Merging #3639 (88a39de) into main (39e5102) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@           Coverage Diff           @@
##            main   #3639     +/-   ##
=======================================
+ Coverage   99.7%   99.7%   +0.1%     
=======================================
  Files        337     337             
  Lines      34004   34021     +17     
=======================================
+ Hits       33873   33890     +17     
  Misses       131     131

Impacted Files	Coverage Δ
evalml/data_checks/id_columns_data_check.py	`100.0% <100.0%> (ø)`
evalml/tests/conftest.py	`98.0% <100.0%> (+0.1%)`	⬆️
...ts/data_checks_tests/test_id_columns_data_check.py	`100.0% <100.0%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Cmancuso · 2022-08-04T18:38:00Z

evalml/tests/data_checks_tests/test_id_columns_data_check.py

            action_options=[
                DataCheckActionOption(
                    DataCheckActionCode.SET_FIRST_COL_ID,
                    data_check_name=id_data_check_name,
-                    metadata={"columns": [0, 1]},
+                    metadata={"columns": ["ID", "col_2", "col_3_id"]},


How is this working exactly? How can we tell which columns the recommendation is to drop and which columns it recommends to mark primary?

It is checking if the first column satisfies these two characteristics:

It is either named 'ID' or its name ends with '_id'

All of its values are unique

If those characteristics are met, it suggests to mark as primary key. Otherwise, it suggests to drop columns

In this case , the column's name was 'ID' and had the values ["a", "b", "c", "d"] which are all unique

Right, I think the point of confusion comes in the metadata. When we look at the action code SET_FIRST_COL_ID and then look at the column metadata ['ID', 'col_2', ...], how do we know which of these is the col_id to set as the primary key? It would be clearer if the metadata had one value only when we need set the column so that we don't have to worry about confusion on later steps. The additional columns should be listed somewhere else.

Oh okay in the case where the data check action code is SET_FIRST_COL_ID , the first column name in the metadata is always the column to set as the primary key and the other columns are the ones to drop. Would it be a good idea for me to change the metadata structure to something like
{ "primary_key": "ID", "drop": ['col_2', 'col_3_id'] } ?

Yea, this seems clearer to me! It makes more sense and will be much easier for developers down the line to understand

Ah, rather than using drop for the other columns, can you just continue to list them under columns value? I think that'd be fine rather than introducing a new value to check.

…ry-key

bchen1116

Left one non-blocking comment for discussion about how we do the data check actions, but the tests and implementation look good to me!

Would probably be good to update the docs to mention this additional capability of the ID column check!

bchen1116 · 2022-08-05T14:22:19Z

evalml/tests/data_checks_tests/test_id_columns_data_check.py

+                    data_check_name=id_data_check_name,
+                    metadata={
+                        "primary_key": "col_1_id",
+                        "columns": ["col_2", "col_3_id"],


Curious what the others think about this. I think it'd be ideal if the SET_FIRST_COL_ID data check action is only associated with the column that we want to set as the primary key, and DROP_COL data check action to be associated with the other columns, rather than associating those columns with the SET_FIRST_COL_ID action. I don't think this is blocking, but might be something to file for future fix if we don't get to it now.

…ry-key

bchen1116

Great work on changing this! Left some nits and suggestions, but once those are addressed, lmk and I'll approve!

bchen1116 · 2022-08-08T19:11:51Z

docs/source/user_guide/data_checks.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Primary key columns however, can be useful. Though they are ignored from the modeling process, they can be used as an identifier to query on before or after the modeling process. `IDColumnsDataCheck` will also remind you if a primary key exists. In the given example, 'user_id' is identified as a primary key, while 'revenue_id' was identified as a regular unique identifier. "


Can we put something in here about how the primary key must is only recommended if first column is identified as ID?

bchen1116 · 2022-08-08T19:12:00Z

docs/source/user_guide/data_checks.ipynb

@@ -858,8 +891,13 @@
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.6"
+  },
+  "vscode": {


Remove this

…ry-key

bchen1116

Thanks for making the changes!

Cmancuso · 2022-08-08T21:12:07Z

evalml/data_checks/id_columns_data_check.py

+            col_names
+            and col_names[0] in id_cols_above_threshold
+            and id_cols_above_threshold[col_names[0]] == 1.0
+        ):
            first_col_id = True


would it make sense to just do the warning append in this conditional block to keep all that logic together?

Yeah you're right, made a change to remove the redundant logic

Cmancuso · 2022-08-08T21:16:25Z

evalml/data_checks/id_columns_data_check.py

@@ -89,66 +89,81 @@ def validate(self, X, y=None):
            ...     }
            ... ]

-            If the first column of the dataframe is identified as an ID column it is most likely the primary key.
+            Despite being all unique, "Country_Rank" will not be identified as an ID column as id_threshold is set to 1.0


thanks for clearing this up, this text is much more helpful.

NAB but is this confusing? If the threshold is set to 1, I would sort of expect it to be identified as ID whether or not that was including in the column header. Might be worth discussing later.

Should we reverse the threshold measurement such that the lower it is, the more selective ID column identification is?

Cmancuso

This is looking good to me, but lets make sure to get 1 review from the EvalML folks before merging.

…ry-key

eccabay

I just left a couple small nitpicks, but otherwise looks good!

We're currently in the process of generating a release, so if you could hold off on merging this until after the release has gone out, that would be appreciated. It should be all set sometime today or early next week.

eccabay · 2022-08-12T14:38:55Z

docs/source/user_guide/data_checks.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Primary key columns however, can be useful. Though they are ignored from the modeling process, they can be used as an identifier to query on before or after the modeling process. `IDColumnsDataCheck` will also remind you if it finds that the first column of the DataFrame is a primary key. In the given example, 'user_id' is identified as a primary key, while 'revenue_id' was identified as a regular unique identifier. "


Can you add a bit more clarification for users here about what a primary key column is? There's useful information in here but having a clearer definition will make this more understandable!

eccabay · 2022-08-12T14:40:19Z

evalml/data_checks/id_columns_data_check.py

            ... ]

-            Despite being all unique, "Country_Rank" will not be identified as an ID column as id_threshold is set to 1.0
-            by default and its name doesn't indicate that it's an ID.
+            If the first column of the dataframe is 100% likely to be an ID column, it is probably the primary key.


Nit: "is 100% likely" doesn't make much sense. Can you rephrase this to something along the lines of "has all unique values" instead?

eccabay · 2022-08-12T14:43:57Z

evalml/data_checks/id_columns_data_check.py

+                and id_cols_above_threshold[col_names[0]] == 1.0
+            ):
+                del id_cols_above_threshold[col_names[0]]
+                warning_msg = "The first column '{}' has a high likelihood of being the primary key"


Nit: can we rephrase this to something like "The first column '{}' is likely to be the primary key"?

eccabay · 2022-08-12T14:48:52Z

evalml/tests/data_checks_tests/test_id_columns_data_check.py

@@ -168,6 +141,194 @@ def test_id_columns_strings():
        ).to_dict(),
    ]

+
+def test_id_cols_data_check_input_formats():


This test has a lot of repeated code. You can clean this up by parameterizing over the data type and setting up the input data that way - see here for an example

…ry-key

chukarsten

Agreed with @eccabay , let's please use @pytest.mark.parametrize to vastly reduced repeated code in this PR. Feel free to reach out to me if you need help with this! Would be happy to show you how.

…ry-key

chukarsten · 2022-08-18T21:34:48Z

evalml/tests/conftest.py

@@ -89,6 +89,48 @@ def graphviz():
    return graphviz


+@pytest.fixture
+def get_test_data_with_or_without_primary_key():
+    def _get_test_data_with_primary_key(input_type, has_primary_key):


Clever, haven't seen a nested fixture like this. Very cool.

simha104 added 2 commits August 4, 2022 10:17

Uniqueness of first column is now checked before flagging it as a pri…

2848866

…mary key

Updated release notes

b235f5b

Test fixes

cb7d746

simha104 marked this pull request as ready for review August 4, 2022 18:06

auto-assign bot assigned simha104 Aug 4, 2022

Cmancuso reviewed Aug 4, 2022

View reviewed changes

simha104 requested review from Cmancuso, ParthivNaresh and bchen1116 August 4, 2022 19:05

simha104 added 5 commits August 4, 2022 15:32

Updated metadata

8acaada

lint fix

2aafec6

Merge branch 'main' into TML-2022-backend-1st-column-id-mark-as-prima…

a8adbcb

…ry-key

Changed drop to columns

7d187b8

lint fix

38a3fff

bchen1116 reviewed Aug 5, 2022

View reviewed changes

simha104 added 5 commits August 5, 2022 11:11

Separation of messages for drop col and set primary key

a422303

lint fix

7bab689

Updated docs

525a2d1

lint fix

eaf5f80

Merge branch 'main' into TML-2022-backend-1st-column-id-mark-as-prima…

159e4ce

…ry-key

simha104 requested a review from bchen1116 August 8, 2022 16:37

bchen1116 reviewed Aug 8, 2022

View reviewed changes

simha104 added 2 commits August 8, 2022 12:34

Merge branch 'main' into TML-2022-backend-1st-column-id-mark-as-prima…

59ed745

…ry-key

Docs update

7d837a5

simha104 requested a review from bchen1116 August 8, 2022 20:31

bchen1116 approved these changes Aug 8, 2022

View reviewed changes

simha104 requested review from chukarsten and fjlanasa and removed request for ParthivNaresh August 8, 2022 20:54

Cmancuso reviewed Aug 8, 2022

View reviewed changes

Removed redundant logic

facea79

simha104 requested a review from Cmancuso August 9, 2022 18:00

Cmancuso requested review from jeremyliweishih and eccabay August 10, 2022 14:07

Cmancuso approved these changes Aug 10, 2022

View reviewed changes

Merge branch 'main' into TML-2022-backend-1st-column-id-mark-as-prima…

2bccc76

…ry-key

eccabay approved these changes Aug 12, 2022

View reviewed changes

chukarsten and others added 4 commits August 12, 2022 11:02

Merge branch 'main' into TML-2022-backend-1st-column-id-mark-as-prima…

92b9d89

…ry-key

Merge branch 'main' into TML-2022-backend-1st-column-id-mark-as-prima…

e06d7be

…ry-key

Merge branch 'main' into TML-2022-backend-1st-column-id-mark-as-prima…

8ed70a4

…ry-key

Merge branch 'main' into TML-2022-backend-1st-column-id-mark-as-prima…

162a5eb

…ry-key

chukarsten suggested changes Aug 15, 2022

View reviewed changes

simha104 added 2 commits August 15, 2022 13:57

Description and redundancy changes

b1a44b8

Merge branch 'main' into TML-2022-backend-1st-column-id-mark-as-prima…

8c9b5e9

…ry-key

simha104 requested review from chukarsten and eccabay August 15, 2022 21:06

simha104 and others added 3 commits August 15, 2022 14:22

Merge branch 'main' into TML-2022-backend-1st-column-id-mark-as-prima…

880aff9

…ry-key

Merge branch 'main' into TML-2022-backend-1st-column-id-mark-as-prima…

482f5d8

…ry-key

Merge branch 'main' into TML-2022-backend-1st-column-id-mark-as-prima…

e634c17

…ry-key

chukarsten approved these changes Aug 18, 2022

View reviewed changes

chukarsten enabled auto-merge (squash) August 18, 2022 21:35

simha104 disabled auto-merge August 18, 2022 21:37

conditional added

88a39de

simha104 enabled auto-merge (squash) August 18, 2022 21:40

simha104 merged commit 85472b6 into main Aug 18, 2022

simha104 deleted the TML-2022-backend-1st-column-id-mark-as-primary-key branch August 18, 2022 22:06

chukarsten mentioned this pull request Aug 19, 2022

Release v0.56.1 #3673

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uniqueness of first column is now checked before flagging it as a primary key #3639

Uniqueness of first column is now checked before flagging it as a primary key #3639

simha104 commented Aug 4, 2022 •

edited

Loading

codecov bot commented Aug 4, 2022 •

edited

Loading

Cmancuso Aug 4, 2022

simha104 Aug 4, 2022 •

edited

Loading

simha104 Aug 4, 2022 •

edited

Loading

bchen1116 Aug 4, 2022

simha104 Aug 4, 2022 •

edited

Loading

bchen1116 Aug 4, 2022

simha104 Aug 4, 2022

bchen1116 Aug 5, 2022

bchen1116 left a comment

bchen1116 Aug 5, 2022

bchen1116 left a comment

bchen1116 Aug 8, 2022

bchen1116 Aug 8, 2022

bchen1116 left a comment

Cmancuso Aug 8, 2022

simha104 Aug 8, 2022

Cmancuso Aug 8, 2022

simha104 Aug 9, 2022

Cmancuso left a comment

eccabay left a comment

eccabay Aug 12, 2022

eccabay Aug 12, 2022

eccabay Aug 12, 2022

eccabay Aug 12, 2022

chukarsten left a comment

chukarsten Aug 18, 2022

Uniqueness of first column is now checked before flagging it as a primary key #3639

Uniqueness of first column is now checked before flagging it as a primary key #3639

Conversation

simha104 commented Aug 4, 2022 • edited Loading

Pull Request Description

codecov bot commented Aug 4, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

simha104 Aug 4, 2022 • edited Loading

Choose a reason for hiding this comment

simha104 Aug 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simha104 Aug 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bchen1116 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bchen1116 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bchen1116 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Cmancuso left a comment

Choose a reason for hiding this comment

eccabay left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chukarsten left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simha104 commented Aug 4, 2022 •

edited

Loading

codecov bot commented Aug 4, 2022 •

edited

Loading

simha104 Aug 4, 2022 •

edited

Loading

simha104 Aug 4, 2022 •

edited

Loading

simha104 Aug 4, 2022 •

edited

Loading