Uniqueness of first column is now checked before flagging it as a pri…

…mary key (#3639) * Uniqueness of first column is now checked before flagging it as a primary key Co-authored-by: chukarsten <64713315+chukarsten@users.noreply.github.com>
alteryx · Aug 18, 2022 · 85472b6 · 85472b6
1 parent 39e5102
commit 85472b6
Show file tree

Hide file tree

Showing 5 changed files with 282 additions and 142 deletions.
diff --git a/docs/source/release_notes.rst b/docs/source/release_notes.rst
@@ -3,6 +3,7 @@ Release Notes
 **Latest Release**
     * Enhancements
     * Fixes
+        * ``IDColumnsDataCheck`` now only returns an action code to set the first column as the primary key if it contains unique values :pr:`3639`
     * Changes
     * Documentation Changes
     * Testing Changes

diff --git a/docs/source/user_guide/data_checks.ipynb b/docs/source/user_guide/data_checks.ipynb
@@ -248,7 +248,7 @@
    "source": [
     "### ID Columns\n",
     "\n",
-    "ID columns in your dataset provide little to no benefit to a machine learning pipeline as the pipeline cannot extrapolate useful information from unique identifiers. Thus, `IDColumnsDataCheck` reminds you if these columns exists. In the given example, 'user_number' and 'id' columns are both identified as potentially being unique identifiers that should be removed."
+    "ID columns in your dataset provide little to no benefit to a machine learning pipeline as the pipeline cannot extrapolate useful information from unique identifiers. Thus, `IDColumnsDataCheck` reminds you if these columns exists. In the given example, 'user_number' and 'revenue_id' columns are both identified as potentially being unique identifiers that should be removed."
    ]
   },
   {
@@ -261,7 +261,40 @@
     "\n",
     "X = pd.DataFrame(\n",
     "    [[0, 53, 6325, 5], [1, 90, 6325, 10], [2, 90, 18, 20]],\n",
-    "    columns=[\"user_number\", \"cost\", \"revenue\", \"id\"],\n",
+    "    columns=[\"user_number\", \"cost\", \"revenue\", \"revenue_id\"],\n",
+    ")\n",
+    "\n",
+    "id_col_check = IDColumnsDataCheck(id_threshold=0.9)\n",
+    "messages = id_col_check.validate(X)\n",
+    "\n",
+    "errors = [message for message in messages if message[\"level\"] == \"error\"]\n",
+    "warnings = [message for message in messages if message[\"level\"] == \"warning\"]\n",
+    "\n",
+    "for warning in warnings:\n",
+    "    print(\"Warning:\", warning[\"message\"])\n",
+    "\n",
+    "for error in errors:\n",
+    "    print(\"Error:\", error[\"message\"])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Primary key columns however, can be useful. Primary key columns are typically the first column in the dataset, have all unique values, and are either named `ID` or a name that ends with `_id`. Though they are ignored from the modeling process, they can be used as an identifier to query on before or after the modeling process. `IDColumnsDataCheck` will also remind you if it finds that the first column of the DataFrame is a primary key. In the given example, `user_id` is identified as a primary key, while `revenue_id` was identified as a regular unique identifier. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from evalml.data_checks import IDColumnsDataCheck\n",
+    "\n",
+    "X = pd.DataFrame(\n",
+    "    [[0, 53, 6325, 5], [1, 90, 6325, 10], [2, 90, 18, 20]],\n",
+    "    columns=[\"user_id\", \"cost\", \"revenue\", \"revenue_id\"],\n",
     ")\n",
     "\n",
     "id_col_check = IDColumnsDataCheck(id_threshold=0.9)\n",

diff --git a/evalml/data_checks/id_columns_data_check.py b/evalml/data_checks/id_columns_data_check.py
@@ -89,66 +89,81 @@ def validate(self, X, y=None):
             ...     }
             ... ]
 
-            If the first column of the dataframe is identified as an ID column it is most likely the primary key.
+            Despite being all unique, "Country_Rank" will not be identified as an ID column as id_threshold is set to 1.0
+            by default and its name doesn't indicate that it's an ID.
 
             >>> df = pd.DataFrame({
-            ...     "sales_id": [0, 1, 2, 3, 4],
-            ...     "customer_id": [123, 124, 125, 126, 127],
-            ...     "Sales": [10, 42, 31, 51, 61]
+            ...    "humidity": ["high", "very high", "low", "low", "high"],
+            ...    "Country_Rank": [1, 2, 3, 4, 5],
+            ...    "Sales": ["very high", "high", "high", "medium", "very low"]
             ... })
             ...
             >>> id_col_check = IDColumnsDataCheck()
+            >>> assert id_col_check.validate(df) == []
+
+            However lowering the threshold will cause this column to be identified as an ID.
+
+            >>> id_col_check = IDColumnsDataCheck()
+            >>> id_col_check = IDColumnsDataCheck(id_threshold=0.95)
             >>> assert id_col_check.validate(df) == [
             ...     {
-            ...         "message": "The first column 'sales_id' is most likely to be the ID column. Columns 'customer_id' are also 100.0% or more likely to be an ID column",
+            ...         "message": "Columns 'Country_Rank' are 95.0% or more likely to be an ID column",
             ...         "data_check_name": "IDColumnsDataCheck",
             ...         "level": "warning",
-            ...         "code": "HAS_ID_FIRST_COLUMN",
-            ...         "details": {'columns': ["sales_id", "customer_id"], 'rows': None},
+            ...         "details": {"columns": ["Country_Rank"], "rows": None},
+            ...         "code": "HAS_ID_COLUMN",
             ...         "action_options": [
             ...             {
-            ...                 "code": "SET_FIRST_COL_ID",
+            ...                 "code": "DROP_COL",
             ...                 "data_check_name": "IDColumnsDataCheck",
             ...                 "parameters": {},
-            ...                 "metadata": {'columns': ["sales_id", "customer_id"], 'rows': None}
+            ...                 "metadata": {"columns": ["Country_Rank"], "rows": None}
             ...             }
             ...         ]
-            ...    }
+            ...     }
             ... ]
 
-            Despite being all unique, "Country_Rank" will not be identified as an ID column as id_threshold is set to 1.0
-            by default and its name doesn't indicate that it's an ID.
+            If the first column of the dataframe has all unique values and is named either 'ID' or a name that ends with '_id', it is probably the primary key.
+            The other ID columns should be dropped.
 
             >>> df = pd.DataFrame({
-            ...    "humidity": ["high", "very high", "low", "low", "high"],
-            ...    "Country_Rank": [1, 2, 3, 4, 5],
-            ...    "Sales": ["very high", "high", "high", "medium", "very low"]
+            ...     "sales_id": [0, 1, 2, 3, 4],
+            ...     "customer_id": [123, 124, 125, 126, 127],
+            ...     "Sales": [10, 42, 31, 51, 61]
             ... })
             ...
             >>> id_col_check = IDColumnsDataCheck()
-            >>> assert id_col_check.validate(df) == []
-
-
-            However lowering the threshold will cause this column to be identified as an ID.
-
-            >>> id_col_check = IDColumnsDataCheck()
-            >>> id_col_check = IDColumnsDataCheck(id_threshold=0.95)
             >>> assert id_col_check.validate(df) == [
             ...     {
-            ...         "message": "Columns 'Country_Rank' are 95.0% or more likely to be an ID column",
+            ...         "message": "The first column 'sales_id' is likely to be the primary key",
+            ...         "data_check_name": "IDColumnsDataCheck",
+            ...         "level": "warning",
+            ...         "code": "HAS_ID_FIRST_COLUMN",
+            ...         "details": {"columns": "sales_id", "rows": None},
+            ...         "action_options": [
+            ...             {
+            ...                 "code": "SET_FIRST_COL_ID",
+            ...                 "data_check_name": "IDColumnsDataCheck",
+            ...                 "parameters": {},
+            ...                 "metadata": {"columns": "sales_id", "rows": None}
+            ...             }
+            ...         ]
+            ...    },
+            ...    {
+            ...        "message": "Columns 'customer_id' are 100.0% or more likely to be an ID column",
             ...         "data_check_name": "IDColumnsDataCheck",
             ...         "level": "warning",
-            ...         "details": {"columns": ["Country_Rank"], "rows": None},
             ...         "code": "HAS_ID_COLUMN",
+            ...         "details": {"columns": ["customer_id"], "rows": None},
             ...         "action_options": [
             ...             {
             ...                 "code": "DROP_COL",
             ...                 "data_check_name": "IDColumnsDataCheck",
             ...                 "parameters": {},
-            ...                 "metadata": {"columns": ["Country_Rank"], "rows": None}
+            ...                 "metadata": {"columns": ["customer_id"], "rows": None}
             ...             }
             ...         ]
-            ...     }
+            ...    }
             ... ]
         """
         messages = []
@@ -187,54 +202,53 @@ def validate(self, X, y=None):
             key: value for key, value in id_cols.items() if value >= self.id_threshold
         }
 
-        first_col_id = False
-
-        if col_names and col_names[0] in id_cols_above_threshold:
-            first_col_id = True
-            del id_cols_above_threshold[col_names[0]]
-
         if id_cols_above_threshold:
-            warning_msg = ""
-            message_code = None
-            action_code = None
-            if first_col_id:
-                warning_msg = "The first column '{}' is most likely to be the ID column. Columns {} are also {}% or more likely to be an ID column"
+            if (
+                col_names[0] in id_cols_above_threshold
+                and id_cols_above_threshold[col_names[0]] == 1.0
+            ):
+                del id_cols_above_threshold[col_names[0]]
+                warning_msg = "The first column '{}' is likely to be the primary key"
                 warning_msg = warning_msg.format(
                     col_names[0],
-                    (", ").join(
-                        ["'{}'".format(str(col)) for col in id_cols_above_threshold],
-                    ),
-                    self.id_threshold * 100,
                 )
-                message_code = DataCheckMessageCode.HAS_ID_FIRST_COLUMN
-                action_code = DataCheckActionCode.SET_FIRST_COL_ID
-                details = {"columns": [col_names[0]] + (list(id_cols_above_threshold))}
-            else:
+                messages.append(
+                    DataCheckWarning(
+                        message=warning_msg,
+                        data_check_name=self.name,
+                        message_code=DataCheckMessageCode.HAS_ID_FIRST_COLUMN,
+                        details={"columns": col_names[0]},
+                        action_options=[
+                            DataCheckActionOption(
+                                DataCheckActionCode.SET_FIRST_COL_ID,
+                                data_check_name=self.name,
+                                metadata={"columns": col_names[0]},
+                            ),
+                        ],
+                    ).to_dict(),
+                )
+            if id_cols_above_threshold:
                 warning_msg = "Columns {} are {}% or more likely to be an ID column"
                 warning_msg = warning_msg.format(
                     (", ").join(
                         ["'{}'".format(str(col)) for col in id_cols_above_threshold],
                     ),
                     self.id_threshold * 100,
                 )
-                message_code = DataCheckMessageCode.HAS_ID_COLUMN
-                action_code = DataCheckActionCode.DROP_COL
-                details = {"columns": list(id_cols_above_threshold)}
-
-            messages.append(
-                DataCheckWarning(
-                    message=warning_msg,
-                    data_check_name=self.name,
-                    message_code=message_code,
-                    details=details,
-                    action_options=[
-                        DataCheckActionOption(
-                            action_code,
-                            data_check_name=self.name,
-                            metadata=details,
-                        ),
-                    ],
-                ).to_dict(),
-            )
+                messages.append(
+                    DataCheckWarning(
+                        message=warning_msg,
+                        data_check_name=self.name,
+                        message_code=DataCheckMessageCode.HAS_ID_COLUMN,
+                        details={"columns": list(id_cols_above_threshold)},
+                        action_options=[
+                            DataCheckActionOption(
+                                DataCheckActionCode.DROP_COL,
+                                data_check_name=self.name,
+                                metadata={"columns": list(id_cols_above_threshold)},
+                            ),
+                        ],
+                    ).to_dict(),
+                )
 
         return messages
diff --git a/evalml/tests/conftest.py b/evalml/tests/conftest.py
@@ -89,6 +89,48 @@ def graphviz():
     return graphviz
 
 
+@pytest.fixture
+def get_test_data_with_or_without_primary_key():
+    def _get_test_data_with_primary_key(input_type, has_primary_key):
+        X = None
+        if input_type == "integer":
+            X_dict = {
+                "col_1_id": [0, 1, 2, 3],
+                "col_2": [2, 3, 4, 5],
+                "col_3_id": [1, 1, 2, 3],
+                "col_5": [0, 0, 1, 2],
+            }
+            if not has_primary_key:
+                X_dict["col_1_id"] = [1, 1, 2, 3]
+            X = pd.DataFrame.from_dict(X_dict)
+
+        elif input_type == "string":
+            X_dict = {
+                "col_1_id": ["a", "b", "c", "d"],
+                "col_2": ["w", "x", "y", "z"],
+                "col_3_id": [
+                    "123456789012345",
+                    "234567890123456",
+                    "3456789012345678",
+                    "45678901234567",
+                ],
+                "col_5": ["0", "0", "1", "2"],
+            }
+            if not has_primary_key:
+                X_dict["col_1_id"] = ["b", "b", "c", "d"]
+            X = pd.DataFrame.from_dict(X_dict)
+            X.ww.init(
+                logical_types={
+                    "col_1_id": "categorical",
+                    "col_2": "categorical",
+                    "col_5": "categorical",
+                },
+            )
+        return X
+
+    return _get_test_data_with_primary_key
+
+
 @pytest.fixture
 def get_test_data_from_configuration():
     def _get_test_data_from_configuration(