Skip to content

Commit

Permalink
Uniqueness of first column is now checked before flagging it as a pri…
Browse files Browse the repository at this point in the history
…mary key (#3639)

* Uniqueness of first column is now checked before flagging it as a primary key

Co-authored-by: chukarsten <64713315+chukarsten@users.noreply.github.com>
  • Loading branch information
simha104 and chukarsten committed Aug 18, 2022
1 parent 39e5102 commit 85472b6
Show file tree
Hide file tree
Showing 5 changed files with 282 additions and 142 deletions.
1 change: 1 addition & 0 deletions docs/source/release_notes.rst
Expand Up @@ -3,6 +3,7 @@ Release Notes
**Latest Release**
* Enhancements
* Fixes
* ``IDColumnsDataCheck`` now only returns an action code to set the first column as the primary key if it contains unique values :pr:`3639`
* Changes
* Documentation Changes
* Testing Changes
Expand Down
37 changes: 35 additions & 2 deletions docs/source/user_guide/data_checks.ipynb
Expand Up @@ -248,7 +248,7 @@
"source": [
"### ID Columns\n",
"\n",
"ID columns in your dataset provide little to no benefit to a machine learning pipeline as the pipeline cannot extrapolate useful information from unique identifiers. Thus, `IDColumnsDataCheck` reminds you if these columns exists. In the given example, 'user_number' and 'id' columns are both identified as potentially being unique identifiers that should be removed."
"ID columns in your dataset provide little to no benefit to a machine learning pipeline as the pipeline cannot extrapolate useful information from unique identifiers. Thus, `IDColumnsDataCheck` reminds you if these columns exists. In the given example, 'user_number' and 'revenue_id' columns are both identified as potentially being unique identifiers that should be removed."
]
},
{
Expand All @@ -261,7 +261,40 @@
"\n",
"X = pd.DataFrame(\n",
" [[0, 53, 6325, 5], [1, 90, 6325, 10], [2, 90, 18, 20]],\n",
" columns=[\"user_number\", \"cost\", \"revenue\", \"id\"],\n",
" columns=[\"user_number\", \"cost\", \"revenue\", \"revenue_id\"],\n",
")\n",
"\n",
"id_col_check = IDColumnsDataCheck(id_threshold=0.9)\n",
"messages = id_col_check.validate(X)\n",
"\n",
"errors = [message for message in messages if message[\"level\"] == \"error\"]\n",
"warnings = [message for message in messages if message[\"level\"] == \"warning\"]\n",
"\n",
"for warning in warnings:\n",
" print(\"Warning:\", warning[\"message\"])\n",
"\n",
"for error in errors:\n",
" print(\"Error:\", error[\"message\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Primary key columns however, can be useful. Primary key columns are typically the first column in the dataset, have all unique values, and are either named `ID` or a name that ends with `_id`. Though they are ignored from the modeling process, they can be used as an identifier to query on before or after the modeling process. `IDColumnsDataCheck` will also remind you if it finds that the first column of the DataFrame is a primary key. In the given example, `user_id` is identified as a primary key, while `revenue_id` was identified as a regular unique identifier. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from evalml.data_checks import IDColumnsDataCheck\n",
"\n",
"X = pd.DataFrame(\n",
" [[0, 53, 6325, 5], [1, 90, 6325, 10], [2, 90, 18, 20]],\n",
" columns=[\"user_id\", \"cost\", \"revenue\", \"revenue_id\"],\n",
")\n",
"\n",
"id_col_check = IDColumnsDataCheck(id_threshold=0.9)\n",
Expand Down
142 changes: 78 additions & 64 deletions evalml/data_checks/id_columns_data_check.py
Expand Up @@ -89,66 +89,81 @@ def validate(self, X, y=None):
... }
... ]
If the first column of the dataframe is identified as an ID column it is most likely the primary key.
Despite being all unique, "Country_Rank" will not be identified as an ID column as id_threshold is set to 1.0
by default and its name doesn't indicate that it's an ID.
>>> df = pd.DataFrame({
... "sales_id": [0, 1, 2, 3, 4],
... "customer_id": [123, 124, 125, 126, 127],
... "Sales": [10, 42, 31, 51, 61]
... "humidity": ["high", "very high", "low", "low", "high"],
... "Country_Rank": [1, 2, 3, 4, 5],
... "Sales": ["very high", "high", "high", "medium", "very low"]
... })
...
>>> id_col_check = IDColumnsDataCheck()
>>> assert id_col_check.validate(df) == []
However lowering the threshold will cause this column to be identified as an ID.
>>> id_col_check = IDColumnsDataCheck()
>>> id_col_check = IDColumnsDataCheck(id_threshold=0.95)
>>> assert id_col_check.validate(df) == [
... {
... "message": "The first column 'sales_id' is most likely to be the ID column. Columns 'customer_id' are also 100.0% or more likely to be an ID column",
... "message": "Columns 'Country_Rank' are 95.0% or more likely to be an ID column",
... "data_check_name": "IDColumnsDataCheck",
... "level": "warning",
... "code": "HAS_ID_FIRST_COLUMN",
... "details": {'columns': ["sales_id", "customer_id"], 'rows': None},
... "details": {"columns": ["Country_Rank"], "rows": None},
... "code": "HAS_ID_COLUMN",
... "action_options": [
... {
... "code": "SET_FIRST_COL_ID",
... "code": "DROP_COL",
... "data_check_name": "IDColumnsDataCheck",
... "parameters": {},
... "metadata": {'columns': ["sales_id", "customer_id"], 'rows': None}
... "metadata": {"columns": ["Country_Rank"], "rows": None}
... }
... ]
... }
... }
... ]
Despite being all unique, "Country_Rank" will not be identified as an ID column as id_threshold is set to 1.0
by default and its name doesn't indicate that it's an ID.
If the first column of the dataframe has all unique values and is named either 'ID' or a name that ends with '_id', it is probably the primary key.
The other ID columns should be dropped.
>>> df = pd.DataFrame({
... "humidity": ["high", "very high", "low", "low", "high"],
... "Country_Rank": [1, 2, 3, 4, 5],
... "Sales": ["very high", "high", "high", "medium", "very low"]
... "sales_id": [0, 1, 2, 3, 4],
... "customer_id": [123, 124, 125, 126, 127],
... "Sales": [10, 42, 31, 51, 61]
... })
...
>>> id_col_check = IDColumnsDataCheck()
>>> assert id_col_check.validate(df) == []
However lowering the threshold will cause this column to be identified as an ID.
>>> id_col_check = IDColumnsDataCheck()
>>> id_col_check = IDColumnsDataCheck(id_threshold=0.95)
>>> assert id_col_check.validate(df) == [
... {
... "message": "Columns 'Country_Rank' are 95.0% or more likely to be an ID column",
... "message": "The first column 'sales_id' is likely to be the primary key",
... "data_check_name": "IDColumnsDataCheck",
... "level": "warning",
... "code": "HAS_ID_FIRST_COLUMN",
... "details": {"columns": "sales_id", "rows": None},
... "action_options": [
... {
... "code": "SET_FIRST_COL_ID",
... "data_check_name": "IDColumnsDataCheck",
... "parameters": {},
... "metadata": {"columns": "sales_id", "rows": None}
... }
... ]
... },
... {
... "message": "Columns 'customer_id' are 100.0% or more likely to be an ID column",
... "data_check_name": "IDColumnsDataCheck",
... "level": "warning",
... "details": {"columns": ["Country_Rank"], "rows": None},
... "code": "HAS_ID_COLUMN",
... "details": {"columns": ["customer_id"], "rows": None},
... "action_options": [
... {
... "code": "DROP_COL",
... "data_check_name": "IDColumnsDataCheck",
... "parameters": {},
... "metadata": {"columns": ["Country_Rank"], "rows": None}
... "metadata": {"columns": ["customer_id"], "rows": None}
... }
... ]
... }
... }
... ]
"""
messages = []
Expand Down Expand Up @@ -187,54 +202,53 @@ def validate(self, X, y=None):
key: value for key, value in id_cols.items() if value >= self.id_threshold
}

first_col_id = False

if col_names and col_names[0] in id_cols_above_threshold:
first_col_id = True
del id_cols_above_threshold[col_names[0]]

if id_cols_above_threshold:
warning_msg = ""
message_code = None
action_code = None
if first_col_id:
warning_msg = "The first column '{}' is most likely to be the ID column. Columns {} are also {}% or more likely to be an ID column"
if (
col_names[0] in id_cols_above_threshold
and id_cols_above_threshold[col_names[0]] == 1.0
):
del id_cols_above_threshold[col_names[0]]
warning_msg = "The first column '{}' is likely to be the primary key"
warning_msg = warning_msg.format(
col_names[0],
(", ").join(
["'{}'".format(str(col)) for col in id_cols_above_threshold],
),
self.id_threshold * 100,
)
message_code = DataCheckMessageCode.HAS_ID_FIRST_COLUMN
action_code = DataCheckActionCode.SET_FIRST_COL_ID
details = {"columns": [col_names[0]] + (list(id_cols_above_threshold))}
else:
messages.append(
DataCheckWarning(
message=warning_msg,
data_check_name=self.name,
message_code=DataCheckMessageCode.HAS_ID_FIRST_COLUMN,
details={"columns": col_names[0]},
action_options=[
DataCheckActionOption(
DataCheckActionCode.SET_FIRST_COL_ID,
data_check_name=self.name,
metadata={"columns": col_names[0]},
),
],
).to_dict(),
)
if id_cols_above_threshold:
warning_msg = "Columns {} are {}% or more likely to be an ID column"
warning_msg = warning_msg.format(
(", ").join(
["'{}'".format(str(col)) for col in id_cols_above_threshold],
),
self.id_threshold * 100,
)
message_code = DataCheckMessageCode.HAS_ID_COLUMN
action_code = DataCheckActionCode.DROP_COL
details = {"columns": list(id_cols_above_threshold)}

messages.append(
DataCheckWarning(
message=warning_msg,
data_check_name=self.name,
message_code=message_code,
details=details,
action_options=[
DataCheckActionOption(
action_code,
data_check_name=self.name,
metadata=details,
),
],
).to_dict(),
)
messages.append(
DataCheckWarning(
message=warning_msg,
data_check_name=self.name,
message_code=DataCheckMessageCode.HAS_ID_COLUMN,
details={"columns": list(id_cols_above_threshold)},
action_options=[
DataCheckActionOption(
DataCheckActionCode.DROP_COL,
data_check_name=self.name,
metadata={"columns": list(id_cols_above_threshold)},
),
],
).to_dict(),
)

return messages
42 changes: 42 additions & 0 deletions evalml/tests/conftest.py
Expand Up @@ -89,6 +89,48 @@ def graphviz():
return graphviz


@pytest.fixture
def get_test_data_with_or_without_primary_key():
def _get_test_data_with_primary_key(input_type, has_primary_key):
X = None
if input_type == "integer":
X_dict = {
"col_1_id": [0, 1, 2, 3],
"col_2": [2, 3, 4, 5],
"col_3_id": [1, 1, 2, 3],
"col_5": [0, 0, 1, 2],
}
if not has_primary_key:
X_dict["col_1_id"] = [1, 1, 2, 3]
X = pd.DataFrame.from_dict(X_dict)

elif input_type == "string":
X_dict = {
"col_1_id": ["a", "b", "c", "d"],
"col_2": ["w", "x", "y", "z"],
"col_3_id": [
"123456789012345",
"234567890123456",
"3456789012345678",
"45678901234567",
],
"col_5": ["0", "0", "1", "2"],
}
if not has_primary_key:
X_dict["col_1_id"] = ["b", "b", "c", "d"]
X = pd.DataFrame.from_dict(X_dict)
X.ww.init(
logical_types={
"col_1_id": "categorical",
"col_2": "categorical",
"col_5": "categorical",
},
)
return X

return _get_test_data_with_primary_key


@pytest.fixture
def get_test_data_from_configuration():
def _get_test_data_from_configuration(
Expand Down

0 comments on commit 85472b6

Please sign in to comment.