New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modifying ohe get_feature_names so encoded columns are always unique #1349
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1349 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 220 220
Lines 14699 14742 +43
=========================================
+ Hits 14692 14735 +43
Misses 7 7
Continue to review full report at Codecov.
|
def get_feature_names(self): | ||
"""Return feature names for the input features after fitting. | ||
|
||
Returns: | ||
np.array: The feature names after encoding, provided in the same order as input_features. | ||
""" | ||
return self._encoder.get_feature_names(self.features_to_encode) | ||
unique_names = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested this on a df with two columns with 10,000 unique categories and top_n set to None. timeit
reports a mean time of 5.75 ms which I think is acceptable because we usually pass in top_n and name clashes are pretty rare in my experience.
@@ -122,6 +122,31 @@ def test_drop(): | |||
assert col_names == expected_col_names | |||
|
|||
|
|||
def test_drop_binary(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noticed we didn't have coverage for all accepted values of drop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I left a comment on potential styling but looks good either way!
|
||
df = pd.DataFrame({"A": ["x_y", "z"], "A_x": ["y_1", "y"], "A_x_y": ["1", "y"]}) | ||
df_transformed = OneHotEncoder().fit_transform(df) | ||
assert set(df_transformed.columns) == {"A_x_y", "A_z", "A_x_y_1", "A_x_y_1_1", "A_x_y_1_1_2", "A_x_y_y"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks kinda wild, although I guess this instance shouldn't occur very often for a user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea I don't think there is a perfect solution here in the sense that any solution will slightly obfuscate the column name/category level that results in each column in the transformed dataframe (you can create an adversarial example no matter the format we use to name the new columns).
That being said, I think it's important to have unique names to avoid any potential downstream bugs that can arise from having duplicate column names.
Maybe we can better explain how the columns are named in the docs (and explain what happens when there are collisions) so users can better trace the data through the pipeline if they are debugging.
"""Helper to make the name unique.""" | ||
i = 1 | ||
while name in seen_before: | ||
name = f"{name}_{i}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we do something like
i = 1
name = f"{name}_{i}"
while name in seen_before:
i += 1
name = f"{name[:name.rindex("_")]}_{i}"
It's messier, but I think it makes it cleaner from a user standpoint to see
transformed.columns = ['X_y_1', 'X_y_2', 'X_y_3']
versus
transformed.columns = ['X_y_1', 'X_y_1_2', 'X_y_1_2_3']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion! I am not sure there is a perfect solution here but I think this is nicer than what I originally had hehe
abfd78b
to
e73f387
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Left one comment about a redundant variable
unique_names.append(proposed_name) | ||
seen_before.add(proposed_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to have these be two separate variables? It looks like they store the same information
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! The _make_name_unique
needs to repeatedly check whether we have already seen a name so a set is better than a list for that I think. The other option is to create a set from unique_names
inside _make_name_unique
but I figured we might as well just keep one set around at that point. What do you think?
We definitely need to return a list because the order of the columns matters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, that makes a lot of sense. Totally fine with keeping both!
fffe11d
to
3ec02ec
Compare
proposed_name = f"{col}_{category}" | ||
if proposed_name in seen_before: | ||
proposed_name = self._make_name_unique(proposed_name, seen_before) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nit-picky comment but we can keep the logic of making a unique name in make_name_unique()
by not checking here, and doing so in the helper instead. What I mean is here,
proposed_name = self._make_name_unique(f"{col}_{category}", seen_before)
And then in _make_name_unique
:
def _make_name_unique(name, seen_before):
"""Helper to make the name unique."""
i = 1
while name in seen_before:
name = f"{name[:name.rindex('_')]}_{i}"
i += 1
return name
Something like that? :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion! I had to add some minor modifications but the loop within get_feature_names
is easier to read now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really awesome stuff! LGTM, left one really nit-picky comment on how we could cut down on checking for whether or not we need to make a unique name or not :)
c4f9c16
to
5fedea1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, one more thing: it would be good to document this behavior somewhere, so that a user who uses the OHE isn't confused ("What is A_x_y_1_2? Is it encoding col A_x_y_1 value 2? Or category 1 in A_x_y?" Docstring update is sufficent imo :)
5fedea1
to
0a21912
Compare
an integer will be added at the end of the feature name to distinguish it. | ||
|
||
For example, consider a dataframe with a column called "A" and category "x_y" and another column | ||
called "A_x" with "y". In this example, the feature names would be "A_x_y" and "A_x_y_1". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edge case: does this fix address the case where you have a feature "A_x" with category "y" and a feature "A" with category "x_y"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the first case in test_ohe_column_names_unique
but let me know if I'm misunderstanding!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
0a21912
to
8f53096
Compare
Pull Request Description
Fixes #1298
After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of
docs/source/release_notes.rst
to include this pull request by adding :pr:123
.