Modifying ohe get_feature_names so encoded columns are always unique#1349
Modifying ohe get_feature_names so encoded columns are always unique#1349freddyaboulton merged 10 commits intomainfrom
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1349 +/- ##
=========================================
+ Coverage 100.0% 100.0% +0.1%
=========================================
Files 220 220
Lines 14699 14742 +43
=========================================
+ Hits 14692 14735 +43
Misses 7 7
Continue to review full report at Codecov.
|
| np.array: The feature names after encoding, provided in the same order as input_features. | ||
| """ | ||
| return self._encoder.get_feature_names(self.features_to_encode) | ||
| unique_names = [] |
There was a problem hiding this comment.
Tested this on a df with two columns with 10,000 unique categories and top_n set to None. timeit reports a mean time of 5.75 ms which I think is acceptable because we usually pass in top_n and name clashes are pretty rare in my experience.
| assert col_names == expected_col_names | ||
|
|
||
|
|
||
| def test_drop_binary(): |
There was a problem hiding this comment.
Noticed we didn't have coverage for all accepted values of drop.
bchen1116
left a comment
There was a problem hiding this comment.
LGTM! I left a comment on potential styling but looks good either way!
|
|
||
| df = pd.DataFrame({"A": ["x_y", "z"], "A_x": ["y_1", "y"], "A_x_y": ["1", "y"]}) | ||
| df_transformed = OneHotEncoder().fit_transform(df) | ||
| assert set(df_transformed.columns) == {"A_x_y", "A_z", "A_x_y_1", "A_x_y_1_1", "A_x_y_1_1_2", "A_x_y_y"} |
There was a problem hiding this comment.
This looks kinda wild, although I guess this instance shouldn't occur very often for a user.
There was a problem hiding this comment.
Yea I don't think there is a perfect solution here in the sense that any solution will slightly obfuscate the column name/category level that results in each column in the transformed dataframe (you can create an adversarial example no matter the format we use to name the new columns).
That being said, I think it's important to have unique names to avoid any potential downstream bugs that can arise from having duplicate column names.
Maybe we can better explain how the columns are named in the docs (and explain what happens when there are collisions) so users can better trace the data through the pipeline if they are debugging.
| """Helper to make the name unique.""" | ||
| i = 1 | ||
| while name in seen_before: | ||
| name = f"{name}_{i}" |
There was a problem hiding this comment.
Could we do something like
i = 1
name = f"{name}_{i}"
while name in seen_before:
i += 1
name = f"{name[:name.rindex("_")]}_{i}"It's messier, but I think it makes it cleaner from a user standpoint to see
transformed.columns = ['X_y_1', 'X_y_2', 'X_y_3']
versus
transformed.columns = ['X_y_1', 'X_y_1_2', 'X_y_1_2_3']
There was a problem hiding this comment.
Great suggestion! I am not sure there is a perfect solution here but I think this is nicer than what I originally had hehe
abfd78b to
e73f387
Compare
eccabay
left a comment
There was a problem hiding this comment.
LGTM! Left one comment about a redundant variable
| unique_names.append(proposed_name) | ||
| seen_before.add(proposed_name) |
There was a problem hiding this comment.
Is it necessary to have these be two separate variables? It looks like they store the same information
There was a problem hiding this comment.
Good question! The _make_name_unique needs to repeatedly check whether we have already seen a name so a set is better than a list for that I think. The other option is to create a set from unique_names inside _make_name_unique but I figured we might as well just keep one set around at that point. What do you think?
We definitely need to return a list because the order of the columns matters.
There was a problem hiding this comment.
I see, that makes a lot of sense. Totally fine with keeping both!
fffe11d to
3ec02ec
Compare
| proposed_name = f"{col}_{category}" | ||
| if proposed_name in seen_before: | ||
| proposed_name = self._make_name_unique(proposed_name, seen_before) |
There was a problem hiding this comment.
Really nit-picky comment but we can keep the logic of making a unique name in make_name_unique() by not checking here, and doing so in the helper instead. What I mean is here,
proposed_name = self._make_name_unique(f"{col}_{category}", seen_before)
And then in _make_name_unique:
def _make_name_unique(name, seen_before):
"""Helper to make the name unique."""
i = 1
while name in seen_before:
name = f"{name[:name.rindex('_')]}_{i}"
i += 1
return name
Something like that? :D
There was a problem hiding this comment.
Good suggestion! I had to add some minor modifications but the loop within get_feature_names is easier to read now.
angela97lin
left a comment
There was a problem hiding this comment.
Really awesome stuff! LGTM, left one really nit-picky comment on how we could cut down on checking for whether or not we need to make a unique name or not :)
c4f9c16 to
5fedea1
Compare
angela97lin
left a comment
There was a problem hiding this comment.
Ah, one more thing: it would be good to document this behavior somewhere, so that a user who uses the OHE isn't confused ("What is A_x_y_1_2? Is it encoding col A_x_y_1 value 2? Or category 1 in A_x_y?" Docstring update is sufficent imo :)
5fedea1 to
0a21912
Compare
| an integer will be added at the end of the feature name to distinguish it. | ||
|
|
||
| For example, consider a dataframe with a column called "A" and category "x_y" and another column | ||
| called "A_x" with "y". In this example, the feature names would be "A_x_y" and "A_x_y_1". |
There was a problem hiding this comment.
Edge case: does this fix address the case where you have a feature "A_x" with category "y" and a feature "A" with category "x_y"?
There was a problem hiding this comment.
I think this is the first case in test_ohe_column_names_unique but let me know if I'm misunderstanding!
0a21912 to
8f53096
Compare
Pull Request Description
Fixes #1298
After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of
docs/source/release_notes.rstto include this pull request by adding :pr:123.