Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modifying ohe get_feature_names so encoded columns are always unique #1349

Merged
merged 10 commits into from
Nov 20, 2020

Conversation

freddyaboulton
Copy link
Contributor

Pull Request Description

Fixes #1298


After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@codecov
Copy link

codecov bot commented Oct 26, 2020

Codecov Report

Merging #1349 (8f53096) into main (9dfcc4b) will increase coverage by 0.1%.
The diff coverage is 100.0%.

Impacted file tree graph

@@            Coverage Diff            @@
##             main    #1349     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         220      220             
  Lines       14699    14742     +43     
=========================================
+ Hits        14692    14735     +43     
  Misses          7        7             
Impacted Files Coverage Δ
...components/transformers/encoders/onehot_encoder.py 100.0% <100.0%> (ø)
...alml/tests/component_tests/test_one_hot_encoder.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9dfcc4b...8f53096. Read the comment docs.

def get_feature_names(self):
"""Return feature names for the input features after fitting.

Returns:
np.array: The feature names after encoding, provided in the same order as input_features.
"""
return self._encoder.get_feature_names(self.features_to_encode)
unique_names = []
Copy link
Contributor Author

@freddyaboulton freddyaboulton Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested this on a df with two columns with 10,000 unique categories and top_n set to None. timeit reports a mean time of 5.75 ms which I think is acceptable because we usually pass in top_n and name clashes are pretty rare in my experience.

@@ -122,6 +122,31 @@ def test_drop():
assert col_names == expected_col_names


def test_drop_binary():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed we didn't have coverage for all accepted values of drop.

Copy link
Contributor

@bchen1116 bchen1116 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I left a comment on potential styling but looks good either way!


df = pd.DataFrame({"A": ["x_y", "z"], "A_x": ["y_1", "y"], "A_x_y": ["1", "y"]})
df_transformed = OneHotEncoder().fit_transform(df)
assert set(df_transformed.columns) == {"A_x_y", "A_z", "A_x_y_1", "A_x_y_1_1", "A_x_y_1_1_2", "A_x_y_y"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks kinda wild, although I guess this instance shouldn't occur very often for a user.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I don't think there is a perfect solution here in the sense that any solution will slightly obfuscate the column name/category level that results in each column in the transformed dataframe (you can create an adversarial example no matter the format we use to name the new columns).

That being said, I think it's important to have unique names to avoid any potential downstream bugs that can arise from having duplicate column names.

Maybe we can better explain how the columns are named in the docs (and explain what happens when there are collisions) so users can better trace the data through the pipeline if they are debugging.

"""Helper to make the name unique."""
i = 1
while name in seen_before:
name = f"{name}_{i}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do something like

i = 1
name = f"{name}_{i}"
while name in seen_before:
    i += 1
    name = f"{name[:name.rindex("_")]}_{i}"

It's messier, but I think it makes it cleaner from a user standpoint to see

transformed.columns = ['X_y_1', 'X_y_2', 'X_y_3']

versus

transformed.columns = ['X_y_1', 'X_y_1_2', 'X_y_1_2_3']

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion! I am not sure there is a perfect solution here but I think this is nicer than what I originally had hehe

Copy link
Contributor

@eccabay eccabay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Left one comment about a redundant variable

Comment on lines +213 to +229
unique_names.append(proposed_name)
seen_before.add(proposed_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to have these be two separate variables? It looks like they store the same information

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! The _make_name_unique needs to repeatedly check whether we have already seen a name so a set is better than a list for that I think. The other option is to create a set from unique_names inside _make_name_unique but I figured we might as well just keep one set around at that point. What do you think?

We definitely need to return a list because the order of the columns matters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that makes a lot of sense. Totally fine with keeping both!

@freddyaboulton freddyaboulton force-pushed the 1298-one-hot-encoder-unique-names branch 2 times, most recently from fffe11d to 3ec02ec Compare October 30, 2020 20:22
Comment on lines 210 to 212
proposed_name = f"{col}_{category}"
if proposed_name in seen_before:
proposed_name = self._make_name_unique(proposed_name, seen_before)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nit-picky comment but we can keep the logic of making a unique name in make_name_unique() by not checking here, and doing so in the helper instead. What I mean is here,

proposed_name = self._make_name_unique(f"{col}_{category}", seen_before)

And then in _make_name_unique:

    def _make_name_unique(name, seen_before):
        """Helper to make the name unique."""
        i = 1
        while name in seen_before:
            name = f"{name[:name.rindex('_')]}_{i}"
            i += 1
        return name

Something like that? :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion! I had to add some minor modifications but the loop within get_feature_names is easier to read now.

Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really awesome stuff! LGTM, left one really nit-picky comment on how we could cut down on checking for whether or not we need to make a unique name or not :)

@CLAassistant
Copy link

CLAassistant commented Nov 4, 2020

CLA assistant check
All committers have signed the CLA.

@freddyaboulton freddyaboulton force-pushed the 1298-one-hot-encoder-unique-names branch 2 times, most recently from c4f9c16 to 5fedea1 Compare November 4, 2020 16:29
Copy link
Contributor

@angela97lin angela97lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, one more thing: it would be good to document this behavior somewhere, so that a user who uses the OHE isn't confused ("What is A_x_y_1_2? Is it encoding col A_x_y_1 value 2? Or category 1 in A_x_y?" Docstring update is sufficent imo :)

@freddyaboulton freddyaboulton force-pushed the 1298-one-hot-encoder-unique-names branch from 5fedea1 to 0a21912 Compare November 5, 2020 15:57
an integer will be added at the end of the feature name to distinguish it.

For example, consider a dataframe with a column called "A" and category "x_y" and another column
called "A_x" with "y". In this example, the feature names would be "A_x_y" and "A_x_y_1".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edge case: does this fix address the case where you have a feature "A_x" with category "y" and a feature "A" with category "x_y"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the first case in test_ohe_column_names_unique but let me know if I'm misunderstanding!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

@freddyaboulton freddyaboulton force-pushed the 1298-one-hot-encoder-unique-names branch from 0a21912 to 8f53096 Compare November 20, 2020 20:08
@freddyaboulton freddyaboulton merged commit 5d1ea1f into main Nov 20, 2020
@freddyaboulton freddyaboulton deleted the 1298-one-hot-encoder-unique-names branch November 20, 2020 20:25
@dsherry dsherry mentioned this pull request Nov 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OHE doesn't always result in unique names
6 participants