Modifying ohe get_feature_names so encoded columns are always unique #1349

freddyaboulton · 2020-10-26T22:07:15Z

Pull Request Description

After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

codecov · 2020-10-26T22:14:21Z

Codecov Report

Merging #1349 (8f53096) into main (9dfcc4b) will increase coverage by 0.1%.
The diff coverage is 100.0%.

@@            Coverage Diff            @@
##             main    #1349     +/-   ##
=========================================
+ Coverage   100.0%   100.0%   +0.1%     
=========================================
  Files         220      220             
  Lines       14699    14742     +43     
=========================================
+ Hits        14692    14735     +43     
  Misses          7        7

Impacted Files	Coverage Δ
...components/transformers/encoders/onehot_encoder.py	`100.0% <100.0%> (ø)`
...alml/tests/component_tests/test_one_hot_encoder.py	`100.0% <100.0%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9dfcc4b...8f53096. Read the comment docs.

freddyaboulton · 2020-10-27T15:16:13Z

evalml/pipelines/components/transformers/encoders/onehot_encoder.py

    def get_feature_names(self):
        """Return feature names for the input features after fitting.

        Returns:
            np.array: The feature names after encoding, provided in the same order as input_features.
        """
-        return self._encoder.get_feature_names(self.features_to_encode)
+        unique_names = []


Tested this on a df with two columns with 10,000 unique categories and top_n set to None. timeit reports a mean time of 5.75 ms which I think is acceptable because we usually pass in top_n and name clashes are pretty rare in my experience.

freddyaboulton · 2020-10-27T15:17:23Z

evalml/tests/component_tests/test_one_hot_encoder.py

@@ -122,6 +122,31 @@ def test_drop():
    assert col_names == expected_col_names


+def test_drop_binary():


Noticed we didn't have coverage for all accepted values of drop.

bchen1116

LGTM! I left a comment on potential styling but looks good either way!

bchen1116 · 2020-10-28T15:35:56Z

evalml/tests/component_tests/test_one_hot_encoder.py

+
+    df = pd.DataFrame({"A": ["x_y", "z"], "A_x": ["y_1", "y"], "A_x_y": ["1", "y"]})
+    df_transformed = OneHotEncoder().fit_transform(df)
+    assert set(df_transformed.columns) == {"A_x_y", "A_z", "A_x_y_1", "A_x_y_1_1", "A_x_y_1_1_2", "A_x_y_y"}


This looks kinda wild, although I guess this instance shouldn't occur very often for a user.

Yea I don't think there is a perfect solution here in the sense that any solution will slightly obfuscate the column name/category level that results in each column in the transformed dataframe (you can create an adversarial example no matter the format we use to name the new columns).

That being said, I think it's important to have unique names to avoid any potential downstream bugs that can arise from having duplicate column names.

Maybe we can better explain how the columns are named in the docs (and explain what happens when there are collisions) so users can better trace the data through the pipeline if they are debugging.

bchen1116 · 2020-10-28T15:45:23Z

evalml/pipelines/components/transformers/encoders/onehot_encoder.py

+        """Helper to make the name unique."""
+        i = 1
+        while name in seen_before:
+            name = f"{name}_{i}"


Could we do something like

i = 1 name = f"{name}_{i}" while name in seen_before: i += 1 name = f"{name[:name.rindex("_")]}_{i}"

It's messier, but I think it makes it cleaner from a user standpoint to see

transformed.columns = ['X_y_1', 'X_y_2', 'X_y_3']

versus

transformed.columns = ['X_y_1', 'X_y_1_2', 'X_y_1_2_3']

Great suggestion! I am not sure there is a perfect solution here but I think this is nicer than what I originally had hehe

eccabay

LGTM! Left one comment about a redundant variable

eccabay · 2020-10-30T14:26:51Z

evalml/pipelines/components/transformers/encoders/onehot_encoder.py

+                unique_names.append(proposed_name)
+                seen_before.add(proposed_name)


Is it necessary to have these be two separate variables? It looks like they store the same information

Good question! The _make_name_unique needs to repeatedly check whether we have already seen a name so a set is better than a list for that I think. The other option is to create a set from unique_names inside _make_name_unique but I figured we might as well just keep one set around at that point. What do you think?

We definitely need to return a list because the order of the columns matters.

I see, that makes a lot of sense. Totally fine with keeping both!

angela97lin · 2020-11-03T03:48:04Z

evalml/pipelines/components/transformers/encoders/onehot_encoder.py

+                proposed_name = f"{col}_{category}"
+                if proposed_name in seen_before:
+                    proposed_name = self._make_name_unique(proposed_name, seen_before)


Really nit-picky comment but we can keep the logic of making a unique name in make_name_unique() by not checking here, and doing so in the helper instead. What I mean is here,

proposed_name = self._make_name_unique(f"{col}_{category}", seen_before)

And then in _make_name_unique:

def _make_name_unique(name, seen_before): """Helper to make the name unique.""" i = 1 while name in seen_before: name = f"{name[:name.rindex('_')]}_{i}" i += 1 return name

Something like that? :D

Good suggestion! I had to add some minor modifications but the loop within get_feature_names is easier to read now.

angela97lin

Really awesome stuff! LGTM, left one really nit-picky comment on how we could cut down on checking for whether or not we need to make a unique name or not :)

CLAassistant · 2020-11-04T14:42:19Z

All committers have signed the CLA.

angela97lin

Ah, one more thing: it would be good to document this behavior somewhere, so that a user who uses the OHE isn't confused ("What is A_x_y_1_2? Is it encoding col A_x_y_1 value 2? Or category 1 in A_x_y?" Docstring update is sufficent imo :)

dsherry · 2020-11-13T16:44:42Z

evalml/pipelines/components/transformers/encoders/onehot_encoder.py

+        an integer will be added at the end of the feature name to distinguish it.
+
+        For example, consider a dataframe with a column called "A" and category "x_y" and another column
+        called "A_x" with "y". In this example, the feature names would be "A_x_y" and "A_x_y_1".


Edge case: does this fix address the case where you have a feature "A_x" with category "y" and a feature "A" with category "x_y"?

I think this is the first case in test_ohe_column_names_unique but let me know if I'm misunderstanding!

…unction.

…ing.

freddyaboulton commented Oct 27, 2020

View reviewed changes

freddyaboulton marked this pull request as ready for review October 27, 2020 15:30

freddyaboulton requested review from dsherry, bchen1116, christopherbunn, jeremyliweishih and eccabay October 27, 2020 15:30

bchen1116 approved these changes Oct 28, 2020

View reviewed changes

freddyaboulton force-pushed the 1298-one-hot-encoder-unique-names branch from abfd78b to e73f387 Compare October 28, 2020 17:10

dsherry assigned freddyaboulton Oct 28, 2020

eccabay approved these changes Oct 30, 2020

View reviewed changes

freddyaboulton force-pushed the 1298-one-hot-encoder-unique-names branch 2 times, most recently from fffe11d to 3ec02ec Compare October 30, 2020 20:22

freddyaboulton requested a review from angela97lin November 2, 2020 22:34

angela97lin reviewed Nov 3, 2020

View reviewed changes

angela97lin approved these changes Nov 3, 2020

View reviewed changes

freddyaboulton force-pushed the 1298-one-hot-encoder-unique-names branch 2 times, most recently from c4f9c16 to 5fedea1 Compare November 4, 2020 16:29

angela97lin reviewed Nov 5, 2020

View reviewed changes

freddyaboulton force-pushed the 1298-one-hot-encoder-unique-names branch from 5fedea1 to 0a21912 Compare November 5, 2020 15:57

dsherry reviewed Nov 13, 2020

View reviewed changes

freddyaboulton added 6 commits November 20, 2020 15:04

Modifying ohe get_feature_names so they're always unique

d6e58c2

Adding PR 1349 to release notes.

5662f26

writing make_name_unique private method to make sure names are unique.

27c07fa

Removing unused import

71e8a01

Deleting unused line.

52cbefb

Add test coverage for drop being an array.

1221d3e

freddyaboulton added 4 commits November 20, 2020 15:05

Deleting drop attribute from OHE.

7f9bf79

Better name format for duplicate columns. Add helpful comment to test.

68f055a

Making the name unique only happens in the _make_name_unique helper f…

bd8f617

…unction.

Explaining how duplicates are handled in OHE get_feature_names docstr…

8f53096

…ing.

freddyaboulton force-pushed the 1298-one-hot-encoder-unique-names branch from 0a21912 to 8f53096 Compare November 20, 2020 20:08

freddyaboulton merged commit 5d1ea1f into main Nov 20, 2020

freddyaboulton deleted the 1298-one-hot-encoder-unique-names branch November 20, 2020 20:25

dsherry mentioned this pull request Nov 24, 2020

Release v0.16.0 #1468

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modifying ohe get_feature_names so encoded columns are always unique #1349

Modifying ohe get_feature_names so encoded columns are always unique #1349

freddyaboulton commented Oct 26, 2020

codecov bot commented Oct 26, 2020 •

edited

Loading

freddyaboulton Oct 27, 2020 •

edited

Loading

freddyaboulton Oct 27, 2020

bchen1116 left a comment

bchen1116 Oct 28, 2020

freddyaboulton Oct 28, 2020

bchen1116 Oct 28, 2020

freddyaboulton Oct 28, 2020

eccabay left a comment

eccabay Oct 30, 2020

freddyaboulton Oct 30, 2020

eccabay Oct 30, 2020

angela97lin Nov 3, 2020

freddyaboulton Nov 4, 2020

angela97lin left a comment

CLAassistant commented Nov 4, 2020 •

edited

Loading

angela97lin left a comment

dsherry Nov 13, 2020

freddyaboulton Nov 13, 2020

dsherry Nov 17, 2020

		@@ -122,6 +122,31 @@ def test_drop():
		assert col_names == expected_col_names


		def test_drop_binary():

		unique_names.append(proposed_name)
		seen_before.add(proposed_name)

Modifying ohe get_feature_names so encoded columns are always unique #1349

Modifying ohe get_feature_names so encoded columns are always unique #1349

Conversation

freddyaboulton commented Oct 26, 2020

Pull Request Description

codecov bot commented Oct 26, 2020 • edited Loading

Codecov Report

freddyaboulton Oct 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bchen1116 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eccabay left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

angela97lin left a comment

Choose a reason for hiding this comment

CLAassistant commented Nov 4, 2020 • edited Loading

angela97lin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Oct 26, 2020 •

edited

Loading

freddyaboulton Oct 27, 2020 •

edited

Loading

CLAassistant commented Nov 4, 2020 •

edited

Loading