Make OHE deterministic when top_n < no. of categories #1324

freddyaboulton · 2020-10-20T20:33:13Z

Pull Request Description

After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

codecov · 2020-10-20T20:41:43Z

Codecov Report

Merging #1324 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #1324   +/-   ##
=======================================
  Coverage   99.94%   99.94%           
=======================================
  Files         213      213           
  Lines       13425    13436   +11     
=======================================
+ Hits        13418    13429   +11     
  Misses          7        7

Impacted Files	Coverage Δ
...components/transformers/encoders/onehot_encoder.py	`100.00% <100.00%> (ø)`
...alml/tests/component_tests/test_one_hot_encoder.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b09f041...9ae0bc5. Read the comment docs.

dsherry · 2020-10-20T22:40:28Z

evalml/pipelines/components/transformers/encoders/onehot_encoder.py

@@ -111,6 +112,7 @@ def fit(self, X, y=None):
                if top_n is None or len(value_counts) <= top_n:
                    unique_values = value_counts.index.tolist()
                else:
+                    self.random_state.set_state(self._initial_state)


@freddyaboulton could you please explain why this change is necessary?

Is the culprit the call to value_counts.sample below? Is this change masking a pandas bug? If so, could we do

# patch non-determinism bug in pandas sample rs_pandas_patch = np.random_state.RandomState() rs_pandas_patch.set_state(self._initial_state) value_counts = value_counts.sample(frac=1, random_state=rs_pandas_patch)

@dsherry Yes, the culprit is the call to sample! The problem is that the internal state of self.random_state gets modified after each call to sample. This is not a pandas bug but expected behavior of np.random.RandomState. I believe my fix and your code snippet are equivalent (explicitly setting the state of np.random.RandomState to what it was when the OHE was initialized).

Ah! Got it. Yep, thanks for that. I think what threw me off was that this call to set state was pretty far into the nested if/else tree. As a reader that had me thinking this code must only happen in this specific block, but really it could happen at a higher level.

(Aside: because python passes by reference, we should avoid calling self.random_state.set_state. That'll update the state for the random state obj which is shared outside of this object.)

@freddyaboulton I left a long writeup of my thoughts below. Can we jump on a call tomorrow and discuss what to do?

So when I read the reproducer on #1279 a couple weeks back, I wasn't reading carefully, and I thought that the bug was that this would fail:

df1 = OneHotEncoder(top_n=4, random_state=5).fit_transform(df) df2 = OneHotEncoder(top_n=4, random_state=5).fit_transform(df) pd.testing.assert_frame_equal(df1, df2)

However returning to that reproducer, I see that it was in fact

ohe = OneHotEncoder(top_n=4, random_state=5) df1 = ohe.fit_transform(df) df2 = ohe.fit_transform(df) pd.testing.assert_frame_equal(df1, df2)

The way I see it, there's two options here, for fit, and for other component methods like transform/predict:

Calling fit multiple times in a row will produce different behavior (because one RNG is held by the object)

Calling fit multiple times in a row will not produce different behavior.

I believe this is a design decision (which we haven't made yet!) rather than a bug. I think option 2 is a more non-dev-user-centric opinion, and option 1 is more developer-centric, i.e. easier for us to write, haha.

My gut is telling me to argue that the first snippet above should pass (as it currently does I believe?), and the second snippet should always fail, i.e. option 1. My rationale for that argument is that, as the caller, if you create a single object, which initializes the RNG, and then call fit on that object twice in a row, you should expect the RNG will be called two sets of times, once per call to fit. And that necessarily would produce different behavior the second time.

However, I could see the other side as well, option 2. Its much simpler behavior to say that whenever you set the random state, be it a seed or np.random.RandomState, no matter how many times you call fit/transform/predict, you always get the same output. It appears this is sklearn's opinion as well, because the following passes:

import sklearn import numpy as np import pandas as pd lr1 = sklearn.linear_model.LogisticRegression(random_state=np.random.RandomState(5)) state1 = lr1.random_state.get_state() lr1.fit(X, y) state2 = lr1.random_state.get_state() np.testing.assert_array_equal(state1[1], state2[1])

Worth noting option 2 is harder to build, and if we decide that's the invariant we want to enforce, we should think about how to enforce that behavior across all our components, not just the OHE.

dsherry · 2020-10-20T22:41:26Z

evalml/tests/component_tests/test_one_hot_encoder.py

+        pd.testing.assert_frame_equal(df1, df2)
+
+    check_df_equality(5)
+    check_df_equality(get_random_state(5))


Test looks good! Have you verified this was failing before the fix was included?

Thanks! Yes, this test fails on main!

freddyaboulton · 2020-10-21T22:45:59Z

@dsherry @bchen1116 This is good for a second review. Once this is merged, I'll file an issue for tracking the "global" fix we discussed this afternoon.

freddyaboulton · 2020-10-21T22:46:43Z

evalml/pipelines/components/transformers/encoders/onehot_encoder.py

@@ -111,7 +112,9 @@ def fit(self, X, y=None):
                if top_n is None or len(value_counts) <= top_n:
                    unique_values = value_counts.index.tolist()
                else:
-                    value_counts = value_counts.sample(frac=1, random_state=self.random_state)
+                    new_random_state = np.random.RandomState()


Decided to define new_random_state here rather than at the top of the method because this is the only place it's needed. Happy to change it though!

This looks fine to me!

dsherry

👍 🚢

dsherry · 2020-10-21T23:26:33Z

evalml/pipelines/components/transformers/encoders/onehot_encoder.py

@@ -111,7 +112,9 @@ def fit(self, X, y=None):
                if top_n is None or len(value_counts) <= top_n:
                    unique_values = value_counts.index.tolist()
                else:
-                    value_counts = value_counts.sample(frac=1, random_state=self.random_state)
+                    new_random_state = np.random.RandomState()


This looks fine to me!

freddyaboulton changed the title ~~Make OHE deterministic~~ Make OHE deterministic when top_n > # categories Oct 20, 2020

freddyaboulton changed the title ~~Make OHE deterministic when top_n > # categories~~ Make OHE deterministic when top_n < # categories Oct 20, 2020

freddyaboulton changed the title ~~Make OHE deterministic when top_n < # categories~~ Make OHE deterministic when top_n < no. of categories Oct 20, 2020

freddyaboulton marked this pull request as ready for review October 20, 2020 21:12

freddyaboulton requested review from dsherry, angela97lin, bchen1116, jeremyliweishih, christopherbunn and eccabay October 20, 2020 21:12

dsherry reviewed Oct 20, 2020

View reviewed changes

dsherry mentioned this pull request Oct 21, 2020

OHE not deterministic when fit multiple times with top_n < number categories #1279

Closed

freddyaboulton added 4 commits October 21, 2020 17:48

Saving initial random state to make OHE reproducible.

0b0f869

Adding PR 1324 to release notes.

97b1b9b

Updating test_ohe_top_n_categories_always_the_same

340a5d3

Create a new random state with seeded with initial state.

9ae0bc5

freddyaboulton force-pushed the 1279-fix-ohe-not-deterministic branch from 321e6c6 to 9ae0bc5 Compare October 21, 2020 22:03

freddyaboulton requested a review from dsherry October 21, 2020 22:44

freddyaboulton commented Oct 21, 2020

View reviewed changes

dsherry approved these changes Oct 21, 2020

View reviewed changes

freddyaboulton merged commit a2e534f into main Oct 22, 2020

freddyaboulton deleted the 1279-fix-ohe-not-deterministic branch October 22, 2020 18:27

dsherry mentioned this pull request Oct 23, 2020

Passing random state to pipelines created by IterativeAlgorithm next_batch #1321

Merged

freddyaboulton mentioned this pull request Oct 23, 2020

Should clone copy the random state by default? #1340

Closed

dsherry mentioned this pull request Oct 29, 2020

Release v0.15.0 #1370

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make OHE deterministic when top_n < no. of categories #1324

Make OHE deterministic when top_n < no. of categories #1324

freddyaboulton commented Oct 20, 2020

codecov bot commented Oct 20, 2020 •

edited

Loading

dsherry Oct 20, 2020

freddyaboulton Oct 20, 2020 •

edited

Loading

dsherry Oct 20, 2020

dsherry Oct 20, 2020

freddyaboulton Oct 20, 2020

freddyaboulton commented Oct 21, 2020

freddyaboulton Oct 21, 2020 •

edited

Loading

dsherry Oct 21, 2020

dsherry left a comment

dsherry Oct 21, 2020

Make OHE deterministic when top_n < no. of categories #1324

Make OHE deterministic when top_n < no. of categories #1324

Conversation

freddyaboulton commented Oct 20, 2020

Pull Request Description

codecov bot commented Oct 20, 2020 • edited Loading

Codecov Report

dsherry Oct 20, 2020

Choose a reason for hiding this comment

freddyaboulton Oct 20, 2020 • edited Loading

Choose a reason for hiding this comment

dsherry Oct 20, 2020

Choose a reason for hiding this comment

dsherry Oct 20, 2020

Choose a reason for hiding this comment

freddyaboulton Oct 20, 2020

Choose a reason for hiding this comment

freddyaboulton commented Oct 21, 2020

freddyaboulton Oct 21, 2020 • edited Loading

Choose a reason for hiding this comment

dsherry Oct 21, 2020

Choose a reason for hiding this comment

dsherry left a comment

Choose a reason for hiding this comment

dsherry Oct 21, 2020

Choose a reason for hiding this comment

codecov bot commented Oct 20, 2020 •

edited

Loading

freddyaboulton Oct 20, 2020 •

edited

Loading

freddyaboulton Oct 21, 2020 •

edited

Loading