Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make OHE deterministic when top_n < no. of categories #1324

Merged
merged 4 commits into from Oct 22, 2020

Conversation

freddyaboulton
Copy link
Contributor

Pull Request Description

Fixes #1279


After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of docs/source/release_notes.rst to include this pull request by adding :pr:123.

@freddyaboulton freddyaboulton changed the title Make OHE deterministic Make OHE deterministic when top_n > # categories Oct 20, 2020
@freddyaboulton freddyaboulton changed the title Make OHE deterministic when top_n > # categories Make OHE deterministic when top_n < # categories Oct 20, 2020
@freddyaboulton freddyaboulton changed the title Make OHE deterministic when top_n < # categories Make OHE deterministic when top_n < no. of categories Oct 20, 2020
@codecov
Copy link

codecov bot commented Oct 20, 2020

Codecov Report

Merging #1324 into main will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1324   +/-   ##
=======================================
  Coverage   99.94%   99.94%           
=======================================
  Files         213      213           
  Lines       13425    13436   +11     
=======================================
+ Hits        13418    13429   +11     
  Misses          7        7           
Impacted Files Coverage Δ
...components/transformers/encoders/onehot_encoder.py 100.00% <100.00%> (ø)
...alml/tests/component_tests/test_one_hot_encoder.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b09f041...9ae0bc5. Read the comment docs.

@@ -111,6 +112,7 @@ def fit(self, X, y=None):
if top_n is None or len(value_counts) <= top_n:
unique_values = value_counts.index.tolist()
else:
self.random_state.set_state(self._initial_state)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@freddyaboulton could you please explain why this change is necessary?

Is the culprit the call to value_counts.sample below? Is this change masking a pandas bug? If so, could we do

# patch non-determinism bug in pandas sample
rs_pandas_patch = np.random_state.RandomState()
rs_pandas_patch.set_state(self._initial_state)
value_counts = value_counts.sample(frac=1, random_state=rs_pandas_patch)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsherry Yes, the culprit is the call to sample! The problem is that the internal state of self.random_state gets modified after each call to sample. This is not a pandas bug but expected behavior of np.random.RandomState. I believe my fix and your code snippet are equivalent (explicitly setting the state of np.random.RandomState to what it was when the OHE was initialized).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! Got it. Yep, thanks for that. I think what threw me off was that this call to set state was pretty far into the nested if/else tree. As a reader that had me thinking this code must only happen in this specific block, but really it could happen at a higher level.

(Aside: because python passes by reference, we should avoid calling self.random_state.set_state. That'll update the state for the random state obj which is shared outside of this object.)

@freddyaboulton I left a long writeup of my thoughts below. Can we jump on a call tomorrow and discuss what to do?


So when I read the reproducer on #1279 a couple weeks back, I wasn't reading carefully, and I thought that the bug was that this would fail:

df1 = OneHotEncoder(top_n=4, random_state=5).fit_transform(df)
df2 = OneHotEncoder(top_n=4, random_state=5).fit_transform(df)
pd.testing.assert_frame_equal(df1, df2)

However returning to that reproducer, I see that it was in fact

ohe = OneHotEncoder(top_n=4, random_state=5)
df1 = ohe.fit_transform(df)
df2 = ohe.fit_transform(df)
pd.testing.assert_frame_equal(df1, df2)

The way I see it, there's two options here, for fit, and for other component methods like transform/predict:

  1. Calling fit multiple times in a row will produce different behavior (because one RNG is held by the object)
  2. Calling fit multiple times in a row will not produce different behavior.

I believe this is a design decision (which we haven't made yet!) rather than a bug. I think option 2 is a more non-dev-user-centric opinion, and option 1 is more developer-centric, i.e. easier for us to write, haha.

My gut is telling me to argue that the first snippet above should pass (as it currently does I believe?), and the second snippet should always fail, i.e. option 1. My rationale for that argument is that, as the caller, if you create a single object, which initializes the RNG, and then call fit on that object twice in a row, you should expect the RNG will be called two sets of times, once per call to fit. And that necessarily would produce different behavior the second time.

However, I could see the other side as well, option 2. Its much simpler behavior to say that whenever you set the random state, be it a seed or np.random.RandomState, no matter how many times you call fit/transform/predict, you always get the same output. It appears this is sklearn's opinion as well, because the following passes:

import sklearn
import numpy as np
import pandas as pd

lr1 = sklearn.linear_model.LogisticRegression(random_state=np.random.RandomState(5))
state1 = lr1.random_state.get_state()
lr1.fit(X, y)
state2 = lr1.random_state.get_state()
np.testing.assert_array_equal(state1[1], state2[1])

Worth noting option 2 is harder to build, and if we decide that's the invariant we want to enforce, we should think about how to enforce that behavior across all our components, not just the OHE.

pd.testing.assert_frame_equal(df1, df2)

check_df_equality(5)
check_df_equality(get_random_state(5))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test looks good! Have you verified this was failing before the fix was included?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Yes, this test fails on main!

@freddyaboulton
Copy link
Contributor Author

@dsherry @bchen1116 This is good for a second review. Once this is merged, I'll file an issue for tracking the "global" fix we discussed this afternoon.

@@ -111,7 +112,9 @@ def fit(self, X, y=None):
if top_n is None or len(value_counts) <= top_n:
unique_values = value_counts.index.tolist()
else:
value_counts = value_counts.sample(frac=1, random_state=self.random_state)
new_random_state = np.random.RandomState()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to define new_random_state here rather than at the top of the method because this is the only place it's needed. Happy to change it though!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me!

Copy link
Collaborator

@dsherry dsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 🚢

@@ -111,7 +112,9 @@ def fit(self, X, y=None):
if top_n is None or len(value_counts) <= top_n:
unique_values = value_counts.index.tolist()
else:
value_counts = value_counts.sample(frac=1, random_state=self.random_state)
new_random_state = np.random.RandomState()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me!

@freddyaboulton freddyaboulton merged commit a2e534f into main Oct 22, 2020
2 checks passed
@freddyaboulton freddyaboulton deleted the 1279-fix-ohe-not-deterministic branch October 22, 2020 18:27
@dsherry dsherry mentioned this pull request Oct 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OHE not deterministic when fit multiple times with top_n < number categories
2 participants