-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make OHE deterministic when top_n < no. of categories #1324
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1324 +/- ##
=======================================
Coverage 99.94% 99.94%
=======================================
Files 213 213
Lines 13425 13436 +11
=======================================
+ Hits 13418 13429 +11
Misses 7 7
Continue to review full report at Codecov.
|
@@ -111,6 +112,7 @@ def fit(self, X, y=None): | |||
if top_n is None or len(value_counts) <= top_n: | |||
unique_values = value_counts.index.tolist() | |||
else: | |||
self.random_state.set_state(self._initial_state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@freddyaboulton could you please explain why this change is necessary?
Is the culprit the call to value_counts.sample
below? Is this change masking a pandas bug? If so, could we do
# patch non-determinism bug in pandas sample
rs_pandas_patch = np.random_state.RandomState()
rs_pandas_patch.set_state(self._initial_state)
value_counts = value_counts.sample(frac=1, random_state=rs_pandas_patch)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsherry Yes, the culprit is the call to sample! The problem is that the internal state of self.random_state
gets modified after each call to sample. This is not a pandas bug but expected behavior of np.random.RandomState
. I believe my fix and your code snippet are equivalent (explicitly setting the state of np.random.RandomState
to what it was when the OHE was initialized).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! Got it. Yep, thanks for that. I think what threw me off was that this call to set state was pretty far into the nested if
/else
tree. As a reader that had me thinking this code must only happen in this specific block, but really it could happen at a higher level.
(Aside: because python passes by reference, we should avoid calling self.random_state.set_state
. That'll update the state for the random state obj which is shared outside of this object.)
@freddyaboulton I left a long writeup of my thoughts below. Can we jump on a call tomorrow and discuss what to do?
So when I read the reproducer on #1279 a couple weeks back, I wasn't reading carefully, and I thought that the bug was that this would fail:
df1 = OneHotEncoder(top_n=4, random_state=5).fit_transform(df)
df2 = OneHotEncoder(top_n=4, random_state=5).fit_transform(df)
pd.testing.assert_frame_equal(df1, df2)
However returning to that reproducer, I see that it was in fact
ohe = OneHotEncoder(top_n=4, random_state=5)
df1 = ohe.fit_transform(df)
df2 = ohe.fit_transform(df)
pd.testing.assert_frame_equal(df1, df2)
The way I see it, there's two options here, for fit
, and for other component methods like transform
/predict
:
- Calling
fit
multiple times in a row will produce different behavior (because one RNG is held by the object) - Calling
fit
multiple times in a row will not produce different behavior.
I believe this is a design decision (which we haven't made yet!) rather than a bug. I think option 2 is a more non-dev-user-centric opinion, and option 1 is more developer-centric, i.e. easier for us to write, haha.
My gut is telling me to argue that the first snippet above should pass (as it currently does I believe?), and the second snippet should always fail, i.e. option 1. My rationale for that argument is that, as the caller, if you create a single object, which initializes the RNG, and then call fit
on that object twice in a row, you should expect the RNG will be called two sets of times, once per call to fit
. And that necessarily would produce different behavior the second time.
However, I could see the other side as well, option 2. Its much simpler behavior to say that whenever you set the random state, be it a seed or np.random.RandomState
, no matter how many times you call fit
/transform
/predict
, you always get the same output. It appears this is sklearn's opinion as well, because the following passes:
import sklearn
import numpy as np
import pandas as pd
lr1 = sklearn.linear_model.LogisticRegression(random_state=np.random.RandomState(5))
state1 = lr1.random_state.get_state()
lr1.fit(X, y)
state2 = lr1.random_state.get_state()
np.testing.assert_array_equal(state1[1], state2[1])
Worth noting option 2 is harder to build, and if we decide that's the invariant we want to enforce, we should think about how to enforce that behavior across all our components, not just the OHE.
pd.testing.assert_frame_equal(df1, df2) | ||
|
||
check_df_equality(5) | ||
check_df_equality(get_random_state(5)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test looks good! Have you verified this was failing before the fix was included?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Yes, this test fails on main
!
321e6c6
to
9ae0bc5
Compare
@dsherry @bchen1116 This is good for a second review. Once this is merged, I'll file an issue for tracking the "global" fix we discussed this afternoon. |
@@ -111,7 +112,9 @@ def fit(self, X, y=None): | |||
if top_n is None or len(value_counts) <= top_n: | |||
unique_values = value_counts.index.tolist() | |||
else: | |||
value_counts = value_counts.sample(frac=1, random_state=self.random_state) | |||
new_random_state = np.random.RandomState() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decided to define new_random_state
here rather than at the top of the method because this is the only place it's needed. Happy to change it though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 🚢
@@ -111,7 +112,9 @@ def fit(self, X, y=None): | |||
if top_n is None or len(value_counts) <= top_n: | |||
unique_values = value_counts.index.tolist() | |||
else: | |||
value_counts = value_counts.sample(frac=1, random_state=self.random_state) | |||
new_random_state = np.random.RandomState() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine to me!
Pull Request Description
Fixes #1279
After creating the pull request: in order to pass the release_notes_updated check you will need to update the "Future Release" section of
docs/source/release_notes.rst
to include this pull request by adding :pr:123
.