You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(s[0] == s).all() can actually be slower than s.nunique() <=1, for example when testing with time series data, as in pandas s[0] converts to pd.Timedelta type, which leads to overhead for shorter series:
This can be circumvented by testing (s.values[0] == s.values).all() instead. (Note: we actually should be using .array instead of .values!)
See this graphic: ENH: series.is_constant pandas-dev/pandas#54033.
(s[0] == s).all() can yield different results than s.nunique() <= 1, for instance if the first entry in the series happens to be NaN / missing.
importpandasaspds=pd.Series([None, 1.0, 1.0, 1.0])
asserts.nunique() <=1# checks, since dropna=True by defaultassert (s[0] ==s).all() # fails, since comparison with NaN is falsy
Note that it can also yield wrong answers in the other direction:
(s[0] == s).all() will fail if the Series happens to be empty. Empty series can happen naturally, for example, consider slicing a 1h-interval of an irregularly sampled time series.
In general, .nunique() requires iterating over the entire Series, while a more efficient approach allows short-circuiting the operation as soon as a non-equal value is found.
But s[0] == s performs the comparison for all elements, hence the runtime is O(N) regardless. An actual short-circuiting would look something like this:
The rationale should be rewritten, removing the "short-circuiting" part.
The suggested replacement code should be updated to better deal with missing value and extension arrays. In particular, instead of array = data.to_numpy() we should consider array = data.dropna().array.
importpandasaspddata=pd.Series([None, 1.0, 1.0, 3.4])
# replace data.nunique() <= 1 witharray=data.dropna().arrayifarray.shape[0] ==0or (array[0] ==array).all():
print("Series is constant")
# replace data.nunique(dropna=False) witharray=data.dropna().arrayifarray.shape[0] ==0or (data.notna().all() and (array[0] ==array).all()):
print("Series is constant")
# replace data.nunique(dropna=dropna) withdropna=Truearray=data.dropna().arrayifarray.shape[0] ==0or ((dropnaordata.notna().all()) and (array[0] ==array).all()):
print("Series is constant")
There are multiple issues with Rule PD101:
(s[0] == s).all()
can actually be slower thans.nunique() <=1
, for example when testing with time series data, as in pandass[0]
converts topd.Timedelta
type, which leads to overhead for shorter series:This can be circumvented by testing
(s.values[0] == s.values).all()
instead. (Note: we actually should be using.array
instead of.values
!)See this graphic: ENH: series.is_constant pandas-dev/pandas#54033.
(s[0] == s).all()
can yield different results thans.nunique() <= 1
, for instance if the first entry in the series happens to be NaN / missing.Note that it can also yield wrong answers in the other direction:
(s[0] == s).all()
will fail if the Series happens to be empty. Empty series can happen naturally, for example, consider slicing a 1h-interval of an irregularly sampled time series.The stated rationale is technically incorrect:
But
s[0] == s
performs the comparison for all elements, hence the runtime is O(N) regardless. An actual short-circuiting would look something like this:Two things should be changed:
array = data.to_numpy()
we should considerarray = data.dropna().array
.EDITS
.array
instead of.values
or.to_numpy()
, as these constructnumpy
arrays, even if the series is based on a pyarrow-encoded array.(see: https://pandas.pydata.org/docs/reference/api/pandas.Series.values.html)
The text was updated successfully, but these errors were encountered: