Issues with rule PD101. #11910

randolf-scholz · 2024-06-17T19:17:21Z

There are multiple issues with Rule PD101:

(s[0] == s).all() can actually be slower than s.nunique() <=1, for example when testing with time series data, as in pandas s[0] converts to pd.Timedelta type, which leads to overhead for shorter series:

import pandas as pd
import numpy as np
t = np.random.rand(100_000) * np.timedelta64(1, "s")
s3 = pd.Series(t[:10**3])
s4 = pd.Series(t[:10**4])
s5 = pd.Series(t[:10**5])
%timeit s3.nunique() <=1    # 51.4 µs ± 768 ns
%timeit (s3[0] == s).all()  # 69.1 µs ± 123 ns
%timeit s4.nunique() <=1    # 78.9 µs ± 197 ns
%timeit (s4[0] == s).all()  # 69.2 µs ± 62.6 ns
%timeit s5.nunique() <=1    # 375 µs ± 1.14 µs
%timeit (s5[0] == s).all()  # 70.5 µs ± 1.5 µs

This can be circumvented by testing (s.values[0] == s.values).all() instead. (Note: we actually should be using .array instead of .values!)
See this graphic: ENH: series.is_constant pandas-dev/pandas#54033.

(s[0] == s).all() can yield different results than s.nunique() <= 1, for instance if the first entry in the series happens to be NaN / missing.

import pandas as pd
s = pd.Series([None, 1.0, 1.0, 1.0])
assert s.nunique() <= 1   # checks, since dropna=True by default
assert (s[0] == s).all()  # fails, since comparison with NaN is falsy

Note that it can also yield wrong answers in the other direction:

import pandas as pd
s = pd.Series([None, 1, 1, 3], dtype="Int64")
assert (s[0] == s).all()  # incorrectly passes
assert s.nunique() <= 1   # correctly fails

(s[0] == s).all() will fail if the Series happens to be empty. Empty series can happen naturally, for example, consider slicing a 1h-interval of an irregularly sampled time series.
```
import pandas as pd
s = pd.Series([], dtype=float)
assert s.nunique() <= 1   # checks
assert (s[0] == s).all()  # KeyError
```

The stated rationale is technically incorrect:

In general, .nunique() requires iterating over the entire Series, while a more efficient approach allows short-circuiting the operation as soon as a non-equal value is found.

But s[0] == s performs the comparison for all elements, hence the runtime is O(N) regardless. An actual short-circuiting would look something like this:

import pandas as pd
import numpy as np

def is_constant(array):
    if len(s) <= 1:
        return True
    first = array[0]
    return all(item == first for item in array)

s = pd.Series(np.random.rand(100_000))
%timeit (s[0] == s).all()  # 75.5 µs ± 906 ns
%timeit is_constant(s.values)     # 3.12 µs ± 49.9 ns

Two things should be changed:

The rationale should be rewritten, removing the "short-circuiting" part.
The suggested replacement code should be updated to better deal with missing value and extension arrays. In particular, instead of array = data.to_numpy() we should consider array = data.dropna().array.

import pandas as pd

data = pd.Series([None, 1.0, 1.0, 3.4])

# replace data.nunique() <= 1 with
array = data.dropna().array
if array.shape[0] == 0 or (array[0] == array).all():
    print("Series is constant")

# replace data.nunique(dropna=False) with
array = data.dropna().array
if array.shape[0] == 0 or (data.notna().all() and (array[0] == array).all()):
    print("Series is constant")

# replace data.nunique(dropna=dropna) withdropna=True
array = data.dropna().array
if array.shape[0] == 0 or ((dropna or data.notna().all()) and (array[0] == array).all()):
    print("Series is constant")

EDITS

2024-06-18: For Series we should use .array instead of .values or .to_numpy(), as these construct numpy arrays, even if the series is based on a pyarrow-encoded array.
(see: https://pandas.pydata.org/docs/reference/api/pandas.Series.values.html)

The text was updated successfully, but these errors were encountered:

MichaReiser · 2024-06-18T05:51:22Z

Thanks for the great write up! This is excellent

randolf-scholz mentioned this issue Jun 17, 2024

ENH: Add a Series method which checks whether a Series is constant pandas-dev/pandas#58806

Open

3 tasks

MichaReiser added rule Implementing or modifying a lint rule great writeup A wonderful example of a quality contribution labels Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with rule PD101. #11910

Issues with rule PD101. #11910

randolf-scholz commented Jun 17, 2024 •

edited

Loading

MichaReiser commented Jun 18, 2024

Issues with rule PD101. #11910

Issues with rule PD101. #11910

Comments

randolf-scholz commented Jun 17, 2024 • edited Loading

EDITS

MichaReiser commented Jun 18, 2024

randolf-scholz commented Jun 17, 2024 •

edited

Loading