-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Open
Description
Describe the bug, including details regarding any error messages, version, and platform.
pd.options.mode.string_storage = "pyarrow" causes a large slowdown when repeatedly growing a string-typed DataFrame with loc row assignment.
The performance issue largely goes away if I switch to:
pd.options.mode.string_storage = "python"Versions
pandas=3.0.1
pyarrow=23.0.1
python=3.12
platform=Linux
Minimal reproducer
import time
import pandas as pd
import pyarrow as pa
def bench(storage: str, rows: int = 1000, cols: int = 20) -> float:
pd.options.mode.string_storage = storage
source = pd.DataFrame(
[[f"v{j % 10}" for j in range(cols)] for _ in range(rows)]
).astype(str)
out = pd.DataFrame(columns=source.columns).astype(str)
start = time.perf_counter()
for i, row in enumerate(source.itertuples(index=False)):
out.loc[i] = row
return time.perf_counter() - start
print(f"pandas={pd.__version__} pyarrow={pa.__version__}")
for storage in ("python", "pyarrow"):
elapsed = bench(storage)
print(storage, elapsed)Output on my machine
pandas=3.0.1 pyarrow=23.0.1 rows=1000 cols=20
storage=python array=StringArray seconds=0.420
storage=pyarrow array=ArrowStringArray seconds=3.316
slowdown(pyarrow/python)=7.89x
I also see the same pattern with smaller sizes, for example:
500x10: python=0.147s pyarrow=0.508s
500x20: python=0.200s pyarrow=0.930s
1000x10: python=0.292s pyarrow=1.759s
1000x20: python=0.411s pyarrow=3.358s
1500x20: python=0.624s pyarrow=7.174s
Component(s)
Python
Reactions are currently unavailable