Deflate sizeof() of duplicate references to pandas object types#9776
Deflate sizeof() of duplicate references to pandas object types#9776crusaderky merged 1 commit intodask:mainfrom
Conversation
| @sizeof.register(pd.DataFrame) | ||
| def sizeof_pandas_dataframe(df): | ||
| p = sizeof(df.index) | ||
| p = sizeof(df.index) + sizeof(df.columns) |
There was a problem hiding this comment.
columns were completely ignored
dask/sizeof.py
Outdated
| @sizeof.register(pd.Series) | ||
| def sizeof_pandas_series(s): | ||
| p = int(s.memory_usage(index=True)) | ||
| p = sizeof(s.index) + s.memory_usage(index=False, deep=False) |
There was a problem hiding this comment.
Did not measure MultiIndex correctly.
Did not double-count the overhead of Series+index.
| p = int(sum(object_size(l) for l in i.levels)) | ||
| for c in i.codes if hasattr(i, "codes") else i.labels: | ||
| p = object_size(*i.levels) | ||
| for c in i.codes: |
There was a problem hiding this comment.
codes was introduced in pandas 0.24
e81e153 to
96fc33e
Compare
|
Empty Index/Series/DataFrame sizes measured on pandas 1.4.2/linux64: import gc
import pandas
import psutil
pandas.DataFrame([]) # Lazy init?
N = 100_000
p = psutil.Process()
def bench(label, f, offset):
m1 = p.memory_info().rss
a = [f() for _ in range(N)]
m2 = p.memory_info().rss
nbytes = (m2 - m1) / N - offset
print(label, nbytes)
del a
gc.collect()
return nbytes
idx = pandas.Index([1.1, 2.2])
col = pandas.Index([1.1, 2.2, 3.3])
nones = bench("None", lambda: None, 0)
bench("Index", lambda: pandas.Index([1.1, 2.2]), nones + 16)
bench("Series", lambda: pandas.Series([1.1, 2.2], index=idx), nones + 16)
bench(
"DataFrame (homogeneous)",
lambda: pandas.DataFrame([[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]], index=idx, columns=col),
nones + 48,
)
bench(
"DataFrame (heterogeneous)",
lambda: pandas.DataFrame([[1.1, 2.2, 3], [4.4, 5.5, 6]], index=idx, columns=col),
nones + 48,
) |
96fc33e to
e1e5e5b
Compare
| def sizeof_pandas_series(s): | ||
| p = int(s.memory_usage(index=True)) | ||
| # https://github.com/dask/dask/pull/9776#issuecomment-1359085962 | ||
| p = 1200 + sizeof(s.index) + s.memory_usage(index=False, deep=False) |
There was a problem hiding this comment.
Did not measure MultiIndex correctly
7dffdc7 to
5ef5b64
Compare
8bad15c to
2a21618
Compare
| for x in xs: | ||
| sample = np.random.choice(x, size=100, replace=True) | ||
| for i in sample.tolist(): | ||
| unique_samples[id(i)] = i |
There was a problem hiding this comment.
I'm guessing that this is fast, but can I ask you to verify briefly that this doesn't significantly increase the time cost of sizeof for series or dataframe objects? This has been a surprising bottleneck in thepast.
There was a problem hiding this comment.
On one partition of the NYC taxi database:
before: 195 µs ± 2.48 µs
after: 343 µs ± 4.09 µs
Note that I increased the number of samples from 20 to 100 for better accuracy.
There was a problem hiding this comment.
I'm not thrilled to see any increase, but this is probably low enough not to arise frequently on profiles. For context, this often comes up in task-based shuffle, where a lot of what we do is create lots of little small dataframes with getitem calls.
Anyway, thank you for doing the microbenchmark. Happy to relax my previous concern.
|
Ready for review and merge. |
mrocklin
left a comment
There was a problem hiding this comment.
What's here seems well thought-out to me. Thank you for the work @crusaderky
cc also @ntabris who was asking about this
| else: | ||
| # Assume we've already found all unique objects and that all references that | ||
| # we have not yet analyzed are going to point to the same data. | ||
| return sample_nbytes |
There was a problem hiding this comment.
We must be taking up some memory anyway no? If only to hold onto all of the pointers. If I have a billion-row series with just "Y" and "N" I'd expect it to take at least a gigabyte. My understanding here (perhaps flawed) is that we would return less than a GB here in that case. Is my understanding correct?
There was a problem hiding this comment.
Ah, we get this by calling series.memory_usage(deep=False). Is that correct?
| # Contiguous columns of the same dtype share the same overhead | ||
| p += 1200 | ||
| p += col.memory_usage(index=False, deep=False) | ||
| if col.dtype == object: |
There was a problem hiding this comment.
Should we also handle string[python] here? (totally ok to defer this to future work though, this is likely scope creep)
There was a problem hiding this comment.
I'll open a further PR for it
| df3 = pd.DataFrame([[x, y], [z, w]]) | ||
| df4 = pd.DataFrame([[x, y], [z, x]]) | ||
| df5 = pd.DataFrame([[x, x], [x, x]]) | ||
| assert sizeof(df5) < sizeof(df4) < sizeof(df3) |
Fix bugs where sizeof() would return a severely inflated result:
when a pandas object dtype Series contains multiple references to the same Python objects. This is for example the case of the output of
dd.read_parquet.when the same Python object is referenced from multiple Series, e.g.
Demo
Before:
After:
Note 1: Memory usage after releasing the keys is ~200MiB per worker.
Note 2: the overall memory usage is lower before because this data is highly compressible - so it takes a lot less space on disk than in memory.