Deflate sizeof() of duplicate references to pandas object types by crusaderky · Pull Request #9776 · dask/dask

crusaderky · 2022-12-19T17:02:12Z

Fix bugs where sizeof() would return a severely inflated result:

when a Series is duplicated, e.g.

df[["x", "x", "x"]]

when a pandas object dtype Series contains multiple references to the same Python objects. This is for example the case of the output of dd.read_parquet.
when the same Python object is referenced from multiple Series, e.g.

x = "x" * 10_000
DataFrame([[x, "y"], ["y", x]])

Demo

cluster = coiled.Cluster(n_workers=5, worker_vm_types=["t3.xlarge"])
client = distributed.Client(cluster)
df = dd.read_parquet(
    "s3://coiled-datasets/dask-book/nyc-tlc/2009",
    storage_options={"anon": True},
).persist()

Before:

After:

Note 1: Memory usage after releasing the keys is ~200MiB per worker.
Note 2: the overall memory usage is lower before because this data is highly compressible - so it takes a lot less space on disk than in memory.

crusaderky · 2022-12-19T17:03:11Z

dask/sizeof.py

    @sizeof.register(pd.DataFrame)
    def sizeof_pandas_dataframe(df):
-        p = sizeof(df.index)
+        p = sizeof(df.index) + sizeof(df.columns)


columns were completely ignored

crusaderky · 2022-12-19T17:03:56Z

dask/sizeof.py

    @sizeof.register(pd.Series)
    def sizeof_pandas_series(s):
-        p = int(s.memory_usage(index=True))
+        p = sizeof(s.index) + s.memory_usage(index=False, deep=False)


Did not measure MultiIndex correctly.
Did not double-count the overhead of Series+index.

crusaderky · 2022-12-19T17:04:46Z

dask/sizeof.py

-        p = int(sum(object_size(l) for l in i.levels))
-        for c in i.codes if hasattr(i, "codes") else i.labels:
+        p = object_size(*i.levels)
+        for c in i.codes:


codes was introduced in pandas 0.24

crusaderky · 2022-12-20T09:43:35Z

Empty Index/Series/DataFrame sizes measured on pandas 1.4.2/linux64:

import gc
import pandas
import psutil

pandas.DataFrame([])  # Lazy init?
N = 100_000
p = psutil.Process()


def bench(label, f, offset):
    m1 = p.memory_info().rss
    a = [f() for _ in range(N)]
    m2 = p.memory_info().rss
    nbytes = (m2 - m1) / N - offset
    print(label, nbytes)
    del a
    gc.collect()
    return nbytes


idx = pandas.Index([1.1, 2.2])
col = pandas.Index([1.1, 2.2, 3.3])
nones = bench("None", lambda: None, 0)
bench("Index", lambda: pandas.Index([1.1, 2.2]), nones + 16)
bench("Series", lambda: pandas.Series([1.1, 2.2], index=idx), nones + 16)
bench(
    "DataFrame (homogeneous)",
    lambda: pandas.DataFrame([[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]], index=idx, columns=col),
    nones + 48,
)
bench(
    "DataFrame (heterogeneous)",
    lambda: pandas.DataFrame([[1.1, 2.2, 3], [4.4, 5.5, 6]], index=idx, columns=col),
    nones + 48,
)

crusaderky · 2022-12-20T11:55:40Z

dask/sizeof.py

    def sizeof_pandas_series(s):
-        p = int(s.memory_usage(index=True))
+        # https://github.com/dask/dask/pull/9776#issuecomment-1359085962
+        p = 1200 + sizeof(s.index) + s.memory_usage(index=False, deep=False)


Did not measure MultiIndex correctly

mrocklin · 2022-12-20T14:30:12Z

dask/sizeof.py

+        for x in xs:
+            sample = np.random.choice(x, size=100, replace=True)
+            for i in sample.tolist():
+                unique_samples[id(i)] = i


I'm guessing that this is fast, but can I ask you to verify briefly that this doesn't significantly increase the time cost of sizeof for series or dataframe objects? This has been a surprising bottleneck in thepast.

On one partition of the NYC taxi database:

before: 195 µs ± 2.48 µs
after: 343 µs ± 4.09 µs

Note that I increased the number of samples from 20 to 100 for better accuracy.

I'm not thrilled to see any increase, but this is probably low enough not to arise frequently on profiles. For context, this often comes up in task-based shuffle, where a lot of what we do is create lots of little small dataframes with getitem calls.

Anyway, thank you for doing the microbenchmark. Happy to relax my previous concern.

crusaderky · 2022-12-20T15:30:22Z

Ready for review and merge.
The failure on 3.11 which are already visibile in main.

mrocklin

What's here seems well thought-out to me. Thank you for the work @crusaderky

cc also @ntabris who was asking about this

mrocklin · 2022-12-20T18:10:19Z

dask/sizeof.py

+        else:
+            # Assume we've already found all unique objects and that all references that
+            # we have not yet analyzed are going to point to the same data.
+            return sample_nbytes


We must be taking up some memory anyway no? If only to hold onto all of the pointers. If I have a billion-row series with just "Y" and "N" I'd expect it to take at least a gigabyte. My understanding here (perhaps flawed) is that we would return less than a GB here in that case. Is my understanding correct?

Ah, we get this by calling series.memory_usage(deep=False). Is that correct?

mrocklin · 2022-12-20T18:11:23Z

dask/sizeof.py

+                # Contiguous columns of the same dtype share the same overhead
+                p += 1200
+            p += col.memory_usage(index=False, deep=False)
            if col.dtype == object:


Should we also handle string[python] here? (totally ok to defer this to future work though, this is likely scope creep)

I'll open a further PR for it

mrocklin · 2022-12-20T18:15:34Z

dask/tests/test_sizeof.py

+    df3 = pd.DataFrame([[x, y], [z, w]])
+    df4 = pd.DataFrame([[x, y], [z, x]])
+    df5 = pd.DataFrame([[x, x], [x, x]])
+    assert sizeof(df5) < sizeof(df4) < sizeof(df3)


Nice tests. Thanks for these.

crusaderky commented Dec 19, 2022

View reviewed changes

crusaderky self-assigned this Dec 19, 2022

crusaderky force-pushed the pandas_obj_sizeof branch from e81e153 to 96fc33e Compare December 20, 2022 09:40

crusaderky force-pushed the pandas_obj_sizeof branch from 96fc33e to e1e5e5b Compare December 20, 2022 11:50

crusaderky commented Dec 20, 2022

View reviewed changes

crusaderky force-pushed the pandas_obj_sizeof branch from 7dffdc7 to 5ef5b64 Compare December 20, 2022 12:33

github-actions bot added the dataframe label Dec 20, 2022

Deflate sizeof() of duplicate references to pandas object types

2a21618

crusaderky force-pushed the pandas_obj_sizeof branch from 8bad15c to 2a21618 Compare December 20, 2022 14:00

mrocklin reviewed Dec 20, 2022

View reviewed changes

crusaderky marked this pull request as ready for review December 20, 2022 15:29

mrocklin approved these changes Dec 20, 2022

View reviewed changes

crusaderky merged commit 80dd84d into dask:main Dec 21, 2022

crusaderky deleted the pandas_obj_sizeof branch December 21, 2022 11:42

crusaderky mentioned this pull request Dec 21, 2022

Accurate sizeof for pandas string[python] dtype #9781

Merged

jrbourbeau mentioned this pull request May 1, 2023

Improve sizeof for pd.MultiIndex #10230

Merged

3 tasks

Uh oh!

Conversation

crusaderky commented Dec 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Demo

Before:

After:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crusaderky commented Dec 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crusaderky commented Dec 20, 2022

Uh oh!

mrocklin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

crusaderky commented Dec 19, 2022 •

edited

Loading

crusaderky commented Dec 20, 2022 •

edited

Loading