Skip to content

Deflate sizeof() of duplicate references to pandas object types#9776

Merged
crusaderky merged 1 commit intodask:mainfrom
crusaderky:pandas_obj_sizeof
Dec 21, 2022
Merged

Deflate sizeof() of duplicate references to pandas object types#9776
crusaderky merged 1 commit intodask:mainfrom
crusaderky:pandas_obj_sizeof

Conversation

@crusaderky
Copy link
Copy Markdown
Collaborator

@crusaderky crusaderky commented Dec 19, 2022

Fix bugs where sizeof() would return a severely inflated result:

  1. when a Series is duplicated, e.g.
df[["x", "x", "x"]]
  1. when a pandas object dtype Series contains multiple references to the same Python objects. This is for example the case of the output of dd.read_parquet.

  2. when the same Python object is referenced from multiple Series, e.g.

x = "x" * 10_000
DataFrame([[x, "y"], ["y", x]])

Demo

cluster = coiled.Cluster(n_workers=5, worker_vm_types=["t3.xlarge"])
client = distributed.Client(cluster)
df = dd.read_parquet(
    "s3://coiled-datasets/dask-book/nyc-tlc/2009",
    storage_options={"anon": True},
).persist()

Before:

image

After:

Note 1: Memory usage after releasing the keys is ~200MiB per worker.
Note 2: the overall memory usage is lower before because this data is highly compressible - so it takes a lot less space on disk than in memory.

image

@sizeof.register(pd.DataFrame)
def sizeof_pandas_dataframe(df):
p = sizeof(df.index)
p = sizeof(df.index) + sizeof(df.columns)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

columns were completely ignored

dask/sizeof.py Outdated
@sizeof.register(pd.Series)
def sizeof_pandas_series(s):
p = int(s.memory_usage(index=True))
p = sizeof(s.index) + s.memory_usage(index=False, deep=False)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not measure MultiIndex correctly.
Did not double-count the overhead of Series+index.

p = int(sum(object_size(l) for l in i.levels))
for c in i.codes if hasattr(i, "codes") else i.labels:
p = object_size(*i.levels)
for c in i.codes:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

codes was introduced in pandas 0.24

@crusaderky crusaderky self-assigned this Dec 19, 2022
@crusaderky
Copy link
Copy Markdown
Collaborator Author

crusaderky commented Dec 20, 2022

Empty Index/Series/DataFrame sizes measured on pandas 1.4.2/linux64:

import gc
import pandas
import psutil

pandas.DataFrame([])  # Lazy init?
N = 100_000
p = psutil.Process()


def bench(label, f, offset):
    m1 = p.memory_info().rss
    a = [f() for _ in range(N)]
    m2 = p.memory_info().rss
    nbytes = (m2 - m1) / N - offset
    print(label, nbytes)
    del a
    gc.collect()
    return nbytes


idx = pandas.Index([1.1, 2.2])
col = pandas.Index([1.1, 2.2, 3.3])
nones = bench("None", lambda: None, 0)
bench("Index", lambda: pandas.Index([1.1, 2.2]), nones + 16)
bench("Series", lambda: pandas.Series([1.1, 2.2], index=idx), nones + 16)
bench(
    "DataFrame (homogeneous)",
    lambda: pandas.DataFrame([[1.1, 2.2, 3.3], [4.4, 5.5, 6.6]], index=idx, columns=col),
    nones + 48,
)
bench(
    "DataFrame (heterogeneous)",
    lambda: pandas.DataFrame([[1.1, 2.2, 3], [4.4, 5.5, 6]], index=idx, columns=col),
    nones + 48,
)

def sizeof_pandas_series(s):
p = int(s.memory_usage(index=True))
# https://github.com/dask/dask/pull/9776#issuecomment-1359085962
p = 1200 + sizeof(s.index) + s.memory_usage(index=False, deep=False)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not measure MultiIndex correctly

for x in xs:
sample = np.random.choice(x, size=100, replace=True)
for i in sample.tolist():
unique_samples[id(i)] = i
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing that this is fast, but can I ask you to verify briefly that this doesn't significantly increase the time cost of sizeof for series or dataframe objects? This has been a surprising bottleneck in thepast.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On one partition of the NYC taxi database:

before: 195 µs ± 2.48 µs
after: 343 µs ± 4.09 µs

Note that I increased the number of samples from 20 to 100 for better accuracy.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not thrilled to see any increase, but this is probably low enough not to arise frequently on profiles. For context, this often comes up in task-based shuffle, where a lot of what we do is create lots of little small dataframes with getitem calls.

Anyway, thank you for doing the microbenchmark. Happy to relax my previous concern.

@crusaderky crusaderky marked this pull request as ready for review December 20, 2022 15:29
@crusaderky
Copy link
Copy Markdown
Collaborator Author

Ready for review and merge.
The failure on 3.11 which are already visibile in main.

Copy link
Copy Markdown
Member

@mrocklin mrocklin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's here seems well thought-out to me. Thank you for the work @crusaderky

cc also @ntabris who was asking about this

else:
# Assume we've already found all unique objects and that all references that
# we have not yet analyzed are going to point to the same data.
return sample_nbytes
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We must be taking up some memory anyway no? If only to hold onto all of the pointers. If I have a billion-row series with just "Y" and "N" I'd expect it to take at least a gigabyte. My understanding here (perhaps flawed) is that we would return less than a GB here in that case. Is my understanding correct?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, we get this by calling series.memory_usage(deep=False). Is that correct?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct

# Contiguous columns of the same dtype share the same overhead
p += 1200
p += col.memory_usage(index=False, deep=False)
if col.dtype == object:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also handle string[python] here? (totally ok to defer this to future work though, this is likely scope creep)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll open a further PR for it

df3 = pd.DataFrame([[x, y], [z, w]])
df4 = pd.DataFrame([[x, y], [z, x]])
df5 = pd.DataFrame([[x, x], [x, x]])
assert sizeof(df5) < sizeof(df4) < sizeof(df3)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice tests. Thanks for these.

@crusaderky crusaderky merged commit 80dd84d into dask:main Dec 21, 2022
@crusaderky crusaderky deleted the pandas_obj_sizeof branch December 21, 2022 11:42
@jrbourbeau jrbourbeau mentioned this pull request May 1, 2023
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants