Skip to content

Descope pickled outputs sooner#64

Merged
crusaderky merged 3 commits into
dask:mainfrom
crusaderky:descope_pickled
Mar 11, 2022
Merged

Descope pickled outputs sooner#64
crusaderky merged 3 commits into
dask:mainfrom
crusaderky:descope_pickled

Conversation

@crusaderky
Copy link
Copy Markdown
Collaborator

Closes #63

@crusaderky crusaderky self-assigned this Mar 7, 2022
@fjetter fjetter requested a review from ncclementi March 8, 2022 17:11
Comment thread zict/tests/test_func.py
def __len__(self):
return 0

d = Func(lambda v: Dumped(), lambda w: None, Dummy())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me what are we testing here, what should we have in d before and after the update, what are we checking?

I might be missing something but it's not clear to me that we are testing that "doesn't store everything into a list".

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumped.n increases by 1 every time you create an instance and decreases by 1 every time you garbage-collect it.
Without the change in common.py, n would reach 10 and the assertion on line 69 would fail.

Real life example:

d = Func(pickle.dumps, pickle.loads, File(somedir))
mydata = {str(i): numpy.random.random(10e6) for i in range(100)}
d.update(mydata)

mydata occupies 8 GB in RAM (8 * 10e6 * 100).
d.update(mydata) pickles each element and writes it to disk.
Before this PR, you will first create a list of 100 pickles, then write them to disk, and finally release them all at once. You'll observe a buildup of memory while update() is running, going up to 16GB (8GB for mydata + 100*80MB) and then dropping down all of sudden back to 8GB.
After this PR, update() keeps in memory no more than two pickled arrays at any given time, resulting in a peak memory usage of 8.16 GB (8 GB for mydata + 2*80MB).

As noted in #63, distributed is not affected by this issue.

Comment thread zict/common.py
other = args[0]
if isinstance(other, Mapping) or hasattr(other, "items"):
items += other.items()
items = other.items()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like now items are a dict_items type, but we are still initializing it as a list on line 20. Is this ok, does it matter?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list is just a dummy empty iterable to simplify chaining on line 48. It's never filled.

Comment thread zict/common.py
other = args[0]
if isinstance(other, Mapping) or hasattr(other, "items"):
items += other.items()
items = other.items()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more of a general comment, but items is no longer a list, but in our check_items used in check_mappings in utils_test.py we compare against a list. This still works but should we modify this, to be consistent?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I understand. check_items has nothing to do with the update() method?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it hasn't but I noticed that we do things like this check_items(z, [("abc", b"456"), ("xyz", b"12")]) where check_items takes z and does list(z.item()) to compare. I was wondering if we should compare the z.items() directly to the dict_items() but I guess it's not necessary.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a test and it's not testing memory descoping.

Comment thread zict/common.py
else:
# Assuming (key, value) pairs
items += other
items = other
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is what is really different. If other is an iterator, it is not unpacked into a list; instead it is fed directly into _do_update on line 49 which in turn does not load it in memory - at least in the default implementation on line 51, which is not overridden by File.

@crusaderky crusaderky merged commit 5b74a18 into dask:main Mar 11, 2022
@crusaderky crusaderky deleted the descope_pickled branch March 11, 2022 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Memory flare on Func.update() with File backend

2 participants