Descope pickled outputs sooner#64
Conversation
| def __len__(self): | ||
| return 0 | ||
|
|
||
| d = Func(lambda v: Dumped(), lambda w: None, Dummy()) |
There was a problem hiding this comment.
It's not clear to me what are we testing here, what should we have in d before and after the update, what are we checking?
I might be missing something but it's not clear to me that we are testing that "doesn't store everything into a list".
There was a problem hiding this comment.
Dumped.n increases by 1 every time you create an instance and decreases by 1 every time you garbage-collect it.
Without the change in common.py, n would reach 10 and the assertion on line 69 would fail.
Real life example:
d = Func(pickle.dumps, pickle.loads, File(somedir))
mydata = {str(i): numpy.random.random(10e6) for i in range(100)}
d.update(mydata)mydata occupies 8 GB in RAM (8 * 10e6 * 100).
d.update(mydata) pickles each element and writes it to disk.
Before this PR, you will first create a list of 100 pickles, then write them to disk, and finally release them all at once. You'll observe a buildup of memory while update() is running, going up to 16GB (8GB for mydata + 100*80MB) and then dropping down all of sudden back to 8GB.
After this PR, update() keeps in memory no more than two pickled arrays at any given time, resulting in a peak memory usage of 8.16 GB (8 GB for mydata + 2*80MB).
As noted in #63, distributed is not affected by this issue.
| other = args[0] | ||
| if isinstance(other, Mapping) or hasattr(other, "items"): | ||
| items += other.items() | ||
| items = other.items() |
There was a problem hiding this comment.
It looks like now items are a dict_items type, but we are still initializing it as a list on line 20. Is this ok, does it matter?
There was a problem hiding this comment.
The list is just a dummy empty iterable to simplify chaining on line 48. It's never filled.
| other = args[0] | ||
| if isinstance(other, Mapping) or hasattr(other, "items"): | ||
| items += other.items() | ||
| items = other.items() |
There was a problem hiding this comment.
This is more of a general comment, but items is no longer a list, but in our check_items used in check_mappings in utils_test.py we compare against a list. This still works but should we modify this, to be consistent?
There was a problem hiding this comment.
Not sure if I understand. check_items has nothing to do with the update() method?
There was a problem hiding this comment.
I agree that it hasn't but I noticed that we do things like this check_items(z, [("abc", b"456"), ("xyz", b"12")]) where check_items takes z and does list(z.item()) to compare. I was wondering if we should compare the z.items() directly to the dict_items() but I guess it's not necessary.
There was a problem hiding this comment.
It's a test and it's not testing memory descoping.
| else: | ||
| # Assuming (key, value) pairs | ||
| items += other | ||
| items = other |
There was a problem hiding this comment.
This line is what is really different. If other is an iterator, it is not unpacked into a list; instead it is fed directly into _do_update on line 49 which in turn does not load it in memory - at least in the default implementation on line 51, which is not overridden by File.
Closes #63