-
Notifications
You must be signed in to change notification settings - Fork 8
Create doc folder and add immutability options #59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
doc/ImumutabilityOptions.md
Outdated
| ## Immutability Options | ||
|
|
||
|
|
||
| When creating a dataset you can choose between the immutability options `copy` and `pickle`. If you are working with multiple processes in parallel each child process will share its entire memory space with the main process. But because of the large nested structure with many python objects accessing the dataset will create a copy of it in the RAM for the process accessing it ("copy-on-read"). Therefore if the processes access the dataset it will be loaded into the RAM multiple times. While 'copy' gives a complete copy of the dataset to each process pickle uses bytestreams and compresses the data so the memory usage is decreased compared to copy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But because of the large nested structure with many python objects accessing the dataset will create a copy of it in the RAM for the process accessing it ("copy-on-read"). Therefore if the processes access the dataset it will be loaded into the RAM multiple times. While 'copy' gives a complete copy of the dataset to each process pickle uses bytestreams and compresses the data so the memory usage is decreased compared to copy.
Suggestion for an alternative:
Usually the data of the dataset consists of many python objects, e.g.
dict,list, ... . When a process reads/touchs a value, the reference counter for the object will be increased, which triggers a copy. So the Linux behaviour of "copy-on-write" is a "copy-on-read" for Python objects. Therefore if the processes access the dataset it will be loaded into the RAM multiple times.
While thecopyoption gives a complete copy of the dataset to each process, thepickleoption uses bytestreams and compresses the data so the memory usage is decreased compared tocopy.
lazy_dataset/core.py
Outdated
|
|
||
| def __init__(self, examples, name=None): | ||
| assert isinstance(examples, (tuple, list)), (type(examples), examples) | ||
| assert isinstance(examples, (tuple, list, NumpySerializedList)), (type(examples), examples) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you change this to test for collections.UserList?
In theory, we could also check for collections.abc.Sequence, but I am not sure, if it could hide bugs. (e.g. str is a sequence and examples shouldn't be a str.)
tests/test_immutability_options.py
Outdated
|
|
||
| # Download from https://huggingface.co/datasets/merve/coco/resolve/main/annotations/instances_train2017.json | ||
| def create_coco() -> list[Any]: | ||
| with open("instances_train2017.json") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use paderbox.testing.io.cache.url_to_local_path to download that file? Probably you have to add paderbox as a test dependency in setup.py.
tests/test_immutability_options.py
Outdated
| axis.set_xlabel("Times (s)") | ||
| axis.legend() | ||
| axis.set_ylabel("Memory usage (MB)") | ||
| plt.savefig(f"/net/vol/deegen/SHK/Lazy_dataset_test/{immutable_warranty}.png", dpi=600) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you convert this line to a comment?
tests/test_immutability_options.py
Outdated
| @@ -0,0 +1,146 @@ | |||
| from __future__ import annotations | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a small comment, that this is an integration test, and it will not be executed by pytest.
Updated Readme with SVG grafics
Is the Immutability file with the description already complete or what can and should be changed there?