Create doc folder and add immutability options #59

mdeegen · 2023-05-25T11:45:25Z

Is the Immutability file with the description already complete or what can and should be changed there?

boeddeker · 2023-05-25T13:32:30Z

Could you remove files that belong to comparison.md? We will shift them later.
The images are relative large. Would SVG be smaller?
Could you also include the relevant source code?

boeddeker · 2023-05-26T13:00:44Z

doc/ImumutabilityOptions.md

+## Immutability Options
+
+
+When creating a dataset you can choose between the immutability options `copy` and `pickle`. If you are working with multiple processes in parallel each child process will share its entire memory space with the main process. But because of the large nested structure with many python objects accessing the dataset will create a copy of it in the RAM for the process accessing it ("copy-on-read"). Therefore if the processes access the dataset it will be loaded into the RAM multiple times. While 'copy' gives a complete copy of the dataset to each process pickle uses bytestreams and compresses the data so the memory usage is decreased compared to copy.


But because of the large nested structure with many python objects accessing the dataset will create a copy of it in the RAM for the process accessing it ("copy-on-read"). Therefore if the processes access the dataset it will be loaded into the RAM multiple times. While 'copy' gives a complete copy of the dataset to each process pickle uses bytestreams and compresses the data so the memory usage is decreased compared to copy.

Suggestion for an alternative:

Usually the data of the dataset consists of many python objects, e.g. dict, list, ... . When a process reads/touchs a value, the reference counter for the object will be increased, which triggers a copy. So the Linux behaviour of "copy-on-write" is a "copy-on-read" for Python objects. Therefore if the processes access the dataset it will be loaded into the RAM multiple times.
While the copy option gives a complete copy of the dataset to each process, the pickle option uses bytestreams and compresses the data so the memory usage is decreased compared to copy.

Update readme

lazy_dataset/core.py

boeddeker · 2023-05-26T13:45:39Z

lazy_dataset/core.py


    def __init__(self, examples, name=None):
-        assert isinstance(examples, (tuple, list)), (type(examples), examples)
+        assert isinstance(examples, (tuple, list, NumpySerializedList)), (type(examples), examples)


Could you change this to test for collections.UserList?

In theory, we could also check for collections.abc.Sequence, but I am not sure, if it could hide bugs. (e.g. str is a sequence and examples shouldn't be a str.)

boeddeker · 2023-05-26T13:50:17Z

tests/test_immutability_options.py

+
+# Download from https://huggingface.co/datasets/merve/coco/resolve/main/annotations/instances_train2017.json
+def create_coco() -> list[Any]:
+    with open("instances_train2017.json") as f:


Could you use paderbox.testing.io.cache.url_to_local_path to download that file? Probably you have to add paderbox as a test dependency in setup.py.

boeddeker · 2023-05-26T13:52:57Z

tests/test_immutability_options.py

+        axis.set_xlabel("Times (s)")
+        axis.legend()
+        axis.set_ylabel("Memory usage (MB)")
+        plt.savefig(f"/net/vol/deegen/SHK/Lazy_dataset_test/{immutable_warranty}.png", dpi=600)


Could you convert this line to a comment?

boeddeker · 2023-05-26T13:54:13Z

tests/test_immutability_options.py

@@ -0,0 +1,146 @@
+from __future__ import annotations


Add a small comment, that this is an integration test, and it will not be executed by pytest.

Updated Readme with SVG grafics

Create doc folder and add immutability options

a3895b0

boeddeker reviewed May 26, 2023

View reviewed changes

mdeegen and others added 2 commits May 26, 2023 15:13

add svg grafics and test for immutability options

4a205f4

Update ImumutabilityOptions.md

00fa3b3

Update readme

boeddeker reviewed May 26, 2023

View reviewed changes

lazy_dataset/core.py Show resolved Hide resolved

boeddeker reviewed May 26, 2023

View reviewed changes

mdeegen and others added 6 commits May 26, 2023 17:28

Update ImumutabilityOptions.md

20f28a4

Updated Readme with SVG grafics

add immutability options script and update Listdataset assert

da79c0f

Merge remote-tracking branch 'refs/remotes/origin/master'

c7f7e48

Fix documentation

033dda2

Fix immutability_options script for pytest

e8bd3c6

Fix imports in if statement for PyTests

45a0537

boeddeker merged commit 332d6d5 into fgnt:master Aug 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Create doc folder and add immutability options #59

Create doc folder and add immutability options #59

Uh oh!

mdeegen commented May 25, 2023

Uh oh!

boeddeker commented May 25, 2023

Uh oh!

boeddeker May 26, 2023

Uh oh!

Uh oh!

boeddeker May 26, 2023

Uh oh!

boeddeker May 26, 2023

Uh oh!

boeddeker May 26, 2023

Uh oh!

boeddeker May 26, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		## Immutability Options


		When creating a dataset you can choose between the immutability options `copy` and `pickle`. If you are working with multiple processes in parallel each child process will share its entire memory space with the main process. But because of the large nested structure with many python objects accessing the dataset will create a copy of it in the RAM for the process accessing it ("copy-on-read"). Therefore if the processes access the dataset it will be loaded into the RAM multiple times. While 'copy' gives a complete copy of the dataset to each process pickle uses bytestreams and compresses the data so the memory usage is decreased compared to copy.

Create doc folder and add immutability options #59

Create doc folder and add immutability options #59

Uh oh!

Conversation

mdeegen commented May 25, 2023

Uh oh!

boeddeker commented May 25, 2023

Uh oh!

boeddeker May 26, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

boeddeker May 26, 2023

Choose a reason for hiding this comment

Uh oh!

boeddeker May 26, 2023

Choose a reason for hiding this comment

Uh oh!

boeddeker May 26, 2023

Choose a reason for hiding this comment

Uh oh!

boeddeker May 26, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants