Skip to content

Conversation

@mdeegen
Copy link
Contributor

@mdeegen mdeegen commented May 25, 2023

Is the Immutability file with the description already complete or what can and should be changed there?

@boeddeker
Copy link
Member

  • Could you remove files that belong to comparison.md? We will shift them later.
  • The images are relative large. Would SVG be smaller?
  • Could you also include the relevant source code?

## Immutability Options


When creating a dataset you can choose between the immutability options `copy` and `pickle`. If you are working with multiple processes in parallel each child process will share its entire memory space with the main process. But because of the large nested structure with many python objects accessing the dataset will create a copy of it in the RAM for the process accessing it ("copy-on-read"). Therefore if the processes access the dataset it will be loaded into the RAM multiple times. While 'copy' gives a complete copy of the dataset to each process pickle uses bytestreams and compresses the data so the memory usage is decreased compared to copy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But because of the large nested structure with many python objects accessing the dataset will create a copy of it in the RAM for the process accessing it ("copy-on-read"). Therefore if the processes access the dataset it will be loaded into the RAM multiple times. While 'copy' gives a complete copy of the dataset to each process pickle uses bytestreams and compresses the data so the memory usage is decreased compared to copy.

Suggestion for an alternative:

Usually the data of the dataset consists of many python objects, e.g. dict, list, ... . When a process reads/touchs a value, the reference counter for the object will be increased, which triggers a copy. So the Linux behaviour of "copy-on-write" is a "copy-on-read" for Python objects. Therefore if the processes access the dataset it will be loaded into the RAM multiple times.
While the copy option gives a complete copy of the dataset to each process, the pickle option uses bytestreams and compresses the data so the memory usage is decreased compared to copy.


def __init__(self, examples, name=None):
assert isinstance(examples, (tuple, list)), (type(examples), examples)
assert isinstance(examples, (tuple, list, NumpySerializedList)), (type(examples), examples)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you change this to test for collections.UserList?

In theory, we could also check for collections.abc.Sequence, but I am not sure, if it could hide bugs. (e.g. str is a sequence and examples shouldn't be a str.)


# Download from https://huggingface.co/datasets/merve/coco/resolve/main/annotations/instances_train2017.json
def create_coco() -> list[Any]:
with open("instances_train2017.json") as f:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use paderbox.testing.io.cache.url_to_local_path to download that file? Probably you have to add paderbox as a test dependency in setup.py.

axis.set_xlabel("Times (s)")
axis.legend()
axis.set_ylabel("Memory usage (MB)")
plt.savefig(f"/net/vol/deegen/SHK/Lazy_dataset_test/{immutable_warranty}.png", dpi=600)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you convert this line to a comment?

@@ -0,0 +1,146 @@
from __future__ import annotations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a small comment, that this is an integration test, and it will not be executed by pytest.

@boeddeker boeddeker merged commit 332d6d5 into fgnt:master Aug 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants