Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible out-of-memory issue of dataloader #31

Closed
zhiqiangdon opened this issue Nov 17, 2021 · 4 comments
Closed

Possible out-of-memory issue of dataloader #31

zhiqiangdon opened this issue Nov 17, 2021 · 4 comments

Comments

@zhiqiangdon
Copy link

zhiqiangdon commented Nov 17, 2021

Hello,

I have read through your code, but haven't run the code yet. One question about the dataloader implementation. According to

https://github.com/dandelin/ViLT/blob/master/vilt/datasets/base_dataset.py#L43

You load all the arrow files into memory. The pre-training data have hundreds of gigabytes. Is it possible that this may cause out-of-memory issue? Or does this implementation assume large machine memory?

Thanks,

@dandelin
Copy link
Owner

Hi @zhiqiangdon,

Apache Arrow's read_all() function is actually doing a lazy loading, so there will be no OOM issue.
Though if you call the .to_pandas() method, then Arrow will load the dataset eagerly and you will face the OOM issue.

@zhiqiangdon
Copy link
Author

Thanks @dandelin,

I see that you call .to_pandas() on the text column:
https://github.com/dandelin/ViLT/blob/master/vilt/datasets/base_dataset.py#L58
I guess this operation doesn't load the image data, right?

@dandelin
Copy link
Owner

@zhiqiangdon

Yep, you are right. Arrow is columnar DB, so the data will be loaded column-wise manner.

@zhiqiangdon
Copy link
Author

Thanks @dandelin!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants