Possible out-of-memory issue of dataloader #31

zhiqiangdon · 2021-11-17T22:35:13Z

Hello,

I have read through your code, but haven't run the code yet. One question about the dataloader implementation. According to

https://github.com/dandelin/ViLT/blob/master/vilt/datasets/base_dataset.py#L43

You load all the arrow files into memory. The pre-training data have hundreds of gigabytes. Is it possible that this may cause out-of-memory issue? Or does this implementation assume large machine memory?

Thanks,

dandelin · 2021-11-22T14:12:02Z

Hi @zhiqiangdon,

Apache Arrow's read_all() function is actually doing a lazy loading, so there will be no OOM issue.
Though if you call the .to_pandas() method, then Arrow will load the dataset eagerly and you will face the OOM issue.

zhiqiangdon · 2021-11-22T19:42:56Z

Thanks @dandelin,

I see that you call .to_pandas() on the text column:
https://github.com/dandelin/ViLT/blob/master/vilt/datasets/base_dataset.py#L58
I guess this operation doesn't load the image data, right?

dandelin · 2021-11-23T11:57:20Z

@zhiqiangdon

Yep, you are right. Arrow is columnar DB, so the data will be loaded column-wise manner.

zhiqiangdon · 2021-11-24T01:13:15Z

Thanks @dandelin!

zhiqiangdon closed this as completed Nov 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible out-of-memory issue of dataloader #31

Possible out-of-memory issue of dataloader #31

zhiqiangdon commented Nov 17, 2021 •

edited

dandelin commented Nov 22, 2021

zhiqiangdon commented Nov 22, 2021

dandelin commented Nov 23, 2021

zhiqiangdon commented Nov 24, 2021

Possible out-of-memory issue of dataloader #31

Possible out-of-memory issue of dataloader #31

Comments

zhiqiangdon commented Nov 17, 2021 • edited

dandelin commented Nov 22, 2021

zhiqiangdon commented Nov 22, 2021

dandelin commented Nov 23, 2021

zhiqiangdon commented Nov 24, 2021

zhiqiangdon commented Nov 17, 2021 •

edited