Feature Request: Support reading large batch of tabular data from Parquet files efficiently. #3

2sin18 · 2021-12-01T11:43:54Z

User Story

As a recommender system engineer, I want to read large batch of tabular data from Parquet files efficiently, so that training performance of large deep recommenders can be improved.

Detailed requirements

It should be easy to work with existing Dataset based data pipeline.
It should be optimized for extra large batch size, and utilize features of Parquet format, e.g. column selection, batch reading , and row group filtering.
It should be compatible with vanilla TensorFlow >= 1.14 < 2.0 .

API Compatibility

Only new APIs should be introduced.

Willing to contribute

Yes

2sin18 added the enhancement New feature or request label Dec 1, 2021

2sin18 self-assigned this Dec 1, 2021

2sin18 mentioned this issue Dec 1, 2021

Add ParquetDataset for reading tabular data. #4

Merged

2sin18 closed this as completed in #4 Dec 1, 2021

liurcme mentioned this issue Mar 23, 2022

Using shuffle or rebatch may cause OOM problem #43

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Support reading large batch of tabular data from Parquet files efficiently. #3

Feature Request: Support reading large batch of tabular data from Parquet files efficiently. #3

2sin18 commented Dec 1, 2021

Feature Request: Support reading large batch of tabular data from Parquet files efficiently. #3

Feature Request: Support reading large batch of tabular data from Parquet files efficiently. #3

Comments

2sin18 commented Dec 1, 2021

User Story

Detailed requirements

API Compatibility

Willing to contribute