Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bytewax pods consume lots of resource #3825

Closed
sudohainguyen opened this issue Nov 4, 2023 · 0 comments · Fixed by #3826
Closed

Bytewax pods consume lots of resource #3825

sudohainguyen opened this issue Nov 4, 2023 · 0 comments · Fixed by #3826

Comments

@sudohainguyen
Copy link
Collaborator

sudohainguyen commented Nov 4, 2023

Expected Behavior

I conducted benchmark on a feature table with 50m rows x 10 cols, and expect I can efficiently materialize records into online store. In Bytewax mechanism, latest records are extracted to staging location as parquet files, in my case each file contains ~140k rows.
In the efficient way, bytewax pods should process the file with as less memory footprint as possible

Current Behavior

Currently every bytewax pods pull the entire parquet file into memory before writing to online store, which causes huge memory footprint, ~3GB of memory.

Steps to reproduce

Conduct materialization with bytewax engine

Specifications

  • Version: master

Possible Solution

Apply zero-copy mechanism from pyarrow to stream the parquet files and process on-the-fly before pushing to online store

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant