Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update _WriteBuffer to write several parquet files #483

Merged
merged 5 commits into from
Mar 26, 2024

Conversation

gabrielmbmb
Copy link
Member

Description

This PR updates the _WriteBuffer to create several parquet files instead of just one using the ParquetWriter. As we're inferring the schema using the first row added to the _WriteBuffer, we needed to check that schema for each batch is equal to the writer schema, and if it wasn't, create a new one unifying both schemas. This required closing the ParquetWriter and reopening it with the new schema, but this overwrites the content of the parquet file, losing the data from previous steps.

With this new approach, we write a parquet file (00001.parquet, 00002.parquet, etc) once the buffer for the step is full. The buffer will be considered to be full once the defined number of elements in the buffer is reached.

@gabrielmbmb gabrielmbmb marked this pull request as ready for review March 26, 2024 14:26
@gabrielmbmb gabrielmbmb added enhancement New feature or request fix labels Mar 26, 2024
@gabrielmbmb gabrielmbmb self-assigned this Mar 26, 2024
@gabrielmbmb gabrielmbmb added this to the 1.0.0 milestone Mar 26, 2024
Copy link
Contributor

@plaguss plaguss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@gabrielmbmb gabrielmbmb merged commit 29d4ca9 into core-refactor Mar 26, 2024
4 checks passed
@gabrielmbmb gabrielmbmb deleted the write_buffer_partitioned_files branch March 26, 2024 14:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request fix
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

2 participants