Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][Parquet] Parquet Support write and validate CRC #37242

Closed
mapleFU opened this issue Aug 18, 2023 · 3 comments · Fixed by #38360
Closed

[Python][Parquet] Parquet Support write and validate CRC #37242

mapleFU opened this issue Aug 18, 2023 · 3 comments · Fixed by #38360

Comments

@mapleFU
Copy link
Member

mapleFU commented Aug 18, 2023

Describe the enhancement requested

Now, C++ Parquet API already supports CRC in reading and write.

Though system like S3 will ensure the storage data works well. But some data storage like HDD or SSD might corrupt. And network might provide bad result. So having CRC would helps.

Now it's better to has crc in Python code.

Component(s)

Parquet, Python

@pitrou
Copy link
Member

pitrou commented Aug 22, 2023

@danepitkin @AlenkaF This seems useful to expose in Python indeed.

@mapleFU
Copy link
Member Author

mapleFU commented Oct 25, 2023

@frazar Would you mind "take" here?

Github can only assign issue to the one replied to the issue.

@frazar
Copy link
Contributor

frazar commented Nov 20, 2023

I'll take this!

AlenkaF added a commit that referenced this issue Nov 20, 2023
…RC (#38360)

### Rationale for this change

The C++ Parquet API already supports enabling CRC checksum for read and write operations.

CRC checksum are optional and can detect data corruption due to, for example, file storage issues or [cosmic rays](https://en.wikipedia.org/wiki/Soft_error).

It would then be beneficial to expose this optional functionality to the Python API too.

This PR is based on a previous PR which became stale: #37439

### What changes are included in this PR?

The PyArrow interface is expanded to include a `page_checksum_enabled` flag.

### Are these changes tested?

[ ] NOT YET!

### Are there any user-facing changes?

The change is backward compatible. An additional, optional keyword argument is added to some interfaces.

Closes #37242
Supersedes #37439
* Closes: #37242

Lead-authored-by: Francesco Zardi <frazar0@hotmail.it>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>
@AlenkaF AlenkaF added this to the 15.0.0 milestone Nov 20, 2023
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…Page CRC (apache#38360)

### Rationale for this change

The C++ Parquet API already supports enabling CRC checksum for read and write operations.

CRC checksum are optional and can detect data corruption due to, for example, file storage issues or [cosmic rays](https://en.wikipedia.org/wiki/Soft_error).

It would then be beneficial to expose this optional functionality to the Python API too.

This PR is based on a previous PR which became stale: apache#37439

### What changes are included in this PR?

The PyArrow interface is expanded to include a `page_checksum_enabled` flag.

### Are these changes tested?

[ ] NOT YET!

### Are there any user-facing changes?

The change is backward compatible. An additional, optional keyword argument is added to some interfaces.

Closes apache#37242
Supersedes apache#37439
* Closes: apache#37242

Lead-authored-by: Francesco Zardi <frazar0@hotmail.it>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-authored-by: mwish <maplewish117@gmail.com>
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Signed-off-by: AlenkaF <frim.alenka@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment