Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][Parquet] Support hashing for FileMetaData and ParquetSchema #39780

Closed
milesgranger opened this issue Jan 24, 2024 · 0 comments · Fixed by #39781
Closed

[Python][Parquet] Support hashing for FileMetaData and ParquetSchema #39780

milesgranger opened this issue Jan 24, 2024 · 0 comments · Fixed by #39781

Comments

@milesgranger
Copy link
Contributor

Describe the enhancement requested

Being able to hash(parquet_file.metadata) and schema should be helpful for dask. (dask/dask-expr#703)

Component(s)

Python

pitrou pushed a commit that referenced this issue Feb 12, 2024
…uetSchema (#39781)

I think the hash, especially for `FileMetaData` could be better, maybe just use return of `__repr__`, even though that won't include row group info?

### Rationale for this change

Helpful for dependent projects. 

### What changes are included in this PR?

Impl `__hash__` for `ParquetSchema` and `FileMetaData`

### Are these changes tested?

Yes

### Are there any user-facing changes?

Supports hashing metadata:

```python
In [1]: import pyarrow.parquet as pq

In [2]: f = pq.ParquetFile('test.parquet')

In [3]: hash(f.metadata)
Out[3]: 4816453453708427907

In [4]: hash(f.metadata.schema)
Out[4]: 2300988959078172540
```
* Closes: #39780

Authored-by: Miles Granger <miles59923@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
@pitrou pitrou added this to the 16.0.0 milestone Feb 12, 2024
dgreiss pushed a commit to dgreiss/arrow that referenced this issue Feb 19, 2024
…d ParquetSchema (apache#39781)

I think the hash, especially for `FileMetaData` could be better, maybe just use return of `__repr__`, even though that won't include row group info?

### Rationale for this change

Helpful for dependent projects. 

### What changes are included in this PR?

Impl `__hash__` for `ParquetSchema` and `FileMetaData`

### Are these changes tested?

Yes

### Are there any user-facing changes?

Supports hashing metadata:

```python
In [1]: import pyarrow.parquet as pq

In [2]: f = pq.ParquetFile('test.parquet')

In [3]: hash(f.metadata)
Out[3]: 4816453453708427907

In [4]: hash(f.metadata.schema)
Out[4]: 2300988959078172540
```
* Closes: apache#39780

Authored-by: Miles Granger <miles59923@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this issue Feb 28, 2024
…d ParquetSchema (apache#39781)

I think the hash, especially for `FileMetaData` could be better, maybe just use return of `__repr__`, even though that won't include row group info?

### Rationale for this change

Helpful for dependent projects. 

### What changes are included in this PR?

Impl `__hash__` for `ParquetSchema` and `FileMetaData`

### Are these changes tested?

Yes

### Are there any user-facing changes?

Supports hashing metadata:

```python
In [1]: import pyarrow.parquet as pq

In [2]: f = pq.ParquetFile('test.parquet')

In [3]: hash(f.metadata)
Out[3]: 4816453453708427907

In [4]: hash(f.metadata.schema)
Out[4]: 2300988959078172540
```
* Closes: apache#39780

Authored-by: Miles Granger <miles59923@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
thisisnic pushed a commit to thisisnic/arrow that referenced this issue Mar 8, 2024
…d ParquetSchema (apache#39781)

I think the hash, especially for `FileMetaData` could be better, maybe just use return of `__repr__`, even though that won't include row group info?

### Rationale for this change

Helpful for dependent projects. 

### What changes are included in this PR?

Impl `__hash__` for `ParquetSchema` and `FileMetaData`

### Are these changes tested?

Yes

### Are there any user-facing changes?

Supports hashing metadata:

```python
In [1]: import pyarrow.parquet as pq

In [2]: f = pq.ParquetFile('test.parquet')

In [3]: hash(f.metadata)
Out[3]: 4816453453708427907

In [4]: hash(f.metadata.schema)
Out[4]: 2300988959078172540
```
* Closes: apache#39780

Authored-by: Miles Granger <miles59923@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants