ARROW-2760: [Python] Remove legacy property definition syntax from parquet module and test them #2187
ARROW-2760: [Python] Remove legacy property definition syntax from parquet module and test them #2187kszucs wants to merge 16 commits intoapache:masterfrom
Conversation
python/pyarrow/tests/test_parquet.py
Outdated
There was a problem hiding this comment.
The current implementation returns with zero distinct_count and INT64 for all of the integer types.
I'm not sure about the correctness of these values, can someone confirm (my doubt)?
There was a problem hiding this comment.
This surprises me a bit. I would have expected that the small int types would have INT32 as their physical type.
There was a problem hiding this comment.
@xhochy Which means INT64 is the expected? :)
How about distinct_count?
There was a problem hiding this comment.
No, INT64 is not the expected. distinct_count was not computed, so I would expect np.nan or None as the result.
There was a problem hiding this comment.
I guess I need to dig deeper than to find why the physical_types are incorrect.
python/pyarrow/_parquet.pyx
Outdated
There was a problem hiding this comment.
Shouldn't this be called pshysical_type like in RowGroupStatistics?
|
@xhochy How do I trigger Actually I can't find any parquet-cpp tests mentioning AFAICS |
|
We don't have that implemented yet AFAIK so the field should be set (i.e. we should return |
|
@xhochy wouldn't it be better to raise a |
|
No, we could have |
|
I guess here we should check for |
|
According to the has flags I should expose both |
Codecov Report
@@ Coverage Diff @@
## master #2187 +/- ##
==========================================
+ Coverage 84.45% 84.47% +0.01%
==========================================
Files 293 293
Lines 45064 45113 +49
==========================================
+ Hits 38061 38111 +50
+ Misses 6972 6971 -1
Partials 31 31
Continue to review full report at Codecov.
|
👍 |
python/pyarrow/_parquet.pyx
Outdated
There was a problem hiding this comment.
This will return an iterator on Python 3, right? How about returning an actual container (dict or list)?
python/pyarrow/_parquet.pyx
Outdated
There was a problem hiding this comment.
How about __ne__? Is it generated automatically by Cython?
There was a problem hiding this comment.
Python's default implementation is for __ne__ is not __eq__
There was a problem hiding this comment.
That seems true on Python 3, but not on Python 2:
Python 2.7.12 (default, Dec 4 2017, 14:50:18)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> class C(object):
... def __eq__(self, other): return True
...
>>> a = C()
>>> b = C()
>>> a == b
True
>>> a != b
True
>>> There was a problem hiding this comment.
Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:14:23)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: class E:
...: def __eq__(self, other):
...: return True
...:
In [5]: a = E()
In [6]: b = E()
In [7]: a == b
Out[7]: True
In [8]: a != b
Out[8]: FalseI guess this is a python 2/3 incompatibility. Cython does handle this one, right?
There was a problem hiding this comment.
I don't know, best would be to add a test to exercise that :-)
python/pyarrow/tests/test_parquet.py
Outdated
There was a problem hiding this comment.
I'm curious, why is distinct_count always None in the examples below?
There was a problem hiding this comment.
Uwe said:
distinct_count was not computed, so I would expect np.nan or None as the result.
There was a problem hiding this comment.
I see, can you add a small comment?
There was a problem hiding this comment.
Will do, just doing some packaging stuff.
|
Can you rebase this? |
…column stat properties
…ectly __cinit__ constructors instead of custom init methods
14c37e8 to
1e1c7cd
Compare
python/pyarrow/_parquet.pyx
Outdated
| return self.metadata.index_page_offset() | ||
| @property | ||
| def dictionary_page_offset(self): | ||
| return self.metadata.dictionary_page_offset() |
There was a problem hiding this comment.
@xhochy Should it return None if has_dictionary_page is False?
python/pyarrow/_parquet.pyx
Outdated
| return self.metadata.total_compressed_size() | ||
| @property | ||
| def index_page_offset(self): | ||
| return self.metadata.index_page_offset() |
There was a problem hiding this comment.
Similarly index_page_offset is 0, is it supposed to be zero, or index_page is optional too? (I'm not that familiar with parquet specs yet)
There was a problem hiding this comment.
index_page is optional. This should return None.
There was a problem hiding this comment.
How do I know that index_page hasn't been set? (index_page_offset == 0) => return None?
There was a problem hiding this comment.
No, this is buggy. We should export it at the moment. Please raise NotImplementedError for now. I'll fix the parquet-cpp part over at https://issues.apache.org/jira/browse/PARQUET-1358
There was a problem hiding this comment.
Do You mean:
@property
def has_index_page(self):
raise NotImplementedError
@property
def index_page_offset(self):
# leave it as is
return self.metadata.index_page_offset()or
@property
def index_page_offset(self):
raise NotImplementedError?
There was a problem hiding this comment.
Files not written with parquet-cpp will lead to garbage on index_page_offset currently as it is mostly not set.
@property
def has_index_page(self):
raise NotImplementedError("not supported in parquet-cpp")
@property
def index_page_offset(self):
raise NotImplementedError("parquet-cpp doesn't return valid values")
python/pyarrow/tests/test_parquet.py
Outdated
|
|
||
| def test_parquet_metadata_api(): | ||
| import pyarrow.parquet as pq | ||
| import pyarrow._parquet as _pq |
There was a problem hiding this comment.
Should I expose the missing classes from pyarrow._parquet to pyarrow.parquet?
There was a problem hiding this comment.
Yes, they seem to be helpful, so a user could also use them.
@xhochy Do We need to release a new parquet-cpp version too with arrow=0.10.0? If not, then I wouldn't change parquet's API before the release. |
|
We will release a new parquet-cpp version once Arrow 0.10.0 is released. We should not change/break anything in parquet-cpp before the Arrow 0.10.0 release. parquet-cpp releases are much simpler, so I can do more. The releases are mainly limited by PMCs that vote on the release. |
|
So until We have |
f037ead to
74d53bb
Compare
|
This seems ready to go, I moved this into the 0.10.0 milestone so it gets merged |
No description provided.