Skip to content

ARROW-2760: [Python] Remove legacy property definition syntax from parquet module and test them #2187

Closed
kszucs wants to merge 16 commits intoapache:masterfrom
kszucs:cython_properties
Closed

ARROW-2760: [Python] Remove legacy property definition syntax from parquet module and test them #2187
kszucs wants to merge 16 commits intoapache:masterfrom
kszucs:cython_properties

Conversation

@kszucs
Copy link
Copy Markdown
Member

@kszucs kszucs commented Jun 28, 2018

No description provided.

Copy link
Copy Markdown
Member Author

@kszucs kszucs Jun 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation returns with zero distinct_count and INT64 for all of the integer types.
I'm not sure about the correctness of these values, can someone confirm (my doubt)?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This surprises me a bit. I would have expected that the small int types would have INT32 as their physical type.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xhochy Which means INT64 is the expected? :)
How about distinct_count?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, INT64 is not the expected. distinct_count was not computed, so I would expect np.nan or None as the result.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I need to dig deeper than to find why the physical_types are incorrect.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be called pshysical_type like in RowGroupStatistics?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that would be much better

@kszucs kszucs force-pushed the cython_properties branch from 669bf90 to f402963 Compare July 4, 2018 10:29
@kszucs
Copy link
Copy Markdown
Member Author

kszucs commented Jul 4, 2018

@xhochy How do I trigger distinct_count computation?

Actually I can't find any parquet-cpp tests mentioning distinct_count.

AFAICS distinct_count is set along with num_values and null_count.

@xhochy
Copy link
Copy Markdown
Member

xhochy commented Jul 4, 2018

We don't have that implemented yet AFAIK so the field should be set (i.e. we should return None)

@kszucs
Copy link
Copy Markdown
Member Author

kszucs commented Jul 4, 2018

@xhochy wouldn't it be better to raise a NotImplementedError for stat.distinct_count instead?

@xhochy
Copy link
Copy Markdown
Member

xhochy commented Jul 4, 2018

No, we could have distinct_count set when we load Parquet files that were written by parquet-mr. Just the ones written by parquet-cpp / pyarrow don't have it.

@kszucs
Copy link
Copy Markdown
Member Author

kszucs commented Jul 4, 2018

I guess here we should check for has_distinct_count then, how should I represent missing distinct_count on the cpp API? Expose HasDistinctCount()?

@kszucs
Copy link
Copy Markdown
Member Author

kszucs commented Jul 4, 2018

According to the has flags I should expose both HasNullCount and HasDistinctCount to be able to return None for stat.null_count and stat.distinct_count.

@codecov-io
Copy link
Copy Markdown

codecov-io commented Jul 4, 2018

Codecov Report

Merging #2187 into master will increase coverage by 0.01%.
The diff coverage is 72.26%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2187      +/-   ##
==========================================
+ Coverage   84.45%   84.47%   +0.01%     
==========================================
  Files         293      293              
  Lines       45064    45113      +49     
==========================================
+ Hits        38061    38111      +50     
+ Misses       6972     6971       -1     
  Partials       31       31
Impacted Files Coverage Δ
python/pyarrow/parquet.py 91.28% <100%> (ø) ⬆️
python/pyarrow/tests/test_parquet.py 97.4% <100%> (+0.1%) ⬆️
python/pyarrow/_parquet.pyx 68.95% <64.48%> (-0.03%) ⬇️
python/pyarrow/formatting.py 68.75% <83.33%> (+25%) ⬆️
cpp/src/arrow/util/thread-pool-test.cc 98.91% <0%> (-0.55%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8e6af29...d6a7f77. Read the comment docs.

@xhochy
Copy link
Copy Markdown
Member

xhochy commented Jul 8, 2018

According to the has flags I should expose both HasNullCount and HasDistinctCount to be able to return None for stat.null_count and stat.distinct_count.

👍

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will return an iterator on Python 3, right? How about returning an actual container (dict or list)?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about __ne__? Is it generated automatically by Cython?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python's default implementation is for __ne__ is not __eq__

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems true on Python 3, but not on Python 2:

Python 2.7.12 (default, Dec  4 2017, 14:50:18) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> class C(object):
...   def __eq__(self, other): return True
... 
>>> a = C()
>>> b = C()
>>> a == b
True
>>> a != b
True
>>> 

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:14:23)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: class E:
   ...:     def __eq__(self, other):
   ...:         return True
   ...:

In [5]: a = E()

In [6]: b = E()

In [7]: a == b
Out[7]: True

In [8]: a != b
Out[8]: False

I guess this is a python 2/3 incompatibility. Cython does handle this one, right?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, best would be to add a test to exercise that :-)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a test for ColumnSchema

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious, why is distinct_count always None in the examples below?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uwe said:

distinct_count was not computed, so I would expect np.nan or None as the result.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, can you add a small comment?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, just doing some packaging stuff.

@wesm
Copy link
Copy Markdown
Member

wesm commented Jul 24, 2018

Can you rebase this?

@kszucs kszucs force-pushed the cython_properties branch from 14c37e8 to 1e1c7cd Compare July 25, 2018 09:12
return self.metadata.index_page_offset()
@property
def dictionary_page_offset(self):
return self.metadata.dictionary_page_offset()
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xhochy Should it return None if has_dictionary_page is False?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

return self.metadata.total_compressed_size()
@property
def index_page_offset(self):
return self.metadata.index_page_offset()
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly index_page_offset is 0, is it supposed to be zero, or index_page is optional too? (I'm not that familiar with parquet specs yet)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

index_page is optional. This should return None.

Copy link
Copy Markdown
Member Author

@kszucs kszucs Jul 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do I know that index_page hasn't been set? (index_page_offset == 0) => return None?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xhochy ping

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is buggy. We should export it at the moment. Please raise NotImplementedError for now. I'll fix the parquet-cpp part over at https://issues.apache.org/jira/browse/PARQUET-1358

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do You mean:

@property
def has_index_page(self):
    raise NotImplementedError

@property
def index_page_offset(self):
    # leave it as is
    return self.metadata.index_page_offset()

or

@property
def index_page_offset(self):
    raise NotImplementedError

?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Files not written with parquet-cpp will lead to garbage on index_page_offset currently as it is mostly not set.

@property
def has_index_page(self):
    raise NotImplementedError("not supported in parquet-cpp")

@property
def index_page_offset(self):
    raise NotImplementedError("parquet-cpp doesn't return valid values")

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, Thanks!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


def test_parquet_metadata_api():
import pyarrow.parquet as pq
import pyarrow._parquet as _pq
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I expose the missing classes from pyarrow._parquet to pyarrow.parquet?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they seem to be helpful, so a user could also use them.

@kszucs
Copy link
Copy Markdown
Member Author

kszucs commented Jul 25, 2018

According to the has flags I should expose both HasNullCount and HasDistinctCount to be able to return None for stat.null_count and stat.distinct_count.

@xhochy Do We need to release a new parquet-cpp version too with arrow=0.10.0? If not, then I wouldn't change parquet's API before the release.

@xhochy
Copy link
Copy Markdown
Member

xhochy commented Jul 25, 2018

We will release a new parquet-cpp version once Arrow 0.10.0 is released. We should not change/break anything in parquet-cpp before the Arrow 0.10.0 release.

parquet-cpp releases are much simpler, so I can do more. The releases are mainly limited by PMCs that vote on the release.

@kszucs
Copy link
Copy Markdown
Member Author

kszucs commented Jul 25, 2018

So until We have HasNullCount and HasDistinctCount I'll assert stat.distinct_count == 0 instead of None with a note.

@kszucs kszucs force-pushed the cython_properties branch from f037ead to 74d53bb Compare July 26, 2018 12:37
@wesm
Copy link
Copy Markdown
Member

wesm commented Jul 27, 2018

This seems ready to go, I moved this into the 0.10.0 milestone so it gets merged

Copy link
Copy Markdown
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM

Leaving this open for @pitrou for a final review

@wesm wesm closed this in 49ccf6a Jul 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants