New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-6339: [Python][C++] Rowgroup statistics for pd.NaT array ill defined #5180
Conversation
Can you open a JIRA issue? |
@ursabot build |
@fjetter this gives a bunch of statistics-related failures in the tests. It seems that the |
@@ -623,6 +623,11 @@ bool ApplicationVersion::HasCorrectStatistics(Type::type col_type, | |||
return false; | |||
} | |||
|
|||
// Null only arrays do not have proper statistics | |||
if (statistics.null_count > 0 && statistics.distinct_count == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We never actually compute distinct_count so it's always going to be zero I think
I'll try to raise an exception as was suggested by @xhochy |
I guess the proper position for the exception would be the |
@xhochy No idea. Can None be a regular value for those statistics? |
For an all-null column |
this was done in 62202ee |
The issue is that a NaT array is not initialised as a NullArray and the statistics are not invalidated causing the ill defined behaviour.
Compared to a real NullArray (
pd.DataFrame({"t": [None]})
) this gives the same behaviour, i.e. statistics return None. I'm not entirely sure if this is the intended behaviour.