Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-6339: [Python][C++] Rowgroup statistics for pd.NaT array ill defined #5180

Closed
wants to merge 1 commit into from

Conversation

fjetter
Copy link
Contributor

@fjetter fjetter commented Aug 23, 2019

The issue is that a NaT array is not initialised as a NullArray and the statistics are not invalidated causing the ill defined behaviour.

Compared to a real NullArray (pd.DataFrame({"t": [None]})) this gives the same behaviour, i.e. statistics return None. I'm not entirely sure if this is the intended behaviour.

@wesm
Copy link
Member

wesm commented Aug 23, 2019

Can you open a JIRA issue?

@fjetter fjetter changed the title [Python][C++] Rowgroup statistics for pd.NaT array ill defined ARROW-6339: [Python][C++] Rowgroup statistics for pd.NaT array ill defined Aug 24, 2019
@kszucs
Copy link
Member

kszucs commented Aug 29, 2019

@ursabot build

@jorisvandenbossche
Copy link
Member

@fjetter this gives a bunch of statistics-related failures in the tests. It seems that the statistics now return None even in cases where there are valid data (not all-null).

@@ -623,6 +623,11 @@ bool ApplicationVersion::HasCorrectStatistics(Type::type col_type,
return false;
}

// Null only arrays do not have proper statistics
if (statistics.null_count > 0 && statistics.distinct_count == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never actually compute distinct_count so it's always going to be zero I think

@fjetter
Copy link
Contributor Author

fjetter commented Sep 17, 2019

I'll try to raise an exception as was suggested by @xhochy

@fjetter
Copy link
Contributor Author

fjetter commented Sep 17, 2019

I guess the proper position for the exception would be the Statistics class directly?

@xhochy
Copy link
Member

xhochy commented Sep 17, 2019

@pitrou @wesm Do you have an opinion on execption vs returning None?

@pitrou
Copy link
Member

pitrou commented Sep 17, 2019

@xhochy No idea. Can None be a regular value for those statistics?

@xhochy
Copy link
Member

xhochy commented Sep 17, 2019

For an all-null column None could be valid?

@wesm
Copy link
Member

wesm commented Sep 18, 2019

this was done in 62202ee

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants