Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract Date32 parquet statistics as Date32Array rather than Int32Array #10593

Merged
merged 5 commits into from
May 23, 2024

Conversation

xinlifoobar
Copy link
Contributor

Which issue does this PR close?

Closes #10587

Rationale for this change

This is to fix a bug when reading a Date32 or Date64 column from a parquet file, DataFusion currently returns an Int32 array

What changes are included in this PR?

Adds conversions in the get_statistic marco from Int32 to Date32 and Date64 respectively.

Are these changes tested?

Yes

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label May 21, 2024
}
Some(DataType::Date64) => {
Some(ScalarValue::Date64(Some(i64::from(*s.$func()) * 24 * 60 * 60 * 1000)))
}
_ => Some(ScalarValue::Int32(Some(*s.$func()))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be the root cause of the bug, is it something that needs to be addressed in a different PR @alamb ? "catch-all" pattern matching branches are evil

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it is likely what is masking all the bugs

I am not sure if it is worth explicitly listing all the types out here to be honest 🤔 I am hoping in some future PR to revamp how these statistics are done entirely (basically match on the arrow target type in the outer loop rather than the inner loop)

Copy link
Contributor

@edmondop edmondop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a PMC or a full-time committer, my review is not very valuable, but looks good to me.
Added a comment to the PR outside the review to track whether we are ok with a catch-all pattern matching statement or we want to track somewhere to make the patterns explicit

@alamb
Copy link
Contributor

alamb commented May 21, 2024

Not a PMC or a full-time committer, my review is not very valuable, but looks good to me.

Your review is valuable in my opinion -- thank you @edmondop

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @xinlifoobar -- this looks really nice 👌 I had only one question related to Date64 but otherwise this PR looks good to me

}
Some(DataType::Date64) => {
Some(ScalarValue::Date64(Some(i64::from(*s.$func()) * 24 * 60 * 60 * 1000)))
}
_ => Some(ScalarValue::Int32(Some(*s.$func()))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it is likely what is masking all the bugs

I am not sure if it is worth explicitly listing all the types out here to be honest 🤔 I am hoping in some future PR to revamp how these statistics are done entirely (basically match on the arrow target type in the outer loop rather than the inner loop)

@alamb alamb changed the title Fixes bug expect Date32Array but returns Int32Array Extract Date32 parquet statistics as Date32Array rather than Int32Array May 21, 2024
@alamb
Copy link
Contributor

alamb commented May 22, 2024

This file had some merge conflicts so I took the liberty of merging up from main and addressing the comment #10593 (comment) in 43cf435

Thank you again so much @xinlifoobar -- very much appreciated

@alamb alamb merged commit 529d2c0 into apache:main May 23, 2024
23 checks passed
@alamb
Copy link
Contributor

alamb commented May 23, 2024

Thanks agian @xinlifoobar

alamb added a commit to alamb/datafusion that referenced this pull request May 23, 2024
…32Array` (apache#10593)

* Fixes bug expect `Date32Array` but returns Int32Array

* Add round trip ut

* Update arrow_statistics.rs

* remove unreachable code

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@xinlifoobar xinlifoobar deleted the dev/xinli/date_stat branch May 24, 2024 02:47
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
…32Array` (apache#10593)

* Fixes bug expect `Date32Array` but returns Int32Array

* Add round trip ut

* Update arrow_statistics.rs

* remove unreachable code

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFusion reads Date32 and Date64 parquet statistics in as Int32Array
4 participants