Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-5566] [Table API & SQL]Introduce structure to hold table and column level statistics #3196

Closed
wants to merge 2 commits into from

Conversation

beyond1920
Copy link
Contributor

This pr aims to introduce structure to hold table and column level statistics.
TableStats: Responsible for hold table level statistics
ColumnStats: Responsible for hold column level statistics.

Copy link
Contributor

@fhueske fhueske left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @beyond1920, thanks for the PR and sorry of the long time until the review.

I had a few minor comments inline. Another thing I'd like to discuss is compatibility with Java. I think it is important that we can provide statistics without importing the Scala library. There are a few connectors which are implemented in Java and adding a Scala dependency just to provide statistics would not be good.

What do you think?

Best, Fabian

* @param min min value of column values
*/
case class ColumnStats(
ndv: Long,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make all stats optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fhueske , there is no need to make all stats optional. If there is no statistics for ndv/nullcount/avgLen/maxLen, we could give them an invalid value, e.g, -1. But it does not work for max/min, because max/min value could be possible negative, so max/min is optional. What do you think?

case class ColumnStats(
ndv: Long,
nullCount: Long,
avgLen: Long,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Int should be sufficient for value length.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I agree.

avgLen: Long,
maxLen: Long,
max: Option[Any],
min: Option[Any]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to denote whether stats are precise or approximate? Also an optional field could hold the a timestamp when the stats were generated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We prefer to use the available stats, it's the stats provider who should be responsible for providing reliable stats. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think you are right @beyond1920

@beyond1920
Copy link
Contributor Author

@fhueske , thanks for your review. I modify code based on your advice, including compatibility with Java and column stats field type.

@fhueske
Copy link
Contributor

fhueske commented Feb 14, 2017

Thanks for the update @beyond1920!
PR is good to merge.

@asfgit asfgit closed this in 663c1e3 Feb 14, 2017
@fhueske
Copy link
Contributor

fhueske commented Feb 14, 2017

Merged.
Thanks @beyond1920!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants