New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-5566] [Table API & SQL]Introduce structure to hold table and column level statistics #3196
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @beyond1920, thanks for the PR and sorry of the long time until the review.
I had a few minor comments inline. Another thing I'd like to discuss is compatibility with Java. I think it is important that we can provide statistics without importing the Scala library. There are a few connectors which are implemented in Java and adding a Scala dependency just to provide statistics would not be good.
What do you think?
Best, Fabian
* @param min min value of column values | ||
*/ | ||
case class ColumnStats( | ||
ndv: Long, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make all stats optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fhueske , there is no need to make all stats optional. If there is no statistics for ndv/nullcount/avgLen/maxLen, we could give them an invalid value, e.g, -1. But it does not work for max/min, because max/min value could be possible negative, so max/min is optional. What do you think?
case class ColumnStats( | ||
ndv: Long, | ||
nullCount: Long, | ||
avgLen: Long, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Int
should be sufficient for value length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I agree.
avgLen: Long, | ||
maxLen: Long, | ||
max: Option[Any], | ||
min: Option[Any]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to denote whether stats are precise or approximate? Also an optional field could hold the a timestamp when the stats were generated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We prefer to use the available stats, it's the stats provider who should be responsible for providing reliable stats. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think you are right @beyond1920
cadc16e
to
3715780
Compare
@fhueske , thanks for your review. I modify code based on your advice, including compatibility with Java and column stats field type. |
Thanks for the update @beyond1920! |
Merged. |
This pr aims to introduce structure to hold table and column level statistics.
TableStats: Responsible for hold table level statistics
ColumnStats: Responsible for hold column level statistics.