Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21322][SQL][followup] support histogram in filter cardinality estimation #19952

Closed
wants to merge 3 commits into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

some code cleanup/refactor and naming improvement.

How was this patch tested?

existing tests

@cloud-fan
Copy link
Contributor Author

cc @ron8hu @wzhfy

*
* @param value a literal value of a column
* @param bins an array of bins for a given numeric equi-height histogram
* @return the id of the first bin into which a column value falls.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it's a private method and we can omit the parameter doc as it's trivial.

* @param upperBound the highest value of the given range
* @param upperBoundInclusive whether the upperBound is included in the range
* @param lowerBound the lowest value of the given range
* @param lowerBoundInclusive whether the lowerBound is included in the range
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of asking the callers to pass in the upper and lower bin id, it's more intuitive to pass whether to include the range boundaries.

@SparkQA
Copy link

SparkQA commented Dec 12, 2017

Test build #84773 has finished for PR 19952 at commit ebcd6d1.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Dec 12, 2017

Test build #84784 has finished for PR 19952 at commit ebcd6d1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@wzhfy wzhfy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except some minor comments

* @param lowerId id of the low end bin holding the low end value of a column range
* @param higherEnd a given upper bound value of a specified column value range
* @param lowerEnd a given lower bound value of a specified column value range
* Note that the return value is double type, because the range boundaries usually occupy a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: returned value

* @param lowerEnd a given lower bound value of a specified column value range
* Note that the return value is double type, because the range boundaries usually occupy a
* portion of a bin. An extrema case is [value, value] which is generated by equal predicate
* `col = value`, we can get more accuracy by allowing returning portion of histogram bins.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: get higher accuracy

* @param min the lower bound of the current valid range for a given column
* @param datumNumber the numeric value of a literal
* @return the selectivity percentage for a condition in the current range.
* Computes the possibility of a equal predicate using histogram.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: an equality predicate

upperBound: Double,
upperBoundInclusive: Boolean,
lowerBound: Double,
lowerBoundInclusive: Boolean,
histogram: Histogram): Double = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to pass the bin array instead of histogram? we can simplify many histogram.bins here.

val numBinsHoldingEntireRange = EstimationUtils.numBinsHoldingRange(
max, upperBoundInclusive = true, min, lowerBoundInclusive = true, histogram)

val numBinsHoldingDatum = op match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numBinsHoldingRange

* [lowerBound, upperBound].
*
* Note that the returned value is double type, because the range boundaries usually occupy a
* portion of a bin. An extrema case is [value, value] which is generated by equal predicate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: extreme

@SparkQA
Copy link

SparkQA commented Dec 13, 2017

Test build #84822 has finished for PR 19952 at commit 8fe0c49.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 13, 2017

Test build #84827 has finished for PR 19952 at commit 4e35c43.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

thanks, merging to master!

@asfgit asfgit closed this in bdb5e55 Dec 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants