Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3891][SQL] Add array support to percentile, percentile_approx and constant inspectors support #2802

Closed
wants to merge 8 commits into from

Conversation

gvramana
Copy link
Contributor

Supported passing array to percentile and percentile_approx UDAFs
To support percentile_approx, constant inspectors are supported for GenericUDAF
Constant folding support added to CreateArray expression
Avoided constant udf expression re-evaluation

Author: Venkata Ramana G ramana.gollamudi@huawei.com

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@liancheng
Copy link
Contributor

test this please

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #447 has started for PR 2802 at commit f425daf.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #447 has finished for PR 2802 at commit f425daf.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gvramana
Copy link
Contributor Author

gvramana commented Nov 3, 2014

Reworked on this over the changes of #2762. Updated the pull request comment.
Please test the same

@liancheng
Copy link
Contributor

test this please

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #22810 has started for PR 2802 at commit 2a59843.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #22810 has finished for PR 2802 at commit 2a59843.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22810/
Test FAILed.

@gvramana
Copy link
Contributor Author

gvramana commented Nov 3, 2014

fixed test, please re-trigger test

@liancheng
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22860 has started for PR 2802 at commit b5f0ce6.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22860 has finished for PR 2802 at commit b5f0ce6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22860/
Test PASSed.


private val function = resolver.getEvaluator(exprs.map(_.dataType.toTypeInfo).toArray)

private val inspectors =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have to distinguish the isUDAFBridgeRequired? Can we just use the exprs.map(toInspector).toArray?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previously flag added to not disturb UDAFBridge path. Fixed and tested the same.

@chenghao-intel
Copy link
Contributor

Thank you @gvramana, this is a good follow up for #2762 , which I didn't cover the constant value for Hive UDAF. I have some comments on it, let's keep the minimum change for the fixing.

@gvramana
Copy link
Contributor Author

gvramana commented Nov 4, 2014

@chenghao-intel , Thanks for review, fixed and replied on comments. Please check the same.

@marmbrus
Copy link
Contributor

marmbrus commented Nov 4, 2014

ok to test

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22897 has started for PR 2802 at commit 2b2859a.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 4, 2014

Test build #22897 has finished for PR 2802 at commit 2b2859a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22897/
Test PASSed.

private val inspectors = exprs.map(toInspector).toArray

private val function = {
val parameterInfo = new SimpleGenericUDAFParameterInfo(inspectors,false,false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: spaces after ,.

@SparkQA
Copy link

SparkQA commented Nov 23, 2014

Test build #23759 has started for PR 2802 at commit a18f917.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 23, 2014

Test build #23759 has finished for PR 2802 at commit a18f917.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23759/
Test PASSed.

@gvramana
Copy link
Contributor Author

gvramana commented Dec 1, 2014

@marmbrus Please review and merge the same, thanks

@@ -172,6 +177,8 @@ private[hive] case class HiveGenericUdf(functionClassName: String, children: Seq

override def eval(input: Row): Any = {
returnInspector // Make sure initialized.
if(foldable) return constantReturnValue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably unnecessary, as constant folding rule in Optimizer will replace the foldable expression with Literal. Please correct me if there is exceptional case. (and above property constantreturnValue)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constant check and returning value is required for two reasons:

  1. some UDF functions returns constant iterator when initializeAndFoldConstants called with constant iterators, by executing them once.
    But if the same are called with "function.evaluate" they will not return the same constant value type. There will be mismatch in the datatype expected by constantReturnInspector datatype vs datatype returned by function.evaluate.(Ex: org.apache.hadoop.io.Text vs String). This fails unwrap.
    So if return Inspector is constant we don't need to call "function.evaluate" as the expression is already evaluated and return value is already present in constant iterator.
    This I have uncovered, when I made CreateArrayExression as foldable, then test fails. So modified as part of current defect fix only.
  2. Even if Literal is identified during optimization, expression is evaluated twice, once during return Inspector creation and next during eval function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, I guess there is probably a bug in master as you described, can you paste a query to reproduce the failure? (Text V.S. String).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In HiveQuerySuite, "constant array" testcase was failing
SELECT sort_array(
sort_array(
array("hadoop distributed file system",
"enterprise databases", "hadoop map-reduce")))
FROM src LIMIT 1

[info] - constant array *** FAILED *** (596 milliseconds)
[info] Failed to execute query using catalyst:
[info] Error: java.lang.String cannot be cast to org.apache.hadoop.io.Text

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenghao-intel, I think you understand this code better than I do. Are you satisfied with the explanation? Does this approach seem reasonable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this failure is due to the bug of nested constant Expression <-> ObjectInspector in HiveInspectors, and I will fix that in #3429.

@gvramana , how about revert the changes in CreateArray and HiveGenericUDF? I think we can merge the others first. And you can create a new PR for CreateArray.foldable which depends on #3429, since it currently doesn't break anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenghao-intel This PR cannot be separately merged without CreateArray, as percentile_approx accepts only constant array iterator and fails otherwise. I think we can go ahead and merge all these changes as they don't break build or tests, and are not directly dependent on #3429 in order of merge.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the #3429, I think this PR can be more simpler after #3429 merged. Besides, invoking the foldable win eval probably too heavy, which is supposed to be eliminated in Optimizer.

@chenghao-intel
Copy link
Contributor

The code looks good to me in general, except the unnecessary change in the class HiveGenericUDF. Sorry, I didn't test that, please correct me if I am wrong.

@gvramana
Copy link
Contributor Author

gvramana commented Dec 1, 2014

That change is required, please find the updated comment above. Thanks

@gvramana
Copy link
Contributor Author

gvramana commented Dec 1, 2014

@marmbrus, @chenghao-intel any other comments? Can you merge the same, thanks

@@ -165,6 +165,11 @@ private[hive] case class HiveGenericUdf(functionClassName: String, children: Seq
isUDFDeterministic && returnInspector.isInstanceOf[ConstantObjectInspector]

@transient
protected lazy val constantReturnValue = unwrap(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this just be a def? It will only be called once correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, fixed the same.

@SparkQA
Copy link

SparkQA commented Dec 9, 2014

Test build #24246 has started for PR 2802 at commit a0182e5.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 9, 2014

Test build #24246 has finished for PR 2802 at commit a0182e5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24246/
Test PASSed.

@marmbrus
Copy link
Contributor

Thanks! Merged to master.

@asfgit asfgit closed this in f33d550 Dec 17, 2014
asfgit pushed a commit that referenced this pull request Dec 30, 2014
Since #3429 has been merged, the bug of wrapping to Writable for HiveGenericUDF is resolved, we can safely remove the foldable checking in `HiveGenericUdf.eval`, which discussed in #2802.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #3745 from chenghao-intel/generic_udf and squashes the following commits:

622ad03 [Cheng Hao] Remove the unnecessary code change in Generic UDF
markhamstra pushed a commit to markhamstra/spark that referenced this pull request Jan 17, 2015
…and constant inspectors support

Supported passing array to percentile and percentile_approx UDAFs
To support percentile_approx,  constant inspectors are supported for GenericUDAF
Constant folding support added to CreateArray expression
Avoided constant udf expression re-evaluation

Author: Venkata Ramana G <ramana.gollamudihuawei.com>

Author: Venkata Ramana Gollamudi <ramana.gollamudi@huawei.com>

Closes apache#2802 from gvramana/percentile_array_support and squashes the following commits:

a0182e5 [Venkata Ramana Gollamudi] fixed review comment
a18f917 [Venkata Ramana Gollamudi] avoid constant udf expression re-evaluation - fixes failure due to return iterator and value type mismatch
c46db0f [Venkata Ramana Gollamudi] Removed TestHive reset
4d39105 [Venkata Ramana Gollamudi] Unified inspector creation, style check fixes
f37fd69 [Venkata Ramana Gollamudi] Fixed review comments
47f6365 [Venkata Ramana Gollamudi] fixed test
cb7c61e [Venkata Ramana Gollamudi] Supported ConstantInspector for UDAF Fixed HiveUdaf wrap object issue.
7f94aff [Venkata Ramana Gollamudi] Added foldable support to CreateArray
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants