[SPARK-22274][PYTHON][SQL] User-defined aggregation functions with pandas udf (full shuffle) #19872

icexelloss · 2017-12-04T06:38:35Z

What changes were proposed in this pull request?

Add support for using pandas UDFs with groupby().agg().

This PR introduces a new type of pandas UDF - group aggregate pandas UDF. This type of UDF defines a transformation of multiple pandas Series -> a scalar value. Group aggregate pandas UDFs can be used with groupby().agg(). Note group aggregate pandas UDF doesn't support partial aggregation, i.e., a full shuffle is required.

This PR doesn't support group aggregate pandas UDFs that return ArrayType, StructType or MapType. Support for these types is left for future PR.

How was this patch tested?

GroupbyAggPandasUDFTests

icexelloss · 2017-12-04T06:42:56Z

cc @HyukjinKwon @holdenk @ueshin

Passing some basic tests. I will work on this more next week to clean up and add more testing.

SparkQA · 2017-12-04T06:44:09Z

Test build #84414 has finished for PR 19872 at commit 4cfaf0e.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-12-04T06:52:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

@@ -113,6 +113,7 @@ object ExtractPythonUDFs extends Rule[SparkPlan] with PredicateHelper {
  def apply(plan: SparkPlan): SparkPlan = plan transformUp {
    // FlatMapGroupsInPandas can be evaluated directly in python worker
    // Therefore we don't need to extract the UDFs


FlatMapGroupsInPandas and AggregateInPandasExec can be...

SparkQA · 2017-12-04T06:54:37Z

Test build #84415 has finished for PR 19872 at commit a1058b8.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-12-04T06:58:50Z

python/pyspark/sql/group.py

+                jdf = self._jgd.aggInPandas(
+                    _to_seq(self.sql_ctx._sc, [c._jc for c in exprs]))
+            else:
+                jdf = self._jgd.agg(exprs[0]._jc,


If exprs[n] (n > 0) is a UDFColumn? I think we should make sure if any column is a UDFColumn, all columns should be UDFColumn.

This code is removed.

viirya · 2017-12-04T06:58:59Z

python/pyspark/sql/group.py

-            jdf = self._jgd.agg(exprs[0]._jc,
-                                _to_seq(self.sql_ctx._sc, [c._jc for c in exprs[1:]]))
+            if isinstance(exprs[0], UDFColumn):
+                assert all(isinstance(c, UDFColumn) for c in exprs)


A informative error message should be better.

Like all exprs should be UDFColumn".

So I'm a little worried about this change, if other folks have wrapped Java UDAFs (which is reasonable since there aren't other ways to make UDAFs in PySpark before this), this seems like they won't be able to mix them. I'd suggest maybe doing what @viirya suggested bellow but instead of a failure just a warning until Spark 3.

What do y'all think?

I am still trying to figure out the best way to dispatch this, but either way I think we won't be able to fix Java UDAF with pandas UDF.

@holdenk I am not sure what kind of warning message do you have in mind. Can you please explain?

Ah so what your saying is you don't support mixed Python & Java UDAFs? That's certainly something which needs to be communicated in both the documentation and the error message.

Is there a reason why we don't support this?

Answered in #19872 (comment)

holdenk

Thanks for working on this. I'm off for a flight to Strata but a few quick questions. I'll read this more over the coming week :)

holdenk · 2017-12-04T10:55:37Z

python/pyspark/sql/udf.py

@@ -56,6 +56,10 @@ def _create_udf(f, returnType, evalType):
    return udf_obj._wrapped()


+class UDFColumn(Column):


Why did we add this new sub-class?

holdenk · 2017-12-04T10:59:09Z

python/pyspark/sql/functions.py

@@ -2070,6 +2070,8 @@ class PandasUDFType(object):

    GROUP_MAP = PythonEvalType.SQL_PANDAS_GROUP_MAP_UDF

+    GROUP_AGG = PythonEvalType.SQL_PANDAS_GROUP_AGG_UDF


So I'm worried that it isn't clear to the user that this will result in a full-shuffle with no-partial aggregation. Is there maybe a place we can document this warning?

Added in docstring of pandas_udf and groupby().agg()

HyukjinKwon

I thought @ueshin is working on this BTW.

HyukjinKwon · 2017-12-04T13:05:57Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala

+
+    val argOffsets = inputs.map { input =>
+      input.map { e =>
+          allInputs += e


indentation nit

Fixed. Thanks!

HyukjinKwon · 2017-12-04T13:09:21Z

...lyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/pythonLogicalOperators.scala

+    functionExprs: Seq[Expression],
+    output: Seq[Attribute],
+    child: LogicalPlan
+) extends UnaryNode {


nit:

child: LogicalPlan) extends UnaryNode {

HyukjinKwon · 2017-12-04T13:15:12Z

python/pyspark/sql/udf.py

@@ -56,6 +56,10 @@ def _create_udf(f, returnType, evalType):
    return udf_obj._wrapped()


+class UDFColumn(Column):


BTW, what do you think about adding an attribute instead in __call__ like a flag?

HyukjinKwon · 2017-12-04T13:15:51Z

sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

+    val childrenExpressions = exprs.flatMap(expr =>
+      expr.children.map {
+      case ne: NamedExpression => ne
+      case other => Alias(other, other.toString)()


indentation nit

HyukjinKwon · 2017-12-04T13:16:20Z

sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala

+
+    val udfOutputs = exprs.flatMap(expr =>
+      Seq(AttributeReference(expr.name, expr.dataType)())
+    )


I think this could be inlined.

HyukjinKwon · 2017-12-04T13:17:46Z

python/pyspark/sql/tests.py

+class GroupbyAggTests(ReusedSQLTestCase):
+    def assertFramesEqual(self, expected, result):
+        msg = ("DataFrames are not equal: " +
+               ("\n\nExpected:\n%s\n%s" % (expected, expected.dtypes)) +


indentation nit

icexelloss · 2017-12-04T19:12:11Z

I thought @ueshin is working on this BTW.

Oh, I certainly don't want to duplicate @ueshin 's work. I am under the impression that @ueshin is working on two-stage PySpark UDAF with pandas_udf, but I cannot really find the Jira for it...

@ueshin can you point me to what you are working on so I don't overstep?

SparkQA · 2017-12-04T23:29:48Z

Test build #84446 has finished for PR 19872 at commit c1dc543.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class UDFColumn(Column):
case class AggregateInPandas(
case class AggregateInPandasExec(

SparkQA · 2017-12-08T00:39:49Z

Test build #84628 has finished for PR 19872 at commit 3352050.

This patch fails Python style tests.
This patch does not merge cleanly.
This patch adds no public classes.

icexelloss · 2017-12-08T00:44:15Z

I end up removing UDFColumn class and using the existing Aggregate logical plan for pandas group_agg UDF. I also move the dispatch logic to SparkStrategy. This reuses a lot of code being to existing Aggregate and minimize the code changes needed for pandas group_agg UDF.

The code works and three tests (test_basic, test_alias, test_multiple) passes now but the code is kind of messy. I am going on vacation next week but I will clean up the code and move this PR forward when I get back (Dec 16).

Thanks all.

icexelloss · 2017-12-08T00:47:51Z

And to @holdenk 's question. Pandas group_agg udf fundamentally uses different physical plan than the existing java/scala udf and therefore it's hard to combine them together. I don't know a good way to do this, the closest is maybe to compute java/scala and python aggregation separately and join them together.

SparkQA · 2017-12-08T00:54:51Z

Test build #84630 has finished for PR 19872 at commit 184b37f.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-08T01:09:50Z

Test build #84631 has finished for PR 19872 at commit 4332f28.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-08T01:14:59Z

Test build #84632 has finished for PR 19872 at commit 37eff29.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2017-12-08T04:42:23Z

@icexelloss I'm sorry for the late response.
Actually I tried to implement prototypes of Pandas UDAF with partial aggregation and combining existing aggregate functions, but they are still much complicated (ueshin#2, ueshin#3, ueshin#4). I was thinking about easier way to achieve that but not yet.
I've not looked into this pr yet but I guess we can start this pr and pick some functionalities from my prototypes if needed.

ueshin · 2017-12-11T10:10:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/PythonUDF.scala

@@ -32,7 +31,5 @@ case class PythonUDF(
    evalType: Int)
  extends Expression with Unevaluable with NonSQLExpression with UserDefinedExpression {

-  override def toString: String = s"$name(${children.mkString(", ")})"


Why was this removed?

Whoops, my bad, adding back

ueshin · 2017-12-11T10:21:32Z

python/pyspark/sql/tests.py

@@ -4016,6 +4016,124 @@ def test_unsupported_types(self):
            with self.assertRaisesRegexp(Exception, 'Unsupported data type'):
                df.groupby('id').apply(f).collect()

+@unittest.skipIf(not _have_pandas or not _have_arrow, "Pandas or Arrow not installed")
+class GroupbyAggTests(ReusedSQLTestCase):
+    def assertFramesEqual(self, expected, result):


nit: how about making this the common method?

ueshin · 2017-12-11T10:32:36Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala

+      val joined = new JoinedRow
+      val resultProj = UnsafeProjection.create(output, output)
+
+      columnarBatchIter.map(_.rowIterator.next()).map{ outputRow =>


nit: columnarBatchIter.flatMap(_.rowIterator)?
nit: style, add a space between map and { outputRow =>.

columnarBatchIter.flatMap(_.rowIterator)

Doesn't work because rowIterator is a java iterator not a scala iterator, we can convert it, but I am not sure it's better though. @ueshin if you prefer the flatMap one I can change it.

Sorry, I meant columnarBatchIter.flatMap(_.rowIterator.asScala). I'd prefer this one.

ueshin · 2017-12-11T10:45:01Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

@@ -48,9 +48,26 @@ object ExtractPythonUDFFromAggregate extends Rule[LogicalPlan] {
    }.isDefined
  }

+  private def isPandasGroupAggUdf(expr: Expression): Boolean = expr match {
+      case _ @ PythonUDF(_, _, _, _, PythonEvalType.SQL_PANDAS_GROUP_AGG_UDF ) => true


We don't need _ @ here.
nit: remove extra space after SQL_PANDAS_GROUP_AGG_UDF.

ueshin · 2017-12-11T10:47:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

+    if (hasPandasGroupAggUdf(agg)) {
+      Aggregate(agg.groupingExpressions, agg.aggregateExpressions, agg.child)
+    } else {
+


nit: style, we need indent for this block.

ueshin · 2017-12-11T10:48:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/PythonUDF.scala

@@ -15,10 +15,9 @@
 * limitations under the License.
 */

-package org.apache.spark.sql.execution.python
+package org.apache.spark.sql.catalyst.expressions


Do we need to move package to catalyst?

We do. This is similar to https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala

The reason is we need to access the class PythonUDF in analyzer.

I see, thanks!

SparkQA · 2017-12-19T22:20:38Z

Test build #85136 has finished for PR 19872 at commit ab91314.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-20T01:35:31Z

Test build #85137 has finished for PR 19872 at commit 1a197b7.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-20T01:39:19Z

Test build #85138 has finished for PR 19872 at commit 62c8f00.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2017-12-20T04:04:37Z

@ramacode2014 Hi, I'm not sure why you received notifications from this PR, but I guess you can unsubscribe by the "Unsubscribe" button in the right column of this page. Sorry for the inconvenience. Thanks!

ueshin · 2017-12-20T05:10:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

-            alias.toAttribute
+
+    if (hasPandasGroupAggUdf(agg)) {
+      Aggregate(agg.groupingExpressions, agg.aggregateExpressions, agg.child)


Do we need to copy?

I am not sure. But I added copy in ExtractGroupAggPandasUDFFromAggregate similar to existing rules.

ueshin · 2017-12-20T05:20:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

+  }
+
+  private def hasPandasGroupAggUdf(agg: Aggregate): Boolean = {
+    val actualAggExpr = agg.aggregateExpressions.drop(agg.groupingExpressions.length)


Do we need to drop the grouping expressions?
If we need, we can drop them only if conf.dataFrameRetainGroupColumns == true, otherwise aggregateExpressions doesn't contain groupingExpressions?

This is fixed. Added test_retain_grouping_columns test

ueshin · 2017-12-20T06:21:10Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala

+    val allInputs = new ArrayBuffer[Expression]
+    val dataTypes = new ArrayBuffer[DataType]
+
+    allInputs.appendAll(groupingExpressions)


I guess we don't need to append groupingExpressions. Seems like they are dropped later.

This is fixed.

ueshin · 2017-12-20T06:24:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/AggregateInPandasExec.scala

+        .compute(projectedRowIter, context.partitionId(), context)
+
+      val joined = new JoinedRow
+      val resultProj = UnsafeProjection.create(output, output)


We need to handle resultExpressions for the following cases:

def test_result_expressions(self): import numpy as np from pyspark.sql.functions import mean, pandas_udf, PandasUDFType df = self.data @pandas_udf('double', PandasUDFType.GROUP_AGG) def mean_udf(v, w): return np.average(v, weights=w) result1 = (df.groupby('id') .agg(mean_udf(df.v, lit(1.0)) + 1) .sort('id') .toPandas()) expected1 = (df.groupby('id') .agg(mean(df.v) + 1) .sort('id') .toPandas()) self.assertPandasEqual(expected1, result1)

Thanks @ueshin for reminding me of this. Just want to clarify the semantics:

Does

.agg(mean(df.v) + 1)

mean "compute mean of df.v and plus the mean by one as output", i.e, same as

.agg(mean(df.v).alias('mean')) .withColumn('mean', col('mean') + 1)

?

Yes, I think so about the behavior. I guess the plan could be different, though.
We can compare the behavior with non-udf aggregation and let's follow the behavior.

I added ExtractGroupAggPandasUDFFromAggregate rule to deal with this

ueshin · 2017-12-20T06:54:46Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/ExtractPythonUDFs.scala

+    actualAggExpr.exists(isPandasGroupAggUdf)
+  }
+
+


nit: remove an extra line.

SparkQA · 2017-12-20T07:35:09Z

Test build #85152 has finished for PR 19872 at commit ea5d6f3.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

icexelloss · 2017-12-27T21:59:37Z

@ueshin I pushed some more change to address your comments. There is one regression in existing test SQLTests.test_udf_with_aggregate_function. I will try to fix it tomorrow.

SparkQA · 2017-12-28T00:18:44Z

Test build #85442 has finished for PR 19872 at commit 99367a6.

This patch fails PySpark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2017-12-28T01:00:05Z

Test build #85446 has finished for PR 19872 at commit 66a31f9.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-18T17:00:01Z

Test build #86345 has finished for PR 19872 at commit 17fad5c.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-18T17:02:50Z

Test build #86344 has finished for PR 19872 at commit a94b146.

This patch fails to build.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2018-01-18T20:58:35Z

Test build #86346 has finished for PR 19872 at commit 0fec5cf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-18T22:24:08Z

Test build #86350 has finished for PR 19872 at commit 4d22107.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

icexelloss · 2018-01-19T14:29:56Z

@ueshin I think all comments are addressed. Can you take a final look? Thanks!

ueshin

We also need to add PythonEvalType.SQL_PANDAS_GROUP_AGG_UDF to udf.py#L40-L41 to pass require_minimum_pyarrow_version().

LGTM except for the comments.

Btw, I'm afraid I guess we shouldn't merge this into branch-2.3 since we are already close to release 2.3.
WDYT? @HyukjinKwon @cloud-fan

ueshin · 2018-01-22T09:44:10Z

python/pyspark/sql/functions.py

+    3. GROUP_AGG
+
+       A group aggregate UDF defines a transformation: One or more `pandas.Series` -> A scalar
+       The `returnType` should be a primitive data type, e.g, :class:`DoubleType`.


very small nit: e.g. instead of e.g.

Fixed. Thanks!

HyukjinKwon · 2018-01-22T13:06:56Z

+1 for master-only. We can cherry-pick and backport if we should even after this gets merged anyway. For a reminder, we should complete the doc #19575 too.

icexelloss · 2018-01-22T16:00:42Z

Addressed latest comments. Yeah I think master only is fine.

SparkQA · 2018-01-22T19:20:20Z

Test build #86487 has finished for PR 19872 at commit 91885e5.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-22T23:52:34Z

Test build #86492 has finished for PR 19872 at commit cc659bc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-01-23T05:11:26Z

Thanks! merging to master.

icexelloss · 2018-01-23T15:04:31Z

Thanks all for review!

yhuai · 2018-01-31T23:28:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/patterns.scala

@@ -199,7 +200,7 @@ object ExtractFiltersAndInnerJoins extends PredicateHelper {
 object PhysicalAggregation {
  // groupingExpressions, aggregateExpressions, resultExpressions, child
  type ReturnType =
-    (Seq[NamedExpression], Seq[AggregateExpression], Seq[NamedExpression], LogicalPlan)
+    (Seq[NamedExpression], Seq[Expression], Seq[NamedExpression], LogicalPlan)


@icexelloss Thank you for this contribution! I just came across the change in this file. I am not sure if changing the type at here is the best option. The reason is that whenever we use this PhysicalAggregation rule, we have to check the instance type of those aggregate expressions and do casting. To me, it seems better to leave this rule untouched and create a new rule just for Python UDAF. What do you think?

(maybe you and reviewers already discussed it. If so, can you point me to the discussion?)

Thank you!

Hi @yhuai,

You bring up a good point. I agree with you ideally we should avoid doing. When I was making the change, I found the solution implemented results in least amount of duplicate code, because a lot of logic is shared between AggregateExpression and Python UDF, but the downside is exactly what you mentioned.

One alternative is to create new rules for Python UDAF, my concern is that could result in quite a bit of code duplication. Maybe there is a way to avoid code duplication and keep the type safety, I am happy to explore the option. (Maybe create a parent class for AggregateExpression and Python UDAF)?

I prefer that we try out using a new rule. We can create utility function to reuse code. Will you have a chance to try it out?

@yhuai Yeah I can certainly try it out. Created https://issues.apache.org/jira/browse/SPARK-23302 to track.

I assume this is not urgent?

It will be good to try it out soon. But it is not urgent.

yhuai · 2018-02-01T03:46:13Z

python/pyspark/sql/tests.py

+        from pyspark.sql.functions import pandas_udf, PandasUDFType
+
+        with QuietTest(self.sc):
+            with self.assertRaisesRegex(NotImplementedError, 'not supported'):


@icexelloss This line does not compile ( we need assertRaisesRegexp). Can you file a pr to fix it? Thanks! Meanwhile, we will look into jenkins setup and see why the test was not exercised.

I'll file the follow-up pr to fix it soon.

I filed #20467. Thanks.

@yhuai, if you meant not running tests in Python 2, this link might be helpful. Let me leave it just in case - #19884 (comment).

@ueshin Thanks for fixing this. (I am late to the party)

… of `assertRaisesRegex`. ## What changes were proposed in this pull request? This is a follow-up pr of apache#19872 which uses `assertRaisesRegex` but it doesn't exist in Python 2, so some tests fail when running tests in Python 2 environment. Unfortunately, we missed it because currently Python 2 environment of the pr builder doesn't have proper versions of pandas or pyarrow, so the tests were skipped. This pr modifies to use `assertRaisesRegexp` instead of `assertRaisesRegex`. ## How was this patch tested? Tested manually in my local environment. Author: Takuya UESHIN <ueshin@databricks.com> Closes apache#20467 from ueshin/issues/SPARK-22274/fup1.

viirya reviewed Dec 4, 2017

View reviewed changes

holdenk reviewed Dec 4, 2017

View reviewed changes

HyukjinKwon reviewed Dec 4, 2017

View reviewed changes

icexelloss force-pushed the SPARK-22274-groupby-agg branch from a1058b8 to c1dc543 Compare December 4, 2017 23:23

icexelloss force-pushed the SPARK-22274-groupby-agg branch from 3352050 to 184b37f Compare December 8, 2017 00:48

ueshin reviewed Dec 11, 2017

View reviewed changes

ueshin reviewed Dec 20, 2017

View reviewed changes

icexelloss force-pushed the SPARK-22274-groupby-agg branch from 99367a6 to 66a31f9 Compare December 27, 2017 21:55

icexelloss added 4 commits January 18, 2018 11:46

Minor style fix

cf9e7dc

Minor style fix

6d505d3

Revert accidental removal

8d2d943

Fix docs. Address PR comments.

17fad5c

icexelloss force-pushed the SPARK-22274-groupby-agg branch from a94b146 to 17fad5c Compare January 18, 2018 16:47

Fix SparkStrategies

0fec5cf

Add a manual test

4d22107

ueshin reviewed Jan 22, 2018

View reviewed changes

Address comments

91885e5

Add doctest SKIP

cc659bc

asfgit closed this in b2ce17b Jan 23, 2018

icexelloss deleted the SPARK-22274-groupby-agg branch January 26, 2018 21:05

icexelloss restored the SPARK-22274-groupby-agg branch January 26, 2018 21:05

yhuai reviewed Jan 31, 2018

View reviewed changes

yhuai reviewed Feb 1, 2018

View reviewed changes

ueshin mentioned this pull request Feb 1, 2018

[SPARK-22274][PYTHON][SQL][FOLLOWUP] Use assertRaisesRegexp instead of assertRaisesRegex. #20467

Closed

		@@ -56,6 +56,10 @@ def _create_udf(f, returnType, evalType):
		return udf_obj._wrapped()


		class UDFColumn(Column):

		@@ -2070,6 +2070,8 @@ class PandasUDFType(object):

		GROUP_MAP = PythonEvalType.SQL_PANDAS_GROUP_MAP_UDF

		GROUP_AGG = PythonEvalType.SQL_PANDAS_GROUP_AGG_UDF

[SPARK-22274][PYTHON][SQL] User-defined aggregation functions with pandas udf (full shuffle) #19872

[SPARK-22274][PYTHON][SQL] User-defined aggregation functions with pandas udf (full shuffle) #19872

Conversation

icexelloss commented Dec 4, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

icexelloss commented Dec 4, 2017

SparkQA commented Dec 4, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 4, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

holdenk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Dec 4, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

icexelloss commented Dec 4, 2017

SparkQA commented Dec 4, 2017

SparkQA commented Dec 8, 2017

icexelloss commented Dec 8, 2017 • edited Loading

icexelloss commented Dec 8, 2017

SparkQA commented Dec 8, 2017

SparkQA commented Dec 8, 2017

SparkQA commented Dec 8, 2017

ueshin commented Dec 8, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 19, 2017

SparkQA commented Dec 20, 2017

SparkQA commented Dec 20, 2017

ueshin commented Dec 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

icexelloss Dec 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 20, 2017

icexelloss commented Dec 4, 2017 •

edited

Loading

HyukjinKwon Dec 4, 2017 •

edited

Loading

icexelloss commented Dec 8, 2017 •

edited

Loading

icexelloss Dec 27, 2017 •

edited

Loading