New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UDF using DataFrame.apply #65
Comments
There are some problems to use it directly.
|
Actually, I think we could hack this via |
Would groupby-apply introduce exchange/shuffle on every |
Yeah, we should avoid. With Spark 3.0, we can avoid a shuffle with |
This PR proposes to implement `DataFrame.apply` with both `axis` 0 and 1. Note that, `DataFrame.apply(..., axis=1)` with global aggregations is impossible. It can be tested with the examples below: ```python import numpy as np import databricks.koalas as ks df = ks.DataFrame([[4, 9]] * 10, columns=['A', 'B']) df.apply(np.sqrt, axis=0) def sqrt(x) -> ks.Series[float]: return np.sqrt(x) df.apply(sqrt, axis=0) df.apply(np.sum, axis=1) def summation(x) -> int: return np.sum(x) df.apply(summation, axis=1) ``` Basically the approach is using group map Pandas UDF by grouping by partitions. ```python from pyspark.sql.functions import pandas_udf, PandasUDFType from pyspark.sql import functions as F df = spark.createDataFrame( [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v")) @pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP) def func(pdf): return pdf.apply(...) df.groupby(F.spark_partition_id()).apply(func).show() ``` Resolves #1228 Resolves #65
These are the patterns that I have seen with the apply() function:
The text was updated successfully, but these errors were encountered: