UDF using DataFrame.apply #65

AbdealiLoKo · 2019-04-09T04:29:37Z

These are the patterns that I have seen with the apply() function:

df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6, 7], 'b': [7, 6, 5, 4, 3, 2, 1]})

# Option 1: Use a series and apply on it to run a UDF (Technically a Series.apply)
df['c'] = df['a'].apply(lambda x: x*2)

# Option 2: Use the entire row from a dataframe
df['d'] = df.apply(lambda x: x.a*x.b, axis=1)

# Option 3: Use some columns in a row
df['e'] = df[['a', 'b']].apply(lambda x: x.a*x.b, axis=1)

# Option 4: Apply aggregates on every column to create a new dataframe
new_df = df.apply(lambda x: x.sum())

# Option 5: Apply aggregates but return multiple values - here it adjusts the shorted columns with NaNs
new_df = df.apply(lambda x: x[:x[0]])

The text was updated successfully, but these errors were encountered:

HyukjinKwon · 2019-05-09T13:08:56Z

There are some problems to use it directly.

Current Scalar UDF doesn't support aggregation across whole data. If we do the same thing, it will aggregate each batch unit that's internally used (arrow batch).
If we restrict it's 1 to 1 then we can consider transform instead. But we shouldn't forget we should check if the input & output length are same.
When axis is different, it will need a different execution style to row by row, which will probably be like d397c1a way.

HyukjinKwon · 2019-12-12T09:34:03Z

Actually, I think we could hack this via df.groupby(F.spark_partition_id()).apply(pands_udf). Let me take a look soon.

chunyang · 2020-01-13T22:29:42Z

Would groupby-apply introduce exchange/shuffle on every apply? If so, seems like that would be an expensive operation.

HyukjinKwon · 2020-01-14T00:12:04Z

Yeah, we should avoid. With Spark 3.0, we can avoid a shuffle with mapInPandas API.

This PR proposes to implement `DataFrame.apply` with both `axis` 0 and 1. Note that, `DataFrame.apply(..., axis=1)` with global aggregations is impossible. It can be tested with the examples below: ```python import numpy as np import databricks.koalas as ks df = ks.DataFrame([[4, 9]] * 10, columns=['A', 'B']) df.apply(np.sqrt, axis=0) def sqrt(x) -> ks.Series[float]: return np.sqrt(x) df.apply(sqrt, axis=0) df.apply(np.sum, axis=1) def summation(x) -> int: return np.sum(x) df.apply(summation, axis=1) ``` Basically the approach is using group map Pandas UDF by grouping by partitions. ```python from pyspark.sql.functions import pandas_udf, PandasUDFType from pyspark.sql import functions as F df = spark.createDataFrame( [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v")) @pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP) def func(pdf): return pdf.apply(...) df.groupby(F.spark_partition_id()).apply(func).show() ``` Resolves #1228 Resolves #65

ueshin mentioned this issue Nov 25, 2019

Dataframe.apply #1066

Closed

ueshin mentioned this issue Dec 11, 2019

df.apply() is not implemented #554

Closed

ueshin mentioned this issue Jan 27, 2020

DataFrame.apply(func, axis=1) #1228

Closed

ueshin added the enhancement New feature or request label Jan 27, 2020

HyukjinKwon mentioned this issue Feb 7, 2020

Implement DataFrame.apply #1259

Merged

HyukjinKwon closed this as completed in #1259 Feb 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UDF using DataFrame.apply #65

UDF using DataFrame.apply #65

AbdealiLoKo commented Apr 9, 2019

HyukjinKwon commented May 9, 2019

HyukjinKwon commented Dec 12, 2019

chunyang commented Jan 13, 2020

HyukjinKwon commented Jan 14, 2020

UDF using DataFrame.apply #65

UDF using DataFrame.apply #65

Comments

AbdealiLoKo commented Apr 9, 2019

HyukjinKwon commented May 9, 2019

HyukjinKwon commented Dec 12, 2019

chunyang commented Jan 13, 2020

HyukjinKwon commented Jan 14, 2020