[SPARK-37491][PYTHON]Fix Series.asof for unsorted values #35191

pralabhkumar · 2022-01-13T09:52:46Z

What changes were proposed in this pull request?

Fix Series.asof when values of the series is not sorted

Before

import pandas as pd
from pyspark import pandas as ps
import numpy as np
pser = pd.Series([2, 1, np.nan, 4], index=[10, 20, 30, 40], name="Koalas")
psser = ps.from_pandas(pser)
psser.asof([5, 25])
5     NaN
25    2.0
Name: Koalas, dtype: float64

pser = pd.Series([4, np.nan, np.nan, 2], index=[10, 20, 30, 40], name="Koalas")
psser = ps.from_pandas(pser)
psser.asof([5, 100])

5      NaN
100    4.0

After

import pandas as pd
from pyspark import pandas as ps
import numpy as np
pser = pd.Series([2, 1, np.nan, 4], index=[10, 20, 30, 40], name="Koalas")
psser = ps.from_pandas(pser)
psser.asof([5, 25])
5     NaN
25    1.0
Name: Koalas, dtype: float64

pser = pd.Series([4, np.nan, np.nan, 2], index=[10, 20, 30, 40], name="Koalas")
psser = ps.from_pandas(pser)
psser.asof([5, 100])
5      NaN
100    2.0

Why are the changes needed?

There is a bug in ps.as_of, when the series is not sorted

Does this PR introduce any user-facing change?

Yes user will be able to see the behavior exactly matching to pandas

How was this patch tested?

unit tests

pralabhkumar · 2022-01-13T12:35:47Z

@itholic @HyukjinKwon Please review . Have added details in jira about the approach.

AmplabJenkins · 2022-01-13T20:01:58Z

Can one of the admins verify this patch?

HyukjinKwon · 2022-01-14T00:12:05Z

@pralabhkumar sorry do you mind rebasing and syncing to the latest master? Seems like something went wrong in CI.

Yikun

Yep, just FYI, here is the orginal CI link: https://github.com/pralabhkumar/spark/runs/4801311278

Yikun · 2022-01-14T01:32:08Z

python/pyspark/pandas/series.py

+            .withColumn("identifier", col("values.identifier"))
+            .withColumn("value", col("values.Koalas"))
+            .drop("values")
+            .na.drop(subset="value")


BTW, there was a complain in lint CI, it should be a bug on type hint of na.drop, I submit the fixed on #35201.

You could just use dropna(subset="value") instead of .na.drop(subset="value") to work around, and also I think dropna is more reasonable and simple in here.

Thx for the comment , i'll do the same

pralabhkumar · 2022-01-14T04:42:16Z

@HyukjinKwon Will rebase and sync to latest master.

### What changes were proposed in this pull request? Fix drop subset inline type hint ### Why are the changes needed? it should be same with `DataFrame.dropna`: https://github.com/apache/spark/blob/90003398745bfee78416074ed786e986fcb2c8cd/python/pyspark/sql/dataframe.py#L2359 See also: #35191 (comment) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #35201 from Yikun/SPARK-36885-FOLLOWUP. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

itholic · 2022-01-17T05:57:05Z

Could you update the PR description with Before & After example ?? e.g. #34931

pralabhkumar · 2022-01-17T06:24:20Z

Could you update the PR description with Before & After example ?? e.g. #34931

Done
@itholic

itholic · 2022-01-17T06:40:26Z

Thanks! Let me leave some comments after taking deeper look tomorrow.

Btw, you can make the code example more prettier with python keyword as below, just FYI :-)

this results to:

import pandas as pd
from pyspark import pandas as ps
import numpy as np
pser = pd.Series([2, 1, np.nan, 4], index=[10, 20, 30, 40], name="Koalas")
psser = ps.from_pandas(pser)
psser.asof([5, 25])
5     NaN
25    2.0
Name: Koalas, dtype: float64

pser = pd.Series([4, np.nan, np.nan, 2], index=[10, 20, 30, 40], name="Koalas")
psser = ps.from_pandas(pser)
psser.asof([5, 100])

5      NaN
100    4.0

pralabhkumar · 2022-01-20T04:15:06Z

@itholic , Please let me know about your review comments.

itholic

Can you add some more comment for each step to demonstrate the code more understandable ?

It's hard to track each sdf.

Maybe the Series.argsort is one of good example: https://github.com/apache/spark/blob/master/python/pyspark/pandas/series.py#L5612-L5680

itholic · 2022-01-20T05:32:34Z

python/pyspark/pandas/series.py

@@ -5228,22 +5228,62 @@ def asof(self, where: Union[Any, List]) -> Union[Scalar, "Series"]:
            where = [where]
        index_scol = self._internal.index_spark_columns[0]
        index_type = self._internal.spark_type_for(index_scol)
+        from pyspark.sql.functions import struct, lit, explode, col, row_number


pyspark.sql.functions is already imported as F. I think we can just reuse it.

itholic · 2022-01-20T06:03:59Z

python/pyspark/pandas/tests/test_series.py

@@ -2071,6 +2071,18 @@ def test_asof(self):
        with ps.option_context("compute.eager_check", False):
            self.assert_eq(psser.asof(20), 4.0)

+        pser = pd.Series([2, 1, np.nan, 4], index=[10, 20, 30, 40], name="Koalas")
+        psser = ps.from_pandas(pser)
+        self.assert_eq(psser.asof([5, 25]), pser.asof([5, 25]))


How about psser.asof([25, 25]) ? It might fail

itholic · 2022-01-20T06:05:40Z

python/pyspark/pandas/series.py

+            F.when(
+                index_scol <= SF.lit(index).cast(index_type),
+                struct(
+                    lit(column_prefix_constant + str(index)).alias("identifier"),


Since where can be the same value, the column name here can be duplicated.

itholic · 2022-01-20T06:17:53Z

python/pyspark/pandas/series.py

+                index_scol <= SF.lit(index).cast(index_type),
+                struct(
+                    lit(column_prefix_constant + str(index)).alias("identifier"),
+                    self.spark.column.alias("Koalas"),


Should we have to use alias here, instead of just use the existing column name ??

At least, the alias "Koalas" looks a bit wired here to me.

changed the alias name , it was required to alias the column , so that i can easily refer it later on

pralabhkumar · 2022-01-20T07:14:31Z

@itholic Thx for the comments , working on it

pralabhkumar · 2022-01-24T03:30:30Z

@itholic Addressed the review comments, please review

itholic · 2022-01-27T02:26:17Z

@pralabhkumar Sorry for being delayed. I've been busy for couple of days. Will take a closer look soon 🙏

pralabhkumar · 2022-01-31T18:12:34Z

@itholic Please review PR . Thx for your time

pralabhkumar · 2022-02-02T05:45:10Z

@itholic Gentle ping

itholic · 2022-02-02T21:24:25Z

@pralabhkumar Just came back from vacation. Will leave the comment very soon! 🙏

itholic · 2022-02-03T05:26:03Z

python/pyspark/pandas/series.py

+            F.when(
+                index_scol <= SF.lit(index).cast(index_type),
+                F.struct(
+                    F.lit(column_prefix_constant + str(index) + "_" + str(idx)).alias("identifier"),
+                    self.spark.column.alias("col_value"),
+                ),
+            ).alias(column_prefix_constant + str(index) + "_" + str(idx))


I think maybe we can use F.last with ignorenulls=True instead of F.max ??
(https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.functions.last.html)

e.g.

cond = [ F.last(F.when(index_scol <= SF.lit(index).cast(index_type), self.spark.column), ignorenulls=True) for index in where ]

It returns the last non-null value from given column, and seems like you want to do the same thing in your fix.

If this works, I think we simply need to fix only this part.

Yes @itholic , this is working (since index_level_0) is sorted. However , test case with psser.asof([25, 25]) is failing , ambiguous of duplicate cols in psdf = ps.DataFrame(sdf) . Therefore , in order to pass above test case ,
below is the change.

cond = [ F.last( F.when(index_scol <= SF.lit(index).cast(index_type), self.spark.column), ignorenulls=True, ).alias(column_prefix_constant + str(index) + "_" + str(idx)) for idx, index in enumerate(where) ]

Then

with ps.option_context("compute.default_index_type", "distributed", "compute.max_rows", 1): psdf = ps.DataFrame(sdf) # type: DataFrame df = pd.DataFrame(psdf.transpose().values, columns=[self.name], index=where) return df[df.columns[0]]

Please let me know , if its ok , i'll update the PR

I think we might want to return the pandas-on-Spark Series rather than pandas Series, and should leverage the pandas DataFrame directly when only the where has duplicate item.

So, how about this ??

cond = [ F.last( F.when(index_scol <= SF.lit(index).cast(index_type), self.spark.column), ignorenulls=True, ) for index in where ]

Then

# The data is expected to be small so it's fine to transpose/use default index. with ps.option_context("compute.default_index_type", "distributed", "compute.max_rows", 1): if len(where) == len(set(where)): psdf: DataFrame = DataFrame(sdf) psdf.columns = pd.Index(where) return first_series(psdf.transpose()).rename(self.name) else: # If `where` has duplicate items, leverage the pandas directly # since pandas API on Spark doesn't support the duplicate column name. pdf: pd.DataFrame = sdf.limit(1).toPandas() pdf.columns = pd.Index(where) return first_series(DataFrame(pdf.transpose())).rename(self.name)

??

@itholic Yes i think this is good suggestion. Please let me know , if I update this PR , or you are planning to create new PR with the suggested code changes.

It's okay to just update here! :-)

itholic

Otherwise, looks pretty good.

itholic · 2022-02-08T03:16:22Z

python/pyspark/pandas/tests/test_series.py

+
+        pser = pd.Series([2, 1, np.nan, 4], index=[10, 20, 30, 40], name="Koalas")
+        psser = ps.from_pandas(pser)
+        self.assert_eq(psser.asof([25, 25]), pser.asof([25, 25]))


Can we also test the string & timestamp index ??

e.g.

>>> pser = pd.Series([2, 1, np.nan, 4], index=['a', 'b', 'c', 'd']) >>> pser.asof(['a', 'd']) a 2.0 d 4.0 dtype: float64

>>> pser = pd.Series([2, 1, np.nan, 4], index=[pd.Timestamp(2020, 1, 1), pd.Timestamp(2020, 2, 2), pd.Timestamp(2020, 3, 3), pd.Timestamp(2020, 4, 4)]) >>> pser.asof([pd.Timestamp(2020, 1, 1), pd.Timestamp(2020, 2, 4)]) 2020-01-01 2.0 2020-02-04 1.0 dtype: float64

@itholic For the timestamp , its throwing exception , since Timestamp name is like tuple so pd.Index , throws assertion error assert is_name_like_tuple(column_label, check_type=True), column_label . However the earlier code(suggested one) is working fine . Currently working on resolving it

Had to revert to earlier code because of this issue .

itholic · 2022-02-08T03:18:59Z

python/pyspark/pandas/tests/test_series.py

+
+        pser = pd.Series([2, 1, np.nan, 4], index=[10, 20, 30, 40], name="Koalas")
+        psser = ps.from_pandas(pser)
+        self.assert_eq(psser.asof([25, 25]), pser.asof([25, 25]))


And also set the np.nan as where ?

>>> pser.asof([np.nan, np.nan]) NaN 3.0 NaN 3.0 dtype: float64

Seems like this case is only supported for numeric type index.

pralabhkumar · 2022-02-21T16:14:19Z

@itholic , please review the PR .

itholic · 2022-02-25T00:06:52Z

python/pyspark/pandas/series.py

+            if len(original_where) > 0:
+                df = pd.DataFrame(
+                    psdf.transpose().values, columns=[self.name], index=original_where
+                )
+            else:
+                df = pd.DataFrame(psdf.transpose().values, columns=[self.name], index=where)
+            return df[df.columns[0]]


Seems like this returns the pandas DataFrame, but we should return the pandas-on-Spark DataFrame.

I think maybe we can just keep the previous fix and address the Timestamp case separately ??

e.g.

with ps.option_context("compute.default_index_type", "distributed", "compute.max_rows", 1): if (len(where) == len(set(where))) and not isinstance(index_type, TimestampType): psdf: DataFrame = DataFrame(sdf) psdf.columns = pd.Index(where) return first_series(psdf.transpose()).rename(self.name) else: # If `where` has duplicate items, leverage the pandas directly # since pandas API on Spark doesn't support the duplicate column name. pdf: pd.DataFrame = sdf.limit(1).toPandas() pdf.columns = pd.Index(where) return first_series(DataFrame(pdf.transpose())).rename(self.name)

pralabhkumar · 2022-02-28T09:22:58Z

@itholic Please review the changes

itholic

Otherwise, looks pretty good

itholic · 2022-03-02T00:49:01Z

python/pyspark/pandas/series.py

+        if np.nan in where:
+            max_index = self._internal.spark_frame.select(F.last(index_scol)).take(1)[0][0]
+            modified_where = [max_index if x is np.nan else x for x in where]
+            cond = [
+                F.last(
+                    F.when(index_scol <= SF.lit(index).cast(index_type), self.spark.column),
+                    ignorenulls=True,
+                )
+                for idx, index in enumerate(modified_where)
+            ]
+        else:
+            cond = [
+                F.last(
+                    F.when(index_scol <= SF.lit(index).cast(index_type), self.spark.column),
+                    ignorenulls=True,
+                )
+                for idx, index in enumerate(where)
+            ]


nit: I think maybe we can unify the cond and leave some comment to improve a bit of readability.

e.g.

if np.nan in where: # When `where` is np.nan, pandas returns the last index value. max_index = self._internal.spark_frame.select(F.last(index_scol)).take(1)[0][0] modified_where = [max_index if x is np.nan else x for x in where] else: modified_where = where cond = [ F.last( F.when(index_scol <= SF.lit(index).cast(index_type), self.spark.column), ignorenulls=True, ) for idx, index in enumerate(modified_where) ]

ueshin · 2022-03-11T18:22:37Z

@pralabhkumar I think it's already fixed at 54abb85. Could you merge the latest master branch and push the commit?

ueshin · 2022-03-11T18:24:44Z

@itholic @Yikun @HyukjinKwon @xinrong-databricks Could you take another look? Thanks.

HyukjinKwon · 2022-03-14T01:18:03Z

Should be good to go if it looks fine to you, Takuya.

Yikun

LGTM

pralabhkumar · 2022-03-14T10:07:36Z

@ueshin
Build is passing . please find some time review it .

ueshin

LGTM.

ueshin · 2022-03-14T17:32:20Z

Thanks! merging to master.

itholic · 2022-03-14T21:15:45Z

Thanks for your efforts :-)

pralabhkumar · 2022-03-15T02:34:11Z

Thanks merging to master

### What changes were proposed in this pull request? Use the monotonically ID as a sorting condition for `max_by` instead of a literal string. ### Why are the changes needed? #35191 had a error where the literal string `"__monotonically_increasing_id__"` was used as the tie-breaker in `max_by` instead of the actual ID. ### Does this PR introduce _any_ user-facing change? Fixes nondeterminism in `asof` ### How was this patch tested? In some circumstances `//python:pyspark.pandas.tests.connect.series.test_parity_as_of` is sufficient to reproduce ### Was this patch authored or co-authored using generative AI tooling? No Closes #46018 from markj-db/SPARK-47824. Authored-by: Mark Jarvin <mark.jarvin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Use the monotonically ID as a sorting condition for `max_by` instead of a literal string. ### Why are the changes needed? #35191 had a error where the literal string `"__monotonically_increasing_id__"` was used as the tie-breaker in `max_by` instead of the actual ID. ### Does this PR introduce _any_ user-facing change? Fixes nondeterminism in `asof` ### How was this patch tested? In some circumstances `//python:pyspark.pandas.tests.connect.series.test_parity_as_of` is sufficient to reproduce ### Was this patch authored or co-authored using generative AI tooling? No Closes #46018 from markj-db/SPARK-47824. Authored-by: Mark Jarvin <mark.jarvin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit a0ccdf2) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? Use the monotonically ID as a sorting condition for `max_by` instead of a literal string. ### Why are the changes needed? apache#35191 had a error where the literal string `"__monotonically_increasing_id__"` was used as the tie-breaker in `max_by` instead of the actual ID. ### Does this PR introduce _any_ user-facing change? Fixes nondeterminism in `asof` ### How was this patch tested? In some circumstances `//python:pyspark.pandas.tests.connect.series.test_parity_as_of` is sufficient to reproduce ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#46018 from markj-db/SPARK-47824. Authored-by: Mark Jarvin <mark.jarvin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit a0ccdf2) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

github-actions bot added CORE PYTHON labels Jan 13, 2022

HyukjinKwon closed this Jan 13, 2022

HyukjinKwon reopened this Jan 13, 2022

Yikun reviewed Jan 14, 2022

View reviewed changes

Yikun mentioned this pull request Jan 15, 2022

[SPARK-36885][PYTHON][FOLLOWUP] Fix drop subset inline type hint #35201

Closed

pralabhkumar force-pushed the rk_spark_asof_series branch from d1117d0 to f65d4d9 Compare January 17, 2022 03:41

itholic reviewed Jan 20, 2022

View reviewed changes

itholic reviewed Feb 3, 2022

View reviewed changes

itholic reviewed Feb 8, 2022

View reviewed changes

itholic reviewed Feb 25, 2022

View reviewed changes

itholic reviewed Mar 2, 2022

View reviewed changes

Yikun approved these changes Mar 14, 2022

View reviewed changes

pralabhkumar added 14 commits March 14, 2022 08:27

Fix Series.asof for unsorted values

29a1566

na.drop changed to dropna and rebased

77a2376

Addressed review comment

a5e3f4f

Removed reduntant print statement

838c07b

Implemented the suggested changes

d946bb1

Added code to take care of data , nan and strings

5f818a5

Fix mypy issue

0a0deff

Implemeneted suggested changes and np.nan case

1790039

Implemented suggested changes

ea15e5a

Implemented suggested changes

dd9cb53

Implemented suggested changes

284909f

Changed documentation

876db67

documentation removed

f563718

Rebase with master branch and added documentation

dc4310f

pralabhkumar force-pushed the rk_spark_asof_series branch from 5683590 to dc4310f Compare March 14, 2022 04:33

ueshin approved these changes Mar 14, 2022

View reviewed changes

ueshin closed this in f6c4634 Mar 14, 2022

markj-db mentioned this pull request Apr 11, 2024

[SPARK-47824][PS] Fix nondeterminism in pyspark.pandas.series.asof #46018

Closed

[SPARK-37491][PYTHON]Fix Series.asof for unsorted values #35191

[SPARK-37491][PYTHON]Fix Series.asof for unsorted values #35191

Conversation

pralabhkumar commented Jan 13, 2022 • edited Loading

What changes were proposed in this pull request?

Before

After

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

pralabhkumar commented Jan 13, 2022

AmplabJenkins commented Jan 13, 2022

HyukjinKwon commented Jan 14, 2022

Yikun left a comment

Choose a reason for hiding this comment

Yikun Jan 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pralabhkumar commented Jan 14, 2022

itholic commented Jan 17, 2022

pralabhkumar commented Jan 17, 2022

itholic commented Jan 17, 2022

pralabhkumar commented Jan 20, 2022

itholic left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pralabhkumar commented Jan 20, 2022 • edited Loading

pralabhkumar commented Jan 24, 2022

itholic commented Jan 27, 2022 • edited Loading

pralabhkumar commented Jan 31, 2022

pralabhkumar commented Feb 2, 2022

itholic commented Feb 2, 2022 • edited Loading

itholic Feb 3, 2022 • edited Loading

Choose a reason for hiding this comment

pralabhkumar Feb 4, 2022 • edited Loading

Choose a reason for hiding this comment

itholic Feb 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itholic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pralabhkumar Feb 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pralabhkumar commented Feb 21, 2022

Choose a reason for hiding this comment

pralabhkumar commented Feb 28, 2022

itholic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ueshin commented Mar 11, 2022 • edited Loading

ueshin commented Mar 11, 2022

HyukjinKwon commented Mar 14, 2022

Yikun left a comment

Choose a reason for hiding this comment

pralabhkumar commented Mar 14, 2022

ueshin left a comment

Choose a reason for hiding this comment

ueshin commented Mar 14, 2022

itholic commented Mar 14, 2022

pralabhkumar commented Mar 15, 2022

pralabhkumar commented Jan 13, 2022 •

edited

Loading

Yikun Jan 14, 2022 •

edited

Loading

itholic left a comment •

edited

Loading

pralabhkumar commented Jan 20, 2022 •

edited

Loading

itholic commented Jan 27, 2022 •

edited

Loading

itholic commented Feb 2, 2022 •

edited

Loading

itholic Feb 3, 2022 •

edited

Loading

pralabhkumar Feb 4, 2022 •

edited

Loading

itholic Feb 7, 2022 •

edited

Loading

pralabhkumar Feb 11, 2022 •

edited

Loading

ueshin commented Mar 11, 2022 •

edited

Loading