[SPARK-35344][PYTHON] Support creating a Column of numpy literals in pandas API on Spark #32955

xinrong-meng · 2021-06-17T18:14:27Z

What changes were proposed in this pull request?

The PR is proposed to support creating a Column of numpy literal value in pandas-on-Spark. It consists of three changes mainly:

Enable the lit function defined in pyspark.pandas.spark.functions to support numpy literals input.

>>> from pyspark.pandas.spark import functions as SF
>>> SF.lit(np.int64(1))
Column<'CAST(1 AS BIGINT)'>
>>> SF.lit(np.int32(1))
Column<'CAST(1 AS INT)'>
>>> SF.lit(np.int8(1))
Column<'CAST(1 AS TINYINT)'>
>>> SF.lit(np.byte(1))
Column<'CAST(1 AS TINYINT)'>
>>> SF.lit(np.float32(1))
Column<'CAST(1.0 AS FLOAT)'>

Substitute F.lit by SF.lit, that is, use lit function defined in pyspark.pandas.spark.functions rather than lit function defined in pyspark.sql.functions to allow creating columns out of numpy literals.
Enable numpy literals input in isin method

Non-goal:

Some pandas-on-Spark APIs use PySpark column-related APIs internally, and these column-related APIs don't support numpy literals, thus numpy literals are disallowed as input (e.g. to_replace parameter in replace API). This PR doesn't aim to adjust all of them. This PR adjusts isin only, because the PR is inspired by that (as AttributeError: 'numpy.int64' object has no attribute '_get_object_id' databricks/koalas#2161).
To complete mappings between all kinds of numpy literals and Spark data types should be a followup task.

Why are the changes needed?

Spark (lit function defined in pyspark.sql.functions) doesn't support creating a Column out of numpy literal value.
So lit function defined in pyspark.pandas.spark.functions is adjusted in order to support that in pandas-on-Spark.

Does this PR introduce any user-facing change?

Yes.
Before:

>>> a = ps.DataFrame({'source': [1,2,3,4,5]})
>>> a.source.isin([np.int64(1), np.int64(2)])
Traceback (most recent call last):
...
AttributeError: 'numpy.int64' object has no attribute '_get_object_id'

After:

>>> a = ps.DataFrame({'source': [1,2,3,4,5]})
>>> a.source.isin([np.int64(1), np.int64(2)])
0     True
1     True
2    False
3    False
4    False
Name: source, dtype: bool

How was this patch tested?

Unit tests.

Keyword: SPARK-35337

SparkQA · 2021-06-17T19:12:24Z

Test build #139943 has finished for PR 32955 at commit c222eb7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-06-17T19:13:08Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44470/

SparkQA · 2021-06-17T19:50:36Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44470/

SparkQA · 2021-06-21T18:35:48Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44620/

SparkQA · 2021-06-21T18:46:56Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44620/

SparkQA · 2021-06-21T20:14:20Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44623/

SparkQA · 2021-06-21T20:23:36Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44623/

SparkQA · 2021-06-21T20:53:33Z

Test build #140092 has finished for PR 32955 at commit 0811c90.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2021-06-21T21:29:03Z

Test build #140095 has finished for PR 32955 at commit 8876e6d.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

xinrong-meng · 2021-06-21T23:57:24Z

CC @ueshin @HyukjinKwon @itholic

SparkQA · 2021-06-22T00:28:23Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44632/

SparkQA · 2021-06-22T00:37:21Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44632/

SparkQA · 2021-06-22T02:00:49Z

Test build #140104 has finished for PR 32955 at commit 623cc1d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2021-06-22T03:44:03Z

python/pyspark/pandas/spark/functions.py

@xinrong-databricks, I think we can just support this natively all across PySpark. Can you add an input converter here (

spark/python/pyspark/sql/types.py

Lines 1608 to 1622 in 20750a3

class DatetimeConverter(object):

def can_convert(self, obj):

return isinstance(obj, datetime.datetime)

def convert(self, obj, gateway_client):

Timestamp = JavaClass("java.sql.Timestamp", gateway_client)

seconds = (calendar.timegm(obj.utctimetuple()) if obj.tzinfo

else time.mktime(obj.timetuple()))

t = Timestamp(int(seconds) * 1000)

t.setNanos(obj.microsecond * 1000)

return t

# datetime is a subclass of date, we should register DatetimeConverter first

register_input_converter(DatetimeConverter())

register_input_converter(DateConverter())

) and see if it works?

Also, I think we can simplify it by item() (https://stackoverflow.com/a/11389998/2438480) with np.generic type check (https://numpy.org/doc/stable/reference/arrays.scalars.html).

cc @ueshin @itholic too

cc @mengxr too FYI

Certainly, I am working on this.

Adding an input converter might be tricky in this case.

Py4j takes the instance of java.lang.Integer as int rather than a JavaObject, but the return value of def convert is expected to be a JavaObject.

Unfortunately, there is also no wrapper in java.sql package for numeric values (as java.sql.Timestamp). And instantiating a JavaObject out of a value seems impossible according to its constructor.

Do you have insights on that by any chance? @HyukjinKwon @ueshin

Attached please find my pseudo-code

class SomeConverter(object): def can_convert(self, obj): import numpy as np return isinstance(obj, np.generic) def convert(self, obj, gateway_client): Integer = JavaClass("java.lang.Integer", gateway_client) return Integer.valueOf(obj.item()) # This is an `int` ... register_input_converter(SomeConverter())

And the exception stack trace looks like

Traceback (most recent call last): File "/Users/xinrong.meng/spark/python/pyspark/pandas/tests/data_type_ops/test_udt_ops.py", line 43, in test print(F.lit(np.int64(1))) File "/Users/xinrong.meng/spark/python/pyspark/sql/functions.py", line 100, in lit return col if isinstance(col, Column) else _invoke_function("lit", col) File "/Users/xinrong.meng/spark/python/pyspark/sql/functions.py", line 58, in _invoke_function return Column(jf(*args)) File "/miniconda2/envs/pyspark-dev-pd-1.1.5/lib/python3.9/site-packages/py4j/java_gateway.py", line 1313, in __call__ temp_arg._detach() AttributeError: 'int' object has no attribute '_detach'

Just found something: https://www.py4j.org/advanced_topics.html#boxing

I am wondering if we may support creating a Column of numpy literal value in pandas-on-Spark first. We might need more research on supporting that in PySpark.

CC @HyukjinKwon @ueshin

I'm fine with the current implementation since sounds like the converter is difficult.
I'd leave this to @HyukjinKwon.

Okie, I am fine as is.

SparkQA · 2021-06-22T18:16:49Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44684/

SparkQA · 2021-06-22T18:27:36Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44684/

SparkQA · 2021-06-22T20:09:22Z

Test build #140158 has finished for PR 32955 at commit c13414e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-06-24T23:42:01Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44816/

SparkQA · 2021-06-24T23:51:19Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44816/

SparkQA · 2021-06-25T02:20:37Z

Test build #140287 has finished for PR 32955 at commit 23982c1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2021-06-29T02:03:15Z

Thanks! merging to master.

ueshin · 2021-06-29T02:05:16Z

@xinrong-databricks Please file follow-up tickets if needed. Thanks!

xinrong-meng · 2021-06-29T16:27:44Z

Thanks @ueshin! I will file follow-up tickets.

This is more a bug or safeguard. After conversion of arguments (via our `Converter.convert` interface in Py4J), the returned argument might not be a plain `JavaObject`. For example, `JavaObject(java.lang.Integer)` would be converted to `int` automatically, see also #163. However, the current codebase requires it to be a `JavaObject` by assuming `_detach` method exists (to garbage collect the instance). In fact, calling a Java method with these Python primitives are valid in Py4J, so it makes sense to allow passing returning primitive types via `Converter.convert`. Therefore, this PR proposes to call `_detach` only when it exists, and delegates the type checking into actual method invocation that is consistent with calling the usual JVM methods via Py4J. I manually tested, and will add the integration test into PySpark side. It's a bit tricky to add a unittest, and it will be tested together with PySpark. See also apache/spark#32955 (comment)

github-actions bot added CORE PYTHON labels Jun 17, 2021

github-actions bot added the BUILD label Jun 21, 2021

xinrong-meng force-pushed the datatypeops_literal branch from 8876e6d to 623cc1d Compare June 21, 2021 23:40

xinrong-meng changed the title ~~[WIP] Support creating a Column of numpy literal value in pandas-on-Spark~~ [SPARK-35344][PYTHON] Support creating a Column of numpy literal value in pandas-on-Spark Jun 21, 2021

xinrong-meng marked this pull request as ready for review June 21, 2021 23:57

HyukjinKwon reviewed Jun 22, 2021

View reviewed changes

xinrong-meng force-pushed the datatypeops_literal branch from 623cc1d to c13414e Compare June 22, 2021 16:58

xinrong-meng changed the title ~~[SPARK-35344][PYTHON] Support creating a Column of numpy literal value in pandas-on-Spark~~ [WIP][SPARK-35344][PYTHON] Support creating a Column of numpy literal value Jun 23, 2021

xinrong-meng marked this pull request as draft June 23, 2021 03:05

xinrong-meng force-pushed the datatypeops_literal branch from c13414e to 23982c1 Compare June 24, 2021 22:53

xinrong-meng added 4 commits June 28, 2021 08:43

SF.lit

97e1ee2

Replace F.lit by SF.lit

e8cdaed

Test

80818ed

Adjust isin

73c43eb

xinrong-meng force-pushed the datatypeops_literal branch from 23982c1 to 73c43eb Compare June 28, 2021 16:21

Use np.generic and literal.item()

6852569

xinrong-meng changed the title ~~[WIP][SPARK-35344][PYTHON] Support creating a Column of numpy literal value~~ [SPARK-35344][PYTHON] Support creating a Column of numpy literals in pandas API on Spark Jun 28, 2021

xinrong-meng marked this pull request as ready for review June 28, 2021 16:29

Reformat

5154a7d

ueshin approved these changes Jun 29, 2021

View reviewed changes

ueshin closed this in 5f0113e Jun 29, 2021

HyukjinKwon mentioned this pull request Aug 8, 2022

Detach Java objects only when _detach method exists py4j/py4j#492

Merged

	class DatetimeConverter(object):
	def can_convert(self, obj):
	return isinstance(obj, datetime.datetime)

	def convert(self, obj, gateway_client):
	Timestamp = JavaClass("java.sql.Timestamp", gateway_client)
	seconds = (calendar.timegm(obj.utctimetuple()) if obj.tzinfo
	else time.mktime(obj.timetuple()))
	t = Timestamp(int(seconds) * 1000)
	t.setNanos(obj.microsecond * 1000)
	return t

	# datetime is a subclass of date, we should register DatetimeConverter first
	register_input_converter(DatetimeConverter())
	register_input_converter(DateConverter())

[SPARK-35344][PYTHON] Support creating a Column of numpy literals in pandas API on Spark #32955

[SPARK-35344][PYTHON] Support creating a Column of numpy literals in pandas API on Spark #32955

Uh oh!

Conversation

xinrong-meng commented Jun 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jun 17, 2021

Uh oh!

SparkQA commented Jun 17, 2021

Uh oh!

SparkQA commented Jun 17, 2021

Uh oh!

SparkQA commented Jun 21, 2021

Uh oh!

SparkQA commented Jun 21, 2021

Uh oh!

SparkQA commented Jun 21, 2021

Uh oh!

SparkQA commented Jun 21, 2021

Uh oh!

SparkQA commented Jun 21, 2021

Uh oh!

SparkQA commented Jun 21, 2021

Uh oh!

xinrong-meng commented Jun 21, 2021

Uh oh!

SparkQA commented Jun 22, 2021

Uh oh!

SparkQA commented Jun 22, 2021

Uh oh!

SparkQA commented Jun 22, 2021

Uh oh!

HyukjinKwon Jun 22, 2021

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jun 22, 2021

Choose a reason for hiding this comment

Uh oh!

xinrong-meng Jun 23, 2021

Choose a reason for hiding this comment

Uh oh!

xinrong-meng Jun 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ueshin Jun 24, 2021

Choose a reason for hiding this comment

Uh oh!

xinrong-meng Jun 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ueshin Jun 29, 2021

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jun 29, 2021

Choose a reason for hiding this comment

Uh oh!

xinrong-meng Jun 29, 2021

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 22, 2021

Uh oh!

SparkQA commented Jun 22, 2021

Uh oh!

SparkQA commented Jun 22, 2021

Uh oh!

SparkQA commented Jun 24, 2021

Uh oh!

SparkQA commented Jun 24, 2021

Uh oh!

SparkQA commented Jun 25, 2021

Uh oh!

ueshin commented Jun 29, 2021

Uh oh!

xinrong-meng commented Jun 17, 2021 •

edited

Loading

xinrong-meng Jun 24, 2021 •

edited

Loading

xinrong-meng Jun 28, 2021 •

edited

Loading