Skip to content

Conversation

@xinrong-meng
Copy link
Member

@xinrong-meng xinrong-meng commented Jun 17, 2021

What changes were proposed in this pull request?

The PR is proposed to support creating a Column of numpy literal value in pandas-on-Spark. It consists of three changes mainly:

  • Enable the lit function defined in pyspark.pandas.spark.functions to support numpy literals input.
>>> from pyspark.pandas.spark import functions as SF
>>> SF.lit(np.int64(1))
Column<'CAST(1 AS BIGINT)'>
>>> SF.lit(np.int32(1))
Column<'CAST(1 AS INT)'>
>>> SF.lit(np.int8(1))
Column<'CAST(1 AS TINYINT)'>
>>> SF.lit(np.byte(1))
Column<'CAST(1 AS TINYINT)'>
>>> SF.lit(np.float32(1))
Column<'CAST(1.0 AS FLOAT)'>
  • Substitute F.lit by SF.lit, that is, use lit function defined in pyspark.pandas.spark.functions rather than lit function defined in pyspark.sql.functions to allow creating columns out of numpy literals.
  • Enable numpy literals input in isin method

Non-goal:

  • Some pandas-on-Spark APIs use PySpark column-related APIs internally, and these column-related APIs don't support numpy literals, thus numpy literals are disallowed as input (e.g. to_replace parameter in replace API). This PR doesn't aim to adjust all of them. This PR adjusts isin only, because the PR is inspired by that (as AttributeError: 'numpy.int64' object has no attribute '_get_object_id' databricks/koalas#2161).
  • To complete mappings between all kinds of numpy literals and Spark data types should be a followup task.

Why are the changes needed?

Spark (lit function defined in pyspark.sql.functions) doesn't support creating a Column out of numpy literal value.
So lit function defined in pyspark.pandas.spark.functions is adjusted in order to support that in pandas-on-Spark.

Does this PR introduce any user-facing change?

Yes.
Before:

>>> a = ps.DataFrame({'source': [1,2,3,4,5]})
>>> a.source.isin([np.int64(1), np.int64(2)])
Traceback (most recent call last):
...
AttributeError: 'numpy.int64' object has no attribute '_get_object_id'

After:

>>> a = ps.DataFrame({'source': [1,2,3,4,5]})
>>> a.source.isin([np.int64(1), np.int64(2)])
0     True
1     True
2    False
3    False
4    False
Name: source, dtype: bool

How was this patch tested?

Unit tests.

Keyword: SPARK-35337

@SparkQA
Copy link

SparkQA commented Jun 17, 2021

Test build #139943 has finished for PR 32955 at commit c222eb7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44470/

@SparkQA
Copy link

SparkQA commented Jun 17, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44470/

@github-actions github-actions bot added the BUILD label Jun 21, 2021
@SparkQA
Copy link

SparkQA commented Jun 21, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44620/

@SparkQA
Copy link

SparkQA commented Jun 21, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44620/

@SparkQA
Copy link

SparkQA commented Jun 21, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44623/

@SparkQA
Copy link

SparkQA commented Jun 21, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44623/

@SparkQA
Copy link

SparkQA commented Jun 21, 2021

Test build #140092 has finished for PR 32955 at commit 0811c90.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 21, 2021

Test build #140095 has finished for PR 32955 at commit 8876e6d.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@xinrong-meng xinrong-meng force-pushed the datatypeops_literal branch from 8876e6d to 623cc1d Compare June 21, 2021 23:40
@xinrong-meng xinrong-meng changed the title [WIP] Support creating a Column of numpy literal value in pandas-on-Spark [SPARK-35344][PYTHON] Support creating a Column of numpy literal value in pandas-on-Spark Jun 21, 2021
@xinrong-meng
Copy link
Member Author

CC @ueshin @HyukjinKwon @itholic

@xinrong-meng xinrong-meng marked this pull request as ready for review June 21, 2021 23:57
@SparkQA
Copy link

SparkQA commented Jun 22, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44632/

@SparkQA
Copy link

SparkQA commented Jun 22, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44632/

@SparkQA
Copy link

SparkQA commented Jun 22, 2021

Test build #140104 has finished for PR 32955 at commit 623cc1d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xinrong-databricks, I think we can just support this natively all across PySpark. Can you add an input converter here (

class DatetimeConverter(object):
def can_convert(self, obj):
return isinstance(obj, datetime.datetime)
def convert(self, obj, gateway_client):
Timestamp = JavaClass("java.sql.Timestamp", gateway_client)
seconds = (calendar.timegm(obj.utctimetuple()) if obj.tzinfo
else time.mktime(obj.timetuple()))
t = Timestamp(int(seconds) * 1000)
t.setNanos(obj.microsecond * 1000)
return t
# datetime is a subclass of date, we should register DatetimeConverter first
register_input_converter(DatetimeConverter())
register_input_converter(DateConverter())
) and see if it works?

Also, I think we can simplify it by item() (https://stackoverflow.com/a/11389998/2438480) with np.generic type check (https://numpy.org/doc/stable/reference/arrays.scalars.html).

cc @ueshin @itholic too

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @mengxr too FYI

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly, I am working on this.

Copy link
Member Author

@xinrong-meng xinrong-meng Jun 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding an input converter might be tricky in this case.

Py4j takes the instance of java.lang.Integer as int rather than a JavaObject, but the return value of def convert is expected to be a JavaObject.

Unfortunately, there is also no wrapper in java.sql package for numeric values (as java.sql.Timestamp). And instantiating a JavaObject out of a value seems impossible according to its constructor.

Do you have insights on that by any chance? @HyukjinKwon @ueshin

Attached please find my pseudo-code

class SomeConverter(object):
    def can_convert(self, obj):
        import numpy as np
        return isinstance(obj, np.generic)

    def convert(self, obj, gateway_client):
        Integer = JavaClass("java.lang.Integer", gateway_client)
        return Integer.valueOf(obj.item())  # This is an `int`

...
register_input_converter(SomeConverter())

And the exception stack trace looks like

Traceback (most recent call last):
  File "/Users/xinrong.meng/spark/python/pyspark/pandas/tests/data_type_ops/test_udt_ops.py", line 43, in test
    print(F.lit(np.int64(1)))
  File "/Users/xinrong.meng/spark/python/pyspark/sql/functions.py", line 100, in lit
    return col if isinstance(col, Column) else _invoke_function("lit", col)
  File "/Users/xinrong.meng/spark/python/pyspark/sql/functions.py", line 58, in _invoke_function
    return Column(jf(*args))
  File "/miniconda2/envs/pyspark-dev-pd-1.1.5/lib/python3.9/site-packages/py4j/java_gateway.py", line 1313, in __call__
    temp_arg._detach()
AttributeError: 'int' object has no attribute '_detach'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@xinrong-meng xinrong-meng Jun 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we may support creating a Column of numpy literal value in pandas-on-Spark first. We might need more research on supporting that in PySpark.

CC @HyukjinKwon @ueshin

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with the current implementation since sounds like the converter is difficult.
I'd leave this to @HyukjinKwon.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okie, I am fine as is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@xinrong-meng xinrong-meng force-pushed the datatypeops_literal branch from 623cc1d to c13414e Compare June 22, 2021 16:58
@SparkQA
Copy link

SparkQA commented Jun 22, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44684/

@SparkQA
Copy link

SparkQA commented Jun 22, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44684/

@SparkQA
Copy link

SparkQA commented Jun 22, 2021

Test build #140158 has finished for PR 32955 at commit c13414e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xinrong-meng xinrong-meng changed the title [SPARK-35344][PYTHON] Support creating a Column of numpy literal value in pandas-on-Spark [WIP][SPARK-35344][PYTHON] Support creating a Column of numpy literal value Jun 23, 2021
@xinrong-meng xinrong-meng marked this pull request as draft June 23, 2021 03:05
@xinrong-meng xinrong-meng force-pushed the datatypeops_literal branch from c13414e to 23982c1 Compare June 24, 2021 22:53
@SparkQA
Copy link

SparkQA commented Jun 24, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44816/

@SparkQA
Copy link

SparkQA commented Jun 24, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44816/

@SparkQA
Copy link

SparkQA commented Jun 25, 2021

Test build #140287 has finished for PR 32955 at commit 23982c1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xinrong-meng xinrong-meng force-pushed the datatypeops_literal branch from 23982c1 to 73c43eb Compare June 28, 2021 16:21
@xinrong-meng xinrong-meng changed the title [WIP][SPARK-35344][PYTHON] Support creating a Column of numpy literal value [SPARK-35344][PYTHON] Support creating a Column of numpy literals in pandas API on Spark Jun 28, 2021
@xinrong-meng xinrong-meng marked this pull request as ready for review June 28, 2021 16:29
@ueshin
Copy link
Member

ueshin commented Jun 29, 2021

Thanks! merging to master.

@ueshin ueshin closed this in 5f0113e Jun 29, 2021
@ueshin
Copy link
Member

ueshin commented Jun 29, 2021

@xinrong-databricks Please file follow-up tickets if needed. Thanks!

@xinrong-meng
Copy link
Member Author

Thanks @ueshin! I will file follow-up tickets.

HyukjinKwon added a commit to py4j/py4j that referenced this pull request Aug 10, 2022
This is more a bug or safeguard.

After conversion of arguments (via our `Converter.convert` interface in Py4J), the returned argument might not be a plain `JavaObject`. For example, `JavaObject(java.lang.Integer)` would be converted to `int` automatically, see also #163.

However, the current codebase requires it to be a `JavaObject` by assuming `_detach` method exists (to garbage collect the instance). In fact, calling a Java method with these Python primitives are valid in Py4J, so it makes sense to allow passing returning primitive types via `Converter.convert`.

Therefore, this PR proposes to call `_detach` only when it exists, and delegates the type checking into actual method invocation that is consistent with calling the usual JVM methods via Py4J.

I manually tested, and will add the integration test into PySpark side. It's a bit tricky to add a unittest, and it will be tested together with PySpark.

See also apache/spark#32955 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants