Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-44733][PYTHON][DOCS] Add Python to Spark type conversion page to PySpark docs. #43369

Closed
wants to merge 5 commits into from

Conversation

PhilDakin
Copy link
Contributor

@PhilDakin PhilDakin commented Oct 13, 2023

@allisonwang-db

What changes were proposed in this pull request?

Add documentation page showing Python to Spark type mappings for PySpark.

Why are the changes needed?

Surface this information to users navigating the PySpark docs per https://issues.apache.org/jira/browse/SPARK-44733.

Does this PR introduce any user-facing change?

Yes, adds new page to PySpark docs.

How was this patch tested?

Build HTML docs file using Sphinx, inspect visually.

Was this patch authored or co-authored using generative AI tooling?

No.

full

@github-actions github-actions bot added the DOCS label Oct 16, 2023
@HyukjinKwon
Copy link
Member

Mind attaching the output HTML image if you don't mind? Otherwise looks fine from a cursory look. cc @itholic, @xinrong-meng and @zhengruifeng too if you find some time to review.

@PhilDakin
Copy link
Contributor Author

spark_python_docs_build_html_user_guide_sql_type_conversions html

@PhilDakin
Copy link
Contributor Author

Test failure looks unrelated, pyspark-mllib Error: The operation was canceled..

@itholic
Copy link
Contributor

itholic commented Oct 18, 2023

Looks nice. Could you rebase the PR to master?

@PhilDakin
Copy link
Contributor Author

Screenshot 2023-10-18 at 11 33 37 AM

@allisonwang-db
Copy link
Contributor

Hi @PhilDakin thanks for doing this! I personally think it's better to have the table here instead of a link to another page.

Also, I think we should explain why this conversion table matters. For example, it is useful when users what to map a Python return type to a Spark return type in a Python UDF.

Another thing we need to mention is type casting. What if I want to cast an int type in Python to a FloatType in Spark? Currently, for regular Python UDF, it will return NULL, I believe, but for arrow-optimized Python UDF, it can cast the value properly. It will be valuable to have a table like this:

# The following table shows the results when the type coercion in Arrow is needed, that is,
# when the user-specified return type(SQL Type) of the UDF and the actual instance(Python
# Value(Type)) that the UDF returns are different.
# Arrow and Pickle have different type coercion rules, so a UDF might have a different result
# with/without Arrow optimization. That's the main reason the Arrow optimization for Python
# UDFs is disabled by default.
# +-----------------------------+--------------+----------+------+---------------+--------------------+-----------------------------+----------+----------------------+---------+--------------------+----------------------------+------------+--------------+ # noqa
# |SQL Type \ Python Value(Type)|None(NoneType)|True(bool)|1(int)| a(str)| 1970-01-01(date)|1970-01-01 00:00:00(datetime)|1.0(float)|array('i', [1])(array)|[1](list)| (1,)(tuple)|bytearray(b'ABC')(bytearray)| 1(Decimal)|{'a': 1}(dict)| # noqa
# +-----------------------------+--------------+----------+------+---------------+--------------------+-----------------------------+----------+----------------------+---------+--------------------+----------------------------+------------+--------------+ # noqa
# | boolean| None| True| None| None| None| None| None| None| None| None| None| None| None| # noqa
# | tinyint| None| None| 1| None| None| None| None| None| None| None| None| None| None| # noqa
# | smallint| None| None| 1| None| None| None| None| None| None| None| None| None| None| # noqa
# | int| None| None| 1| None| None| None| None| None| None| None| None| None| None| # noqa
# | bigint| None| None| 1| None| None| None| None| None| None| None| None| None| None| # noqa
# | string| None| 'true'| '1'| 'a'|'java.util.Gregor...| 'java.util.Gregor...| '1.0'| '[I@120d813a'| '[1]'|'[Ljava.lang.Obje...| '[B@48571878'| '1'| '{a=1}'| # noqa
# | date| None| X| X| X|datetime.date(197...| datetime.date(197...| X| X| X| X| X| X| X| # noqa
# | timestamp| None| X| X| X| X| datetime.datetime...| X| X| X| X| X| X| X| # noqa
# | float| None| None| None| None| None| None| 1.0| None| None| None| None| None| None| # noqa
# | double| None| None| None| None| None| None| 1.0| None| None| None| None| None| None| # noqa
# | binary| None| None| None|bytearray(b'a')| None| None| None| None| None| None| bytearray(b'ABC')| None| None| # noqa
# | decimal(10,0)| None| None| None| None| None| None| None| None| None| None| None|Decimal('1')| None| # noqa
# +-----------------------------+--------------+----------+------+---------------+--------------------+-----------------------------+----------+----------------------+---------+--------------------+----------------------------+------------+--------------+ # noqa
# Note: Python 3.9.15, Pandas 1.5.2 and PyArrow 10.0.1 are used.
# Note: The values of 'SQL Type' are DDL formatted strings, which can be used as `returnType`s.
# Note: The values inside the table are generated by `repr`. X' means it throws an exception
# during the conversion.

@allisonwang-db
Copy link
Contributor

^ We don't have to add everything in this PR, but I do think we should have a separate table for type conversion in PySpark docs, and then we can improve it.

@PhilDakin
Copy link
Contributor Author

@allisonwang-db brought back the table and added a section indicating when these conversions are relevant during UDF definitions.

Will follow up with examples going into more depth on type conversion as a separate PR for https://issues.apache.org/jira/browse/SPARK-44734.

@PhilDakin
Copy link
Contributor Author

I agree that duplicating the table is not ideal. It would be nice to have a cross-format inclusion mechanism for tables, between the main documentation and PySpark's. Seems a bit out of scope for this PR, though.

@PhilDakin
Copy link
Contributor Author

@allisonwang-db what do you think here?

@zhengruifeng
Copy link
Contributor

Do we need to mention related configs like spark.sql.pyspark.inferNestedDictAsStruct.enabled and spark.sql.timestampType?

In createDataFrame, spark.sql.pyspark.inferNestedDictAsStruct.enabled controls whether a dict be treated as a map or struct.

BTW, I think we may need to mention nested rows and numpy arrays:

In [25]: spark.createDataFrame(np.zeros([3,3], "int8"))
Out[25]: DataFrame[_1: tinyint, _2: tinyint, _3: tinyint]

In [26]: spark.createDataFrame(np.zeros([3,3], "int64"))
Out[26]: DataFrame[_1: bigint, _2: bigint, _3: bigint]

In [27]: spark.createDataFrame([Row(a=1, b=Row(c=2))])
Out[27]: DataFrame[a: bigint, b: struct<c:bigint>]

Copy link
Contributor

@allisonwang-db allisonwang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! This is much more clear. It would be great also to include a screenshot in the PR description.

python/docs/source/user_guide/sql/type_conversions.rst Outdated Show resolved Hide resolved
python/docs/source/user_guide/sql/type_conversions.rst Outdated Show resolved Hide resolved
…- no longs in Python, use note directive, fix title RST lines.
…- add example section emphasizing importance during UDFs, TODO for conversions.
…- add relevant configs, provided more examples.
@PhilDakin
Copy link
Contributor Author

@allisonwang-db added full-page screenshot to description and rebased onto master.

@PhilDakin
Copy link
Contributor Author

@allisonwang-db any further updates needed here?

Copy link
Contributor

@allisonwang-db allisonwang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

df = spark.createDataFrame(
[[1]], schema=StructType([StructField("int", IntegerType())])
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be two vertical newlines according to PEP 8.

@udf(returnType=StringType())
def to_string(value):
return str(value)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@udf(returnType=FloatType())
def to_float(value):
return float(value)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

# |-- Score: double (nullable = true)
# |-- Period: long (nullable = true)

import pandas as pd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would mode the imports to the top. numpy too.


.. code-block:: python

data = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be either 3 spaces (per the sphinx specification), or 4 spaces to be consistent across PySpark documentation (yes we're using non standard spacing in most of the rst files).

* - Configuration
- Description
- Default
* - spark.sql.execution.pythonUDF.arrow.enabled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make it code block like

Suggested change
* - spark.sql.execution.pythonUDF.arrow.enabled
* - `spark.sql.execution.pythonUDF.arrow.enabled`

?


.. code-block:: python

from pyspark.sql.types import *
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid wildcards. It's discouraged according to PEP 8.


All Conversions
---------------
.. list-table::
Copy link
Member

@HyukjinKwon HyukjinKwon Nov 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's at least add a comment here to update docs/sql-ref-datatypes.md together if anyone makes some change. I don't still like that we're duplicating the docs but probably it's fine as we're going to put all Python specific information here.

@@ -119,10 +119,10 @@ from pyspark.sql.types import *

|Data type|Value type in Python|API to access or create a data type|
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add a comment that we should fix python/docs/source/user_guide/sql/type_conversions.rst. You could use <!-- comment -->

Python to Spark Type Conversions
================================

.. TODO: Add additional information on conversions when Arrow is enabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably file a JIRA

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is covered by the ticket in the TODO below, modifying to make this clear.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise.

@HyukjinKwon
Copy link
Member

Merged to master.

@HyukjinKwon
Copy link
Member

@PhilDakin do you have a JIRA ID? so I can assign this ticket (SPARK-44733) to you. feel free to directly comment in the JIRA.

@PhilDakin
Copy link
Contributor Author

PhilDakin commented Nov 14, 2023

@HyukjinKwon ah, I was still going to address your other comments before merge. Not a big deal.

Will comment on Jira.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants