[SPARK-44733][PYTHON][DOCS] Add Python to Spark type conversion page to PySpark docs. #43369

PhilDakin · 2023-10-13T21:26:33Z

@allisonwang-db

What changes were proposed in this pull request?

Add documentation page showing Python to Spark type mappings for PySpark.

Why are the changes needed?

Surface this information to users navigating the PySpark docs per https://issues.apache.org/jira/browse/SPARK-44733.

Does this PR introduce any user-facing change?

Yes, adds new page to PySpark docs.

How was this patch tested?

Build HTML docs file using Sphinx, inspect visually.

Was this patch authored or co-authored using generative AI tooling?

No.

python/docs/source/user_guide/sql/type_conversions.rst

HyukjinKwon · 2023-10-17T11:37:13Z

Mind attaching the output HTML image if you don't mind? Otherwise looks fine from a cursory look. cc @itholic, @xinrong-meng and @zhengruifeng too if you find some time to review.

PhilDakin · 2023-10-17T14:57:02Z

PhilDakin · 2023-10-17T15:00:50Z

Test failure looks unrelated, pyspark-mllib Error: The operation was canceled..

itholic · 2023-10-18T04:48:24Z

Looks nice. Could you rebase the PR to master?

docs/sql-ref-datatypes.md

PhilDakin · 2023-10-18T15:34:21Z

allisonwang-db · 2023-10-18T17:23:32Z

Hi @PhilDakin thanks for doing this! I personally think it's better to have the table here instead of a link to another page.

Also, I think we should explain why this conversion table matters. For example, it is useful when users what to map a Python return type to a Spark return type in a Python UDF.

Another thing we need to mention is type casting. What if I want to cast an int type in Python to a FloatType in Spark? Currently, for regular Python UDF, it will return NULL, I believe, but for arrow-optimized Python UDF, it can cast the value properly. It will be valuable to have a table like this:

spark/python/pyspark/sql/udf.py

Lines 94 to 119 in b41ea91

    
           # The following table shows the results when the type coercion in Arrow is needed, that is, 
        
           # when the user-specified return type(SQL Type) of the UDF and the actual instance(Python 
        
           # Value(Type)) that the UDF returns are different. 
        
           # Arrow and Pickle have different type coercion rules, so a UDF might have a different result 
        
           # with/without Arrow optimization. That's the main reason the Arrow optimization for Python 
        
           # UDFs is disabled by default. 
        
           # +-----------------------------+--------------+----------+------+---------------+--------------------+-----------------------------+----------+----------------------+---------+--------------------+----------------------------+------------+--------------+  # noqa 
        
           # |SQL Type \ Python Value(Type)|None(NoneType)|True(bool)|1(int)|         a(str)|    1970-01-01(date)|1970-01-01 00:00:00(datetime)|1.0(float)|array('i', [1])(array)|[1](list)|         (1,)(tuple)|bytearray(b'ABC')(bytearray)|  1(Decimal)|{'a': 1}(dict)|  # noqa 
        
           # +-----------------------------+--------------+----------+------+---------------+--------------------+-----------------------------+----------+----------------------+---------+--------------------+----------------------------+------------+--------------+  # noqa 
        
           # |                      boolean|          None|      True|  None|           None|                None|                         None|      None|                  None|     None|                None|                        None|        None|          None|  # noqa 
        
           # |                      tinyint|          None|      None|     1|           None|                None|                         None|      None|                  None|     None|                None|                        None|        None|          None|  # noqa 
        
           # |                     smallint|          None|      None|     1|           None|                None|                         None|      None|                  None|     None|                None|                        None|        None|          None|  # noqa 
        
           # |                          int|          None|      None|     1|           None|                None|                         None|      None|                  None|     None|                None|                        None|        None|          None|  # noqa 
        
           # |                       bigint|          None|      None|     1|           None|                None|                         None|      None|                  None|     None|                None|                        None|        None|          None|  # noqa 
        
           # |                       string|          None|    'true'|   '1'|            'a'|'java.util.Gregor...|         'java.util.Gregor...|     '1.0'|         '[I@120d813a'|    '[1]'|'[Ljava.lang.Obje...|               '[B@48571878'|         '1'|       '{a=1}'|  # noqa 
        
           # |                         date|          None|         X|     X|              X|datetime.date(197...|         datetime.date(197...|         X|                     X|        X|                   X|                           X|           X|             X|  # noqa 
        
           # |                    timestamp|          None|         X|     X|              X|                   X|         datetime.datetime...|         X|                     X|        X|                   X|                           X|           X|             X|  # noqa 
        
           # |                        float|          None|      None|  None|           None|                None|                         None|       1.0|                  None|     None|                None|                        None|        None|          None|  # noqa 
        
           # |                       double|          None|      None|  None|           None|                None|                         None|       1.0|                  None|     None|                None|                        None|        None|          None|  # noqa 
        
           # |                       binary|          None|      None|  None|bytearray(b'a')|                None|                         None|      None|                  None|     None|                None|           bytearray(b'ABC')|        None|          None|  # noqa 
        
           # |                decimal(10,0)|          None|      None|  None|           None|                None|                         None|      None|                  None|     None|                None|                        None|Decimal('1')|          None|  # noqa 
        
           # +-----------------------------+--------------+----------+------+---------------+--------------------+-----------------------------+----------+----------------------+---------+--------------------+----------------------------+------------+--------------+  # noqa 
        
           # Note: Python 3.9.15, Pandas 1.5.2 and PyArrow 10.0.1 are used. 
        
           # Note: The values of 'SQL Type' are DDL formatted strings, which can be used as `returnType`s. 
        
           # Note: The values inside the table are generated by `repr`. X' means it throws an exception 
        
           # during the conversion.

allisonwang-db · 2023-10-18T20:58:10Z

^ We don't have to add everything in this PR, but I do think we should have a separate table for type conversion in PySpark docs, and then we can improve it.

PhilDakin · 2023-10-18T21:45:47Z

@allisonwang-db brought back the table and added a section indicating when these conversions are relevant during UDF definitions.

Will follow up with examples going into more depth on type conversion as a separate PR for https://issues.apache.org/jira/browse/SPARK-44734.

PhilDakin · 2023-10-18T21:56:26Z

I agree that duplicating the table is not ideal. It would be nice to have a cross-format inclusion mechanism for tables, between the main documentation and PySpark's. Seems a bit out of scope for this PR, though.

PhilDakin · 2023-10-23T19:26:17Z

@allisonwang-db what do you think here?

zhengruifeng · 2023-10-24T04:04:56Z

Do we need to mention related configs like spark.sql.pyspark.inferNestedDictAsStruct.enabled and spark.sql.timestampType?

In createDataFrame, spark.sql.pyspark.inferNestedDictAsStruct.enabled controls whether a dict be treated as a map or struct.

BTW, I think we may need to mention nested rows and numpy arrays:

In [25]: spark.createDataFrame(np.zeros([3,3], "int8"))
Out[25]: DataFrame[_1: tinyint, _2: tinyint, _3: tinyint]

In [26]: spark.createDataFrame(np.zeros([3,3], "int64"))
Out[26]: DataFrame[_1: bigint, _2: bigint, _3: bigint]

In [27]: spark.createDataFrame([Row(a=1, b=Row(c=2))])
Out[27]: DataFrame[a: bigint, b: struct<c:bigint>]

allisonwang-db

Thanks for working on this! This is much more clear. It would be great also to include a screenshot in the PR description.

python/docs/source/user_guide/sql/type_conversions.rst

…to PySpark docs.

…- no longs in Python, use note directive, fix title RST lines.

…- add example section emphasizing importance during UDFs, TODO for conversions.

…- add relevant configs, provided more examples.

… period cleanup, move table.

PhilDakin · 2023-10-30T18:11:00Z

@allisonwang-db added full-page screenshot to description and rebased onto master.

PhilDakin · 2023-11-03T15:02:43Z

@allisonwang-db any further updates needed here?

allisonwang-db

Thanks for working on this!

HyukjinKwon · 2023-11-12T23:30:30Z

python/docs/source/user_guide/sql/type_conversions.rst

+  df = spark.createDataFrame(
+      [[1]], schema=StructType([StructField("int", IntegerType())])
+  )
+


should be two vertical newlines according to PEP 8.

HyukjinKwon · 2023-11-12T23:30:35Z

python/docs/source/user_guide/sql/type_conversions.rst

+  @udf(returnType=StringType())
+  def to_string(value):
+      return str(value)
+


HyukjinKwon · 2023-11-12T23:30:44Z

python/docs/source/user_guide/sql/type_conversions.rst

+  @udf(returnType=FloatType())
+  def to_float(value):
+      return float(value)
+


HyukjinKwon · 2023-11-12T23:31:04Z

python/docs/source/user_guide/sql/type_conversions.rst

+  # |-- Score: double (nullable = true)
+  # |-- Period: long (nullable = true)
+
+  import pandas as pd


I would mode the imports to the top. numpy too.

HyukjinKwon · 2023-11-12T23:31:59Z

python/docs/source/user_guide/sql/type_conversions.rst

+
+.. code-block:: python
+
+  data = [


should be either 3 spaces (per the sphinx specification), or 4 spaces to be consistent across PySpark documentation (yes we're using non standard spacing in most of the rst files).

HyukjinKwon · 2023-11-12T23:32:18Z

python/docs/source/user_guide/sql/type_conversions.rst

+    * - Configuration
+      - Description
+      - Default
+    * - spark.sql.execution.pythonUDF.arrow.enabled


Should we make it code block like

Suggested change

* - spark.sql.execution.pythonUDF.arrow.enabled

* - `spark.sql.execution.pythonUDF.arrow.enabled`

?

HyukjinKwon · 2023-11-12T23:32:32Z

python/docs/source/user_guide/sql/type_conversions.rst

+
+.. code-block:: python
+
+    from pyspark.sql.types import *


Let's avoid wildcards. It's discouraged according to PEP 8.

HyukjinKwon · 2023-11-12T23:33:21Z

python/docs/source/user_guide/sql/type_conversions.rst

+
+All Conversions
+---------------
+.. list-table::


Let's at least add a comment here to update docs/sql-ref-datatypes.md together if anyone makes some change. I don't still like that we're duplicating the docs but probably it's fine as we're going to put all Python specific information here.

HyukjinKwon · 2023-11-12T23:34:33Z

docs/sql-ref-datatypes.md

@@ -119,10 +119,10 @@ from pyspark.sql.types import *

 |Data type|Value type in Python|API to access or create a data type|


Let's also add a comment that we should fix python/docs/source/user_guide/sql/type_conversions.rst. You could use 

HyukjinKwon · 2023-11-12T23:35:28Z

python/docs/source/user_guide/sql/type_conversions.rst

+Python to Spark Type Conversions
+================================
+
+.. TODO: Add additional information on conversions when Arrow is enabled.


Should probably file a JIRA

This is covered by the ticket in the TODO below, modifying to make this clear.

HyukjinKwon

LGTM otherwise.

HyukjinKwon · 2023-11-14T02:25:22Z

Merged to master.

HyukjinKwon · 2023-11-14T02:26:55Z

@PhilDakin do you have a JIRA ID? so I can assign this ticket (SPARK-44733) to you. feel free to directly comment in the JIRA.

PhilDakin · 2023-11-14T02:30:46Z

@HyukjinKwon ah, I was still going to address your other comments before merge. Not a big deal.

Will comment on Jira.

github-actions bot added SQL PYTHON labels Oct 13, 2023

PhilDakin force-pushed the 20231013.SPARK-44733 branch from 68f4d6c to 320fec8 Compare October 13, 2023 21:28

HyukjinKwon reviewed Oct 16, 2023

View reviewed changes

python/docs/source/user_guide/sql/type_conversions.rst Outdated Show resolved Hide resolved

HyukjinKwon reviewed Oct 16, 2023

View reviewed changes

python/docs/source/user_guide/sql/type_conversions.rst Outdated Show resolved Hide resolved

HyukjinKwon reviewed Oct 16, 2023

View reviewed changes

python/docs/source/user_guide/sql/type_conversions.rst Outdated Show resolved Hide resolved

github-actions bot added the DOCS label Oct 16, 2023

HyukjinKwon reviewed Oct 18, 2023

View reviewed changes

docs/sql-ref-datatypes.md Show resolved Hide resolved

PhilDakin force-pushed the 20231013.SPARK-44733 branch from 0e9fc71 to a7c7895 Compare October 18, 2023 15:32

PhilDakin force-pushed the 20231013.SPARK-44733 branch from a7c7895 to d91db17 Compare October 18, 2023 21:43

PhilDakin force-pushed the 20231013.SPARK-44733 branch from d91db17 to a4d2fc8 Compare October 18, 2023 21:50

allisonwang-db reviewed Oct 30, 2023

View reviewed changes

python/docs/source/user_guide/sql/type_conversions.rst Outdated Show resolved Hide resolved

python/docs/source/user_guide/sql/type_conversions.rst Outdated Show resolved Hide resolved

PhilDakin added 5 commits October 30, 2023 14:09

[SPARK-44733][PYTHON][DOCS] Add Python to Spark type conversion page …

3fa86b5

…to PySpark docs.

[SPARK-44733][PYTHON][DOCS] Address comments for PySpark docs update …

bccff91

…- no longs in Python, use note directive, fix title RST lines.

[SPARK-44733][PYTHON][DOCS] Address comments for PySpark docs update …

b180633

…- add example section emphasizing importance during UDFs, TODO for conversions.

[SPARK-44733][PYTHON][DOCS] Address comments for PySpark docs update …

697069b

…- add relevant configs, provided more examples.

SPARK-44733][PYTHON][DOCS] Address comments for PySpark docs update -…

dac9cc1

… period cleanup, move table.

PhilDakin force-pushed the 20231013.SPARK-44733 branch from 63d5793 to dac9cc1 Compare October 30, 2023 18:10

allisonwang-db approved these changes Nov 6, 2023

View reviewed changes

HyukjinKwon reviewed Nov 12, 2023

View reviewed changes

HyukjinKwon approved these changes Nov 12, 2023

View reviewed changes

HyukjinKwon closed this in cd19d6c Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-44733][PYTHON][DOCS] Add Python to Spark type conversion page to PySpark docs. #43369

[SPARK-44733][PYTHON][DOCS] Add Python to Spark type conversion page to PySpark docs. #43369

PhilDakin commented Oct 13, 2023 •

edited

Loading

HyukjinKwon commented Oct 17, 2023

PhilDakin commented Oct 17, 2023

PhilDakin commented Oct 17, 2023

itholic commented Oct 18, 2023

PhilDakin commented Oct 18, 2023

allisonwang-db commented Oct 18, 2023

allisonwang-db commented Oct 18, 2023

PhilDakin commented Oct 18, 2023

PhilDakin commented Oct 18, 2023

PhilDakin commented Oct 23, 2023

zhengruifeng commented Oct 24, 2023

allisonwang-db left a comment

PhilDakin commented Oct 30, 2023

PhilDakin commented Nov 3, 2023

allisonwang-db left a comment

HyukjinKwon Nov 12, 2023

HyukjinKwon Nov 12, 2023

HyukjinKwon Nov 12, 2023

HyukjinKwon Nov 12, 2023

HyukjinKwon Nov 12, 2023

HyukjinKwon Nov 12, 2023

HyukjinKwon Nov 12, 2023

HyukjinKwon Nov 12, 2023 •

edited

Loading

HyukjinKwon Nov 12, 2023

HyukjinKwon Nov 12, 2023

PhilDakin Nov 14, 2023

HyukjinKwon left a comment

HyukjinKwon commented Nov 14, 2023

HyukjinKwon commented Nov 14, 2023

PhilDakin commented Nov 14, 2023 •

edited

Loading

	* - spark.sql.execution.pythonUDF.arrow.enabled
	* - `spark.sql.execution.pythonUDF.arrow.enabled`

		@@ -119,10 +119,10 @@ from pyspark.sql.types import *

		\|Data type\|Value type in Python\|API to access or create a data type\|

[SPARK-44733][PYTHON][DOCS] Add Python to Spark type conversion page to PySpark docs. #43369

[SPARK-44733][PYTHON][DOCS] Add Python to Spark type conversion page to PySpark docs. #43369

Conversation

PhilDakin commented Oct 13, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

HyukjinKwon commented Oct 17, 2023

PhilDakin commented Oct 17, 2023

PhilDakin commented Oct 17, 2023

itholic commented Oct 18, 2023

PhilDakin commented Oct 18, 2023

allisonwang-db commented Oct 18, 2023

allisonwang-db commented Oct 18, 2023

PhilDakin commented Oct 18, 2023

PhilDakin commented Oct 18, 2023

PhilDakin commented Oct 23, 2023

zhengruifeng commented Oct 24, 2023

allisonwang-db left a comment

Choose a reason for hiding this comment

PhilDakin commented Oct 30, 2023

PhilDakin commented Nov 3, 2023

allisonwang-db left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Nov 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Nov 14, 2023

HyukjinKwon commented Nov 14, 2023

PhilDakin commented Nov 14, 2023 • edited Loading

PhilDakin commented Oct 13, 2023 •

edited

Loading

HyukjinKwon Nov 12, 2023 •

edited

Loading

PhilDakin commented Nov 14, 2023 •

edited

Loading