-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-44733][PYTHON][DOCS] Add Python to Spark type conversion page to PySpark docs. #43369
Conversation
68f4d6c
to
320fec8
Compare
Mind attaching the output HTML image if you don't mind? Otherwise looks fine from a cursory look. cc @itholic, @xinrong-meng and @zhengruifeng too if you find some time to review. |
Test failure looks unrelated, |
Looks nice. Could you rebase the PR to master? |
0e9fc71
to
a7c7895
Compare
Hi @PhilDakin thanks for doing this! I personally think it's better to have the table here instead of a link to another page. Also, I think we should explain why this conversion table matters. For example, it is useful when users what to map a Python return type to a Spark return type in a Python UDF. Another thing we need to mention is type casting. What if I want to cast an int type in Python to a FloatType in Spark? Currently, for regular Python UDF, it will return NULL, I believe, but for arrow-optimized Python UDF, it can cast the value properly. It will be valuable to have a table like this: spark/python/pyspark/sql/udf.py Lines 94 to 119 in b41ea91
|
^ We don't have to add everything in this PR, but I do think we should have a separate table for type conversion in PySpark docs, and then we can improve it. |
a7c7895
to
d91db17
Compare
@allisonwang-db brought back the table and added a section indicating when these conversions are relevant during UDF definitions. Will follow up with examples going into more depth on type conversion as a separate PR for https://issues.apache.org/jira/browse/SPARK-44734. |
d91db17
to
a4d2fc8
Compare
I agree that duplicating the table is not ideal. It would be nice to have a cross-format inclusion mechanism for tables, between the main documentation and PySpark's. Seems a bit out of scope for this PR, though. |
@allisonwang-db what do you think here? |
Do we need to mention related configs like In createDataFrame, BTW, I think we may need to mention nested rows and numpy arrays:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! This is much more clear. It would be great also to include a screenshot in the PR description.
…- no longs in Python, use note directive, fix title RST lines.
…- add example section emphasizing importance during UDFs, TODO for conversions.
…- add relevant configs, provided more examples.
… period cleanup, move table.
63d5793
to
dac9cc1
Compare
@allisonwang-db added full-page screenshot to description and rebased onto master. |
@allisonwang-db any further updates needed here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this!
df = spark.createDataFrame( | ||
[[1]], schema=StructType([StructField("int", IntegerType())]) | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be two vertical newlines according to PEP 8.
@udf(returnType=StringType()) | ||
def to_string(value): | ||
return str(value) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
@udf(returnType=FloatType()) | ||
def to_float(value): | ||
return float(value) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
# |-- Score: double (nullable = true) | ||
# |-- Period: long (nullable = true) | ||
|
||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would mode the imports to the top. numpy too.
|
||
.. code-block:: python | ||
|
||
data = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be either 3 spaces (per the sphinx specification), or 4 spaces to be consistent across PySpark documentation (yes we're using non standard spacing in most of the rst files).
* - Configuration | ||
- Description | ||
- Default | ||
* - spark.sql.execution.pythonUDF.arrow.enabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make it code block like
* - spark.sql.execution.pythonUDF.arrow.enabled | |
* - `spark.sql.execution.pythonUDF.arrow.enabled` |
?
|
||
.. code-block:: python | ||
|
||
from pyspark.sql.types import * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's avoid wildcards. It's discouraged according to PEP 8.
|
||
All Conversions | ||
--------------- | ||
.. list-table:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's at least add a comment here to update docs/sql-ref-datatypes.md
together if anyone makes some change. I don't still like that we're duplicating the docs but probably it's fine as we're going to put all Python specific information here.
@@ -119,10 +119,10 @@ from pyspark.sql.types import * | |||
|
|||
|Data type|Value type in Python|API to access or create a data type| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also add a comment that we should fix python/docs/source/user_guide/sql/type_conversions.rst
. You could use <!-- comment -->
Python to Spark Type Conversions | ||
================================ | ||
|
||
.. TODO: Add additional information on conversions when Arrow is enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably file a JIRA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is covered by the ticket in the TODO below, modifying to make this clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise.
Merged to master. |
@PhilDakin do you have a JIRA ID? so I can assign this ticket (SPARK-44733) to you. feel free to directly comment in the JIRA. |
@HyukjinKwon ah, I was still going to address your other comments before merge. Not a big deal. Will comment on Jira. |
@allisonwang-db
What changes were proposed in this pull request?
Add documentation page showing Python to Spark type mappings for PySpark.
Why are the changes needed?
Surface this information to users navigating the PySpark docs per https://issues.apache.org/jira/browse/SPARK-44733.
Does this PR introduce any user-facing change?
Yes, adds new page to PySpark docs.
How was this patch tested?
Build HTML docs file using Sphinx, inspect visually.
Was this patch authored or co-authored using generative AI tooling?
No.