-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22221][DOCS] Adding User Documentation for Arrow #19575
[SPARK-22221][DOCS] Adding User Documentation for Arrow #19575
Conversation
This is a WIP to start adding user documentation on how to use and describe any differences that the user might see working with Arrow enabled functionality. I'm not sure if the SQL programming guide is the right place to add it, but I'll start here and can move if needed. Here is a high-level list of things to add:
|
Test build #83053 has finished for PR 19575 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need some rudimentary doc for 2.3?
Yea, I would like to know it too. |
Yes, I think we do need something to at least highlight some differences if using Arrow. I've been meaning to work on this, just been too busy lately. @icexelloss if you're able to help out on this, that would be great! |
I am happy to help out with some sections. |
0723e86
to
5699d1b
Compare
Test build #86218 has finished for PR 19575 at commit
|
@HyukjinKwon @ueshin @gatorsmile does this seem like an appropriate place to put Arrow related user docs? I think we just need to add something for additional pandas_udfs and it's still a little rough so I will go over it all again. |
Test build #86540 has finished for PR 19575 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! I left some comments.
docs/sql-programming-guide.md
Outdated
|
||
## How to Write Vectorized UDFs | ||
|
||
A vectorized UDF is similar to a standard UDF in Spark except the inputs and output of the will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output of the will
-> output of the udf will
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ooops, I think I meant the inputs and output will be Pandas Series
docs/sql-programming-guide.md
Outdated
a `datetime64` type with nanosecond resolution, `datetime64[ns]` with optional time zone. | ||
|
||
When timestamp data is transferred from Spark to Pandas it will be converted to nanoseconds | ||
and made time zone aware using the Spark session time zone, if set, or local Python system time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use session time zone anyway. If not set, the default is JVM system timezone (the value returned by TimeZone.getDefault()
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I was thinking of when respectSessionTimeZone
is false, but since it is true by default and will is deprecated. But to keep things simple maybe best not to mention this conf and just say session tz or JVM default?
docs/sql-programming-guide.md
Outdated
|
||
Apache Arrow is an in-memory columnar data format that is used in Spark to efficiently transfer | ||
data between JVM and Python processes. This currently is most beneficial to Python users that | ||
work with Pandas/NumPy data. It's usage is not automatic and might require some minor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's usage is
-> Its usage is
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, you're right
docs/sql-programming-guide.md
Outdated
high memory usage in the JVM. To avoid possible out of memory exceptions, the size of the Arrow | ||
record batches can be adjusted by setting the conf "spark.sql.execution.arrow.maxRecordsPerBatch" | ||
to an integer that will determine the maximum number of rows for each batch. Using this limit, | ||
each data partition will be made into 1 or more record batches for processing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mention about the default value of spark.sql.execution.arrow.maxRecordsPerBatch
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
Thanks for the review @ueshin! If the RC passes, will this still be able to get in before the docs are updated? @icexelloss will you be able to write a brief section on groupby/apply soon, in case this can be merged? |
Hi Bryan, sorry I haven't got chance to take a look at this. Yes I can write the groupby apply section tomorrow. |
Test build #86610 has finished for PR 19575 at commit
|
docs/sql-programming-guide.md
Outdated
and each column will be made time zone aware using the Spark session time zone. This will occur | ||
when calling `toPandas()` or `pandas_udf` with a timestamp column. For example if the session time | ||
zone is 'America/Los_Angeles' then the Pandas timestamp column will be of type | ||
`datetime64[ns, America/Los_Angeles]`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid this is not correct.
The timestamp value will be timezone naive anyway which represents the timestamp respecting the session timezone, but the timezone info will be dropped. As a result, the timestamp column will be of type datetime64[ns]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I should have refreshed my memory better before writing.. fixing now
it looks like maybe we have a blocker for RC2? |
How about a critical @felixcheung? I will focus on this one anyway to get this into 2.3.0. Seems RC3 will be going on if I didn't misunderstand. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Insane nitpicking. Thoughts and suggestions are mixed. Maybe you could just pick up what makes sense to you. Doc is kind of an important but grunting job to be honest .. thank you for doing this.
docs/sql-programming-guide.md
Outdated
|
||
## How to Enable for Conversion to/from Pandas | ||
|
||
Arrow is available as an optimization when converting a Spark DataFrame to Pandas using the call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tiny nit: there's a trailing whitespace
docs/sql-programming-guide.md
Outdated
Note that a standard UDF (non-Pandas) will load timestamp data as Python datetime objects, which is | ||
different than a Pandas timestamp. It is recommended to use Pandas time series functionality when | ||
working with timestamps in `pandas_udf`s to get the best performance, see | ||
[here](https://pandas.pydata.org/pandas-docs/stable/timeseries.html) for details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto for a trailing whitespace
docs/sql-programming-guide.md
Outdated
give a high-level description of how to use Arrow in Spark and highlight any differences when | ||
working with Arrow-enabled data. | ||
|
||
## Ensure pyarrow Installed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, pyarrow -> PyArrow
docs/sql-programming-guide.md
Outdated
|
||
## Ensure pyarrow Installed | ||
|
||
If you install pyspark using pip, then pyarrow can be brought in as an extra dependency of the sql |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe, pyspark -> PySpark
docs/sql-programming-guide.md
Outdated
|
||
## Ensure pyarrow Installed | ||
|
||
If you install pyspark using pip, then pyarrow can be brought in as an extra dependency of the sql |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sql -> SQL
docs/sql-programming-guide.md
Outdated
## How to Write Vectorized UDFs | ||
|
||
A vectorized UDF is similar to a standard UDF in Spark except the inputs and output will be | ||
Pandas Series, which allow the function to be composed with vectorized operations. This function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pandas Series
-> Pandas Series/DataFrame
maybe also saying please check the API doc. Maybe this one needs a help from @icexelloss to generally organise these and clean up. This description sounds only for scalar UDFs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I can help with that. @BryanCutler do you mind if I make some change to this section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't mind please go ahead
docs/sql-programming-guide.md
Outdated
<div data-lang="python" markdown="1"> | ||
{% highlight python %} | ||
|
||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a example file separately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that too as they were a little bit longer than I thought. How about we leave them here for now and then follow up with separate files with proper runnable examples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yuo, sounds fine.
docs/sql-programming-guide.md
Outdated
0.8.0. You can install using pip or conda from the conda-forge channel. See pyarrow | ||
[installation](https://arrow.apache.org/docs/python/install.html) for details. | ||
|
||
## How to Enable for Conversion to/from Pandas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe "Enabling for Conversion to/from Pandas" just to match the sentence form
docs/sql-programming-guide.md
Outdated
give a high-level description of how to use Arrow in Spark and highlight any differences when | ||
working with Arrow-enabled data. | ||
|
||
## Ensure pyarrow Installed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems there are more sub topics than I thought. Probably, we could consider remove this one too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the section on installing or just the header and merge with the above section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed to a sub-heading, let me know if you think that is better
docs/sql-programming-guide.md
Outdated
@@ -1640,6 +1640,154 @@ Configuration of Hive is done by placing your `hive-site.xml`, `core-site.xml` a | |||
You may run `./bin/spark-sql --help` for a complete list of all available | |||
options. | |||
|
|||
# Usage Guide for Pandas with Arrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we leave a word "PySpark" somewhere at the first?
@BryanCutler I added a section for groupby apply here: https://github.com/BryanCutler/spark/pull/29/files |
Thanks all, I'll merge the groupby PR and do an update now |
Test build #86648 has finished for PR 19575 at commit
|
Thanks everyone for reviewing! I think I addressed all the comments, so please take one more look. |
Test build #86719 has finished for PR 19575 at commit
|
Test build #86720 has finished for PR 19575 at commit
|
</div> | ||
|
||
### Group Map | ||
Group map Pandas UDFs are used with `groupBy().apply()` which implements the "split-apply-combine" pattern. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rxin WDYT about this name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which name? If you mean "split-apply-combine", I think it's fine - https://pandas.pydata.org/pandas-docs/stable/groupby.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Group Map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grouped Vectorized UDFs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can change to whatever you guys like, but I think these two section names were made to reflect the different pandas_udf types - scalar and group map. Is that right @icexelloss ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct. The names in this section matches the enums in PandasUDFType
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@icexelloss we already agreed on the names when we wrote the blog, right?
docs/sql-programming-guide.md
Outdated
|
||
## Usage Notes | ||
|
||
### Supported SQL-Arrow Types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> Supported SQL Types
docs/sql-programming-guide.md
Outdated
|
||
### Supported SQL-Arrow Types | ||
|
||
Currently, all Spark SQL data types are supported except `MapType`, `ArrayType` of `TimestampType`, and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are supported
-> are supported by Arrow-based conversion
docs/sql-programming-guide.md
Outdated
high memory usage in the JVM. To avoid possible out of memory exceptions, the size of the Arrow | ||
record batches can be adjusted by setting the conf "spark.sql.execution.arrow.maxRecordsPerBatch" | ||
to an integer that will determine the maximum number of rows for each batch. The default value is | ||
10,000 records per batch and does not take into account the number of columns, so it should be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value is 10,000 records per batch. Since the number of columns could be huge, the value should be adjusted accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about "The default value is 10,000 records per batch. If the number of columns is large, the value should be adjusted accordingly"
|
||
Data partitions in Spark are converted into Arrow record batches, which can temporarily lead to | ||
high memory usage in the JVM. To avoid possible out of memory exceptions, the size of the Arrow | ||
record batches can be adjusted by setting the conf "spark.sql.execution.arrow.maxRecordsPerBatch" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on this description, it sounds like we should not use the number of records, but the size, right? cc @cloud-fan @ueshin too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, both can be used where applicable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We went with maxRecordsPerBatch
because it's easy to implement, otherwise we may need some way to estimate/calculate the memory consumption of arrow data. @BryanCutler is it easy to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible to estimate the size of the arrow buffer used, but it does make it more complicated to implement in Spark. I also wonder how useful this would be if the user hits memory problems. At least with a number of records, it's easy to understand and change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current approach is just to make the external users hard to tune. Now, maxRecordsPerBatch
also depends on the width your output schema. This is not user friendly to end users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's not an ideal approach. I'm happy to make a JIRA to followup and look into other ways to break up the batches, but that won't be in before 2.3. So does that mean our options here are (unless I'm not understanding internal/external conf correctly)
- Keep
maxRecordsPerBatch
internal and remove this doc section. - Externalize this conf and deprecate once a better approach is found.
I think (2) is better because if the user hits memory issues, then they can at least find someway to adjust it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it is too late to add a new conf for 2.3 release, we can do it in 2.4 release. In the 2.4 release, we can respect both conf. We just need to change the default of maxRecordsPerBatch
to int.max in the 2.4 release. I am fine to externalize it in 2.3 release.
To use `groupBy().apply()`, the user needs to define the following: | ||
* A Python function that defines the computation for each group. | ||
* A `StructType` object or a string that defines the schema of the output `DataFrame`. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to warn users that, the Group Map Pandas UDF requires to load all the data of a group into memory, which is not controlled by spark.sql.execution.arrow.maxRecordsPerBatch
, and may OOM if the data is skewed and some partitions have a lot of records.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah good point, I'll add that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems we better stick to same naming? - https://databricks.com/blog/2017/10/30/introducing-vectorized-udfs-for-pyspark.html and #18732 (comment)?
</div> | ||
|
||
### Group Map | ||
Group map Pandas UDFs are used with `groupBy().apply()` which implements the "split-apply-combine" pattern. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grouped Vectorized UDFs
?
or to wrap the function, no additional configuration is required. Currently, there are two types of | ||
Pandas UDF: Scalar and Group Map. | ||
|
||
### Scalar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scalar Vectorized UDFs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the side note, I think in the context here. Pandas UDFs
and Vectorized UDFs
are interchangeable from a user's point of view, I am not sure the need for introducing both to the users. Maybe we should just stick to one of them?
docs/sql-programming-guide.md
Outdated
|
||
## Enabling for Conversion to/from Pandas | ||
|
||
Arrow is available as an optimization when converting a Spark DataFrame to Pandas using the call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a Spark DataFrame to Pandas
-> a Spark DataFrame to Pandas DataFrame
?
docs/sql-programming-guide.md
Outdated
## Enabling for Conversion to/from Pandas | ||
|
||
Arrow is available as an optimization when converting a Spark DataFrame to Pandas using the call | ||
`toPandas()` and when creating a Spark DataFrame from Pandas with `createDataFrame(pandas_df)`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a Spark DataFrame from Pandas
-> a Spark DataFrame from Pandas DataFrame
?
Test build #86755 has finished for PR 19575 at commit
|
I have two major comments.
|
Actually, aggregation can only be executed on grouped data, so |
Thanks @gatorsmile , I made https://issues.apache.org/jira/browse/SPARK-23258 to track changing the
It seems like we are changing around the groupBy-apply name a lot and I don't want to change things here unless this has been agreed upon, can you confirm @icexelloss ? |
is it possible to decide on the names for groupBy()-apply() UDFs as a followup? it sounds like there are still things that need discussion |
Thanks! I will submit a follow-up PR to rename it. Merged to 2.3 and master. |
BTW, Thanks for your great works! I will add all your names in the contributors of this PR |
## What changes were proposed in this pull request? Adding user facing documentation for working with Arrow in Spark Author: Bryan Cutler <cutlerb@gmail.com> Author: Li Jin <ice.xelloss@gmail.com> Author: hyukjinkwon <gurwls223@gmail.com> Closes #19575 from BryanCutler/arrow-user-docs-SPARK-2221. (cherry picked from commit 0d60b32) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
Thanks to everyone for contributing and reviewing! |
@gatorsmile I don't think any change of naming (group map, group agg, etc) has been agreed upon yet. We can certainly open an PR to discuss it. |
@icexelloss Thanks for your reply. Welcome your comment in my PR #20428 |
…xRecordsPerBatch ## What changes were proposed in this pull request? This is a followup to #19575 which added a section on setting max Arrow record batches and this will externalize the conf that was referenced in the docs. ## How was this patch tested? NA Author: Bryan Cutler <cutlerb@gmail.com> Closes #20423 from BryanCutler/arrow-user-doc-externalize-maxRecordsPerBatch-SPARK-22221. (cherry picked from commit f235df6) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
…xRecordsPerBatch ## What changes were proposed in this pull request? This is a followup to #19575 which added a section on setting max Arrow record batches and this will externalize the conf that was referenced in the docs. ## How was this patch tested? NA Author: Bryan Cutler <cutlerb@gmail.com> Closes #20423 from BryanCutler/arrow-user-doc-externalize-maxRecordsPerBatch-SPARK-22221.
What changes were proposed in this pull request?
Adding user facing documentation for working with Arrow in Spark