[SPARK-46253][PYTHON] Plan Python data source read using MapInArrow #44170

allisonwang-db · 2023-12-05T01:44:30Z

What changes were proposed in this pull request?

This PR changes how we plan Python data source read. Instead of using a regular Python UDTF, we can use an arrow UDF and plan the data source read using the MapInArrow operator.

Why are the changes needed?

To improve the performance

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests

Was this patch authored or co-authored using generative AI tooling?

No

allisonwang-db · 2023-12-11T12:43:14Z

cc @ueshin

allisonwang-db · 2023-12-12T03:09:07Z

cc @cloud-fan @HyukjinKwon

HyukjinKwon · 2023-12-12T05:09:06Z

python/pyspark/sql/worker/plan_data_source_read.py

+            def batched(iterator: Iterator, n: int) -> Iterator:
+                return iter(functools.partial(lambda it: list(islice(it, n)), iterator), [])
+
+            max_batch_size = int(os.environ.get("ARROW_MAX_RECORDS_PER_BATCH", "10000"))


While this is probably fine for now because the batch size is less likely changed often, we should ideally send the configuration through the socket. Otherwise, it will create a new Python worker whenever you change this configuration instead of reusing.

HyukjinKwon · 2023-12-12T05:09:50Z

python/pyspark/sql/worker/plan_data_source_read.py

-            def __init__(self) -> None:
-                self.ser = CloudPickleSerializer()
+        # Wrap the data source read logic in an mapInArrow UDF.
+        import pyarrow as pa


and we shouldn't forget to document that it requires a PyArrow.

HyukjinKwon · 2023-12-12T22:05:51Z

Merged to master.

zhengruifeng · 2023-12-15T02:53:49Z

python/pyspark/sql/worker/plan_data_source_read.py

+        pa_schema = to_arrow_schema(return_type)
+        column_names = return_type.fieldNames()
+        column_converters = [
+            LocalDataToArrowConversion._create_converter(field.dataType)


since LocalDataToArrowConversion and ArrowTableToRowsConversion were not designed for data source api, so I think we should look into it to make sure the behavior is as expected, e.g.
1, there is a _deduplicate_field_names logic introduced in 71bac15, no sure whether is should be used in data source;
2, IIRC, the internally used to_arrow_schema doesn't support all SQL types

github-actions bot added SQL PYTHON labels Dec 5, 2023

arrow

da3e750

allisonwang-db force-pushed the spark-46253-arrow-read branch from abd2131 to da3e750 Compare December 5, 2023 01:45

update

e3b477c

HyukjinKwon reviewed Dec 12, 2023

View reviewed changes

HyukjinKwon approved these changes Dec 12, 2023

View reviewed changes

address comments

324f292

HyukjinKwon closed this in 01d61a5 Dec 12, 2023

zhengruifeng reviewed Dec 15, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-46253][PYTHON] Plan Python data source read using MapInArrow #44170

[SPARK-46253][PYTHON] Plan Python data source read using MapInArrow #44170

allisonwang-db commented Dec 5, 2023

allisonwang-db commented Dec 11, 2023

allisonwang-db commented Dec 12, 2023

HyukjinKwon Dec 12, 2023

HyukjinKwon Dec 12, 2023

HyukjinKwon commented Dec 12, 2023

zhengruifeng Dec 15, 2023

[SPARK-46253][PYTHON] Plan Python data source read using MapInArrow #44170

[SPARK-46253][PYTHON] Plan Python data source read using MapInArrow #44170

Conversation

allisonwang-db commented Dec 5, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

allisonwang-db commented Dec 11, 2023

allisonwang-db commented Dec 12, 2023

HyukjinKwon Dec 12, 2023

Choose a reason for hiding this comment

HyukjinKwon Dec 12, 2023

Choose a reason for hiding this comment

HyukjinKwon commented Dec 12, 2023

zhengruifeng Dec 15, 2023

Choose a reason for hiding this comment