[SPARK-41833][SPARK-41881][SPARK-41815][CONNECT][PYTHON] Make `DataFrame.collect` handle None/NaN/Array/Binary porperly #39386

zhengruifeng · 2023-01-04T07:37:07Z

What changes were proposed in this pull request?

Existing DataFrame.collect directly collect coming Arrow batches into a Pandas DataFrame, and then convert each series into a Row, which is problematic since it can not correctly handle None/NaN/Arrays/Binary/etc.

This PR refactor DataFrame.collect by directly building rows from the raw Arrow Table, in order to support:
1, None/NaN values;
2, ArrayType
3, BinaryType

Why are the changes needed?

To be consistent with PySpark

Does this PR introduce any user-facing change?

yes

How was this patch tested?

enabled doctests

fix fix fix fix

zhengruifeng · 2023-01-04T11:03:13Z

cc @HyukjinKwon @grundprinzip

HyukjinKwon · 2023-01-04T23:51:59Z

Merged to master.

github-actions bot added CONNECT CORE PYTHON SQL labels Jan 4, 2023

zhengruifeng changed the title ~~[SPARK-41833][CONNECT][PYTHON] Make DataFrame.collect handle ArrayType and BinaryType porperly~~ [SPARK-41833][CONNECT][PYTHON][WIP] Make DataFrame.collect handle ArrayType and BinaryType porperly Jan 4, 2023

zhengruifeng force-pushed the connect_fix_41833 branch from 8d54dfa to 32ea86f Compare January 4, 2023 10:46

zhengruifeng changed the title ~~[SPARK-41833][CONNECT][PYTHON][WIP] Make DataFrame.collect handle ArrayType and BinaryType porperly~~ [SPARK-41833][SPARK-41881][SPARK-41815][CONNECT][PYTHON] Make DataFrame.collect handle None/NaN/Array/Binary porperly Jan 4, 2023

zhengruifeng force-pushed the connect_fix_41833 branch from 32ea86f to bee5a32 Compare January 4, 2023 10:54

init

8b69df3

fix fix fix fix

zhengruifeng force-pushed the connect_fix_41833 branch from bee5a32 to 8b69df3 Compare January 4, 2023 11:01

HyukjinKwon approved these changes Jan 4, 2023

View reviewed changes

HyukjinKwon closed this in 731b89d Jan 4, 2023

zhengruifeng deleted the connect_fix_41833 branch January 5, 2023 00:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-41833][SPARK-41881][SPARK-41815][CONNECT][PYTHON] Make `DataFrame.collect` handle None/NaN/Array/Binary porperly #39386

[SPARK-41833][SPARK-41881][SPARK-41815][CONNECT][PYTHON] Make `DataFrame.collect` handle None/NaN/Array/Binary porperly #39386

zhengruifeng commented Jan 4, 2023 •

edited

Loading

zhengruifeng commented Jan 4, 2023

HyukjinKwon commented Jan 4, 2023

[SPARK-41833][SPARK-41881][SPARK-41815][CONNECT][PYTHON] Make DataFrame.collect handle None/NaN/Array/Binary porperly #39386

[SPARK-41833][SPARK-41881][SPARK-41815][CONNECT][PYTHON] Make DataFrame.collect handle None/NaN/Array/Binary porperly #39386

Conversation

zhengruifeng commented Jan 4, 2023 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

zhengruifeng commented Jan 4, 2023

HyukjinKwon commented Jan 4, 2023

[SPARK-41833][SPARK-41881][SPARK-41815][CONNECT][PYTHON] Make `DataFrame.collect` handle None/NaN/Array/Binary porperly #39386

[SPARK-41833][SPARK-41881][SPARK-41815][CONNECT][PYTHON] Make `DataFrame.collect` handle None/NaN/Array/Binary porperly #39386

zhengruifeng commented Jan 4, 2023 •

edited

Loading