Skip to content

Conversation

@itholic
Copy link
Contributor

@itholic itholic commented Sep 30, 2022

What changes were proposed in this pull request?

This PR proposes to fix the ps.read_parquet test since the pd.to_parquet is broken when the index is MultiIndex from pandas 1.5.0.

We leverage the pd.to_parquet in the test, so the test failed with pandas 1.5.0 as below (MultiIndex is not respected):

DataFrame shape mismatch
[left]:  (20, 5)
[right]: (20, 4)

Left:
    i32  i64     f  bhello     index
0     0    0   0.0      yo  0.617492
1     1    1   1.0  people  0.823826
2     2    2   2.0  people  0.443275
3     0    3   3.0   hello  0.639776
4     1    4   4.0      yo  0.393410
5     2    0   5.0      yo  0.898860
6     0    1   6.0  people  0.725236
7     1    2   7.0      yo  0.933009
8     2    3   8.0      yo  0.663381
9     0    4   9.0   hello  0.471077
10    1    0  10.0   hello  0.562182
11    2    1  11.0  people  0.734902
12    0    2  12.0      yo  0.956519
13    1    3  13.0   hello  0.860517
14    2    4  14.0  people  0.012749
15    0    0  15.0  people  0.561815
16    1    1  16.0  people  0.389130
17    2    2  17.0   hello  0.930301
18    0    3  18.0   hello  0.835025
19    1    4  19.0      yo  0.212191
i32         int32
i64         int64
f         float64
bhello     object
index     float64
dtype: object

Right:
             i32  i64     f  bhello
   index                           
0  0.617492    0    0   0.0      yo
1  0.823826    1    1   1.0  people
2  0.443275    2    2   2.0  people
3  0.639776    0    3   3.0   hello
4  0.393410    1    4   4.0      yo
5  0.898860    2    0   5.0      yo
6  0.725236    0    1   6.0  people
7  0.933009    1    2   7.0      yo
8  0.663381    2    3   8.0      yo
9  0.471077    0    4   9.0   hello
10 0.562182    1    0  10.0   hello
11 0.734902    2    1  11.0  people
12 0.956519    0    2  12.0      yo
13 0.860517    1    3  13.0   hello
14 0.012749    2    4  14.0  people
15 0.561815    0    0  15.0  people
16 0.389130    1    1  16.0  people
17 0.930301    2    2  17.0   hello
18 0.835025    0    3  18.0   hello
19 0.212191    1    4  19.0      yo
i32         int32
i64         int64
f         float64
bhello     object
dtype: object

Why are the changes needed?

We should make the all test passing with pandas 1.5.0.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manually test with pandas 1.5.0.


expected3 = expected2.set_index("index", append=True)
# There is a bug in `to_parquet` from pandas 1.5.0 when writing MultiIndex.
# See https://github.com/pandas-dev/pandas/issues/48848 for the reported issue.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the related discussion with pandas community in the pandas repo, and confirmed as regression.

Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
@itholic
Copy link
Contributor Author

itholic commented Oct 6, 2022

cc @HyukjinKwon Can you take a look when you find some time? The comment is resolved

@HyukjinKwon
Copy link
Member

Merged to master.

@itholic itholic deleted the SPARK-40590 branch April 22, 2023 05:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants