Skip to content

Commit 0dc774e

Browse files
itholicHyukjinKwon
authored andcommitted
[SPARK-40590][TEST] Fix ps.read_parquet when pandas_metadata is True
### What changes were proposed in this pull request? This PR proposes to fix the `ps.read_parquet` test since the `pd.to_parquet` is broken when the index is `MultiIndex` from pandas 1.5.0. We leverage the `pd.to_parquet` in the test, so the test failed with pandas 1.5.0 as below (`MultiIndex` is not respected): ```python DataFrame shape mismatch [left]: (20, 5) [right]: (20, 4) Left: i32 i64 f bhello index 0 0 0 0.0 yo 0.617492 1 1 1 1.0 people 0.823826 2 2 2 2.0 people 0.443275 3 0 3 3.0 hello 0.639776 4 1 4 4.0 yo 0.393410 5 2 0 5.0 yo 0.898860 6 0 1 6.0 people 0.725236 7 1 2 7.0 yo 0.933009 8 2 3 8.0 yo 0.663381 9 0 4 9.0 hello 0.471077 10 1 0 10.0 hello 0.562182 11 2 1 11.0 people 0.734902 12 0 2 12.0 yo 0.956519 13 1 3 13.0 hello 0.860517 14 2 4 14.0 people 0.012749 15 0 0 15.0 people 0.561815 16 1 1 16.0 people 0.389130 17 2 2 17.0 hello 0.930301 18 0 3 18.0 hello 0.835025 19 1 4 19.0 yo 0.212191 i32 int32 i64 int64 f float64 bhello object index float64 dtype: object Right: i32 i64 f bhello index 0 0.617492 0 0 0.0 yo 1 0.823826 1 1 1.0 people 2 0.443275 2 2 2.0 people 3 0.639776 0 3 3.0 hello 4 0.393410 1 4 4.0 yo 5 0.898860 2 0 5.0 yo 6 0.725236 0 1 6.0 people 7 0.933009 1 2 7.0 yo 8 0.663381 2 3 8.0 yo 9 0.471077 0 4 9.0 hello 10 0.562182 1 0 10.0 hello 11 0.734902 2 1 11.0 people 12 0.956519 0 2 12.0 yo 13 0.860517 1 3 13.0 hello 14 0.012749 2 4 14.0 people 15 0.561815 0 0 15.0 people 16 0.389130 1 1 16.0 people 17 0.930301 2 2 17.0 hello 18 0.835025 0 3 18.0 hello 19 0.212191 1 4 19.0 yo i32 int32 i64 int64 f float64 bhello object dtype: object ``` ### Why are the changes needed? We should make the all test passing with pandas 1.5.0. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually test with pandas 1.5.0. Closes #38055 from itholic/SPARK-40590. Lead-authored-by: itholic <haejoon.lee@databricks.com> Co-authored-by: Haejoon Lee <44108233+itholic@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
1 parent 71bc4ad commit 0dc774e

File tree

1 file changed

+12
-4
lines changed

1 file changed

+12
-4
lines changed

python/pyspark/pandas/tests/test_dataframe_spark_io.py

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
import unittest
1919
import glob
2020
import os
21+
from distutils.version import LooseVersion
2122

2223
import numpy as np
2324
import pandas as pd
@@ -96,11 +97,18 @@ def test_parquet_read_with_pandas_metadata(self):
9697
self.assert_eq(ps.read_parquet(path2, pandas_metadata=True), expected2)
9798

9899
expected3 = expected2.set_index("index", append=True)
100+
# There is a bug in `to_parquet` from pandas 1.5.0 when writing MultiIndex.
101+
# See https://github.com/pandas-dev/pandas/issues/48848 for the reported issue.
102+
if LooseVersion(pd.__version__) == LooseVersion("1.5.0"):
103+
expected_psdf = ps.read_parquet(path2, pandas_metadata=True).set_index(
104+
"index", append=True
105+
)
106+
else:
107+
path3 = "{}/file3.parquet".format(tmp)
108+
expected3.to_parquet(path3)
109+
expected_psdf = ps.read_parquet(path3, pandas_metadata=True)
99110

100-
path3 = "{}/file3.parquet".format(tmp)
101-
expected3.to_parquet(path3)
102-
103-
self.assert_eq(ps.read_parquet(path3, pandas_metadata=True), expected3)
111+
self.assert_eq(expected_psdf, expected3)
104112

105113
def test_parquet_write(self):
106114
with self.temp_dir() as tmp:

0 commit comments

Comments
 (0)