[SPARK-40590][TEST] Fix ps.read_parquet when pandas_metadata is True

itholic · HyukjinKwon · commit 0dc774e625bc · 2022-10-06T15:54:05.000+09:00
### What changes were proposed in this pull request? This PR proposes to fix the `ps.read_parquet` test since the `pd.to_parquet` is broken when the index is `MultiIndex` from pandas 1.5.0. We leverage the `pd.to_parquet` in the test, so the test failed with pandas 1.5.0 as below (`MultiIndex` is not respected): ```python DataFrame shape mismatch [left]: (20, 5) [right]: (20, 4) Left: i32 i64 f bhello index 0 0 0 0.0 yo 0.617492 1 1 1 1.0 people 0.823826 2 2 2 2.0 people 0.443275 3 0 3 3.0 hello 0.639776 4 1 4 4.0 yo 0.393410 5 2 0 5.0 yo 0.898860 6 0 1 6.0 people 0.725236 7 1 2 7.0 yo 0.933009 8 2 3 8.0 yo 0.663381 9 0 4 9.0 hello 0.471077 10 1 0 10.0 hello 0.562182 11 2 1 11.0 people 0.734902 12 0 2 12.0 yo 0.956519 13 1 3 13.0 hello 0.860517 14 2 4 14.0 people 0.012749 15 0 0 15.0 people 0.561815 16 1 1 16.0 people 0.389130 17 2 2 17.0 hello 0.930301 18 0 3 18.0 hello 0.835025 19 1 4 19.0 yo 0.212191 i32 int32 i64 int64 f float64 bhello object index float64 dtype: object Right: i32 i64 f bhello index 0 0.617492 0 0 0.0 yo 1 0.823826 1 1 1.0 people 2 0.443275 2 2 2.0 people 3 0.639776 0 3 3.0 hello 4 0.393410 1 4 4.0 yo 5 0.898860 2 0 5.0 yo 6 0.725236 0 1 6.0 people 7 0.933009 1 2 7.0 yo 8 0.663381 2 3 8.0 yo 9 0.471077 0 4 9.0 hello 10 0.562182 1 0 10.0 hello 11 0.734902 2 1 11.0 people 12 0.956519 0 2 12.0 yo 13 0.860517 1 3 13.0 hello 14 0.012749 2 4 14.0 people 15 0.561815 0 0 15.0 people 16 0.389130 1 1 16.0 people 17 0.930301 2 2 17.0 hello 18 0.835025 0 3 18.0 hello 19 0.212191 1 4 19.0 yo i32 int32 i64 int64 f float64 bhello object dtype: object ``` ### Why are the changes needed? We should make the all test passing with pandas 1.5.0. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually test with pandas 1.5.0. Closes #38055 from itholic/SPARK-40590. Lead-authored-by: itholic <haejoon.lee@databricks.com> Co-authored-by: Haejoon Lee <44108233+itholic@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
diff --git a/python/pyspark/pandas/tests/test_dataframe_spark_io.py b/python/pyspark/pandas/tests/test_dataframe_spark_io.py
@@ -18,6 +18,7 @@
 import unittest
 import glob
 import os
+from distutils.version import LooseVersion
 
 import numpy as np
 import pandas as pd
@@ -96,11 +97,18 @@ def test_parquet_read_with_pandas_metadata(self):
             self.assert_eq(ps.read_parquet(path2, pandas_metadata=True), expected2)
 
             expected3 = expected2.set_index("index", append=True)
+            # There is a bug in `to_parquet` from pandas 1.5.0 when writing MultiIndex.
+            # See https://github.com/pandas-dev/pandas/issues/48848 for the reported issue.
+            if LooseVersion(pd.__version__) == LooseVersion("1.5.0"):
+                expected_psdf = ps.read_parquet(path2, pandas_metadata=True).set_index(
+                    "index", append=True
+                )
+            else:
+                path3 = "{}/file3.parquet".format(tmp)
+                expected3.to_parquet(path3)
+                expected_psdf = ps.read_parquet(path3, pandas_metadata=True)
 
-            path3 = "{}/file3.parquet".format(tmp)
-            expected3.to_parquet(path3)
-
-            self.assert_eq(ps.read_parquet(path3, pandas_metadata=True), expected3)
+            self.assert_eq(expected_psdf, expected3)
 
     def test_parquet_write(self):
         with self.temp_dir() as tmp: