Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27387][PYTHON][TESTS] Replace sqlutils.assertPandasEqual with Pandas assert_frame_equals #24306

Conversation

BryanCutler
Copy link
Member

@BryanCutler BryanCutler commented Apr 5, 2019

What changes were proposed in this pull request?

Running PySpark tests with Pandas 0.24.x causes a failure in test_pandas_udf_grouped_map test_supported_types:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

This is because a column is an ArrayType and the method sqlutils ReusedSQLTestCase.assertPandasEqual does not properly check this.

This PR removes assertPandasEqual and replaces it with the built-in pandas.util.testing.assert_frame_equal which can properly handle columns of ArrayType and also prints out better diff between the DataFrames when an error occurs.

Additionally, imports of pandas and pyarrow were moved to the top of related test files to avoid duplicating the same import many times.

How was this patch tested?

Existing tests

@BryanCutler
Copy link
Member Author

I went ahead and did this before #24298 just in case we end up testing with a newer version of Pandas that also causes the same error as above. cc @HyukjinKwon @ueshin

"""
import sys
if sys.version < '3':
pd_assert_frame_equal(left, right, check_column_type=False)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the only surprise, and it's because calling DataFrame.apply with kwargs infers the column names as str for Python 2, and causes DataFrame.columns.inferred_type to be 'mixed' and the assert to fail

Copy link
Member

@viirya viirya Apr 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we let all tests using this wrapped version? I think it's better if we use a consistent version of assert_frame_equal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, the problem only comes up in these tests because of the way they call assign. How about I remove this function wrapping and just make the option conditional for Python version?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Sounds good. Thanks.

@SparkQA
Copy link

SparkQA commented Apr 5, 2019

Test build #104333 has finished for PR 24306 at commit 26eaa1a.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Apr 5, 2019

Test build #104336 has finished for PR 24306 at commit 26eaa1a.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look good pending tests.

@HyukjinKwon
Copy link
Member

Seems the test failure is related with the Python version upgrade ..

@dongjoon-hyun
Copy link
Member

Yes. It does.

@dongjoon-hyun
Copy link
Member

Retest this please

@SparkQA
Copy link

SparkQA commented Apr 6, 2019

Test build #104346 has finished for PR 24306 at commit 26eaa1a.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 8, 2019

Test build #104403 has finished for PR 24306 at commit 1361382.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler BryanCutler force-pushed the python-pandas-assert_frame_equal-SPARK-27387 branch from 1361382 to 7cdcb9f Compare April 8, 2019 20:23
@SparkQA
Copy link

SparkQA commented Apr 8, 2019

Test build #104404 has finished for PR 24306 at commit 7cdcb9f.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Apr 8, 2019

Test build #104405 has finished for PR 24306 at commit 7cdcb9f.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 8, 2019

Test build #104406 has finished for PR 24306 at commit 1c5452a.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 8, 2019

Test build #104407 has finished for PR 24306 at commit 224063c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler BryanCutler force-pushed the python-pandas-assert_frame_equal-SPARK-27387 branch from 224063c to 26eaa1a Compare April 8, 2019 22:59
@SparkQA
Copy link

SparkQA commented Apr 8, 2019

Test build #104411 has finished for PR 24306 at commit 26eaa1a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

import pyarrow as pa


def assert_frame_equal(left, right):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more for my info, but what happens here if you dont' have pandas? how can this method work?
If you do have pandas elsewhere, do the other imports shadow this definition?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would fail if you don't have pandas with an error that it doesn't know what pd_assert_frame_equal is, since it's only imported above, conditional on pandas being imported already. This is only seen by this file and wouldn't change anything elsewhere.

I'm going to remove this though, and only define check_column_type here. That should make it clearer I think.

@SparkQA
Copy link

SparkQA commented Apr 9, 2019

Test build #104438 has finished for PR 24306 at commit 8dd7202.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Apr 9, 2019

Test build #104445 has finished for PR 24306 at commit 8dd7202.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member Author

BryanCutler commented Apr 9, 2019 via email

@BryanCutler
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Apr 9, 2019

Test build #104454 has finished for PR 24306 at commit 8dd7202.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

rshkv pushed a commit to rshkv/spark that referenced this pull request Feb 27, 2020
…Pandas assert_frame_equals

## What changes were proposed in this pull request?

Running PySpark tests with Pandas 0.24.x causes a failure in `test_pandas_udf_grouped_map` test_supported_types:
`ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()`

This is because a column is an ArrayType and the method `sqlutils ReusedSQLTestCase.assertPandasEqual ` does not properly check this.

This PR removes `assertPandasEqual` and replaces it with the built-in `pandas.util.testing.assert_frame_equal` which can properly handle columns of ArrayType and also prints out better diff between the DataFrames when an error occurs.

Additionally, imports of pandas and pyarrow were moved to the top of related test files to avoid duplicating the same import many times.

## How was this patch tested?

Existing tests

Closes apache#24306 from BryanCutler/python-pandas-assert_frame_equal-SPARK-27387.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
rshkv pushed a commit to palantir/spark that referenced this pull request Apr 8, 2020
…Pandas assert_frame_equals

## What changes were proposed in this pull request?

Running PySpark tests with Pandas 0.24.x causes a failure in `test_pandas_udf_grouped_map` test_supported_types:
`ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()`

This is because a column is an ArrayType and the method `sqlutils ReusedSQLTestCase.assertPandasEqual ` does not properly check this.

This PR removes `assertPandasEqual` and replaces it with the built-in `pandas.util.testing.assert_frame_equal` which can properly handle columns of ArrayType and also prints out better diff between the DataFrames when an error occurs.

Additionally, imports of pandas and pyarrow were moved to the top of related test files to avoid duplicating the same import many times.

## How was this patch tested?

Existing tests

Closes apache#24306 from BryanCutler/python-pandas-assert_frame_equal-SPARK-27387.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
rshkv pushed a commit to palantir/spark that referenced this pull request May 1, 2020
…Pandas assert_frame_equals

## What changes were proposed in this pull request?

Running PySpark tests with Pandas 0.24.x causes a failure in `test_pandas_udf_grouped_map` test_supported_types:
`ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()`

This is because a column is an ArrayType and the method `sqlutils ReusedSQLTestCase.assertPandasEqual ` does not properly check this.

This PR removes `assertPandasEqual` and replaces it with the built-in `pandas.util.testing.assert_frame_equal` which can properly handle columns of ArrayType and also prints out better diff between the DataFrames when an error occurs.

Additionally, imports of pandas and pyarrow were moved to the top of related test files to avoid duplicating the same import many times.

## How was this patch tested?

Existing tests

Closes apache#24306 from BryanCutler/python-pandas-assert_frame_equal-SPARK-27387.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
rshkv pushed a commit to palantir/spark that referenced this pull request May 1, 2020
…Pandas assert_frame_equals

## What changes were proposed in this pull request?

Running PySpark tests with Pandas 0.24.x causes a failure in `test_pandas_udf_grouped_map` test_supported_types:
`ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()`

This is because a column is an ArrayType and the method `sqlutils ReusedSQLTestCase.assertPandasEqual ` does not properly check this.

This PR removes `assertPandasEqual` and replaces it with the built-in `pandas.util.testing.assert_frame_equal` which can properly handle columns of ArrayType and also prints out better diff between the DataFrames when an error occurs.

Additionally, imports of pandas and pyarrow were moved to the top of related test files to avoid duplicating the same import many times.

## How was this patch tested?

Existing tests

Closes apache#24306 from BryanCutler/python-pandas-assert_frame_equal-SPARK-27387.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants