Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark Comparison fails to detect similarities in unordered data #314

Open
shreya-goddu opened this issue Jun 18, 2024 · 2 comments
Open

Comments

@shreya-goddu
Copy link

shreya-goddu commented Jun 18, 2024

Found a few cases where datacompy returns mismatches with Spark dataframe comparisons when the data is not sorted (using v0.11.3)

Cases where it reports mismatches when it shouldn't:

  1. Column has an array data type and the dataframes have the same values but different orders

DF1
+---+------+
| a| b|
+---+------+
| 1|[1, 2]|
+---+------+
DF2
+---+------+
| a| b|
+---+------+
| 1|[2, 1]|
+---+------+

`import datacompy

comparison = datacompy.SparkCompare(spark, df, df2, join_columns=['a'])
comparison.rows_both_mismatch.show()`

+---+------+---------+-------+
| a|b_base|b_compare|b_match|
+---+------+---------+-------+
| 1|[1, 2]| [2, 1]| false|
+---+------+---------+-------+

  1. Dataframes with the non-unique join_column values fail to detect similarity when unordered

DF1
+---+---+
| a| b|
+---+---+
| 1| 1|
| 1| 2|
+---+---+
DF2
+---+---+
| a| b|
+---+---+
| 1| 2|
| 1| 1|
+---+---+

`import datacompy

comparison = datacompy.SparkCompare(spark, df, df2, join_columns=['a'])
comparison.rows_both_mismatch.show()`

+---+------+---------+-------+
| a|b_base|b_compare|b_match|
+---+------+---------+-------+
| 1| 1| 2| false|
+---+------+---------+-------+

@fdosani
Copy link
Member

fdosani commented Jun 18, 2024

@shreya-goddu So this is a known issue (point 2). Mainly because with the legacy SparkCompare you are using drops duplicates before it does anything. This was one of the reasons we wanted to align to Pandas and have moved this to be deprecated and have the new Spark versions (Pandas on Spark API and also Spark SQL)

Just to show you what I mean:

from datacompy.spark.legacy import LegacySparkCompare.  # starting v0.12.0
pdf1 = pd.DataFrame({'a': [1, 1], 'b': [1, 2]})
pdf2 = pd.DataFrame({'a': [1, 1], 'b': [2, 1]})
sdf1 = spark.createDataFrame(pdf1)
sdf2 = spark.createDataFrame(pdf2)

compare = LegacySparkCompare(spark, sdf1, sdf2, join_columns=["a"])
compare.rows_both_mismatch.show()  # gives the same issue you were experiencing
+---+------+---------+-------+
|  a|b_base|b_compare|b_match|
+---+------+---------+-------+
|  1|     1|        2|  false|
+---+------+---------+-------+

# under the hood one of the rows is actually dropped.
compare.base_df.show()
+---+---+
|  a|  b|
+---+---+
|  1|  1|
+---+---+

@fdosani
Copy link
Member

fdosani commented Jun 18, 2024

For point 1:

For array types I don’t know if we support those in terms of our compare logic. So it isn’t surprising it says [1,2] doesn’t equal [2,1]. This would depend on your definition also. Someone might say they are equal others might not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants