Spark Comparison fails to detect similarities in unordered data #314

shreya-goddu · 2024-06-18T11:04:05Z

Found a few cases where datacompy returns mismatches with Spark dataframe comparisons when the data is not sorted (using v0.11.3)

Cases where it reports mismatches when it shouldn't:

Column has an array data type and the dataframes have the same values but different orders

DF1
+---+------+
| a| b|
+---+------+
| 1|[1, 2]|
+---+------+
DF2
+---+------+
| a| b|
+---+------+
| 1|[2, 1]|
+---+------+

`import datacompy

comparison = datacompy.SparkCompare(spark, df, df2, join_columns=['a'])
comparison.rows_both_mismatch.show()`

+---+------+---------+-------+
| a|b_base|b_compare|b_match|
+---+------+---------+-------+
| 1|[1, 2]| [2, 1]| false|
+---+------+---------+-------+

Dataframes with the non-unique join_column values fail to detect similarity when unordered

DF1
+---+---+
| a| b|
+---+---+
| 1| 1|
| 1| 2|
+---+---+
DF2
+---+---+
| a| b|
+---+---+
| 1| 2|
| 1| 1|
+---+---+

`import datacompy

comparison = datacompy.SparkCompare(spark, df, df2, join_columns=['a'])
comparison.rows_both_mismatch.show()`

+---+------+---------+-------+
| a|b_base|b_compare|b_match|
+---+------+---------+-------+
| 1| 1| 2| false|
+---+------+---------+-------+

fdosani · 2024-06-18T12:21:17Z

@shreya-goddu So this is a known issue (point 2). Mainly because with the legacy SparkCompare you are using drops duplicates before it does anything. This was one of the reasons we wanted to align to Pandas and have moved this to be deprecated and have the new Spark versions (Pandas on Spark API and also Spark SQL)

Just to show you what I mean:

from datacompy.spark.legacy import LegacySparkCompare.  # starting v0.12.0
pdf1 = pd.DataFrame({'a': [1, 1], 'b': [1, 2]})
pdf2 = pd.DataFrame({'a': [1, 1], 'b': [2, 1]})
sdf1 = spark.createDataFrame(pdf1)
sdf2 = spark.createDataFrame(pdf2)

compare = LegacySparkCompare(spark, sdf1, sdf2, join_columns=["a"])
compare.rows_both_mismatch.show()  # gives the same issue you were experiencing
+---+------+---------+-------+
|  a|b_base|b_compare|b_match|
+---+------+---------+-------+
|  1|     1|        2|  false|
+---+------+---------+-------+

# under the hood one of the rows is actually dropped.
compare.base_df.show()
+---+---+
|  a|  b|
+---+---+
|  1|  1|
+---+---+

fdosani · 2024-06-18T12:33:11Z

For point 1:

For array types I don’t know if we support those in terms of our compare logic. So it isn’t surprising it says [1,2] doesn’t equal [2,1]. This would depend on your definition also. Someone might say they are equal others might not.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark Comparison fails to detect similarities in unordered data #314

Spark Comparison fails to detect similarities in unordered data #314

shreya-goddu commented Jun 18, 2024 •

edited

Loading

fdosani commented Jun 18, 2024 •

edited

Loading

fdosani commented Jun 18, 2024

Spark Comparison fails to detect similarities in unordered data #314

Spark Comparison fails to detect similarities in unordered data #314

Comments

shreya-goddu commented Jun 18, 2024 • edited Loading

fdosani commented Jun 18, 2024 • edited Loading

fdosani commented Jun 18, 2024

shreya-goddu commented Jun 18, 2024 •

edited

Loading

fdosani commented Jun 18, 2024 •

edited

Loading