You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@shreya-goddu So this is a known issue (point 2). Mainly because with the legacy SparkCompare you are using drops duplicates before it does anything. This was one of the reasons we wanted to align to Pandas and have moved this to be deprecated and have the new Spark versions (Pandas on Spark API and also Spark SQL)
Just to show you what I mean:
fromdatacompy.spark.legacyimportLegacySparkCompare. # starting v0.12.0pdf1=pd.DataFrame({'a': [1, 1], 'b': [1, 2]})
pdf2=pd.DataFrame({'a': [1, 1], 'b': [2, 1]})
sdf1=spark.createDataFrame(pdf1)
sdf2=spark.createDataFrame(pdf2)
compare=LegacySparkCompare(spark, sdf1, sdf2, join_columns=["a"])
compare.rows_both_mismatch.show() # gives the same issue you were experiencing+---+------+---------+-------+|a|b_base|b_compare|b_match|+---+------+---------+-------+|1|1|2|false|+---+------+---------+-------+# under the hood one of the rows is actually dropped.compare.base_df.show()
+---+---+|a|b|+---+---+|1|1|+---+---+
For array types I don’t know if we support those in terms of our compare logic. So it isn’t surprising it says [1,2] doesn’t equal [2,1]. This would depend on your definition also. Someone might say they are equal others might not.
Found a few cases where datacompy returns mismatches with Spark dataframe comparisons when the data is not sorted (using v0.11.3)
Cases where it reports mismatches when it shouldn't:
DF1
+---+------+
| a| b|
+---+------+
| 1|[1, 2]|
+---+------+
DF2
+---+------+
| a| b|
+---+------+
| 1|[2, 1]|
+---+------+
`import datacompy
comparison = datacompy.SparkCompare(spark, df, df2, join_columns=['a'])
comparison.rows_both_mismatch.show()`
+---+------+---------+-------+
| a|b_base|b_compare|b_match|
+---+------+---------+-------+
| 1|[1, 2]| [2, 1]| false|
+---+------+---------+-------+
DF1
+---+---+
| a| b|
+---+---+
| 1| 1|
| 1| 2|
+---+---+
DF2
+---+---+
| a| b|
+---+---+
| 1| 2|
| 1| 1|
+---+---+
`import datacompy
comparison = datacompy.SparkCompare(spark, df, df2, join_columns=['a'])
comparison.rows_both_mismatch.show()`
+---+------+---------+-------+
| a|b_base|b_compare|b_match|
+---+------+---------+-------+
| 1| 1| 2| false|
+---+------+---------+-------+
The text was updated successfully, but these errors were encountered: