Mismatch between Summary Report and Actual #157

guptaat · 2022-08-08T15:39:38Z

Summary report shows there are records in base and compare dataset that did not have corresponding matches

****** Row Summary ******
Number of rows in common: 2971140
Number of rows in base but not compare: 41536
Number of rows in compare but not base: 41536**
Number of duplicate rows found in base: 3721
Number of duplicate rows found in compare: 3721

but when i try to get the records in base and/or compare that do not have matches.. it returns 0...

Using latest Spark version to do comparison...

Any thoughts/suggestions on what might be the issue. I was hoping to see 41536 records for compare.rows_only_compare

fdosani · 2022-08-15T15:14:24Z

Can you provide some more details on your initialization of the SparkCompare, and also the full commands you are running in the notebook? The image is cutoff in cell 18 and 19.

The code being run here is this. It would be nice to know your join_columns, and some sampling of the data to get a sense of why this might be happening.

guptaat · 2022-08-16T18:39:11Z

Thank you for the response and pointer.

Environment Variables and Initialization for spark
%env PYSPARK_PYTHON=python
%env PYSPARK_DRIVER_PYTHON_OPTS=notebook
%env PYSPARK_DRIVER_PYTHON=jupyter
%env JAVA_HOME=C:\Program Files\Eclipse Adoptium\jdk-11.0.14.9-hotspot

from pyspark.sql import SparkSession
import pyspark.sql.functions as f
from pyspark.sql.window import Window
from pyspark.sql.functions import when, expr,row_number,rank,col, size, split, concat_ws, upper, to_timestamp
spark = SparkSession.builder \
.master('local[*]') \
.appName('datacompy') \
.config("spark.driver.memory", "40g") \
.getOrCreate()

Regarding image being cut off cell 18 & 19... its nothing except option to print vertically and truncation being off
compare.rows_only_compare.show(n=1,vertical=True,truncate=False)

Datasets being compared are being joined on 6 different elements/columns. using the
join_columns=['fld1','fld2','fld3','fld4','fld5','fld6']

Datacompy was throwing errors while comparing. Those errors went away after I substituted any null values with some default value for fld4 and fld6. fl2 thru fld 5 are strings and fld 6 is datetime

But there were discrepancies in report as outlined in original post. Going through code and manually joining and verifying the counts against datacompy, noticed fld5 also had nulls.

So technically there were no records that didnt have corresponding entry between two datasets but just the output/summary report was off.

After filling nulls with default value for fld 5 and rerunning datacompy.. get the correct Row Summary

SUMMARY

Ensure there are no nulls for the data elements that are being used for joins as datacompy may either throw errors or provide wrong summary counts.

Unfortunately I cannot share the data I am working with but as time permits, will try to put together a sample dataset that can reproduce the issue and even see how to handle nulls within join to alleviate the issue. But in the meantime figured will share what I found and hopefully you can get to fix faster than I can

Thank you

fdosani · 2022-08-16T18:47:51Z

Maybe this is related to #147 and the null safe compares. It might be worth testing the branch I have here to see if it solves the issue or not: spark-compare-refactor

fdosani · 2022-08-29T12:06:16Z

@guptaat just wanted to follow up on this if you had a chance to test out my suggestion.

guptaat · 2022-08-29T15:08:03Z

@fdosani apologies. got sidetracked in between due to other projects. Hopefully will pick it back again this week and report back asap. Thank you

guptaat · 2022-08-29T19:03:37Z

@fdosani I installed the spark compare refactor branch using
pip install --upgrade git+https://github.com/capitalone/datacompy@spark-compare-refactor

It did not seem to solve the problem as got the results as below

Looked at the code again for sparkcompare.py and didn't notice it to be different than the version in develop branch...

especially the join condition... where code still seems to be using '='

Do we have the changes committed or I am missing something basic?

fdosani · 2022-09-01T13:27:49Z

@guptaat Sorry for the delay. In that branch there are 2 spark compares. One is the old way and another is the refactor. Can you double check you are using: from datacompy.spare_core import SparkCompare to import and use the new logic.

The new logic is using eqNullSafe shown here:

datacompy/datacompy/spark_core.py

Line 846 in 59c83da

(col(col_1).eqNullSafe(col(col_2)))

guptaat · 2022-09-13T18:23:50Z

@fdosani

Yes spark.core.py does have eqNullSafe ( i was previously looking at sparkcompare.py)

I explicitly added the import statement as u suggested

Got the following results

but if I specifically replace nulls in the datasets... I get the following results which is what I was expecting

so issue still remains .. I will continue to tinker and keep u posted of any progress

Thanks

fdosani · 2023-06-22T14:02:39Z

Feel free to reopen if this is still an issue. Closing for now.

fdosani added bug Something isn't working question Further information is requested labels Aug 15, 2022

fdosani self-assigned this Aug 15, 2022

guptaat closed this as completed Aug 16, 2022

guptaat reopened this Aug 16, 2022

fdosani closed this as completed Jun 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mismatch between Summary Report and Actual #157

Mismatch between Summary Report and Actual #157

guptaat commented Aug 8, 2022

fdosani commented Aug 15, 2022

guptaat commented Aug 16, 2022 •

edited

fdosani commented Aug 16, 2022

fdosani commented Aug 29, 2022

guptaat commented Aug 29, 2022

guptaat commented Aug 29, 2022

fdosani commented Sep 1, 2022

guptaat commented Sep 13, 2022

fdosani commented Jun 22, 2023 •

edited

Mismatch between Summary Report and Actual #157

Mismatch between Summary Report and Actual #157

Comments

guptaat commented Aug 8, 2022

fdosani commented Aug 15, 2022

guptaat commented Aug 16, 2022 • edited

fdosani commented Aug 16, 2022

fdosani commented Aug 29, 2022

guptaat commented Aug 29, 2022

guptaat commented Aug 29, 2022

fdosani commented Sep 1, 2022

guptaat commented Sep 13, 2022

fdosani commented Jun 22, 2023 • edited

guptaat commented Aug 16, 2022 •

edited

fdosani commented Jun 22, 2023 •

edited