Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mismatch between Summary Report and Actual #157

Closed
guptaat opened this issue Aug 8, 2022 · 9 comments
Closed

Mismatch between Summary Report and Actual #157

guptaat opened this issue Aug 8, 2022 · 9 comments
Assignees
Labels
bug Something isn't working question Further information is requested

Comments

@guptaat
Copy link

guptaat commented Aug 8, 2022

Summary report shows there are records in base and compare dataset that did not have corresponding matches

****** Row Summary ******
Number of rows in common: 2971140
Number of rows in base but not compare: 41536
Number of rows in compare but not base: 41536**
Number of duplicate rows found in base: 3721
Number of duplicate rows found in compare: 3721

but when i try to get the records in base and/or compare that do not have matches.. it returns 0...

image

Using latest Spark version to do comparison...

Any thoughts/suggestions on what might be the issue. I was hoping to see 41536 records for compare.rows_only_compare

@fdosani
Copy link
Member

fdosani commented Aug 15, 2022

Can you provide some more details on your initialization of the SparkCompare, and also the full commands you are running in the notebook? The image is cutoff in cell 18 and 19.

The code being run here is this. It would be nice to know your join_columns, and some sampling of the data to get a sense of why this might be happening.

@fdosani fdosani added bug Something isn't working question Further information is requested labels Aug 15, 2022
@fdosani fdosani self-assigned this Aug 15, 2022
@guptaat
Copy link
Author

guptaat commented Aug 16, 2022

Thank you for the response and pointer.

Environment Variables and Initialization for spark
%env PYSPARK_PYTHON=python
%env PYSPARK_DRIVER_PYTHON_OPTS=notebook
%env PYSPARK_DRIVER_PYTHON=jupyter
%env JAVA_HOME=C:\Program Files\Eclipse Adoptium\jdk-11.0.14.9-hotspot

from pyspark.sql import SparkSession
import pyspark.sql.functions as f
from pyspark.sql.window import Window
from pyspark.sql.functions import when, expr,row_number,rank,col, size, split, concat_ws, upper, to_timestamp
spark = SparkSession.builder \
.master('local[*]') \
.appName('datacompy') \
.config("spark.driver.memory", "40g") \
.getOrCreate()

Regarding image being cut off cell 18 & 19... its nothing except option to print vertically and truncation being off
compare.rows_only_compare.show(n=1,vertical=True,truncate=False)

Datasets being compared are being joined on 6 different elements/columns. using the
join_columns=['fld1','fld2','fld3','fld4','fld5','fld6']

Datacompy was throwing errors while comparing. Those errors went away after I substituted any null values with some default value for fld4 and fld6. fl2 thru fld 5 are strings and fld 6 is datetime

But there were discrepancies in report as outlined in original post. Going through code and manually joining and verifying the counts against datacompy, noticed fld5 also had nulls.

So technically there were no records that didnt have corresponding entry between two datasets but just the output/summary report was off.

After filling nulls with default value for fld 5 and rerunning datacompy.. get the correct Row Summary

image

SUMMARY

Ensure there are no nulls for the data elements that are being used for joins as datacompy may either throw errors or provide wrong summary counts.

Unfortunately I cannot share the data I am working with but as time permits, will try to put together a sample dataset that can reproduce the issue and even see how to handle nulls within join to alleviate the issue. But in the meantime figured will share what I found and hopefully you can get to fix faster than I can

Thank you

@guptaat guptaat closed this as completed Aug 16, 2022
@guptaat guptaat reopened this Aug 16, 2022
@fdosani
Copy link
Member

fdosani commented Aug 16, 2022

Maybe this is related to #147 and the null safe compares. It might be worth testing the branch I have here to see if it solves the issue or not: spark-compare-refactor

@fdosani
Copy link
Member

fdosani commented Aug 29, 2022

@guptaat just wanted to follow up on this if you had a chance to test out my suggestion.

@guptaat
Copy link
Author

guptaat commented Aug 29, 2022

@fdosani apologies. got sidetracked in between due to other projects. Hopefully will pick it back again this week and report back asap. Thank you

@guptaat
Copy link
Author

guptaat commented Aug 29, 2022

@fdosani I installed the spark compare refactor branch using
pip install --upgrade git+https://github.com/capitalone/datacompy@spark-compare-refactor

It did not seem to solve the problem as got the results as below

image

Looked at the code again for sparkcompare.py and didn't notice it to be different than the version in develop branch...

especially the join condition... where code still seems to be using '='

image

Do we have the changes committed or I am missing something basic?

@fdosani
Copy link
Member

fdosani commented Sep 1, 2022

@guptaat Sorry for the delay. In that branch there are 2 spark compares. One is the old way and another is the refactor. Can you double check you are using: from datacompy.spare_core import SparkCompare to import and use the new logic.

The new logic is using eqNullSafe shown here:

(col(col_1).eqNullSafe(col(col_2)))

@guptaat
Copy link
Author

guptaat commented Sep 13, 2022

@fdosani

Yes spark.core.py does have eqNullSafe ( i was previously looking at sparkcompare.py)

image

I explicitly added the import statement as u suggested

image

Got the following results
image

but if I specifically replace nulls in the datasets... I get the following results which is what I was expecting

image

so issue still remains .. I will continue to tinker and keep u posted of any progress

Thanks

@fdosani
Copy link
Member

fdosani commented Jun 22, 2023

Feel free to reopen if this is still an issue. Closing for now.

@fdosani fdosani closed this as completed Jun 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants