New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mismatch between Summary Report and Actual #157
Comments
Can you provide some more details on your initialization of the The code being run here is this. It would be nice to know your |
Thank you for the response and pointer. Environment Variables and Initialization for spark
Regarding image being cut off cell 18 & 19... its nothing except option to print vertically and truncation being off Datasets being compared are being joined on 6 different elements/columns. using the Datacompy was throwing errors while comparing. Those errors went away after I substituted any null values with some default value for fld4 and fld6. fl2 thru fld 5 are strings and fld 6 is datetime But there were discrepancies in report as outlined in original post. Going through code and manually joining and verifying the counts against datacompy, noticed fld5 also had nulls. So technically there were no records that didnt have corresponding entry between two datasets but just the output/summary report was off. After filling nulls with default value for fld 5 and rerunning datacompy.. get the correct Row Summary SUMMARY Ensure there are no nulls for the data elements that are being used for joins as datacompy may either throw errors or provide wrong summary counts. Unfortunately I cannot share the data I am working with but as time permits, will try to put together a sample dataset that can reproduce the issue and even see how to handle nulls within join to alleviate the issue. But in the meantime figured will share what I found and hopefully you can get to fix faster than I can Thank you |
Maybe this is related to #147 and the null safe compares. It might be worth testing the branch I have here to see if it solves the issue or not: |
@guptaat just wanted to follow up on this if you had a chance to test out my suggestion. |
@fdosani apologies. got sidetracked in between due to other projects. Hopefully will pick it back again this week and report back asap. Thank you |
@fdosani I installed the spark compare refactor branch using It did not seem to solve the problem as got the results as below Looked at the code again for sparkcompare.py and didn't notice it to be different than the version in develop branch... especially the join condition... where code still seems to be using '=' Do we have the changes committed or I am missing something basic? |
@guptaat Sorry for the delay. In that branch there are 2 spark compares. One is the old way and another is the refactor. Can you double check you are using: The new logic is using datacompy/datacompy/spark_core.py Line 846 in 59c83da
|
Yes spark.core.py does have eqNullSafe ( i was previously looking at sparkcompare.py) I explicitly added the import statement as u suggested but if I specifically replace nulls in the datasets... I get the following results which is what I was expecting so issue still remains .. I will continue to tinker and keep u posted of any progress Thanks |
Feel free to reopen if this is still an issue. Closing for now. |
Summary report shows there are records in base and compare dataset that did not have corresponding matches
but when i try to get the records in base and/or compare that do not have matches.. it returns 0...
Using latest Spark version to do comparison...
Any thoughts/suggestions on what might be the issue. I was hoping to see 41536 records for compare.rows_only_compare
The text was updated successfully, but these errors were encountered: