Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use OrderedSet to maintain DataFrame column order through set operations #108

Merged
merged 1 commit into from Jun 29, 2021

Conversation

gandhis1
Copy link
Contributor

@gandhis1 gandhis1 commented Apr 9, 2021

@fdosani @KrishanBhasin

Fixes #81. Iteration on #107. This aims to maintain the original order when printing out the set of common columns as well as unique columns, both which are based on set operations.

…on set operations

This aims to maintain the original order when printing out the set of common columns as well as unique columns, both which are based on set operations.
@CLAassistant
Copy link

CLAassistant commented Apr 9, 2021

CLA assistant check
All committers have signed the CLA.

compare = datacompy.Compare(df1, df2, ["join"])
assert list(compare.df1_unq_columns()) == ["f", "c"]
assert list(compare.df2_unq_columns()) == ["e", "d"]
assert list(compare.intersect_columns()) == ["join", "g", "b", "h", "a"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example is intended to try to test edge cases (make sure the intersection is "stable" based on the left side), but doesn't necessarily reflect a realistic case. A realistic case is comparing two DataFrames where the columns are virtually identical with minimal differences, and it is in this context that maintaining the original DataFrame ordering actually becomes useful.

Copy link
Member

@fdosani fdosani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @gandhis1. @KrishanBhasin previous PR #107 partially fixes this correct. This is just a variation on the implementation from what I am seeing? In principal I'm fine with the changes, its more adding another dependency which I'd like to discuss. The library was last touched in June 2020. @jborchma @elzzhu thoughts?

@fdosani
Copy link
Member

fdosani commented Apr 13, 2021

@gandhis1 Would you be open to implementing this without an external dependency? Maybe a collections.OrderedDict which just uses the keys. Talking about this with the team and I agree the order would be helpful but I'm a bit hesitant with introducing a new dependency like ordered-set

@gandhis1
Copy link
Contributor Author

gandhis1 commented Apr 13, 2021

So we can copy in the source code here, which removes the PyPi package dependency. The library is under MIT license which as I understand means we just need a copyright and license notice included in the source file.

The challenge with using OrderedDict is it apparently does not support ordered set operations:

In [1]: x = OrderedDict([("b", None), ("a", None), ("c", None)])                                        

In [2]: y = OrderedDict([("c", None), ("b", None), ("z", None)])                                        

In [3]: x.keys()                                                                                        
Out[3]: odict_keys(['b', 'a', 'c'])

In [4]: x.keys() - y.keys()                                                                            
Out[4]: {'a'}

In [5]: type(x.keys() - y.keys())                                                                      
Out[5]: set

In [6]: x.keys() | y.keys()                                                                            
Out[6]: {'a', 'b', 'c', 'z'}

@gandhis1
Copy link
Contributor Author

Revisiting this - any issue if I literally copy the original source code, which is a single file covered under the MIT license, into this repository? MIT license does not affect the license of any other code in this repository, as long as the original license is included with the file.

@fdosani
Copy link
Member

fdosani commented Jun 29, 2021

@SanthiSridharan sorry for the extended delay on this. Been really busy.
So I think we can merge this in. the library doesn't introduce any sub-dependencies and I think that is fine in itself. If this causes issues we can revisit it at a later date.

@fdosani fdosani self-requested a review June 29, 2021 22:31
Copy link
Member

@fdosani fdosani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fdosani fdosani merged commit 47ad53c into capitalone:develop Jun 29, 2021
@fdosani fdosani mentioned this pull request Oct 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use deterministic output order for column-based reporting
3 participants