Use OrderedSet to maintain DataFrame column order through set operations #108

gandhis1 · 2021-04-09T19:13:39Z

Fixes #81. Iteration on #107. This aims to maintain the original order when printing out the set of common columns as well as unique columns, both which are based on set operations.

…on set operations This aims to maintain the original order when printing out the set of common columns as well as unique columns, both which are based on set operations.

CLAassistant · 2021-04-09T19:13:43Z

All committers have signed the CLA.

gandhis1 · 2021-04-09T19:18:00Z

tests/test_core.py

+    compare = datacompy.Compare(df1, df2, ["join"])
+    assert list(compare.df1_unq_columns()) == ["f", "c"]
+    assert list(compare.df2_unq_columns()) == ["e", "d"]
+    assert list(compare.intersect_columns()) == ["join", "g", "b", "h", "a"]


This example is intended to try to test edge cases (make sure the intersection is "stable" based on the left side), but doesn't necessarily reflect a realistic case. A realistic case is comparing two DataFrames where the columns are virtually identical with minimal differences, and it is in this context that maintaining the original DataFrame ordering actually becomes useful.

fdosani

Thanks for the PR @gandhis1. @KrishanBhasin previous PR #107 partially fixes this correct. This is just a variation on the implementation from what I am seeing? In principal I'm fine with the changes, its more adding another dependency which I'd like to discuss. The library was last touched in June 2020. @jborchma @elzzhu thoughts?

fdosani · 2021-04-13T17:57:06Z

@gandhis1 Would you be open to implementing this without an external dependency? Maybe a collections.OrderedDict which just uses the keys. Talking about this with the team and I agree the order would be helpful but I'm a bit hesitant with introducing a new dependency like ordered-set

gandhis1 · 2021-04-13T18:27:54Z

So we can copy in the source code here, which removes the PyPi package dependency. The library is under MIT license which as I understand means we just need a copyright and license notice included in the source file.

The challenge with using OrderedDict is it apparently does not support ordered set operations:

In [1]: x = OrderedDict([("b", None), ("a", None), ("c", None)])                                        

In [2]: y = OrderedDict([("c", None), ("b", None), ("z", None)])                                        

In [3]: x.keys()                                                                                        
Out[3]: odict_keys(['b', 'a', 'c'])

In [4]: x.keys() - y.keys()                                                                            
Out[4]: {'a'}

In [5]: type(x.keys() - y.keys())                                                                      
Out[5]: set

In [6]: x.keys() | y.keys()                                                                            
Out[6]: {'a', 'b', 'c', 'z'}

gandhis1 · 2021-04-26T23:44:53Z

Revisiting this - any issue if I literally copy the original source code, which is a single file covered under the MIT license, into this repository? MIT license does not affect the license of any other code in this repository, as long as the original license is included with the file.

fdosani · 2021-06-29T22:31:26Z

@SanthiSridharan sorry for the extended delay on this. Been really busy.
So I think we can merge this in. the library doesn't introduce any sub-dependencies and I think that is fine in itself. If this causes issues we can revisit it at a later date.

fdosani

LGTM

Use OrderedSet to maintain DataFrame column order throughout comparis…

837c2c6

…on set operations This aims to maintain the original order when printing out the set of common columns as well as unique columns, both which are based on set operations.

gandhis1 requested review from elzzhu, fdosani and jborchma as code owners April 9, 2021 19:13

gandhis1 commented Apr 9, 2021

View reviewed changes

fdosani reviewed Apr 12, 2021

View reviewed changes

fdosani self-requested a review June 29, 2021 22:31

fdosani approved these changes Jun 29, 2021

View reviewed changes

fdosani merged commit 47ad53c into capitalone:develop Jun 29, 2021

fdosani mentioned this pull request Oct 22, 2021

Release v0.7.3 #118

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use OrderedSet to maintain DataFrame column order through set operations #108

Use OrderedSet to maintain DataFrame column order through set operations #108

gandhis1 commented Apr 9, 2021

CLAassistant commented Apr 9, 2021 •

edited

gandhis1 Apr 9, 2021

fdosani left a comment

fdosani commented Apr 13, 2021

gandhis1 commented Apr 13, 2021 •

edited

gandhis1 commented Apr 26, 2021

fdosani commented Jun 29, 2021

fdosani left a comment

Use OrderedSet to maintain DataFrame column order through set operations #108

Use OrderedSet to maintain DataFrame column order through set operations #108

Conversation

gandhis1 commented Apr 9, 2021

CLAassistant commented Apr 9, 2021 • edited

gandhis1 Apr 9, 2021

Choose a reason for hiding this comment

fdosani left a comment

Choose a reason for hiding this comment

fdosani commented Apr 13, 2021

gandhis1 commented Apr 13, 2021 • edited

gandhis1 commented Apr 26, 2021

fdosani commented Jun 29, 2021

fdosani left a comment

Choose a reason for hiding this comment

CLAassistant commented Apr 9, 2021 •

edited

gandhis1 commented Apr 13, 2021 •

edited