Skip to content

feat: hide sensitive columns in PandasCompare without modifying original dfs#494

Merged
fdosani merged 18 commits intocapitalone:developfrom
sh13m:pandas-hide-reveal-col
Mar 29, 2026
Merged

feat: hide sensitive columns in PandasCompare without modifying original dfs#494
fdosani merged 18 commits intocapitalone:developfrom
sh13m:pandas-hide-reveal-col

Conversation

@sh13m
Copy link
Copy Markdown

@sh13m sh13m commented Mar 6, 2026

Improves on #490. Column hashing changed to data hiding/censoring. Can now be done without affecting the original dataframes and should not blow up memory footprint.

  • Realized data is already being copied internally when merging dataframes in _dataframe_merge(), with df1_unq_rows, df2_unq_rows, and intersect_rows being used downstream.
  • We can modify these dataframes after _compare() is called to hide the relevant sensitive columns without modifying the original dataframes, also shouldn't introduce any significant memory pressure.
  • Something not caught previously but should be fixed now is that tolerances wouldn't have worked for floats because the hashing was done prior to comparing.
  • Since things are being done after the compare now instead of before, we can change the hashing to a simple column replacement of all ******* to achieve full data hiding.
    • Much faster and simpler this way (don't need to worry about nulls).
  • Sensitive column hiding is now done through the hide_sensitive_columns(sensitive_columns) method, we can also reveal the hidden columns through reveal_sensitive_columns().
    • combined the setter and validation into a single private method to prevent direct attribute modification, self.sensitive_columns should only be controlled through the hide/reveal methods.
  • Updated unit tests.

@sh13m sh13m marked this pull request as ready for review March 6, 2026 01:09
Copy link
Copy Markdown
Member

@fdosani fdosani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial thoughts on the review. Nice work!

Comment thread datacompy/pandas.py Outdated
Comment thread datacompy/pandas.py
Comment thread datacompy/pandas.py
Comment thread datacompy/pandas.py
Comment thread tests/test_pandas.py
@sh13m sh13m closed this Mar 12, 2026
@sh13m sh13m reopened this Mar 12, 2026
@sh13m sh13m closed this Mar 12, 2026
@sh13m sh13m reopened this Mar 12, 2026
@sh13m sh13m closed this Mar 12, 2026
@sh13m sh13m reopened this Mar 12, 2026
@sh13m
Copy link
Copy Markdown
Author

sh13m commented Mar 12, 2026

Not sure why tests were randomly failing, might have been a github issue.

Copy link
Copy Markdown
Member

@fdosani fdosani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments and thoughts, appreciate the effort and improve on the hashing!

Comment thread datacompy/pandas.py Outdated
Comment thread datacompy/pandas.py Outdated
Comment thread datacompy/pandas.py Outdated
Comment thread tests/test_pandas.py Outdated
@sh13m sh13m closed this Mar 17, 2026
@sh13m sh13m reopened this Mar 17, 2026
@fdosani fdosani merged commit 323bdda into capitalone:develop Mar 29, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants