⚡️ Speed up method PolarsCompare._get_column_comparison by 47%
#48
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 47% (0.47x) speedup for
PolarsCompare._get_column_comparisonindatacompy/polars.py⏱️ Runtime :
328 microseconds→224 microseconds(best of5runs)📝 Explanation and details
The optimization replaces three separate list comprehensions and generator expressions with a single loop that processes
self.column_statsin one pass.Key Changes:
column_statsthree times (for unequal columns, equal columns, and unequal values), the optimized version processes all statistics in one loopunequal_columns,equal_columns, andunequal_valuesare incremented directly rather than creating intermediate lists and callinglen()orsum()Why it's faster:
[col for col in self.column_stats if col["unequal_cnt"] > 0]and[col for col in self.column_stats if col["unequal_cnt"] == 0]) which require memory allocation and deallocation+=) are faster than callinglen()andsum()on collectionsPerformance characteristics:
The optimization shows consistent 35-82% speedup across all test cases, with particularly strong gains on:
This optimization is especially valuable when
_get_column_comparison()is called frequently during data comparison workflows, as it reduces both computational overhead and memory pressure with no change in functionality.✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
⏪ Replay Tests and Runtime
test_pytest_teststest_snowflake_py_teststest_polars_py_teststest_sparktest_sql_spark_py_teststest_fuguete__replay_test_0.py::test_datacompy_polars_PolarsCompare__get_column_comparisonTo edit these changes
git checkout codeflash/optimize-PolarsCompare._get_column_comparison-mi6mwy42and push.