⚡️ Speed up method Compare._get_row_summary by 81%
#25
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 81% (0.81x) speedup for
Compare._get_row_summaryindatacompy/core.py⏱️ Runtime :
41.3 milliseconds→22.8 milliseconds(best of15runs)📝 Explanation and details
The optimized code achieves an 81% speedup by eliminating redundant method calls and computations in two key areas:
Primary optimization in
_get_row_summary():self.count_matching_rows()twice andself.intersect_rows.shape[0]twice when building the summary dictionary. The optimized version caches these values in variables (equal_rowsandintersect_rows_shape) and reuses them.count_matching_rows()involves pandas DataFrame operations (all(axis=1).sum()) which are computationally expensive. The line profiler shows this method taking ~96% of execution time in the original version.Secondary optimization in constructor:
str(col).lower()orstr(col)for each column in a complex conditional within the list comprehension, the code now separates the logic - first casting the join_columns list once, then applying either lowercase conversion or string conversion in separate, cleaner list comprehensions.join_columnstoList[str]when it's not None, preventing wasted work.Performance impact analysis from test results:
_get_row_summary()is likely called in report generation workflowsThe optimizations preserve all existing behavior while significantly reducing computational overhead through intelligent caching of expensive operations.
✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
⏪ Replay Tests and Runtime
test_pytest_teststest_sparktest_helper_py_teststest_fuguetest_fugue_polars_py_teststest_fuguetest_fugue_p__replay_test_0.py::test_datacompy_core_Compare__get_row_summaryTo edit these changes
git checkout codeflash/optimize-Compare._get_row_summary-mi5sq0tgand push.