⚡️ Speed up function get_merged_columns by 294%
#28
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 294% (2.94x) speedup for
get_merged_columnsindatacompy/core.py⏱️ Runtime :
10.3 milliseconds→2.60 milliseconds(best of34runs)📝 Explanation and details
The optimization achieves a 294% speedup by replacing expensive O(N) column lookups with O(1) set operations.
Key changes:
merged_cols_set = set(merged_df.columns)creates a hash table for O(1) lookups instead of linear searches through pandas Index objectssuffix_str = "_" + suffixavoids repeated string concatenation in the loopWhy it's faster:
The original code performed
col in merged_df.columnschecks, which are O(N) operations on pandas Index objects. With hundreds of columns, this becomes expensive - the line profiler shows 58.7% of time was spent on the first column check and 18.9% on the suffixed column check. The optimized version converts these to O(1) set lookups, reducing lookup time from ~77% to ~24% of total execution time.Performance by test case:
Impact on workloads:
Based on the function reference,
get_merged_columnsis called twice in_dataframe_mergeafter pandas merge operations to extract column subsets. Since dataframe merging is a core operation in data comparison workflows, this optimization will significantly improve performance in data validation pipelines that process datasets with many columns.✅ Correctness verification report:
🌀 Generated Regression Tests and Runtime
⏪ Replay Tests and Runtime
test_pytest_teststest_sparktest_helper_py_teststest_fuguetest_fugue_polars_py_teststest_fuguetest_fugue_p__replay_test_0.py::test_datacompy_core_get_merged_columnsTo edit these changes
git checkout codeflash/optimize-get_merged_columns-mi5tocxcand push.