⚡️ Speed up function sort_rows by 20%
#34
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 20% (0.20x) speedup for
sort_rowsindatacompy/spark/helper.py⏱️ Runtime :
469 milliseconds→391 milliseconds(best of5runs)📝 Explanation and details
The optimized code achieves a 20% speedup through two key performance improvements:
1. Optimized Column Membership Testing
The original code performs O(N) linear searches through
compare_colslist for each column inbase_cols. The optimization createscompare_cols_set = set(compare_cols)once, then uses set membership testing (x not in compare_cols_set) which is O(1) average case. For DataFrames with many columns, this dramatically reduces the computational complexity from O(N²) to O(N).2. More Efficient DataFrame Column Addition
Replaced
df.select("*", row_number().over(w).alias("row"))withdf.withColumn("row", row_number().over(w)). ThewithColumnmethod is specifically optimized for adding a single new column and avoids the overhead of reconstructing the entire column schema thatselect("*", ...)requires.Performance Impact Analysis
Based on the function references,
sort_rowsis called fromcompare_by_row, which is likely used in data comparison workflows. The line profiler shows that DataFrame operations (Window.orderByand column additions) consume ~98% of execution time, so even small optimizations to these operations yield meaningful gains.Test Case Applicability
The optimizations are most beneficial for DataFrames with larger numbers of columns, where the set-based membership testing provides the greatest advantage. The annotated tests show minimal impact on error cases, which is expected since exceptions are raised early before the optimized code paths are reached.
The optimizations maintain identical behavior and output while improving performance for typical use cases where DataFrames have multiple columns that need validation and sorting.
✅ Correctness verification report:
⚙️ Existing Unit Tests and Runtime
test_spark/test_helper.py::test_sort_rows_failuretest_spark/test_helper.py::test_sort_rows_success🌀 Generated Regression Tests and Runtime
To edit these changes
git checkout codeflash/optimize-sort_rows-mi5wfuxqand push.