⚡️ Speed up function _generate_id_within_group by 138%
#40
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 138% (1.38x) speedup for
_generate_id_within_groupindatacompy/spark/sql.py⏱️ Runtime :
16.8 seconds→7.07 seconds(best of5runs)📝 Explanation and details
The optimization achieves a 137% speedup by consolidating multiple expensive Spark operations into a single DataFrame action, eliminating redundant data processing.
Key Optimization: Single-Pass Null/Default Value Detection
The original code performs two separate Spark actions to check for nulls and default values:
The optimized version combines both checks into one Spark action:
Performance Impact Analysis:
From the line profiler results, the original code spends 95% of total time (26.6s out of 28.2s) on the two separate
selectExpr().first()calls. The optimized version reduces this to 86.7% of total time (11.7s out of 13.5s) for the singleselect().limit(1).collect()operation.Why This Matters for Hot Path Usage:
Based on the function references,
_generate_id_within_groupis called during duplicate row processing in_dataframe_merge(), which appears to be part of a core data comparison workflow. Whenself._any_dupesis True, this function is called twice per comparison (once for each DataFrame), making the optimization particularly valuable for:Additional Benefits:
isnull,lit) instead of string-basedselectExprreduces parsing overheadThe optimization is most effective for datasets with join columns containing nulls or requiring duplicate handling, which are common scenarios in data comparison workflows.
✅ Correctness verification report:
⚙️ Existing Unit Tests and Runtime
test_spark/test_sql_spark.py::test_generate_id_within_grouptest_spark/test_sql_spark.py::test_generate_id_within_group_single_joinTo edit these changes
git checkout codeflash/optimize-_generate_id_within_group-mi6f73spand push.