⚡️ Speed up function format_numeric_fields by 8%
#35
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 8% (0.08x) speedup for
format_numeric_fieldsindatacompy/spark/helper.py⏱️ Runtime :
397 milliseconds→366 milliseconds(best of5runs)📝 Explanation and details
The optimization achieves an 8% speedup through two key changes that reduce Python overhead during DataFrame column processing:
1. Set-based membership testing: Changed
numeric_typesfrom a list to a set. The original code performedc[1] not in numeric_typeschecks 229 times according to the profiler. List membership testing is O(n) while set membership is O(1), making lookups significantly faster when checking data types.2. Generator expression instead of list building: Replaced the explicit loop that builds a
fixed_colslist with a generator expression that lazily computes column transformations. This eliminates the overhead of:list.append()calls (229 total appends in the original)The profiler results show the optimization's impact clearly - the original code spent significant time in the loop structure (lines with
c[1] not in numeric_types,fixed_cols.append()calls), while the optimized version consolidates all the column processing logic into a single generator that's consumed directly bydf.select().The generator approach is particularly effective here because PySpark's
select()method can consume the generator directly without materializing an intermediate list, reducing memory pressure and eliminating Python-level iteration overhead. This optimization is most beneficial for DataFrames with many columns, where the membership testing and list building costs accumulate significantly.✅ Correctness verification report:
⚙️ Existing Unit Tests and Runtime
test_spark/test_helper.py::test_format_numeric_fieldsTo edit these changes
git checkout codeflash/optimize-format_numeric_fields-mi5xgu5tand push.