⚡️ Speed up function columns_equal by 45%
#27
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 45% (0.45x) speedup for
columns_equalindatacompy/core.py⏱️ Runtime :
131 milliseconds→90.5 milliseconds(best of38runs)📝 Explanation and details
The optimized code delivers a 44% speedup through several key performance improvements:
Core Optimizations
1. Reduced
infer_dtypecalls: The original code calledpd.api.types.infer_dtype()twice per function call. The optimized version caches these expensive calls upfront, eliminating redundant type inference.2. Vectorized array operations: For list/array comparisons, replaced pandas
DataFrame.apply()withnp.fromiter()and direct numpy operations. This eliminates the overhead of creating temporary DataFrames and row-wise function application.3. Direct numpy array access: Used
.valuesproperty to work directly with underlying numpy arrays for string and numeric comparisons, bypassing pandas Series overhead in performance-critical sections.4. Optimized string normalization: Enhanced
normalize_string_column()to use chained.stroperations when bothignore_spacesandignore_caseare True, and added early returns to avoid unnecessary processing.Performance Impact by Test Case
DataFrame.apply()with efficientnp.fromiter()infer_dtyperesultsProduction Impact
Based on
function_references,columns_equalis called in hot paths like_intersect_compare()which loops through all shared columns, andall_mismatch()which processes intersection dataframes. The optimizations particularly benefit:np.fromiter()optimization provides dramatic speedupsThe optimizations maintain identical functionality while significantly reducing computational overhead, making dataframe comparison operations substantially faster in production workflows.
✅ Correctness verification report:
⚙️ Existing Unit Tests and Runtime
test_core.py::test_bad_date_columnstest_core.py::test_categorical_columntest_core.py::test_columns_equal_liststest_core.py::test_columns_equal_numpy_arraystest_core.py::test_date_columns_equaltest_core.py::test_date_columns_equal_with_ignore_spacestest_core.py::test_date_columns_equal_with_ignore_spaces_and_casetest_core.py::test_date_columns_unequaltest_core.py::test_decimal_columns_equaltest_core.py::test_decimal_columns_equal_reltest_core.py::test_decimal_float_columns_equaltest_core.py::test_decimal_float_columns_equal_reltest_core.py::test_infinity_and_beyondtest_core.py::test_mixed_columntest_core.py::test_mixed_column_with_ignore_spacestest_core.py::test_mixed_column_with_ignore_spaces_and_casetest_core.py::test_numeric_columns_equal_abstest_core.py::test_numeric_columns_equal_reltest_core.py::test_rounded_date_columnstest_core.py::test_single_date_columns_equal_to_stringtest_core.py::test_string_as_numerictest_core.py::test_string_columns_equaltest_core.py::test_string_columns_equal_with_ignore_spacestest_core.py::test_string_columns_equal_with_ignore_spaces_and_casetest_core.py::test_string_pyarrow_columns_equal🌀 Generated Regression Tests and Runtime
⏪ Replay Tests and Runtime
test_pytest_teststest_sparktest_helper_py_teststest_fuguetest_fugue_polars_py_teststest_fuguetest_fugue_p__replay_test_0.py::test_datacompy_core_columns_equalTo edit these changes
git checkout codeflash/optimize-columns_equal-mi5te72nand push.