⚡️ Speed up function `columns_equal` by 17% #37

codeflash-ai · 2025-11-19T16:55:44Z

📄 17% (0.17x) speedup for `columns_equal` in `datacompy/spark/sql.py`

⏱️ Runtime : 230 milliseconds → 197 milliseconds (best of 5 runs)

📝 Explanation and details

The optimized code achieves a 16% speedup through two key optimizations that reduce computational overhead in PySpark operations:

1. Optimized _get_column_dtypes() function:

Original: Uses next() with generator expressions that scan the entire dataframe.dtypes list twice (O(n) for each column lookup)
Optimized: Converts dataframe.dtypes to a dictionary once, enabling O(1) lookups for both columns
Impact: Line profiler shows this function's total time reduced from 3.03ms to 1.72ms (~43% faster)

2. Expression caching in numeric comparisons:

Original: Repeatedly constructs identical Spark expressions like col(col_1), lit(abs_tol), isnan(col(col_1)) multiple times within the same logical plan
Optimized: Pre-computes and caches these expressions (col1_expr, abs_tol_lit, nan_col1, etc.) to avoid redundant Spark expression construction
Impact: Reduces the overhead of building complex Spark logical plans with duplicate sub-expressions

3. Consistent expression reuse in string comparisons:

Optimized: Pre-computes col1_expr and col2_expr, then builds normalized versions (c1n, c2n) based on the transformation flags, avoiding repeated col() calls

Performance benefits are amplified in the hot path: Based on the function references, columns_equal() is called in a loop within _intersect_compare() for every shared column between DataFrames. Since DataFrame comparisons often involve many columns, these micro-optimizations compound significantly across multiple invocations.

The optimization is particularly effective for workloads with DataFrames containing many columns or when performing repeated comparisons, as the expression caching and O(1) dtype lookups provide consistent performance gains without changing the function's behavior or API.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	✅ 63 Passed
🌀 Generated Regression Tests	🔘 None Found
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

⚙️ Existing Unit Tests and Runtime

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`test_spark/test_sql_spark.py::test_bad_date_columns`	9.46ms	7.36ms	28.5%✅
`test_spark/test_sql_spark.py::test_columns_equal_arrays`	29.0ms	26.8ms	8.09%✅
`test_spark/test_sql_spark.py::test_date_columns_equal`	11.4ms	11.2ms	2.02%✅
`test_spark/test_sql_spark.py::test_date_columns_equal_with_ignore_spaces`	15.5ms	15.3ms	1.80%✅
`test_spark/test_sql_spark.py::test_date_columns_equal_with_ignore_spaces_and_case`	19.5ms	19.4ms	0.806%✅
`test_spark/test_sql_spark.py::test_date_columns_unequal`	33.4ms	30.9ms	8.38%✅
`test_spark/test_sql_spark.py::test_decimal_columns_equal`	13.0ms	9.25ms	40.0%✅
`test_spark/test_sql_spark.py::test_decimal_columns_equal_rel`	12.9ms	9.32ms	37.9%✅
`test_spark/test_sql_spark.py::test_decimal_float_columns_equal`	12.7ms	9.52ms	33.2%✅
`test_spark/test_sql_spark.py::test_decimal_float_columns_equal_rel`	13.3ms	9.43ms	40.4%✅
`test_spark/test_sql_spark.py::test_infinity_and_beyond`	13.0ms	9.21ms	41.6%✅
`test_spark/test_sql_spark.py::test_numeric_columns_equal_abs`	12.9ms	9.90ms	30.5%✅
`test_spark/test_sql_spark.py::test_numeric_columns_equal_rel`	14.7ms	10.3ms	42.4%✅
`test_spark/test_sql_spark.py::test_rounded_date_columns`	3.70ms	3.84ms	-3.54%⚠️
`test_spark/test_sql_spark.py::test_string_columns_equal`	3.95ms	3.88ms	1.70%✅
`test_spark/test_sql_spark.py::test_string_columns_equal_with_ignore_spaces`	5.38ms	5.38ms	0.168%✅
`test_spark/test_sql_spark.py::test_string_columns_equal_with_ignore_spaces_and_case`	6.41ms	6.33ms	1.29%✅

To edit these changes git checkout codeflash/optimize-columns_equal-mi68u9sg and push.

The optimized code achieves a **16% speedup** through two key optimizations that reduce computational overhead in PySpark operations: **1. Optimized `_get_column_dtypes()` function:** - **Original**: Uses `next()` with generator expressions that scan the entire `dataframe.dtypes` list twice (O(n) for each column lookup) - **Optimized**: Converts `dataframe.dtypes` to a dictionary once, enabling O(1) lookups for both columns - **Impact**: Line profiler shows this function's total time reduced from 3.03ms to 1.72ms (~43% faster) **2. Expression caching in numeric comparisons:** - **Original**: Repeatedly constructs identical Spark expressions like `col(col_1)`, `lit(abs_tol)`, `isnan(col(col_1))` multiple times within the same logical plan - **Optimized**: Pre-computes and caches these expressions (`col1_expr`, `abs_tol_lit`, `nan_col1`, etc.) to avoid redundant Spark expression construction - **Impact**: Reduces the overhead of building complex Spark logical plans with duplicate sub-expressions **3. Consistent expression reuse in string comparisons:** - **Optimized**: Pre-computes `col1_expr` and `col2_expr`, then builds normalized versions (`c1n`, `c2n`) based on the transformation flags, avoiding repeated `col()` calls **Performance benefits are amplified in the hot path:** Based on the function references, `columns_equal()` is called in a loop within `_intersect_compare()` for every shared column between DataFrames. Since DataFrame comparisons often involve many columns, these micro-optimizations compound significantly across multiple invocations. The optimization is particularly effective for workloads with DataFrames containing many columns or when performing repeated comparisons, as the expression caching and O(1) dtype lookups provide consistent performance gains without changing the function's behavior or API.

codeflash-ai bot requested a review from mashraf-222 November 19, 2025 16:55

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash labels Nov 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up function `columns_equal` by 17% #37

⚡️ Speed up function `columns_equal` by 17% #37

Uh oh!

codeflash-ai bot commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function columns_equal by 17% #37

Are you sure you want to change the base?

⚡️ Speed up function columns_equal by 17% #37

Uh oh!

Conversation

codeflash-ai bot commented Nov 19, 2025

📄 17% (0.17x) speedup for columns_equal in datacompy/spark/sql.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function `columns_equal` by 17% #37

⚡️ Speed up function `columns_equal` by 17% #37

📄 17% (0.17x) speedup for `columns_equal` in `datacompy/spark/sql.py`