Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 19, 2025

📄 17% (0.17x) speedup for columns_equal in datacompy/spark/sql.py

⏱️ Runtime : 230 milliseconds 197 milliseconds (best of 5 runs)

📝 Explanation and details

The optimized code achieves a 16% speedup through two key optimizations that reduce computational overhead in PySpark operations:

1. Optimized _get_column_dtypes() function:

  • Original: Uses next() with generator expressions that scan the entire dataframe.dtypes list twice (O(n) for each column lookup)
  • Optimized: Converts dataframe.dtypes to a dictionary once, enabling O(1) lookups for both columns
  • Impact: Line profiler shows this function's total time reduced from 3.03ms to 1.72ms (~43% faster)

2. Expression caching in numeric comparisons:

  • Original: Repeatedly constructs identical Spark expressions like col(col_1), lit(abs_tol), isnan(col(col_1)) multiple times within the same logical plan
  • Optimized: Pre-computes and caches these expressions (col1_expr, abs_tol_lit, nan_col1, etc.) to avoid redundant Spark expression construction
  • Impact: Reduces the overhead of building complex Spark logical plans with duplicate sub-expressions

3. Consistent expression reuse in string comparisons:

  • Optimized: Pre-computes col1_expr and col2_expr, then builds normalized versions (c1n, c2n) based on the transformation flags, avoiding repeated col() calls

Performance benefits are amplified in the hot path: Based on the function references, columns_equal() is called in a loop within _intersect_compare() for every shared column between DataFrames. Since DataFrame comparisons often involve many columns, these micro-optimizations compound significantly across multiple invocations.

The optimization is particularly effective for workloads with DataFrames containing many columns or when performing repeated comparisons, as the expression caching and O(1) dtype lookups provide consistent performance gains without changing the function's behavior or API.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 63 Passed
🌀 Generated Regression Tests 🔘 None Found
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
⚙️ Existing Unit Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
test_spark/test_sql_spark.py::test_bad_date_columns 9.46ms 7.36ms 28.5%✅
test_spark/test_sql_spark.py::test_columns_equal_arrays 29.0ms 26.8ms 8.09%✅
test_spark/test_sql_spark.py::test_date_columns_equal 11.4ms 11.2ms 2.02%✅
test_spark/test_sql_spark.py::test_date_columns_equal_with_ignore_spaces 15.5ms 15.3ms 1.80%✅
test_spark/test_sql_spark.py::test_date_columns_equal_with_ignore_spaces_and_case 19.5ms 19.4ms 0.806%✅
test_spark/test_sql_spark.py::test_date_columns_unequal 33.4ms 30.9ms 8.38%✅
test_spark/test_sql_spark.py::test_decimal_columns_equal 13.0ms 9.25ms 40.0%✅
test_spark/test_sql_spark.py::test_decimal_columns_equal_rel 12.9ms 9.32ms 37.9%✅
test_spark/test_sql_spark.py::test_decimal_float_columns_equal 12.7ms 9.52ms 33.2%✅
test_spark/test_sql_spark.py::test_decimal_float_columns_equal_rel 13.3ms 9.43ms 40.4%✅
test_spark/test_sql_spark.py::test_infinity_and_beyond 13.0ms 9.21ms 41.6%✅
test_spark/test_sql_spark.py::test_numeric_columns_equal_abs 12.9ms 9.90ms 30.5%✅
test_spark/test_sql_spark.py::test_numeric_columns_equal_rel 14.7ms 10.3ms 42.4%✅
test_spark/test_sql_spark.py::test_rounded_date_columns 3.70ms 3.84ms -3.54%⚠️
test_spark/test_sql_spark.py::test_string_columns_equal 3.95ms 3.88ms 1.70%✅
test_spark/test_sql_spark.py::test_string_columns_equal_with_ignore_spaces 5.38ms 5.38ms 0.168%✅
test_spark/test_sql_spark.py::test_string_columns_equal_with_ignore_spaces_and_case 6.41ms 6.33ms 1.29%✅

To edit these changes git checkout codeflash/optimize-columns_equal-mi68u9sg and push.

Codeflash Static Badge

The optimized code achieves a **16% speedup** through two key optimizations that reduce computational overhead in PySpark operations:

**1. Optimized `_get_column_dtypes()` function:**
- **Original**: Uses `next()` with generator expressions that scan the entire `dataframe.dtypes` list twice (O(n) for each column lookup)
- **Optimized**: Converts `dataframe.dtypes` to a dictionary once, enabling O(1) lookups for both columns
- **Impact**: Line profiler shows this function's total time reduced from 3.03ms to 1.72ms (~43% faster)

**2. Expression caching in numeric comparisons:**
- **Original**: Repeatedly constructs identical Spark expressions like `col(col_1)`, `lit(abs_tol)`, `isnan(col(col_1))` multiple times within the same logical plan
- **Optimized**: Pre-computes and caches these expressions (`col1_expr`, `abs_tol_lit`, `nan_col1`, etc.) to avoid redundant Spark expression construction
- **Impact**: Reduces the overhead of building complex Spark logical plans with duplicate sub-expressions

**3. Consistent expression reuse in string comparisons:**
- **Optimized**: Pre-computes `col1_expr` and `col2_expr`, then builds normalized versions (`c1n`, `c2n`) based on the transformation flags, avoiding repeated `col()` calls

**Performance benefits are amplified in the hot path:** Based on the function references, `columns_equal()` is called in a loop within `_intersect_compare()` for every shared column between DataFrames. Since DataFrame comparisons often involve many columns, these micro-optimizations compound significantly across multiple invocations.

The optimization is particularly effective for workloads with DataFrames containing many columns or when performing repeated comparisons, as the expression caching and O(1) dtype lookups provide consistent performance gains without changing the function's behavior or API.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 19, 2025 16:55
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash labels Nov 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant