⚡️ Speed up function `sort_rows` by 20% #34

codeflash-ai · 2025-11-19T11:08:39Z

📄 20% (0.20x) speedup for `sort_rows` in `datacompy/spark/helper.py`

⏱️ Runtime : 469 milliseconds → 391 milliseconds (best of 5 runs)

📝 Explanation and details

The optimized code achieves a 20% speedup through two key performance improvements:

1. Optimized Column Membership Testing
The original code performs O(N) linear searches through compare_cols list for each column in base_cols. The optimization creates compare_cols_set = set(compare_cols) once, then uses set membership testing (x not in compare_cols_set) which is O(1) average case. For DataFrames with many columns, this dramatically reduces the computational complexity from O(N²) to O(N).

2. More Efficient DataFrame Column Addition
Replaced df.select("*", row_number().over(w).alias("row")) with df.withColumn("row", row_number().over(w)). The withColumn method is specifically optimized for adding a single new column and avoids the overhead of reconstructing the entire column schema that select("*", ...) requires.

Performance Impact Analysis
Based on the function references, sort_rows is called from compare_by_row, which is likely used in data comparison workflows. The line profiler shows that DataFrame operations (Window.orderBy and column additions) consume ~98% of execution time, so even small optimizations to these operations yield meaningful gains.

Test Case Applicability
The optimizations are most beneficial for DataFrames with larger numbers of columns, where the set-based membership testing provides the greatest advantage. The annotated tests show minimal impact on error cases, which is expected since exceptions are raised early before the optimized code paths are reached.

The optimizations maintain identical behavior and output while improving performance for typical use cases where DataFrames have multiple columns that need validation and sorting.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	✅ 7 Passed
🌀 Generated Regression Tests	✅ 3 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

⚙️ Existing Unit Tests and Runtime

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`test_spark/test_helper.py::test_sort_rows_failure`	9.56μs	10.3μs	-6.94%⚠️
`test_spark/test_helper.py::test_sort_rows_success`	16.2ms	11.8ms	37.5%✅

🌀 Generated Regression Tests and Runtime

import logging

import pyspark.sql

# imports
import pytest  # used for our unit tests
from datacompy.spark.helper import sort_rows
from pyspark.sql import Row, SparkSession

LOG = logging.getLogger(__name__)

try:
    import pyspark.sql
    from pyspark.sql import Window
    from pyspark.sql.functions import row_number
except ImportError:
    LOG.warning(
        "Please note that you are missing the optional dependency: spark. "
        "If you need to use this functionality it must be installed."
    )

# unit tests


@pytest.fixture(scope="module")
def spark():
    # Create a SparkSession for testing
    spark = (
        SparkSession.builder.master("local[1]").appName("sort_rows_test").getOrCreate()
    )
    yield spark
    spark.stop()


# -------------------
# 1. Basic Test Cases
# -------------------


def test_column_mismatch_extra_in_base(spark):
    # base_df has a column not present in compare_df
    base = spark.createDataFrame([Row(a=1, b=2, c=3)])
    compare = spark.createDataFrame([Row(a=1, b=2)])
    with pytest.raises(Exception) as excinfo:
        sort_rows(base, compare)  # 8.30μs -> 8.34μs (0.456% slower)


def test_unsortable_column_type(spark):
    # DataFrame has a column with unsortable type (array)
    base = spark.createDataFrame([Row(a=[1, 2], b=2)])
    compare = spark.createDataFrame([Row(a=[1, 2], b=2)])
    # Should raise an AnalysisException due to inability to sort array type
    with pytest.raises(Exception):
        sort_rows(base, compare)

import logging

import pyspark.sql

# imports
import pytest
from datacompy.spark.helper import sort_rows
from pyspark.sql import Row, SparkSession

LOG = logging.getLogger(__name__)

try:
    import pyspark.sql
    from pyspark.sql import Window
    from pyspark.sql.functions import row_number
except ImportError:
    LOG.warning(
        "Please note that you are missing the optional dependency: spark. "
        "If you need to use this functionality it must be installed."
    )

# unit tests


@pytest.fixture(scope="module")
def spark():
    # Create a SparkSession for testing
    spark_session = (
        SparkSession.builder.master("local[1]").appName("pytest-spark").getOrCreate()
    )
    yield spark_session
    spark_session.stop()


# -------- BASIC TEST CASES --------


def test_sort_rows_missing_column_in_compare(spark):
    # base_df has a column not present in compare_df
    df1 = spark.createDataFrame([Row(a=1, b=2)])
    df2 = spark.createDataFrame([Row(b=2)])
    with pytest.raises(Exception) as excinfo:
        sort_rows(df1, df2)  # 7.81μs -> 8.57μs (8.86% slower)

To edit these changes git checkout codeflash/optimize-sort_rows-mi5wfuxq and push.

The optimized code achieves a 20% speedup through two key performance improvements: **1. Optimized Column Membership Testing** The original code performs O(N) linear searches through `compare_cols` list for each column in `base_cols`. The optimization creates `compare_cols_set = set(compare_cols)` once, then uses set membership testing (`x not in compare_cols_set`) which is O(1) average case. For DataFrames with many columns, this dramatically reduces the computational complexity from O(N²) to O(N). **2. More Efficient DataFrame Column Addition** Replaced `df.select("*", row_number().over(w).alias("row"))` with `df.withColumn("row", row_number().over(w))`. The `withColumn` method is specifically optimized for adding a single new column and avoids the overhead of reconstructing the entire column schema that `select("*", ...)` requires. **Performance Impact Analysis** Based on the function references, `sort_rows` is called from `compare_by_row`, which is likely used in data comparison workflows. The line profiler shows that DataFrame operations (`Window.orderBy` and column additions) consume ~98% of execution time, so even small optimizations to these operations yield meaningful gains. **Test Case Applicability** The optimizations are most beneficial for DataFrames with larger numbers of columns, where the set-based membership testing provides the greatest advantage. The annotated tests show minimal impact on error cases, which is expected since exceptions are raised early before the optimized code paths are reached. The optimizations maintain identical behavior and output while improving performance for typical use cases where DataFrames have multiple columns that need validation and sorting.

codeflash-ai bot requested a review from mashraf-222 November 19, 2025 11:08

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up function `sort_rows` by 20% #34

⚡️ Speed up function `sort_rows` by 20% #34

Uh oh!

codeflash-ai bot commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function sort_rows by 20% #34

Are you sure you want to change the base?

⚡️ Speed up function sort_rows by 20% #34

Uh oh!

Conversation

codeflash-ai bot commented Nov 19, 2025

📄 20% (0.20x) speedup for sort_rows in datacompy/spark/helper.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚡️ Speed up function `sort_rows` by 20% #34

⚡️ Speed up function `sort_rows` by 20% #34

📄 20% (0.20x) speedup for `sort_rows` in `datacompy/spark/helper.py`