Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 19, 2025

📄 20% (0.20x) speedup for sort_rows in datacompy/spark/helper.py

⏱️ Runtime : 469 milliseconds 391 milliseconds (best of 5 runs)

📝 Explanation and details

The optimized code achieves a 20% speedup through two key performance improvements:

1. Optimized Column Membership Testing
The original code performs O(N) linear searches through compare_cols list for each column in base_cols. The optimization creates compare_cols_set = set(compare_cols) once, then uses set membership testing (x not in compare_cols_set) which is O(1) average case. For DataFrames with many columns, this dramatically reduces the computational complexity from O(N²) to O(N).

2. More Efficient DataFrame Column Addition
Replaced df.select("*", row_number().over(w).alias("row")) with df.withColumn("row", row_number().over(w)). The withColumn method is specifically optimized for adding a single new column and avoids the overhead of reconstructing the entire column schema that select("*", ...) requires.

Performance Impact Analysis
Based on the function references, sort_rows is called from compare_by_row, which is likely used in data comparison workflows. The line profiler shows that DataFrame operations (Window.orderBy and column additions) consume ~98% of execution time, so even small optimizations to these operations yield meaningful gains.

Test Case Applicability
The optimizations are most beneficial for DataFrames with larger numbers of columns, where the set-based membership testing provides the greatest advantage. The annotated tests show minimal impact on error cases, which is expected since exceptions are raised early before the optimized code paths are reached.

The optimizations maintain identical behavior and output while improving performance for typical use cases where DataFrames have multiple columns that need validation and sorting.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 7 Passed
🌀 Generated Regression Tests 3 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
⚙️ Existing Unit Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
test_spark/test_helper.py::test_sort_rows_failure 9.56μs 10.3μs -6.94%⚠️
test_spark/test_helper.py::test_sort_rows_success 16.2ms 11.8ms 37.5%✅
🌀 Generated Regression Tests and Runtime
import logging

import pyspark.sql

# imports
import pytest  # used for our unit tests
from datacompy.spark.helper import sort_rows
from pyspark.sql import Row, SparkSession

LOG = logging.getLogger(__name__)

try:
    import pyspark.sql
    from pyspark.sql import Window
    from pyspark.sql.functions import row_number
except ImportError:
    LOG.warning(
        "Please note that you are missing the optional dependency: spark. "
        "If you need to use this functionality it must be installed."
    )

# unit tests


@pytest.fixture(scope="module")
def spark():
    # Create a SparkSession for testing
    spark = (
        SparkSession.builder.master("local[1]").appName("sort_rows_test").getOrCreate()
    )
    yield spark
    spark.stop()


# -------------------
# 1. Basic Test Cases
# -------------------


def test_column_mismatch_extra_in_base(spark):
    # base_df has a column not present in compare_df
    base = spark.createDataFrame([Row(a=1, b=2, c=3)])
    compare = spark.createDataFrame([Row(a=1, b=2)])
    with pytest.raises(Exception) as excinfo:
        sort_rows(base, compare)  # 8.30μs -> 8.34μs (0.456% slower)


def test_unsortable_column_type(spark):
    # DataFrame has a column with unsortable type (array)
    base = spark.createDataFrame([Row(a=[1, 2], b=2)])
    compare = spark.createDataFrame([Row(a=[1, 2], b=2)])
    # Should raise an AnalysisException due to inability to sort array type
    with pytest.raises(Exception):
        sort_rows(base, compare)
import logging

import pyspark.sql

# imports
import pytest
from datacompy.spark.helper import sort_rows
from pyspark.sql import Row, SparkSession

LOG = logging.getLogger(__name__)

try:
    import pyspark.sql
    from pyspark.sql import Window
    from pyspark.sql.functions import row_number
except ImportError:
    LOG.warning(
        "Please note that you are missing the optional dependency: spark. "
        "If you need to use this functionality it must be installed."
    )

# unit tests


@pytest.fixture(scope="module")
def spark():
    # Create a SparkSession for testing
    spark_session = (
        SparkSession.builder.master("local[1]").appName("pytest-spark").getOrCreate()
    )
    yield spark_session
    spark_session.stop()


# -------- BASIC TEST CASES --------


def test_sort_rows_missing_column_in_compare(spark):
    # base_df has a column not present in compare_df
    df1 = spark.createDataFrame([Row(a=1, b=2)])
    df2 = spark.createDataFrame([Row(b=2)])
    with pytest.raises(Exception) as excinfo:
        sort_rows(df1, df2)  # 7.81μs -> 8.57μs (8.86% slower)

To edit these changes git checkout codeflash/optimize-sort_rows-mi5wfuxq and push.

Codeflash Static Badge

The optimized code achieves a 20% speedup through two key performance improvements:

**1. Optimized Column Membership Testing**
The original code performs O(N) linear searches through `compare_cols` list for each column in `base_cols`. The optimization creates `compare_cols_set = set(compare_cols)` once, then uses set membership testing (`x not in compare_cols_set`) which is O(1) average case. For DataFrames with many columns, this dramatically reduces the computational complexity from O(N²) to O(N).

**2. More Efficient DataFrame Column Addition**
Replaced `df.select("*", row_number().over(w).alias("row"))` with `df.withColumn("row", row_number().over(w))`. The `withColumn` method is specifically optimized for adding a single new column and avoids the overhead of reconstructing the entire column schema that `select("*", ...)` requires.

**Performance Impact Analysis**
Based on the function references, `sort_rows` is called from `compare_by_row`, which is likely used in data comparison workflows. The line profiler shows that DataFrame operations (`Window.orderBy` and column additions) consume ~98% of execution time, so even small optimizations to these operations yield meaningful gains.

**Test Case Applicability**
The optimizations are most beneficial for DataFrames with larger numbers of columns, where the set-based membership testing provides the greatest advantage. The annotated tests show minimal impact on error cases, which is expected since exceptions are raised early before the optimized code paths are reached.

The optimizations maintain identical behavior and output while improving performance for typical use cases where DataFrames have multiple columns that need validation and sorting.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 19, 2025 11:08
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant