Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 19, 2025

📄 47% (0.47x) speedup for PolarsCompare._get_column_comparison in datacompy/polars.py

⏱️ Runtime : 328 microseconds 224 microseconds (best of 5 runs)

📝 Explanation and details

The optimization replaces three separate list comprehensions and generator expressions with a single loop that processes self.column_stats in one pass.

Key Changes:

  • Single traversal: Instead of iterating through column_stats three times (for unequal columns, equal columns, and unequal values), the optimized version processes all statistics in one loop
  • Direct accumulation: Variables unequal_columns, equal_columns, and unequal_values are incremented directly rather than creating intermediate lists and calling len() or sum()

Why it's faster:

  1. Reduced memory allocations: The original code creates two temporary lists ([col for col in self.column_stats if col["unequal_cnt"] > 0] and [col for col in self.column_stats if col["unequal_cnt"] == 0]) which require memory allocation and deallocation
  2. Cache efficiency: Processing each dictionary element once improves CPU cache utilization compared to three separate passes
  3. Eliminated function call overhead: Direct arithmetic operations (+=) are faster than calling len() and sum() on collections

Performance characteristics:
The optimization shows consistent 35-82% speedup across all test cases, with particularly strong gains on:

  • Large datasets (1000 columns): 35-46% faster
  • Small datasets with simple cases: 58-82% faster
  • Edge cases with mixed values: 11-81% faster

This optimization is especially valuable when _get_column_comparison() is called frequently during data comparison workflows, as it reduces both computational overhead and memory pressure with no change in functionality.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 17 Passed
⏪ Replay Tests 16 Passed
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import pytest  # used for our unit tests
from datacompy.polars import PolarsCompare

# function to test (already provided above, not repeated here)
# We assume PolarsCompare._get_column_comparison is defined as above.

# --------------------------
# Unit tests for _get_column_comparison
# --------------------------


class DummyPolarsCompare(PolarsCompare):
    """
    Helper subclass to allow easy injection of column_stats for testing.
    """

    def __init__(self, column_stats):
        # We don't want to run the full parent __init__, just set column_stats
        self.column_stats = column_stats


@pytest.mark.parametrize(
    "column_stats, expected",
    [
        # Basic: All columns equal, no unequal values
        (
            [
                {"unequal_cnt": 0, "column": "a"},
                {"unequal_cnt": 0, "column": "b"},
                {"unequal_cnt": 0, "column": "c"},
            ],
            {
                "column_comparison": {
                    "unequal_columns": 0,
                    "equal_columns": 3,
                    "unequal_values": 0,
                }
            },
        ),
        # Basic: One column with unequal values
        (
            [
                {"unequal_cnt": 0, "column": "a"},
                {"unequal_cnt": 2, "column": "b"},
                {"unequal_cnt": 0, "column": "c"},
            ],
            {
                "column_comparison": {
                    "unequal_columns": 1,
                    "equal_columns": 2,
                    "unequal_values": 2,
                }
            },
        ),
        # Basic: All columns unequal
        (
            [
                {"unequal_cnt": 1, "column": "a"},
                {"unequal_cnt": 5, "column": "b"},
                {"unequal_cnt": 3, "column": "c"},
            ],
            {
                "column_comparison": {
                    "unequal_columns": 3,
                    "equal_columns": 0,
                    "unequal_values": 9,
                }
            },
        ),
        # Edge: No columns at all
        (
            [],
            {
                "column_comparison": {
                    "unequal_columns": 0,
                    "equal_columns": 0,
                    "unequal_values": 0,
                }
            },
        ),
        # Edge: Some columns missing "column" key, but "unequal_cnt" present
        (
            [
                {"unequal_cnt": 0},
                {"unequal_cnt": 1},
                {"unequal_cnt": 0},
            ],
            {
                "column_comparison": {
                    "unequal_columns": 1,
                    "equal_columns": 2,
                    "unequal_values": 1,
                }
            },
        ),
        # Edge: Negative unequal_cnt (should not happen, but test anyway)
        (
            [
                {"unequal_cnt": -1, "column": "a"},
                {"unequal_cnt": 0, "column": "b"},
            ],
            {
                "column_comparison": {
                    "unequal_columns": 1,
                    "equal_columns": 1,
                    "unequal_values": -1,
                }
            },
        ),
        # Edge: Large unequal_cnt
        (
            [
                {"unequal_cnt": 0, "column": "a"},
                {"unequal_cnt": 999, "column": "b"},
            ],
            {
                "column_comparison": {
                    "unequal_columns": 1,
                    "equal_columns": 1,
                    "unequal_values": 999,
                }
            },
        ),
    ],
)
def test_get_column_comparison_basic_and_edge(column_stats, expected):
    """
    Test _get_column_comparison for basic and edge cases.
    """
    pc = DummyPolarsCompare(column_stats)
    codeflash_output = pc._get_column_comparison()
    result = codeflash_output  # 21.9μs -> 13.8μs (58.7% faster)


def test_get_column_comparison_large_scale_all_equal():
    """
    Large scale: 1000 columns, all equal.
    """
    column_stats = [{"unequal_cnt": 0, "column": f"col{i}"} for i in range(1000)]
    pc = DummyPolarsCompare(column_stats)
    codeflash_output = pc._get_column_comparison()
    result = codeflash_output  # 84.5μs -> 62.4μs (35.4% faster)


def test_get_column_comparison_large_scale_half_unequal():
    """
    Large scale: 1000 columns, half with unequal_cnt=0, half with unequal_cnt=1.
    """
    column_stats = []
    for i in range(500):
        column_stats.append({"unequal_cnt": 0, "column": f"eq{i}"})
    for i in range(500):
        column_stats.append({"unequal_cnt": 1, "column": f"neq{i}"})
    pc = DummyPolarsCompare(column_stats)
    codeflash_output = pc._get_column_comparison()
    result = codeflash_output  # 83.9μs -> 58.1μs (44.6% faster)


def test_get_column_comparison_large_scale_varied_unequal():
    """
    Large scale: 1000 columns, each with increasing unequal_cnt.
    """
    column_stats = [{"unequal_cnt": i, "column": f"col{i}"} for i in range(1000)]
    pc = DummyPolarsCompare(column_stats)
    expected_unequal_columns = 999  # Only col0 has unequal_cnt==0
    expected_equal_columns = 1
    expected_unequal_values = sum(range(1000))  # sum 0..999
    codeflash_output = pc._get_column_comparison()
    result = codeflash_output  # 84.0μs -> 57.6μs (45.8% faster)


def test_get_column_comparison_all_unequal_cnt_zero():
    """
    All columns with 'unequal_cnt' == 0, but with extra keys.
    """
    column_stats = [
        {"unequal_cnt": 0, "column": f"col{i}", "foo": "bar"} for i in range(10)
    ]
    pc = DummyPolarsCompare(column_stats)
    codeflash_output = pc._get_column_comparison()
    result = codeflash_output  # 3.34μs -> 1.98μs (69.3% faster)


def test_get_column_comparison_missing_unequal_cnt_key():
    """
    Edge: Some columns missing 'unequal_cnt' key should raise KeyError.
    """
    column_stats = [
        {"unequal_cnt": 0, "column": "a"},
        {"column": "b"},  # missing unequal_cnt
    ]
    pc = DummyPolarsCompare(column_stats)
    with pytest.raises(KeyError):
        pc._get_column_comparison()  # 1.83μs -> 1.64μs (11.4% faster)


def test_get_column_comparison_non_integer_unequal_cnt():
    """
    Edge: unequal_cnt is not an integer (should still sum as float).
    """
    column_stats = [
        {"unequal_cnt": 1.5, "column": "a"},
        {"unequal_cnt": 0, "column": "b"},
    ]
    pc = DummyPolarsCompare(column_stats)
    codeflash_output = pc._get_column_comparison()
    result = codeflash_output  # 3.19μs -> 1.77μs (80.8% faster)


def test_get_column_comparison_negative_and_large():
    """
    Edge: mix of negative, zero, and large values.
    """
    column_stats = [
        {"unequal_cnt": -10, "column": "neg"},
        {"unequal_cnt": 0, "column": "zero"},
        {"unequal_cnt": 500, "column": "big"},
    ]
    pc = DummyPolarsCompare(column_stats)
    codeflash_output = pc._get_column_comparison()
    result = codeflash_output  # 2.82μs -> 1.77μs (59.4% faster)


def test_get_column_comparison_empty_dicts():
    """
    Edge: column_stats contains empty dicts (should raise KeyError).
    """
    column_stats = [{} for _ in range(3)]
    pc = DummyPolarsCompare(column_stats)
    with pytest.raises(KeyError):
        pc._get_column_comparison()  # 1.61μs -> 1.21μs (33.9% faster)


def test_get_column_comparison_single_column():
    """
    Basic: Only one column, unequal_cnt = 0 and unequal_cnt > 0.
    """
    # unequal_cnt = 0
    pc = DummyPolarsCompare([{"unequal_cnt": 0, "column": "solo"}])
    codeflash_output = pc._get_column_comparison()
    result = codeflash_output  # 2.45μs -> 1.34μs (82.5% faster)
    # unequal_cnt = 7
    pc = DummyPolarsCompare([{"unequal_cnt": 7, "column": "solo"}])
    codeflash_output = pc._get_column_comparison()
    result = codeflash_output  # 1.47μs -> 809ns (81.5% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
⏪ Replay Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
test_pytest_teststest_snowflake_py_teststest_polars_py_teststest_sparktest_sql_spark_py_teststest_fuguete__replay_test_0.py::test_datacompy_polars_PolarsCompare__get_column_comparison 36.7μs 21.2μs 73.4%✅

To edit these changes git checkout codeflash/optimize-PolarsCompare._get_column_comparison-mi6mwy42 and push.

Codeflash Static Badge

The optimization replaces three separate list comprehensions and generator expressions with a single loop that processes `self.column_stats` in one pass. 

**Key Changes:**
- **Single traversal**: Instead of iterating through `column_stats` three times (for unequal columns, equal columns, and unequal values), the optimized version processes all statistics in one loop
- **Direct accumulation**: Variables `unequal_columns`, `equal_columns`, and `unequal_values` are incremented directly rather than creating intermediate lists and calling `len()` or `sum()`

**Why it's faster:**
1. **Reduced memory allocations**: The original code creates two temporary lists (`[col for col in self.column_stats if col["unequal_cnt"] > 0]` and `[col for col in self.column_stats if col["unequal_cnt"] == 0]`) which require memory allocation and deallocation
2. **Cache efficiency**: Processing each dictionary element once improves CPU cache utilization compared to three separate passes
3. **Eliminated function call overhead**: Direct arithmetic operations (`+=`) are faster than calling `len()` and `sum()` on collections

**Performance characteristics:**
The optimization shows consistent 35-82% speedup across all test cases, with particularly strong gains on:
- Large datasets (1000 columns): 35-46% faster
- Small datasets with simple cases: 58-82% faster
- Edge cases with mixed values: 11-81% faster

This optimization is especially valuable when `_get_column_comparison()` is called frequently during data comparison workflows, as it reduces both computational overhead and memory pressure with no change in functionality.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 19, 2025 23:29
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant