Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 19, 2025

📄 9% (0.09x) speedup for _aggregate_stats in datacompy/fugue.py

⏱️ Runtime : 187 milliseconds 172 milliseconds (best of 7 runs)

📝 Explanation and details

The optimized code achieves a 9% speedup through several targeted micro-optimizations that reduce function call overhead and improve data structure operations:

Key Optimizations Applied:

  1. Method Localization: The code caches samples.__getitem__ as samples_append to avoid repeated attribute lookups in the inner loop, reducing overhead when accessing dictionary keys.

  2. Efficient DataFrame Concatenation: Instead of using a list comprehension with pd.concat(v) inside _aggregate_stats, the optimization pre-processes concatenation logic to avoid redundant operations. It uses ignore_index=True when concatenating multiple DataFrames and skips concatenation entirely for single DataFrames.

  3. Index Reset Optimization in _sample: The function now checks if the DataFrame already has a proper RangeIndex before calling reset_index(), avoiding unnecessary index operations when the DataFrame is already in the desired state.

  4. Variable Renaming for Clarity: Changed stats to stats_append to better reflect its purpose and cached concat_results.append as concat_append for faster method access.

Performance Impact by Test Case:

  • Large-scale tests show the biggest gains (11.4% to 89.9% faster), particularly the sparse column test which benefits most from the concatenation optimizations
  • Basic operations see modest improvements (1-4% faster)
  • Edge cases show mixed results but generally maintain or slightly improve performance

Why This Matters in Practice:
Based on the function references, _aggregate_stats is called from the main report() function in datacompy's comparison workflow. Since this function processes potentially large numbers of column statistics and mismatch samples from distributed comparisons, the optimizations become more valuable as dataset size and column count increase. The improvements in large-scale test cases indicate this optimization will be most beneficial for real-world data comparison scenarios involving many columns or distributed processing contexts.

The optimizations are particularly effective for workloads with many columns and multiple comparison operations, where the reduced function call overhead and efficient DataFrame operations compound to provide meaningful performance gains.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 23 Passed
⏪ Replay Tests 6 Passed
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
from typing import Any, Dict, List

import pandas as pd

# imports
from datacompy.fugue import _aggregate_stats

# function to test
# (The function is already provided in the prompt and does not need to be rewritten here.)


# Helper: Build a compare dict for test
def make_compare(
    column_stats: List[Dict[str, Any]], mismatch_samples: Dict[str, pd.DataFrame]
) -> Dict[str, Any]:
    return {"column_stats": column_stats, "mismatch_samples": mismatch_samples}


# -------------------- BASIC TEST CASES --------------------


def test_basic_single_compare_single_column():
    # Single compare, single column, basic stats
    compare = make_compare(
        column_stats=[
            {
                "column": "a",
                "match_column": True,
                "match_cnt": 8,
                "unequal_cnt": 2,
                "dtype1": "int",
                "dtype2": "int",
                "all_match": False,
                "max_diff": 3,
                "null_diff": 1,
            }
        ],
        mismatch_samples={"a": pd.DataFrame({"a": [1, 2]})},
    )
    stats, samples = _aggregate_stats(
        [compare], sample_count=2
    )  # 2.97ms -> 2.88ms (2.94% faster)
    s = stats[0]


def test_basic_multiple_compares_same_column():
    # Two compares, same column, stats should be aggregated
    compare1 = make_compare(
        column_stats=[
            {
                "column": "b",
                "match_column": True,
                "match_cnt": 5,
                "unequal_cnt": 1,
                "dtype1": "float",
                "dtype2": "float",
                "all_match": True,
                "max_diff": 0.2,
                "null_diff": 0,
            }
        ],
        mismatch_samples={"b": pd.DataFrame({"b": [10.1]})},
    )
    compare2 = make_compare(
        column_stats=[
            {
                "column": "b",
                "match_column": True,
                "match_cnt": 7,
                "unequal_cnt": 3,
                "dtype1": "float",
                "dtype2": "float",
                "all_match": False,
                "max_diff": 0.5,
                "null_diff": 2,
            }
        ],
        mismatch_samples={"b": pd.DataFrame({"b": [20.2, 30.3]})},
    )
    stats, samples = _aggregate_stats(
        [compare1, compare2], sample_count=3
    )  # 3.05ms -> 3.00ms (1.46% faster)
    s = stats[0]
    expected = pd.concat(
        [pd.DataFrame({"b": [10.1]}), pd.DataFrame({"b": [20.2, 30.3]})],
        ignore_index=True,
    )
    result_sample = samples[0]


def test_basic_multiple_columns():
    # Two compares, different columns
    compare1 = make_compare(
        column_stats=[
            {
                "column": "x",
                "match_column": True,
                "match_cnt": 2,
                "unequal_cnt": 0,
                "dtype1": "str",
                "dtype2": "str",
                "all_match": True,
                "max_diff": 0,
                "null_diff": 0,
            }
        ],
        mismatch_samples={"x": pd.DataFrame({"x": ["foo"]})},
    )
    compare2 = make_compare(
        column_stats=[
            {
                "column": "y",
                "match_column": False,
                "match_cnt": 1,
                "unequal_cnt": 1,
                "dtype1": "str",
                "dtype2": "int",
                "all_match": False,
                "max_diff": 1,
                "null_diff": 1,
            }
        ],
        mismatch_samples={"y": pd.DataFrame({"y": ["bar"]})},
    )
    stats, samples = _aggregate_stats(
        [compare1, compare2], sample_count=2
    )  # 3.00ms -> 2.89ms (3.74% faster)
    columns = {s["column"] for s in stats}


# -------------------- EDGE TEST CASES --------------------


def test_edge_no_mismatch_samples_but_column_stats():
    # Compares with column_stats but no mismatch_samples
    compare = make_compare(
        column_stats=[
            {
                "column": "w",
                "match_column": True,
                "match_cnt": 3,
                "unequal_cnt": 0,
                "dtype1": "int",
                "dtype2": "int",
                "all_match": True,
                "max_diff": 0,
                "null_diff": 0,
            }
        ],
        mismatch_samples={},
    )
    stats, samples = _aggregate_stats(
        [compare], sample_count=2
    )  # 3.23ms -> 3.33ms (3.06% slower)


def test_edge_all_match_true_aggregation():
    # all_match should be True only if all are True
    compare1 = make_compare(
        column_stats=[
            {
                "column": "c",
                "match_column": True,
                "match_cnt": 2,
                "unequal_cnt": 0,
                "dtype1": "int",
                "dtype2": "int",
                "all_match": True,
                "max_diff": 0,
                "null_diff": 0,
            }
        ],
        mismatch_samples={},
    )
    compare2 = make_compare(
        column_stats=[
            {
                "column": "c",
                "match_column": True,
                "match_cnt": 1,
                "unequal_cnt": 1,
                "dtype1": "int",
                "dtype2": "int",
                "all_match": True,
                "max_diff": 1,
                "null_diff": 0,
            }
        ],
        mismatch_samples={},
    )
    stats, _ = _aggregate_stats(
        [compare1, compare2], sample_count=1
    )  # 3.27ms -> 3.34ms (2.16% slower)


def test_edge_all_match_false_aggregation():
    # all_match should be False if any are False
    compare1 = make_compare(
        column_stats=[
            {
                "column": "d",
                "match_column": True,
                "match_cnt": 2,
                "unequal_cnt": 0,
                "dtype1": "int",
                "dtype2": "int",
                "all_match": True,
                "max_diff": 0,
                "null_diff": 0,
            }
        ],
        mismatch_samples={},
    )
    compare2 = make_compare(
        column_stats=[
            {
                "column": "d",
                "match_column": True,
                "match_cnt": 1,
                "unequal_cnt": 1,
                "dtype1": "int",
                "dtype2": "int",
                "all_match": False,
                "max_diff": 1,
                "null_diff": 0,
            }
        ],
        mismatch_samples={},
    )
    stats, _ = _aggregate_stats(
        [compare1, compare2], sample_count=1
    )  # 2.94ms -> 2.93ms (0.290% faster)


def test_edge_dtype_preserved():
    # dtype1/dtype2 should be preserved as first
    compare1 = make_compare(
        column_stats=[
            {
                "column": "e",
                "match_column": True,
                "match_cnt": 1,
                "unequal_cnt": 0,
                "dtype1": "int",
                "dtype2": "float",
                "all_match": True,
                "max_diff": 0,
                "null_diff": 0,
            }
        ],
        mismatch_samples={},
    )
    compare2 = make_compare(
        column_stats=[
            {
                "column": "e",
                "match_column": True,
                "match_cnt": 1,
                "unequal_cnt": 0,
                "dtype1": "int",
                "dtype2": "float",
                "all_match": True,
                "max_diff": 0,
                "null_diff": 0,
            }
        ],
        mismatch_samples={},
    )
    stats, _ = _aggregate_stats(
        [compare1, compare2], sample_count=1
    )  # 2.91ms -> 2.90ms (0.298% faster)


def test_edge_max_diff_aggregation():
    # max_diff should be the maximum of all max_diff values
    compare1 = make_compare(
        column_stats=[
            {
                "column": "f",
                "match_column": True,
                "match_cnt": 1,
                "unequal_cnt": 0,
                "dtype1": "int",
                "dtype2": "int",
                "all_match": True,
                "max_diff": 1,
                "null_diff": 0,
            }
        ],
        mismatch_samples={},
    )
    compare2 = make_compare(
        column_stats=[
            {
                "column": "f",
                "match_column": True,
                "match_cnt": 1,
                "unequal_cnt": 0,
                "dtype1": "int",
                "dtype2": "int",
                "all_match": True,
                "max_diff": 10,
                "null_diff": 0,
            }
        ],
        mismatch_samples={},
    )
    stats, _ = _aggregate_stats(
        [compare1, compare2], sample_count=1
    )  # 2.88ms -> 2.93ms (1.70% slower)


def test_edge_null_diff_aggregation():
    # null_diff should be summed
    compare1 = make_compare(
        column_stats=[
            {
                "column": "g",
                "match_column": True,
                "match_cnt": 1,
                "unequal_cnt": 0,
                "dtype1": "int",
                "dtype2": "int",
                "all_match": True,
                "max_diff": 0,
                "null_diff": 2,
            }
        ],
        mismatch_samples={},
    )
    compare2 = make_compare(
        column_stats=[
            {
                "column": "g",
                "match_column": True,
                "match_cnt": 1,
                "unequal_cnt": 0,
                "dtype1": "int",
                "dtype2": "int",
                "all_match": True,
                "max_diff": 0,
                "null_diff": 3,
            }
        ],
        mismatch_samples={},
    )
    stats, _ = _aggregate_stats(
        [compare1, compare2], sample_count=1
    )  # 2.91ms -> 2.91ms (0.108% faster)


# -------------------- LARGE SCALE TEST CASES --------------------


def test_large_many_compares_many_columns():
    # Many compares, many columns
    num_compares = 50
    num_columns = 20
    compares = []
    for i in range(num_compares):
        column_stats = []
        mismatch_samples = {}
        for j in range(num_columns):
            colname = f"col{j}"
            column_stats.append(
                {
                    "column": colname,
                    "match_column": True,
                    "match_cnt": i + j,
                    "unequal_cnt": j,
                    "dtype1": "int",
                    "dtype2": "int",
                    "all_match": True,
                    "max_diff": j,
                    "null_diff": i,
                }
            )
            # Each mismatch sample is a DataFrame of 2 rows
            mismatch_samples[colname] = pd.DataFrame({colname: [i, i + 1]})
        compares.append(make_compare(column_stats, mismatch_samples))
    stats, samples = _aggregate_stats(
        compares, sample_count=5
    )  # 26.5ms -> 23.8ms (11.4% faster)
    # Each stat should have correct aggregation
    for s in stats:
        j = int(s["column"][3:])
        # match_cnt: sum over compares
        expected_match_cnt = sum(i + j for i in range(num_compares))
        expected_unequal_cnt = num_compares * j
        expected_max_diff = j
        expected_null_diff = sum(i for i in range(num_compares))
    for idx, sample in enumerate(samples):
        # All values should be from the set of expected values
        expected_values = set()
        for i in range(num_compares):
            expected_values.add(i)
            expected_values.add(i + 1)
        colname = f"col{idx}"


def test_large_many_mismatch_rows():
    # Single compare, one column, mismatch_samples with 1000 rows, sample_count=10
    df = pd.DataFrame({"bigcol": list(range(1000))})
    compare = make_compare(
        column_stats=[
            {
                "column": "bigcol",
                "match_column": True,
                "match_cnt": 1000,
                "unequal_cnt": 0,
                "dtype1": "int",
                "dtype2": "int",
                "all_match": True,
                "max_diff": 0,
                "null_diff": 0,
            }
        ],
        mismatch_samples={"bigcol": df},
    )
    stats, samples = _aggregate_stats(
        [compare], sample_count=10
    )  # 3.25ms -> 3.27ms (0.588% slower)
    # Should be 10 sampled rows, all in 0..999
    sample = samples[0]


def test_large_many_columns_sparse_stats():
    # Each compare only covers a subset of columns
    num_compares = 20
    num_columns = 50
    compares = []
    for i in range(num_compares):
        column_stats = []
        mismatch_samples = {}
        for j in range(
            i, num_columns, num_compares
        ):  # Each compare covers ~2-3 columns
            colname = f"col{j}"
            column_stats.append(
                {
                    "column": colname,
                    "match_column": True,
                    "match_cnt": 1,
                    "unequal_cnt": 0,
                    "dtype1": "int",
                    "dtype2": "int",
                    "all_match": True,
                    "max_diff": 0,
                    "null_diff": 0,
                }
            )
            mismatch_samples[colname] = pd.DataFrame({colname: [j]})
        compares.append(make_compare(column_stats, mismatch_samples))
    stats, samples = _aggregate_stats(
        compares, sample_count=1
    )  # 6.30ms -> 3.32ms (89.9% faster)
    # All samples should have 1 row
    for sample in samples:
        pass


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import pandas as pd

# imports
import pytest
from datacompy.fugue import _aggregate_stats

# unit tests

# ------------------ BASIC TEST CASES ------------------


def test_single_compare_single_column():
    # One compare, one column, all values match
    compares = [
        {
            "column_stats": [
                {
                    "column": "A",
                    "match_column": True,
                    "match_cnt": 10,
                    "unequal_cnt": 0,
                    "dtype1": "int",
                    "dtype2": "int",
                    "all_match": True,
                    "max_diff": 0,
                    "null_diff": 0,
                }
            ],
            "mismatch_samples": {"A": pd.DataFrame([])},
        }
    ]
    stats, samples = _aggregate_stats(
        compares, sample_count=5
    )  # 3.08ms -> 3.15ms (2.46% slower)


def test_multiple_compares_multiple_columns():
    # Two compares, two columns, some mismatches
    compares = [
        {
            "column_stats": [
                {
                    "column": "A",
                    "match_column": True,
                    "match_cnt": 5,
                    "unequal_cnt": 1,
                    "dtype1": "int",
                    "dtype2": "int",
                    "all_match": False,
                    "max_diff": 2,
                    "null_diff": 0,
                },
                {
                    "column": "B",
                    "match_column": True,
                    "match_cnt": 6,
                    "unequal_cnt": 0,
                    "dtype1": "float",
                    "dtype2": "float",
                    "all_match": True,
                    "max_diff": 0.0,
                    "null_diff": 0,
                },
            ],
            "mismatch_samples": {
                "A": pd.DataFrame({"idx": [1], "left": [10], "right": [12]}),
                "B": pd.DataFrame([]),
            },
        },
        {
            "column_stats": [
                {
                    "column": "A",
                    "match_column": True,
                    "match_cnt": 4,
                    "unequal_cnt": 2,
                    "dtype1": "int",
                    "dtype2": "int",
                    "all_match": False,
                    "max_diff": 5,
                    "null_diff": 1,
                },
                {
                    "column": "B",
                    "match_column": True,
                    "match_cnt": 6,
                    "unequal_cnt": 1,
                    "dtype1": "float",
                    "dtype2": "float",
                    "all_match": False,
                    "max_diff": 1.5,
                    "null_diff": 0,
                },
            ],
            "mismatch_samples": {
                "A": pd.DataFrame({"idx": [2], "left": [15], "right": [20]}),
                "B": pd.DataFrame({"idx": [3], "left": [2.0], "right": [3.5]}),
            },
        },
    ]
    stats, samples = _aggregate_stats(
        compares, sample_count=2
    )  # 3.16ms -> 3.12ms (1.50% faster)
    # Check aggregation for column A
    stat_a = next(s for s in stats if s["column"] == "A")
    # Check aggregation for column B
    stat_b = next(s for s in stats if s["column"] == "B")
    # For A, should have two rows (from both compares)
    sample_a = (
        samples[0]
        if "idx" in samples[0].columns and 1 in samples[0]["idx"].values
        else samples[1]
    )
    # For B, should have one row (from second compare)
    sample_b = (
        samples[1]
        if "idx" in samples[1].columns and 3 in samples[1]["idx"].values
        else samples[0]
    )


# ------------------ EDGE TEST CASES ------------------


def test_missing_mismatch_samples_key():
    # compare dict missing 'mismatch_samples' key (should raise KeyError)
    compares = [
        {
            "column_stats": [
                {
                    "column": "A",
                    "match_column": True,
                    "match_cnt": 1,
                    "unequal_cnt": 0,
                    "dtype1": "int",
                    "dtype2": "int",
                    "all_match": True,
                    "max_diff": 0,
                    "null_diff": 0,
                }
            ]
            # no 'mismatch_samples'
        }
    ]
    with pytest.raises(KeyError):
        _aggregate_stats(compares, sample_count=1)  # 2.67μs -> 3.07μs (13.1% slower)


def test_column_with_only_unequal():
    # All values are unequal for a column
    compares = [
        {
            "column_stats": [
                {
                    "column": "C",
                    "match_column": True,
                    "match_cnt": 0,
                    "unequal_cnt": 5,
                    "dtype1": "str",
                    "dtype2": "str",
                    "all_match": False,
                    "max_diff": None,
                    "null_diff": 0,
                }
            ],
            "mismatch_samples": {
                "C": pd.DataFrame(
                    {
                        "idx": range(5),
                        "left": ["a", "b", "c", "d", "e"],
                        "right": ["f", "g", "h", "i", "j"],
                    }
                )
            },
        }
    ]
    stats, samples = _aggregate_stats(
        compares, sample_count=2
    )  # 3.72ms -> 3.74ms (0.667% slower)


def test_column_with_null_diff_and_none_maxdiff():
    # null_diff and max_diff are None
    compares = [
        {
            "column_stats": [
                {
                    "column": "D",
                    "match_column": True,
                    "match_cnt": 2,
                    "unequal_cnt": 3,
                    "dtype1": "object",
                    "dtype2": "object",
                    "all_match": False,
                    "max_diff": None,
                    "null_diff": None,
                }
            ],
            "mismatch_samples": {
                "D": pd.DataFrame(
                    {
                        "idx": [1, 2, 3],
                        "left": [None, "x", "y"],
                        "right": ["a", None, "z"],
                    }
                )
            },
        }
    ]
    stats, samples = _aggregate_stats(
        compares, sample_count=10
    )  # 3.19ms -> 3.16ms (0.985% faster)


def test_sample_count_larger_than_mismatches():
    # sample_count is larger than available mismatches
    compares = [
        {
            "column_stats": [
                {
                    "column": "E",
                    "match_column": True,
                    "match_cnt": 10,
                    "unequal_cnt": 2,
                    "dtype1": "int",
                    "dtype2": "int",
                    "all_match": False,
                    "max_diff": 3,
                    "null_diff": 0,
                }
            ],
            "mismatch_samples": {
                "E": pd.DataFrame({"idx": [0, 1], "left": [1, 2], "right": [4, 5]})
            },
        }
    ]
    stats, samples = _aggregate_stats(
        compares, sample_count=10
    )  # 2.94ms -> 2.85ms (3.08% faster)


def test_mismatch_samples_with_empty_dataframe():
    # mismatch_samples contains empty DataFrame for some columns
    compares = [
        {
            "column_stats": [
                {
                    "column": "F",
                    "match_column": True,
                    "match_cnt": 3,
                    "unequal_cnt": 0,
                    "dtype1": "float",
                    "dtype2": "float",
                    "all_match": True,
                    "max_diff": 0.0,
                    "null_diff": 0,
                }
            ],
            "mismatch_samples": {"F": pd.DataFrame([])},
        }
    ]
    stats, samples = _aggregate_stats(
        compares, sample_count=3
    )  # 2.94ms -> 2.91ms (1.22% faster)


def test_multiple_columns_with_overlap_and_different_types():
    # Overlapping columns with different dtypes and aggregation
    compares = [
        {
            "column_stats": [
                {
                    "column": "G",
                    "match_column": True,
                    "match_cnt": 2,
                    "unequal_cnt": 1,
                    "dtype1": "int",
                    "dtype2": "float",
                    "all_match": False,
                    "max_diff": 1.0,
                    "null_diff": 0,
                }
            ],
            "mismatch_samples": {
                "G": pd.DataFrame({"idx": [0], "left": [1], "right": [2.0]})
            },
        },
        {
            "column_stats": [
                {
                    "column": "G",
                    "match_column": True,
                    "match_cnt": 3,
                    "unequal_cnt": 2,
                    "dtype1": "int",
                    "dtype2": "float",
                    "all_match": False,
                    "max_diff": 2.0,
                    "null_diff": 1,
                }
            ],
            "mismatch_samples": {
                "G": pd.DataFrame({"idx": [1, 2], "left": [3, 4], "right": [5.0, 6.0]})
            },
        },
    ]
    stats, samples = _aggregate_stats(
        compares, sample_count=10
    )  # 3.13ms -> 3.11ms (0.676% faster)
    stat = stats[0]


# ------------------ LARGE SCALE TEST CASES ------------------


def test_large_number_of_columns_and_compares():
    # 100 columns, 10 compares, each with 2 stats per column
    num_cols = 100
    num_compares = 10
    compares = []
    for i in range(num_compares):
        col_stats = []
        mismatch_samples = {}
        for j in range(num_cols):
            col = f"col_{j}"
            col_stats.append(
                {
                    "column": col,
                    "match_column": True,
                    "match_cnt": 5 + i,
                    "unequal_cnt": 2 + i,
                    "dtype1": "int",
                    "dtype2": "int",
                    "all_match": (i % 2 == 0),
                    "max_diff": i,
                    "null_diff": i,
                }
            )
            # Each compare has 2 mismatches per column
            mismatch_samples[col] = pd.DataFrame(
                {"idx": [i * 2, i * 2 + 1], "left": [i, i + 1], "right": [i + 2, i + 3]}
            )
        compares.append(
            {
                "column_stats": col_stats,
                "mismatch_samples": mismatch_samples,
            }
        )
    stats, samples = _aggregate_stats(
        compares, sample_count=5
    )  # 50.9ms -> 44.8ms (13.4% faster)
    for df in samples:
        pass


def test_large_sample_count_smaller_than_total():
    # 20 columns, 5 compares, each with 10 mismatches per column
    num_cols = 20
    num_compares = 5
    compares = []
    for i in range(num_compares):
        col_stats = []
        mismatch_samples = {}
        for j in range(num_cols):
            col = f"col_{j}"
            col_stats.append(
                {
                    "column": col,
                    "match_column": True,
                    "match_cnt": 10,
                    "unequal_cnt": 10,
                    "dtype1": "int",
                    "dtype2": "int",
                    "all_match": False,
                    "max_diff": 1,
                    "null_diff": 0,
                }
            )
            mismatch_samples[col] = pd.DataFrame(
                {
                    "idx": list(range(i * 10, (i + 1) * 10)),
                    "left": [i] * 10,
                    "right": [j] * 10,
                }
            )
        compares.append(
            {
                "column_stats": col_stats,
                "mismatch_samples": mismatch_samples,
            }
        )
    stats, samples = _aggregate_stats(
        compares, sample_count=7
    )  # 11.0ms -> 9.89ms (11.0% faster)
    # Each sample DataFrame should have 7 rows (sample_count)
    for df in samples:
        pass


def test_performance_with_mixed_types_and_nulls():
    # 50 columns, 3 compares, with mixed types and some nulls
    num_cols = 50
    num_compares = 3
    compares = []
    for i in range(num_compares):
        col_stats = []
        mismatch_samples = {}
        for j in range(num_cols):
            col = f"col_{j}"
            col_stats.append(
                {
                    "column": col,
                    "match_column": True,
                    "match_cnt": 3,
                    "unequal_cnt": 2,
                    "dtype1": "float" if j % 2 == 0 else "str",
                    "dtype2": "float" if j % 2 == 0 else "str",
                    "all_match": i == 0,
                    "max_diff": None if j % 10 == 0 else float(j),
                    "null_diff": None if i == 2 else j % 3,
                }
            )
            mismatch_samples[col] = pd.DataFrame(
                {
                    "idx": [i * 2, i * 2 + 1],
                    "left": [None, j] if j % 2 == 0 else ["a", None],
                    "right": [j, None] if j % 2 == 0 else [None, "b"],
                }
            )
        compares.append(
            {
                "column_stats": col_stats,
                "mismatch_samples": mismatch_samples,
            }
        )
    stats, samples = _aggregate_stats(
        compares, sample_count=4
    )  # 23.1ms -> 20.4ms (13.5% faster)
    # Check that for a column with null_diff None, result is None
    for s in stats:
        if s["column"].endswith("0"):
            pass
        # For columns with null_diff None, should be None
        if s["column"].endswith("1") and num_compares == 3:
            pass  # just a check for code coverage


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import pytest
from datacompy.fugue import _aggregate_stats


def test__aggregate_stats():
    with pytest.raises(
        TypeError, match="'SymbolicBoundedInt'\\ object\\ is\\ not\\ subscriptable"
    ):
        _aggregate_stats([0], 0)


def test__aggregate_stats_2():
    with pytest.raises(
        ValueError, match="This\\ flag's\\ object\\ has\\ been\\ deleted\\."
    ):
        _aggregate_stats([], 0)
⏪ Replay Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
test_pytest_teststest_sparktest_helper_py_teststest_fuguetest_fugue_polars_py_teststest_fuguetest_fugue_p__replay_test_0.py::test_datacompy_fugue__aggregate_stats 17.2ms 17.1ms 0.942%✅

To edit these changes git checkout codeflash/optimize-_aggregate_stats-mi6jgv60 and push.

Codeflash Static Badge

The optimized code achieves a **9% speedup** through several targeted micro-optimizations that reduce function call overhead and improve data structure operations:

**Key Optimizations Applied:**

1. **Method Localization**: The code caches `samples.__getitem__` as `samples_append` to avoid repeated attribute lookups in the inner loop, reducing overhead when accessing dictionary keys.

2. **Efficient DataFrame Concatenation**: Instead of using a list comprehension with `pd.concat(v)` inside `_aggregate_stats`, the optimization pre-processes concatenation logic to avoid redundant operations. It uses `ignore_index=True` when concatenating multiple DataFrames and skips concatenation entirely for single DataFrames.

3. **Index Reset Optimization in `_sample`**: The function now checks if the DataFrame already has a proper RangeIndex before calling `reset_index()`, avoiding unnecessary index operations when the DataFrame is already in the desired state.

4. **Variable Renaming for Clarity**: Changed `stats` to `stats_append` to better reflect its purpose and cached `concat_results.append` as `concat_append` for faster method access.

**Performance Impact by Test Case:**
- **Large-scale tests** show the biggest gains (11.4% to 89.9% faster), particularly the sparse column test which benefits most from the concatenation optimizations
- **Basic operations** see modest improvements (1-4% faster)
- **Edge cases** show mixed results but generally maintain or slightly improve performance

**Why This Matters in Practice:**
Based on the function references, `_aggregate_stats` is called from the main `report()` function in datacompy's comparison workflow. Since this function processes potentially large numbers of column statistics and mismatch samples from distributed comparisons, the optimizations become more valuable as dataset size and column count increase. The improvements in large-scale test cases indicate this optimization will be most beneficial for real-world data comparison scenarios involving many columns or distributed processing contexts.

The optimizations are particularly effective for workloads with many columns and multiple comparison operations, where the reduced function call overhead and efficient DataFrame operations compound to provide meaningful performance gains.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 19, 2025 21:53
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash labels Nov 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: Medium Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant