Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Nov 19, 2025

📄 37% (0.37x) speedup for temp_column_name in datacompy/snowflake.py

⏱️ Runtime : 1.40 milliseconds 1.03 milliseconds (best of 15 runs)

📝 Explanation and details

The optimized code achieves a 36% speedup through two key optimizations:

1. Eliminated inefficient list concatenation
The original code used columns = columns + list(dataframe.columns) inside a loop, which creates a new list object on each iteration. This is O(n²) behavior as lists grow. The optimized version initializes columns = set() and uses columns.update(dataframe.columns), which is O(n) and directly updates the set in-place.

2. Streamlined loop logic
Removed the unnecessary unique variable and simplified the while loop from a two-step check (if temp_column in columns: ... if unique:) to a single negated condition (if temp_column not in columns: return temp_column). This reduces the number of operations per iteration.

Why this matters for performance:

  • Set membership testing (in operator) is O(1) vs O(n) for lists
  • set.update() is more efficient than repeated list concatenation
  • Fewer conditional checks per loop iteration reduces overhead

Impact on workloads:
Based on the function reference, temp_column_name() is called during DataFrame merging operations in _dataframe_merge(), specifically when handling duplicate rows. This is likely a hot path during data comparison operations, so the 36% improvement will meaningfully reduce merge times.

Test case insights:
The optimization shows consistent 10-25% improvements across most test cases, with particularly strong gains (199-263% faster) on large-scale scenarios with many DataFrames, where the O(n²) → O(n) column collection improvement is most pronounced.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 58 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 1 Passed
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
import pytest  # used for our unit tests
from datacompy.snowflake import temp_column_name


# Helper class to mimic a DataFrame with a 'columns' attribute
class DummyDF:
    def __init__(self, columns):
        self.columns = columns


# unit tests

# ---- Basic Test Cases ----


def test_no_dataframes_returns_temp_0():
    # No dataframes passed in, should return '_TEMP_0'
    codeflash_output = temp_column_name()  # 1.35μs -> 1.16μs (16.2% faster)


def test_empty_dataframe_returns_temp_0():
    # One empty dataframe, should return '_TEMP_0'
    df = DummyDF([])
    codeflash_output = temp_column_name(df)  # 1.81μs -> 1.52μs (19.4% faster)


def test_single_dataframe_with_no_temp_columns():
    # DataFrame with unrelated columns, should return '_TEMP_0'
    df = DummyDF(["A", "B", "C"])
    codeflash_output = temp_column_name(df)  # 1.88μs -> 1.55μs (21.0% faster)


def test_single_dataframe_with_temp_0_column():
    # DataFrame already has '_TEMP_0', should return '_TEMP_1'
    df = DummyDF(["A", "_TEMP_0", "B"])
    codeflash_output = temp_column_name(df)  # 2.38μs -> 2.13μs (11.7% faster)


def test_multiple_dataframes_with_some_temp_columns():
    # Multiple dataframes, some with temp columns
    df1 = DummyDF(["A", "_TEMP_0"])
    df2 = DummyDF(["B", "C", "_TEMP_1"])
    df3 = DummyDF(["D"])
    # Should skip _TEMP_0 and _TEMP_1, return '_TEMP_2'
    codeflash_output = temp_column_name(
        df1, df2, df3
    )  # 3.40μs -> 2.79μs (22.1% faster)


def test_multiple_dataframes_no_temp_columns():
    # Multiple dataframes, none with temp columns
    df1 = DummyDF(["A"])
    df2 = DummyDF(["B"])
    df3 = DummyDF(["C"])
    codeflash_output = temp_column_name(
        df1, df2, df3
    )  # 2.09μs -> 1.67μs (25.1% faster)


# ---- Edge Test Cases ----


def test_dataframe_with_all_temp_columns_up_to_5():
    # DataFrame has _TEMP_0 to _TEMP_5, should return '_TEMP_6'
    df = DummyDF([f"_TEMP_{i}" for i in range(6)])
    codeflash_output = temp_column_name(df)  # 3.37μs -> 3.15μs (6.88% faster)


def test_dataframe_with_non_sequential_temp_columns():
    # DataFrame has _TEMP_0, _TEMP_2, _TEMP_4, should return '_TEMP_1'
    df = DummyDF(["A", "_TEMP_0", "_TEMP_2", "_TEMP_4"])
    codeflash_output = temp_column_name(df)  # 2.44μs -> 2.23μs (9.41% faster)


def test_dataframe_with_similar_but_not_exact_temp_column_names():
    # DataFrame has '_TEMP_00', '_TEMP_01', should still return '_TEMP_0'
    df = DummyDF(["_TEMP_00", "_TEMP_01"])
    codeflash_output = temp_column_name(df)  # 1.80μs -> 1.55μs (16.0% faster)


def test_dataframe_with_case_sensitive_column_names():
    # DataFrame has '_temp_0' (lowercase), should return '_TEMP_0'
    df = DummyDF(["_temp_0"])
    codeflash_output = temp_column_name(df)  # 1.72μs -> 1.33μs (29.5% faster)


def test_dataframe_with_columns_named_temp_without_underscore():
    # DataFrame has 'TEMP_0', should return '_TEMP_0'
    df = DummyDF(["TEMP_0"])
    codeflash_output = temp_column_name(df)  # 1.62μs -> 1.40μs (15.7% faster)


def test_dataframe_with_negative_temp_column():
    # DataFrame has '_TEMP_-1', should return '_TEMP_0'
    df = DummyDF(["_TEMP_-1"])
    codeflash_output = temp_column_name(df)  # 1.61μs -> 1.48μs (9.08% faster)


def test_dataframe_with_large_temp_column_number():
    # DataFrame has '_TEMP_0', '_TEMP_1', ..., '_TEMP_999'
    df = DummyDF([f"_TEMP_{i}" for i in range(1000)])
    codeflash_output = temp_column_name(df)  # 162μs -> 143μs (13.0% faster)


def test_dataframes_with_overlapping_columns():
    # DataFrames share some columns, but only one has a temp column
    df1 = DummyDF(["A", "B", "_TEMP_0"])
    df2 = DummyDF(["B", "C"])
    codeflash_output = temp_column_name(df1, df2)  # 2.85μs -> 2.29μs (24.5% faster)


def test_dataframe_with_non_string_columns():
    # DataFrame has integer column names, should still return '_TEMP_0'
    df = DummyDF([1, 2, 3])
    codeflash_output = temp_column_name(df)  # 1.84μs -> 1.63μs (12.5% faster)


def test_dataframe_with_mixed_type_columns():
    # DataFrame has string and int columns, should still return '_TEMP_0'
    df = DummyDF(["A", 2, "_TEMP_0"])
    codeflash_output = temp_column_name(df)  # 2.26μs -> 1.97μs (14.4% faster)


def test_dataframe_with_none_column():
    # DataFrame has None as a column name, should still return '_TEMP_0'
    df = DummyDF([None])
    codeflash_output = temp_column_name(df)  # 1.76μs -> 1.45μs (21.5% faster)


def test_dataframe_with_empty_string_column():
    # DataFrame has empty string as a column name, should still return '_TEMP_0'
    df = DummyDF([""])
    codeflash_output = temp_column_name(df)  # 1.67μs -> 1.47μs (13.5% faster)


def test_dataframe_with_temp_column_as_substring():
    # DataFrame has column '_TEMP_01_extra', should return '_TEMP_0'
    df = DummyDF(["_TEMP_01_extra"])
    codeflash_output = temp_column_name(df)  # 1.62μs -> 1.35μs (19.5% faster)


# ---- Large Scale Test Cases ----


def test_many_dataframes_with_no_temp_columns():
    # 100 dataframes, each with 10 columns, none are temp columns
    dfs = [DummyDF([f"C{i}_{j}" for j in range(10)]) for i in range(100)]
    codeflash_output = temp_column_name(*dfs)  # 144μs -> 48.3μs (199% faster)


def test_many_dataframes_with_sequential_temp_columns():
    # 10 dataframes, each with _TEMP_0 to _TEMP_99, should return '_TEMP_100'
    dfs = [DummyDF([f"_TEMP_{j}" for j in range(100)]) for i in range(10)]
    codeflash_output = temp_column_name(*dfs)  # 53.2μs -> 38.9μs (36.6% faster)


def test_many_dataframes_with_sparse_temp_columns():
    # 50 dataframes, each has _TEMP_0, _TEMP_2, ..., up to _TEMP_98
    dfs = [DummyDF([f"_TEMP_{j}" for j in range(0, 100, 2)]) for i in range(50)]
    # Should return '_TEMP_1' (since _TEMP_0 is present, _TEMP_1 is missing)
    codeflash_output = temp_column_name(*dfs)  # 211μs -> 58.1μs (263% faster)


def test_large_dataframe_with_all_possible_temp_columns():
    # One dataframe with _TEMP_0 to _TEMP_999, should return '_TEMP_1000'
    df = DummyDF([f"_TEMP_{i}" for i in range(1000)])
    codeflash_output = temp_column_name(df)  # 161μs -> 145μs (10.9% faster)


def test_large_dataframe_with_non_temp_columns():
    # One dataframe with 1000 non-temp columns
    df = DummyDF([f"COL_{i}" for i in range(1000)])
    codeflash_output = temp_column_name(df)  # 44.0μs -> 39.1μs (12.5% faster)


def test_large_dataframe_with_mixed_columns():
    # DataFrame with 500 temp columns and 500 non-temp columns
    df = DummyDF([f"_TEMP_{i}" for i in range(500)] + [f"COL_{i}" for i in range(500)])
    codeflash_output = temp_column_name(df)  # 100μs -> 89.8μs (11.5% faster)


# ---- Determinism and Uniqueness ----


def test_determinism_multiple_calls():
    # Multiple calls with same input should return same result
    df = DummyDF(["A", "_TEMP_0"])
    codeflash_output = temp_column_name(df)
    result1 = codeflash_output  # 2.45μs -> 2.09μs (16.9% faster)
    codeflash_output = temp_column_name(df)
    result2 = codeflash_output  # 1.03μs -> 856ns (20.6% faster)


def test_uniqueness_with_added_column():
    # If we add the returned temp column to the dataframe, next call should increment
    df = DummyDF(["A", "_TEMP_0"])
    codeflash_output = temp_column_name(df)
    first_temp = codeflash_output  # 2.00μs -> 1.74μs (14.9% faster)
    df.columns.append(first_temp)
    codeflash_output = temp_column_name(df)
    second_temp = codeflash_output  # 1.52μs -> 1.25μs (22.1% faster)


# ---- Error Handling ----


def test_dataframe_with_no_columns_attribute():
    # Object without 'columns' attribute should raise AttributeError
    class NoColumns:
        pass

    obj = NoColumns()
    with pytest.raises(AttributeError):
        temp_column_name(obj)  # 1.85μs -> 2.00μs (7.45% slower)


def test_dataframe_with_columns_as_none():
    # DataFrame with columns set to None should raise TypeError
    class NoneColumns:
        def __init__(self):
            self.columns = None

    obj = NoneColumns()
    with pytest.raises(TypeError):
        temp_column_name(obj)  # 1.76μs -> 2.00μs (12.0% slower)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import pytest
from datacompy.snowflake import temp_column_name


# Helper: Minimal DataFrame-like mock class
class DummyDF:
    def __init__(self, columns):
        # Accepts any iterable of column names
        self.columns = list(columns)


#########################################
# Basic Test Cases
#########################################


def test_no_dataframes_returns_temp_0():
    # No dataframes passed, should return '_TEMP_0'
    codeflash_output = temp_column_name()  # 1.95μs -> 1.80μs (8.50% faster)


def test_one_empty_dataframe_returns_temp_0():
    # One dataframe, no columns
    df = DummyDF([])
    codeflash_output = temp_column_name(df)  # 1.93μs -> 1.94μs (0.515% slower)


def test_one_dataframe_without_temp_columns():
    # DataFrame with unrelated columns
    df = DummyDF(["a", "b", "c"])
    codeflash_output = temp_column_name(df)  # 2.10μs -> 1.84μs (14.0% faster)


def test_one_dataframe_with_temp_0_taken():
    # DataFrame with '_TEMP_0' already present
    df = DummyDF(["_TEMP_0", "foo"])
    codeflash_output = temp_column_name(df)  # 2.53μs -> 2.15μs (17.9% faster)


def test_one_dataframe_with_multiple_temp_taken():
    # DataFrame with '_TEMP_0', '_TEMP_1', '_TEMP_2' taken
    df = DummyDF(["_TEMP_0", "_TEMP_1", "_TEMP_2"])
    codeflash_output = temp_column_name(df)  # 2.76μs -> 2.31μs (19.2% faster)


def test_two_dataframes_with_overlap():
    # Two dataframes, both with some _TEMP columns
    df1 = DummyDF(["foo", "_TEMP_0"])
    df2 = DummyDF(["bar", "_TEMP_1"])
    # '_TEMP_0' and '_TEMP_1' taken, next is '_TEMP_2'
    codeflash_output = temp_column_name(df1, df2)  # 3.04μs -> 2.25μs (34.9% faster)


def test_multiple_dataframes_with_varied_columns():
    df1 = DummyDF(["a", "b"])
    df2 = DummyDF(["_TEMP_0", "c"])
    df3 = DummyDF(["d", "_TEMP_1"])
    # '_TEMP_0' and '_TEMP_1' are taken, so '_TEMP_2'
    codeflash_output = temp_column_name(
        df1, df2, df3
    )  # 3.21μs -> 2.71μs (18.7% faster)


#########################################
# Edge Test Cases
#########################################


def test_column_names_case_sensitive():
    # Should be case-sensitive: '_TEMP_0' vs '_temp_0'
    df = DummyDF(["_temp_0", "_TEMP_1"])
    # Only '_TEMP_1' is taken, so '_TEMP_0' is available
    codeflash_output = temp_column_name(df)  # 1.87μs -> 1.53μs (22.4% faster)


def test_non_string_column_names():
    # DataFrame with non-string columns (should be cast to string)
    df = DummyDF([1, 2, "_TEMP_0"])
    # Only '_TEMP_0' is taken, so '_TEMP_1'
    codeflash_output = temp_column_name(df)  # 2.25μs -> 1.95μs (15.5% faster)


def test_column_names_with_similar_prefix():
    # Columns like '_TEMP_', '_TEMP_00', etc. should not block '_TEMP_0'
    df = DummyDF(["_TEMP_", "_TEMP_00", "_TEMP_000"])
    codeflash_output = temp_column_name(df)  # 1.79μs -> 1.51μs (18.5% faster)


def test_large_gap_in_taken_temp_columns():
    # Only '_TEMP_0' and '_TEMP_100' are taken; should return '_TEMP_1'
    df = DummyDF(["_TEMP_0", "_TEMP_100"])
    codeflash_output = temp_column_name(df)  # 2.20μs -> 1.87μs (17.9% faster)


def test_all_possible_temp_columns_taken_up_to_10():
    # '_TEMP_0' through '_TEMP_10' are taken
    df = DummyDF([f"_TEMP_{i}" for i in range(11)])
    codeflash_output = temp_column_name(df)  # 4.42μs -> 4.11μs (7.39% faster)


def test_column_names_are_not_strings():
    # Columns are integers and floats, but none match '_TEMP_0'
    df = DummyDF([1, 2.5, 3])
    codeflash_output = temp_column_name(df)  # 2.42μs -> 1.99μs (21.9% faster)


def test_dataframe_with_duplicate_column_names():
    # Duplicate column names in a DataFrame
    df = DummyDF(["a", "a", "_TEMP_0", "_TEMP_0"])
    codeflash_output = temp_column_name(df)  # 2.54μs -> 2.08μs (21.6% faster)


def test_dataframe_with_none_column_names():
    # None as a column name (should not interfere)
    df = DummyDF([None, "_TEMP_0"])
    codeflash_output = temp_column_name(df)  # 2.22μs -> 1.78μs (24.3% faster)


def test_dataframe_with_numeric_string_column_names():
    # Columns named '0', '1', etc.
    df = DummyDF(["0", "1", "2"])
    codeflash_output = temp_column_name(df)  # 1.88μs -> 1.60μs (17.4% faster)


def test_dataframe_with_negative_temp_columns():
    # '_TEMP_-1' and '_TEMP_0' present
    df = DummyDF(["_TEMP_-1", "_TEMP_0"])
    codeflash_output = temp_column_name(df)  # 2.40μs -> 1.91μs (25.5% faster)


#########################################
# Large Scale Test Cases
#########################################


def test_large_number_of_columns():
    # DataFrame with 1000 columns, none are temp columns
    cols = [f"col{i}" for i in range(1000)]
    df = DummyDF(cols)
    codeflash_output = temp_column_name(df)  # 43.6μs -> 39.6μs (10.3% faster)


def test_large_number_of_temp_columns_taken():
    # DataFrame with '_TEMP_0' through '_TEMP_999' taken
    cols = [f"_TEMP_{i}" for i in range(1000)]
    df = DummyDF(cols)
    codeflash_output = temp_column_name(df)  # 160μs -> 143μs (11.7% faster)


def test_many_dataframes_with_disjoint_temp_columns():
    # 10 dataframes, each with 100 unique _TEMP columns
    dfs = [DummyDF([f"_TEMP_{i * 100 + j}" for j in range(100)]) for i in range(10)]
    # All _TEMP_0 through _TEMP_999 are taken
    codeflash_output = temp_column_name(*dfs)  # 169μs -> 144μs (17.1% faster)


def test_performance_with_large_non_temp_columns():
    # DataFrame with 1000 non-temp columns, and '_TEMP_0' taken
    cols = [f"col{i}" for i in range(1000)] + ["_TEMP_0"]
    df = DummyDF(cols)
    codeflash_output = temp_column_name(df)  # 42.7μs -> 38.3μs (11.3% faster)


def test_large_sparse_temp_columns():
    # DataFrame with only every 10th _TEMP_ column taken up to 990
    cols = [f"_TEMP_{i}" for i in range(0, 1000, 10)]
    df = DummyDF(cols)
    # '_TEMP_0' is taken, so '_TEMP_1' is available
    codeflash_output = temp_column_name(df)  # 8.86μs -> 7.98μs (11.0% faster)


#########################################
# Miscellaneous/Robustness
#########################################


def test_works_with_iterable_columns():
    # DataFrame with columns as a tuple
    df = DummyDF(("a", "_TEMP_0"))
    codeflash_output = temp_column_name(df)  # 2.19μs -> 1.97μs (11.3% faster)


def test_works_with_set_columns():
    # DataFrame with columns as a set
    df = DummyDF({"a", "_TEMP_0"})
    codeflash_output = temp_column_name(df)  # 2.05μs -> 1.90μs (8.05% faster)


def test_works_with_generator_columns():
    # DataFrame with columns as a generator
    df = DummyDF(f"col{i}" for i in range(10))
    codeflash_output = temp_column_name(df)  # 2.43μs -> 2.14μs (13.7% faster)


def test_non_dataframe_object_raises_attribute_error():
    # Passing an object without 'columns' attribute should raise AttributeError
    class NoColumns:
        pass

    obj = NoColumns()
    with pytest.raises(AttributeError):
        temp_column_name(obj)  # 1.91μs -> 2.06μs (7.27% slower)


def test_dataframe_with_columns_property():
    # DataFrame with columns as a property
    class DFWithProperty:
        @property
        def columns(self):
            return ["a", "_TEMP_0"]

    df = DFWithProperty()
    codeflash_output = temp_column_name(df)  # 2.93μs -> 2.55μs (14.8% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
import pytest
from datacompy.snowflake import temp_column_name


def test_temp_column_name():
    temp_column_name()


def test_temp_column_name_2():
    with pytest.raises(
        AttributeError,
        match="'SymbolicBoundedInt'\\ object\\ has\\ no\\ attribute\\ 'columns'",
    ):
        temp_column_name(0)
🔎 Concolic Coverage Tests and Runtime
Test File::Test Function Original ⏱️ Optimized ⏱️ Speedup
codeflash_concolic_8h8xtkx8/tmpmch8wy6g/test_concolic_coverage.py::test_temp_column_name 1.99μs 1.71μs 16.8%✅

To edit these changes git checkout codeflash/optimize-temp_column_name-mi6kvq29 and push.

Codeflash Static Badge

The optimized code achieves a **36% speedup** through two key optimizations:

**1. Eliminated inefficient list concatenation**
The original code used `columns = columns + list(dataframe.columns)` inside a loop, which creates a new list object on each iteration. This is O(n²) behavior as lists grow. The optimized version initializes `columns = set()` and uses `columns.update(dataframe.columns)`, which is O(n) and directly updates the set in-place.

**2. Streamlined loop logic**
Removed the unnecessary `unique` variable and simplified the while loop from a two-step check (`if temp_column in columns: ... if unique:`) to a single negated condition (`if temp_column not in columns: return temp_column`). This reduces the number of operations per iteration.

**Why this matters for performance:**
- Set membership testing (`in` operator) is O(1) vs O(n) for lists
- `set.update()` is more efficient than repeated list concatenation
- Fewer conditional checks per loop iteration reduces overhead

**Impact on workloads:**
Based on the function reference, `temp_column_name()` is called during DataFrame merging operations in `_dataframe_merge()`, specifically when handling duplicate rows. This is likely a hot path during data comparison operations, so the 36% improvement will meaningfully reduce merge times.

**Test case insights:**
The optimization shows consistent 10-25% improvements across most test cases, with particularly strong gains (199-263% faster) on large-scale scenarios with many DataFrames, where the O(n²) → O(n) column collection improvement is most pronounced.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 November 19, 2025 22:32
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Nov 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant