Skip to content

Conversation

@codeflash-ai
Copy link
Contributor

@codeflash-ai codeflash-ai bot commented Jun 5, 2025

⚡️ This pull request contains optimizations for PR #275

If you approve this dependent PR, these changes will be merged into the original PR branch dont-optimize-repeatedly-gh-actions.

This PR will be automatically closed if the original PR is merged.


📄 15% (0.15x) speedup for FunctionToOptimize.get_code_context_hash in codeflash/discovery/functions_to_optimize.py

⏱️ Runtime : 3.67 milliseconds 3.20 milliseconds (best of 72 runs)

📝 Explanation and details

Here is an optimized version of your code, targeting the areas highlighted as slowest in your line profiling.

Key Optimizations

  1. Read Only Necessary Lines:

    • When starting_line and ending_line are provided, instead of reading the entire file and calling .splitlines(), read only the lines needed. This drastically lowers memory use and speeds up file operations for large files.
    • Uses itertools.islice to efficiently pluck only relevant lines.
  2. String Manipulation Reduction:

    • Reduce the number of intermediate string allocations by reusing objects as much as possible and joining lines only once.
    • Avoids strip() unless absolutely necessary (as likely only for code content).
  3. Variable Lookup:

    • Minimize attribute lookups that are inside loops.

The function semantics are preserved exactly. All comments are retained or improved for code that was changed for better understanding.

Rationale

  • The main bottleneck is reading full files and splitting them when only a small region is needed. By slicing only the relevant lines from file, the function becomes much faster for large files or high call counts.
  • All behaviors, including fallback and hash calculation, are unchanged.
  • Import of islice is local and lightweight.

This should significantly improve both runtime and memory usage of get_code_context_hash.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 135 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests Details
from __future__ import annotations

import hashlib
import logging
import shutil
import tempfile
from dataclasses import dataclass
from pathlib import Path
from typing import Optional

# imports
import pytest  # used for our unit tests
from codeflash.discovery.functions_to_optimize import FunctionToOptimize


# Minimal FunctionParent mock for testing
@dataclass(frozen=True)
class FunctionParent:
    name: str

# Minimal logger mock for testing (since the original is not available)
class DummyLogger:
    def warning(self, msg):
        pass

logger = DummyLogger()
from codeflash.discovery.functions_to_optimize import FunctionToOptimize

# unit tests

@pytest.fixture
def temp_code_file(tmp_path):
    """Fixture to create and clean up a temporary code file."""
    def _make_file(content: str) -> Path:
        file_path = tmp_path / "testfile.py"
        file_path.write_text(content, encoding="utf-8")
        return file_path
    return _make_file

# 1. BASIC TEST CASES

def test_hash_changes_with_code_content(temp_code_file):
    """Test that the hash changes when the function code changes."""
    code1 = "def foo():\n    return 1\n"
    code2 = "def foo():\n    return 2\n"
    file1 = temp_code_file(code1)
    file2 = temp_code_file(code2)
    fto1 = FunctionToOptimize("foo", file1, [], 1, 2)
    fto2 = FunctionToOptimize("foo", file2, [], 1, 2)
    codeflash_output = fto1.get_code_context_hash(); hash1 = codeflash_output
    codeflash_output = fto2.get_code_context_hash(); hash2 = codeflash_output

def test_hash_same_for_identical_code_and_context(temp_code_file):
    """Test that the hash is the same for identical code and context."""
    code = "def bar():\n    return 42\n"
    file1 = temp_code_file(code)
    file2 = temp_code_file(code)
    fto1 = FunctionToOptimize("bar", file1, [], 1, 2)
    fto2 = FunctionToOptimize("bar", file2, [], 1, 2)
    codeflash_output = fto1.get_code_context_hash(); hash1 = codeflash_output
    codeflash_output = fto2.get_code_context_hash(); hash2 = codeflash_output

def test_hash_changes_with_function_name(temp_code_file):
    """Test that the hash changes if the function name changes."""
    code = "def baz():\n    return 0\n"
    file = temp_code_file(code)
    fto1 = FunctionToOptimize("baz", file, [], 1, 2)
    fto2 = FunctionToOptimize("qux", file, [], 1, 2)
    codeflash_output = fto1.get_code_context_hash(); hash1 = codeflash_output
    codeflash_output = fto2.get_code_context_hash(); hash2 = codeflash_output


def test_hash_changes_with_file_name(temp_code_file, tmp_path):
    """Test that the hash changes if the file name changes (even with same content)."""
    code = "def foo():\n    return 1\n"
    file1 = tmp_path / "file1.py"
    file2 = tmp_path / "file2.py"
    file1.write_text(code, encoding="utf-8")
    file2.write_text(code, encoding="utf-8")
    fto1 = FunctionToOptimize("foo", file1, [], 1, 2)
    fto2 = FunctionToOptimize("foo", file2, [], 1, 2)
    codeflash_output = fto1.get_code_context_hash(); hash1 = codeflash_output
    codeflash_output = fto2.get_code_context_hash(); hash2 = codeflash_output

def test_hash_entire_file_if_no_line_numbers(temp_code_file):
    """Test that the entire file content is used if no line numbers are provided."""
    code = "def foo():\n    return 1\n\ndef bar():\n    return 2\n"
    file = temp_code_file(code)
    fto = FunctionToOptimize("foo", file, [])
    # Should use the whole file content
    codeflash_output = fto.get_code_context_hash(); hash_full = codeflash_output
    # Now, use only the first two lines
    fto_partial = FunctionToOptimize("foo", file, [], 1, 2)
    codeflash_output = fto_partial.get_code_context_hash(); hash_partial = codeflash_output

# 2. EDGE TEST CASES

def test_hash_with_empty_file(tmp_path):
    """Test hash generation for an empty file."""
    file = tmp_path / "empty.py"
    file.write_text("", encoding="utf-8")
    fto = FunctionToOptimize("foo", file, [], 1, 1)
    codeflash_output = fto.get_code_context_hash(); hash_val = codeflash_output


def test_hash_with_unicode_content(temp_code_file):
    """Test hash with Unicode characters in the function code."""
    code = "def foo():\n    return '你好, мир, hello'\n"
    file = temp_code_file(code)
    fto = FunctionToOptimize("foo", file, [], 1, 2)
    codeflash_output = fto.get_code_context_hash(); hash_val = codeflash_output

def test_hash_with_large_function(temp_code_file):
    """Test hash with a function that spans many lines."""
    code = "def foo():\n" + "\n".join([f"    x{i} = {i}" for i in range(100)]) + "\n    return x99\n"
    file = temp_code_file(code)
    fto = FunctionToOptimize("foo", file, [], 1, 102)
    codeflash_output = fto.get_code_context_hash(); hash_val = codeflash_output


def test_hash_with_starting_line_greater_than_ending_line(temp_code_file):
    """Test that an empty string is used if starting_line > ending_line."""
    code = "def foo():\n    return 1\n"
    file = temp_code_file(code)
    fto = FunctionToOptimize("foo", file, [], 2, 1)
    codeflash_output = fto.get_code_context_hash(); hash_val = codeflash_output
    # The function_content should be empty, so the hash should be of the context with empty function content
    context_parts = [
        file.name,
        fto.qualified_name,
        ""
    ]
    context_string = '\n---\n'.join(context_parts)
    expected_hash = hashlib.sha256(context_string.encode('utf-8')).hexdigest()

def test_hash_with_starting_line_out_of_bounds(temp_code_file):
    """Test that out-of-bounds line numbers result in empty function content."""
    code = "def foo():\n    return 1\n"
    file = temp_code_file(code)
    # starting_line is beyond file length
    fto = FunctionToOptimize("foo", file, [], 100, 200)
    codeflash_output = fto.get_code_context_hash(); hash_val = codeflash_output
    context_parts = [
        file.name,
        fto.qualified_name,
        ""
    ]
    context_string = '\n---\n'.join(context_parts)
    expected_hash = hashlib.sha256(context_string.encode('utf-8')).hexdigest()

def test_hash_with_only_starting_line(temp_code_file):
    """Test that if only starting_line is provided, the whole file is used."""
    code = "def foo():\n    return 1\n"
    file = temp_code_file(code)
    fto = FunctionToOptimize("foo", file, [], 1, None)
    codeflash_output = fto.get_code_context_hash(); hash_val = codeflash_output
    # Should use the whole file content, not just a single line
    context_parts = [
        file.name,
        fto.qualified_name,
        code.strip()
    ]
    context_string = '\n---\n'.join(context_parts)
    expected_hash = hashlib.sha256(context_string.encode('utf-8')).hexdigest()

def test_hash_with_only_ending_line(temp_code_file):
    """Test that if only ending_line is provided, the whole file is used."""
    code = "def foo():\n    return 1\n"
    file = temp_code_file(code)
    fto = FunctionToOptimize("foo", file, [], None, 2)
    codeflash_output = fto.get_code_context_hash(); hash_val = codeflash_output
    # Should use the whole file content, not just up to ending_line
    context_parts = [
        file.name,
        fto.qualified_name,
        code.strip()
    ]
    context_string = '\n---\n'.join(context_parts)
    expected_hash = hashlib.sha256(context_string.encode('utf-8')).hexdigest()

# 3. LARGE SCALE TEST CASES

def test_hash_with_large_file_and_small_function(tmp_path):
    """Test that the hash is computed efficiently even for a large file."""
    # Create a file with 1000 lines
    lines = [f"line {i}" for i in range(1, 1001)]
    code = "\n".join(lines)
    file = tmp_path / "largefile.py"
    file.write_text(code, encoding="utf-8")
    # Function is just lines 500-510
    fto = FunctionToOptimize("foo", file, [], 500, 510)
    codeflash_output = fto.get_code_context_hash(); hash_val = codeflash_output
    # The function content should be lines 499-509 (0-indexed)
    function_content = "\n".join(lines[499:510])
    context_parts = [
        file.name,
        fto.qualified_name,
        function_content.strip()
    ]
    context_string = '\n---\n'.join(context_parts)
    expected_hash = hashlib.sha256(context_string.encode('utf-8')).hexdigest()


def test_hash_performance_on_multiple_calls(tmp_path):
    """Test that repeated calls are deterministic and performant."""
    code = "\n".join([f"def foo{i}():\n    return {i}" for i in range(100)])
    file = tmp_path / "multi.py"
    file.write_text(code, encoding="utf-8")
    fto = FunctionToOptimize("foo99", file, [], 199, 200)
    codeflash_output = fto.get_code_context_hash(); hash1 = codeflash_output
    codeflash_output = fto.get_code_context_hash(); hash2 = codeflash_output



from __future__ import annotations

import hashlib
import shutil
import tempfile
from pathlib import Path
from typing import Optional

# imports
import pytest  # used for our unit tests
from codeflash.cli_cmds.console import logger
from codeflash.discovery.functions_to_optimize import FunctionToOptimize
from codeflash.models.models import FunctionParent
from pydantic.dataclasses import dataclass


# Helper class for parent scope simulation
class DummyParent:
    def __init__(self, name):
        self.name = name

# --------------------------
# UNIT TESTS START HERE
# --------------------------

# Basic Test Cases

def test_basic_hash_different_code(tmp_path):
    # Test that different code contents produce different hashes
    file1 = tmp_path / "a.py"
    file2 = tmp_path / "b.py"
    file1.write_text("def foo():\n    return 1\n")
    file2.write_text("def foo():\n    return 2\n")
    func1 = FunctionToOptimize(
        function_name="foo",
        file_path=file1,
        parents=[],
        starting_line=1,
        ending_line=2
    )
    func2 = FunctionToOptimize(
        function_name="foo",
        file_path=file2,
        parents=[],
        starting_line=1,
        ending_line=2
    )
    codeflash_output = func1.get_code_context_hash(); hash1 = codeflash_output
    codeflash_output = func2.get_code_context_hash(); hash2 = codeflash_output

def test_basic_hash_same_code_same_hash(tmp_path):
    # Test that identical code and context produce the same hash
    file1 = tmp_path / "c.py"
    file1.write_text("def bar():\n    return 3\n")
    func1 = FunctionToOptimize(
        function_name="bar",
        file_path=file1,
        parents=[],
        starting_line=1,
        ending_line=2
    )
    func2 = FunctionToOptimize(
        function_name="bar",
        file_path=file1,
        parents=[],
        starting_line=1,
        ending_line=2
    )
    codeflash_output = func1.get_code_context_hash(); hash1 = codeflash_output
    codeflash_output = func2.get_code_context_hash(); hash2 = codeflash_output

def test_basic_hash_different_function_names(tmp_path):
    # Test that different function names produce different hashes
    file1 = tmp_path / "d.py"
    file1.write_text("def baz():\n    return 4\n")
    func1 = FunctionToOptimize(
        function_name="baz",
        file_path=file1,
        parents=[],
        starting_line=1,
        ending_line=2
    )
    func2 = FunctionToOptimize(
        function_name="qux",
        file_path=file1,
        parents=[],
        starting_line=1,
        ending_line=2
    )
    codeflash_output = func1.get_code_context_hash(); hash1 = codeflash_output
    codeflash_output = func2.get_code_context_hash(); hash2 = codeflash_output


def test_basic_hash_file_path_affects_hash(tmp_path):
    # Test that file path affects the hash even if code and names are the same
    file1 = tmp_path / "f1.py"
    file2 = tmp_path / "f2.py"
    code = "def foo():\n    return 42\n"
    file1.write_text(code)
    file2.write_text(code)
    func1 = FunctionToOptimize(
        function_name="foo",
        file_path=file1,
        parents=[],
        starting_line=1,
        ending_line=2
    )
    func2 = FunctionToOptimize(
        function_name="foo",
        file_path=file2,
        parents=[],
        starting_line=1,
        ending_line=2
    )
    codeflash_output = func1.get_code_context_hash(); hash1 = codeflash_output
    codeflash_output = func2.get_code_context_hash(); hash2 = codeflash_output

# Edge Test Cases

def test_edge_no_starting_or_ending_line(tmp_path):
    # Test fallback to using the entire file content when line numbers are not given
    file1 = tmp_path / "g.py"
    code = "def foo():\n    return 'hello'\n"
    file1.write_text(code)
    func = FunctionToOptimize(
        function_name="foo",
        file_path=file1,
        parents=[],
        starting_line=None,
        ending_line=None
    )
    codeflash_output = func.get_code_context_hash(); hash1 = codeflash_output
    # The hash should be the same as if we explicitly used the entire file content
    func2 = FunctionToOptimize(
        function_name="foo",
        file_path=file1,
        parents=[],
        starting_line=1,
        ending_line=2
    )
    # However, this will only be true if the function covers the whole file
    # So we check that the hash is equal to the hash of the whole file content
    context_parts = [
        file1.name,
        "foo",
        code.strip()
    ]
    context_string = '\n---\n'.join(context_parts)
    expected_hash = hashlib.sha256(context_string.encode('utf-8')).hexdigest()

def test_edge_file_not_found(monkeypatch, tmp_path):
    # Test fallback hash generation when file is missing
    file1 = tmp_path / "missing.py"
    func = FunctionToOptimize(
        function_name="foo",
        file_path=file1,
        parents=[],
        starting_line=1,
        ending_line=2
    )
    # Patch logger to avoid real logging in test
    class DummyLogger:
        def warning(self, msg): pass
    monkeypatch.setattr("codeflash.cli_cmds.console.logger", DummyLogger())
    codeflash_output = func.get_code_context_hash(); hash1 = codeflash_output
    # The fallback hash should be the hash of "missing.py:foo"
    fallback_string = f"{file1.name}:foo"
    expected_hash = hashlib.sha256(fallback_string.encode('utf-8')).hexdigest()

def test_edge_empty_file(tmp_path):
    # Test behavior with an empty file
    file1 = tmp_path / "empty.py"
    file1.write_text("")
    func = FunctionToOptimize(
        function_name="foo",
        file_path=file1,
        parents=[],
        starting_line=1,
        ending_line=1
    )
    codeflash_output = func.get_code_context_hash(); hash1 = codeflash_output
    # The hash should be based on filename, qualified name, and empty content
    context_parts = [
        file1.name,
        "foo",
        ""
    ]
    context_string = '\n---\n'.join(context_parts)
    expected_hash = hashlib.sha256(context_string.encode('utf-8')).hexdigest()

def test_edge_starting_line_greater_than_ending_line(tmp_path):
    # Test with starting_line > ending_line (should produce empty content)
    file1 = tmp_path / "weird.py"
    file1.write_text("def foo():\n    return 1\n    return 2\n")
    func = FunctionToOptimize(
        function_name="foo",
        file_path=file1,
        parents=[],
        starting_line=3,
        ending_line=2
    )
    codeflash_output = func.get_code_context_hash(); hash1 = codeflash_output
    # The content will be empty string
    context_parts = [
        file1.name,
        "foo",
        ""
    ]
    context_string = '\n---\n'.join(context_parts)
    expected_hash = hashlib.sha256(context_string.encode('utf-8')).hexdigest()

def test_edge_non_ascii_characters(tmp_path):
    # Test handling of non-ASCII (unicode) characters in code
    file1 = tmp_path / "unicode.py"
    code = "def foo():\n    return '你好, мир, hello'\n"
    file1.write_text(code)
    func = FunctionToOptimize(
        function_name="foo",
        file_path=file1,
        parents=[],
        starting_line=1,
        ending_line=2
    )
    codeflash_output = func.get_code_context_hash(); hash1 = codeflash_output
    context_parts = [
        file1.name,
        "foo",
        code.splitlines()[0] + "\n" + code.splitlines()[1]
    ]
    context_string = '\n---\n'.join([file1.name, "foo", code.strip()])
    expected_hash = hashlib.sha256(context_string.encode('utf-8')).hexdigest()


def test_large_scale_many_functions(tmp_path):
    # Test that many different functions produce unique hashes
    file1 = tmp_path / "large.py"
    # Create a file with 100 functions, each returning its index
    code = ""
    for i in range(100):
        code += f"def func_{i}():\n    return {i}\n"
    file1.write_text(code)
    hashes = set()
    lines = code.splitlines()
    for i in range(100):
        func = FunctionToOptimize(
            function_name=f"func_{i}",
            file_path=file1,
            parents=[],
            starting_line=2*i+1,
            ending_line=2*i+2
        )
        codeflash_output = func.get_code_context_hash(); h = codeflash_output
        hashes.add(h)

def test_large_scale_long_function(tmp_path):
    # Test a function with a very large body
    file1 = tmp_path / "longfunc.py"
    # Function with 1000 lines
    code = "def big():\n"
    code += "\n".join([f"    x{i} = {i}" for i in range(1000)])
    code += "\n    return x999\n"
    file1.write_text(code)
    func = FunctionToOptimize(
        function_name="big",
        file_path=file1,
        parents=[],
        starting_line=1,
        ending_line=1002
    )
    codeflash_output = func.get_code_context_hash(); h = codeflash_output
    # The hash should be deterministic and not crash
    context_parts = [
        file1.name,
        "big",
        "\n".join(code.splitlines()[0:1002]).strip()
    ]
    context_string = '\n---\n'.join(context_parts)
    expected_hash = hashlib.sha256(context_string.encode('utf-8')).hexdigest()


def test_large_scale_file_with_many_lines(tmp_path):
    # Test with a file with 1000 lines, extracting a function from the middle
    file1 = tmp_path / "bigfile.py"
    code = "\n".join([f"# line {i}" for i in range(500)])
    code += "\ndef foo():\n    return 123\n"
    code += "\n".join([f"# line {i}" for i in range(500, 1000)])
    file1.write_text(code)
    # Function is at line 501 and 502 (1-based)
    func = FunctionToOptimize(
        function_name="foo",
        file_path=file1,
        parents=[],
        starting_line=501,
        ending_line=502
    )
    codeflash_output = func.get_code_context_hash(); h = codeflash_output
    lines = file1.read_text().splitlines()
    function_content = '\n'.join(lines[500:502])
    context_parts = [
        file1.name,
        "foo",
        function_content.strip()
    ]
    context_string = '\n---\n'.join(context_parts)
    expected_hash = hashlib.sha256(context_string.encode('utf-8')).hexdigest()
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-pr275-2025-06-05T20.41.59 and push.

Codeflash

…in PR #275 (`dont-optimize-repeatedly-gh-actions`)

Here is an optimized version of your code, targeting the areas highlighted as slowest in your line profiling.

### Key Optimizations

1. **Read Only Necessary Lines:**
   - When `starting_line` and `ending_line` are provided, instead of reading the entire file and calling `.splitlines()`, read only the lines needed. This drastically lowers memory use and speeds up file operations for large files.
   - Uses `itertools.islice` to efficiently pluck only relevant lines.

2. **String Manipulation Reduction:**
   - Reduce the number of intermediate string allocations by reusing objects as much as possible and joining lines only once.
   - Avoids `strip()` unless absolutely necessary (as likely only for code content).

3. **Variable Lookup:**
   - Minimize attribute lookups that are inside loops.

The function semantics are preserved exactly. All comments are retained or improved for code that was changed for better understanding.



### Rationale

- The main bottleneck is reading full files and splitting them when only a small region is needed. By slicing only the relevant lines from file, the function becomes much faster for large files or high call counts.
- All behaviors, including fallback and hash calculation, are unchanged.
- Import of `islice` is local and lightweight.

**This should significantly improve both runtime and memory usage of `get_code_context_hash`.**
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Jun 5, 2025
@misrasaurabh1
Copy link
Contributor

looks good in concept

@codeflash-ai codeflash-ai bot closed this Jun 9, 2025
@codeflash-ai
Copy link
Contributor Author

codeflash-ai bot commented Jun 9, 2025

This PR has been automatically closed because the original PR #275 by dasarchan was closed.

@codeflash-ai codeflash-ai bot deleted the codeflash/optimize-pr275-2025-06-05T20.41.59 branch June 9, 2025 06:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants