⚡️ Speed up method `FunctionToOptimize.get_code_context_hash` by 15% in PR #275 (`dont-optimize-repeatedly-gh-actions`) #291

codeflash-ai · 2025-06-05T20:42:05Z

⚡️ This pull request contains optimizations for PR #275

If you approve this dependent PR, these changes will be merged into the original PR branch dont-optimize-repeatedly-gh-actions.

This PR will be automatically closed if the original PR is merged.

📄 15% (0.15x) speedup for `FunctionToOptimize.get_code_context_hash` in `codeflash/discovery/functions_to_optimize.py`

⏱️ Runtime : 3.67 milliseconds → 3.20 milliseconds (best of 72 runs)

📝 Explanation and details

Here is an optimized version of your code, targeting the areas highlighted as slowest in your line profiling.

Key Optimizations

Read Only Necessary Lines:
- When starting_line and ending_line are provided, instead of reading the entire file and calling .splitlines(), read only the lines needed. This drastically lowers memory use and speeds up file operations for large files.
- Uses itertools.islice to efficiently pluck only relevant lines.
String Manipulation Reduction:
- Reduce the number of intermediate string allocations by reusing objects as much as possible and joining lines only once.
- Avoids strip() unless absolutely necessary (as likely only for code content).
Variable Lookup:
- Minimize attribute lookups that are inside loops.

The function semantics are preserved exactly. All comments are retained or improved for code that was changed for better understanding.

Rationale

The main bottleneck is reading full files and splitting them when only a small region is needed. By slicing only the relevant lines from file, the function becomes much faster for large files or high call counts.
All behaviors, including fallback and hash calculation, are unchanged.
Import of islice is local and lightweight.

This should significantly improve both runtime and memory usage of get_code_context_hash.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 135 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Generated Regression Tests Details

from __future__ import annotations

import hashlib
import logging
import shutil
import tempfile
from dataclasses import dataclass
from pathlib import Path
from typing import Optional

# imports
import pytest  # used for our unit tests
from codeflash.discovery.functions_to_optimize import FunctionToOptimize


# Minimal FunctionParent mock for testing
@dataclass(frozen=True)
class FunctionParent:
    name: str

# Minimal logger mock for testing (since the original is not available)
class DummyLogger:
    def warning(self, msg):
        pass

logger = DummyLogger()
from codeflash.discovery.functions_to_optimize import FunctionToOptimize

# unit tests

@pytest.fixture
def temp_code_file(tmp_path):
    """Fixture to create and clean up a temporary code file."""
    def _make_file(content: str) -> Path:
        file_path = tmp_path / "testfile.py"
        file_path.write_text(content, encoding="utf-8")
        return file_path
    return _make_file

# 1. BASIC TEST CASES

def test_hash_changes_with_code_content(temp_code_file):
    """Test that the hash changes when the function code changes."""
    code1 = "def foo():\n    return 1\n"
    code2 = "def foo():\n    return 2\n"
    file1 = temp_code_file(code1)
    file2 = temp_code_file(code2)
    fto1 = FunctionToOptimize("foo", file1, [], 1, 2)
    fto2 = FunctionToOptimize("foo", file2, [], 1, 2)
    codeflash_output = fto1.get_code_context_hash(); hash1 = codeflash_output
    codeflash_output = fto2.get_code_context_hash(); hash2 = codeflash_output

def test_hash_same_for_identical_code_and_context(temp_code_file):
    """Test that the hash is the same for identical code and context."""
    code = "def bar():\n    return 42\n"
    file1 = temp_code_file(code)
    file2 = temp_code_file(code)
    fto1 = FunctionToOptimize("bar", file1, [], 1, 2)
    fto2 = FunctionToOptimize("bar", file2, [], 1, 2)
    codeflash_output = fto1.get_code_context_hash(); hash1 = codeflash_output
    codeflash_output = fto2.get_code_context_hash(); hash2 = codeflash_output

def test_hash_changes_with_function_name(temp_code_file):
    """Test that the hash changes if the function name changes."""
    code = "def baz():\n    return 0\n"
    file = temp_code_file(code)
    fto1 = FunctionToOptimize("baz", file, [], 1, 2)
    fto2 = FunctionToOptimize("qux", file, [], 1, 2)
    codeflash_output = fto1.get_code_context_hash(); hash1 = codeflash_output
    codeflash_output = fto2.get_code_context_hash(); hash2 = codeflash_output


def test_hash_changes_with_file_name(temp_code_file, tmp_path):
    """Test that the hash changes if the file name changes (even with same content)."""
    code = "def foo():\n    return 1\n"
    file1 = tmp_path / "file1.py"
    file2 = tmp_path / "file2.py"
    file1.write_text(code, encoding="utf-8")
    file2.write_text(code, encoding="utf-8")
    fto1 = FunctionToOptimize("foo", file1, [], 1, 2)
    fto2 = FunctionToOptimize("foo", file2, [], 1, 2)
    codeflash_output = fto1.get_code_context_hash(); hash1 = codeflash_output
    codeflash_output = fto2.get_code_context_hash(); hash2 = codeflash_output

def test_hash_entire_file_if_no_line_numbers(temp_code_file):
    """Test that the entire file content is used if no line numbers are provided."""
    code = "def foo():\n    return 1\n\ndef bar():\n    return 2\n"
    file = temp_code_file(code)
    fto = FunctionToOptimize("foo", file, [])
    # Should use the whole file content
    codeflash_output = fto.get_code_context_hash(); hash_full = codeflash_output
    # Now, use only the first two lines
    fto_partial = FunctionToOptimize("foo", file, [], 1, 2)
    codeflash_output = fto_partial.get_code_context_hash(); hash_partial = codeflash_output

# 2. EDGE TEST CASES

def test_hash_with_empty_file(tmp_path):
    """Test hash generation for an empty file."""
    file = tmp_path / "empty.py"
    file.write_text("", encoding="utf-8")
    fto = FunctionToOptimize("foo", file, [], 1, 1)
    codeflash_output = fto.get_code_context_hash(); hash_val = codeflash_output


def test_hash_with_unicode_content(temp_code_file):
    """Test hash with Unicode characters in the function code."""
    code = "def foo():\n    return '你好, мир, hello'\n"
    file = temp_code_file(code)
    fto = FunctionToOptimize("foo", file, [], 1, 2)
    codeflash_output = fto.get_code_context_hash(); hash_val = codeflash_output

def test_hash_with_large_function(temp_code_file):
    """Test hash with a function that spans many lines."""
    code = "def foo():\n" + "\n".join([f"    x{i} = {i}" for i in range(100)]) + "\n    return x99\n"
    file = temp_code_file(code)
    fto = FunctionToOptimize("foo", file, [], 1, 102)
    codeflash_output = fto.get_code_context_hash(); hash_val = codeflash_output


def test_hash_with_starting_line_greater_than_ending_line(temp_code_file):
    """Test that an empty string is used if starting_line > ending_line."""
    code = "def foo():\n    return 1\n"
    file = temp_code_file(code)
    fto = FunctionToOptimize("foo", file, [], 2, 1)
    codeflash_output = fto.get_code_context_hash(); hash_val = codeflash_output
    # The function_content should be empty, so the hash should be of the context with empty function content
    context_parts = [
        file.name,
        fto.qualified_name,
        ""
    ]
    context_string = '\n---\n'.join(context_parts)
    expected_hash = hashlib.sha256(context_string.encode('utf-8')).hexdigest()

def test_hash_with_starting_line_out_of_bounds(temp_code_file):
    """Test that out-of-bounds line numbers result in empty function content."""
    code = "def foo():\n    return 1\n"
    file = temp_code_file(code)
    # starting_line is beyond file length
    fto = FunctionToOptimize("foo", file, [], 100, 200)
    codeflash_output = fto.get_code_context_hash(); hash_val = codeflash_output
    context_parts = [
        file.name,
        fto.qualified_name,
        ""
    ]
    context_string = '\n---\n'.join(context_parts)
    expected_hash = hashlib.sha256(context_string.encode('utf-8')).hexdigest()

def test_hash_with_only_starting_line(temp_code_file):
    """Test that if only starting_line is provided, the whole file is used."""
    code = "def foo():\n    return 1\n"
    file = temp_code_file(code)
    fto = FunctionToOptimize("foo", file, [], 1, None)
    codeflash_output = fto.get_code_context_hash(); hash_val = codeflash_output
    # Should use the whole file content, not just a single line
    context_parts = [
        file.name,
        fto.qualified_name,
        code.strip()
    ]
    context_string = '\n---\n'.join(context_parts)
    expected_hash = hashlib.sha256(context_string.encode('utf-8')).hexdigest()

def test_hash_with_only_ending_line(temp_code_file):
    """Test that if only ending_line is provided, the whole file is used."""
    code = "def foo():\n    return 1\n"
    file = temp_code_file(code)
    fto = FunctionToOptimize("foo", file, [], None, 2)
    codeflash_output = fto.get_code_context_hash(); hash_val = codeflash_output
    # Should use the whole file content, not just up to ending_line
    context_parts = [
        file.name,
        fto.qualified_name,
        code.strip()
    ]
    context_string = '\n---\n'.join(context_parts)
    expected_hash = hashlib.sha256(context_string.encode('utf-8')).hexdigest()

# 3. LARGE SCALE TEST CASES

def test_hash_with_large_file_and_small_function(tmp_path):
    """Test that the hash is computed efficiently even for a large file."""
    # Create a file with 1000 lines
    lines = [f"line {i}" for i in range(1, 1001)]
    code = "\n".join(lines)
    file = tmp_path / "largefile.py"
    file.write_text(code, encoding="utf-8")
    # Function is just lines 500-510
    fto = FunctionToOptimize("foo", file, [], 500, 510)
    codeflash_output = fto.get_code_context_hash(); hash_val = codeflash_output
    # The function content should be lines 499-509 (0-indexed)
    function_content = "\n".join(lines[499:510])
    context_parts = [
        file.name,
        fto.qualified_name,
        function_content.strip()
    ]
    context_string = '\n---\n'.join(context_parts)
    expected_hash = hashlib.sha256(context_string.encode('utf-8')).hexdigest()


def test_hash_performance_on_multiple_calls(tmp_path):
    """Test that repeated calls are deterministic and performant."""
    code = "\n".join([f"def foo{i}():\n    return {i}" for i in range(100)])
    file = tmp_path / "multi.py"
    file.write_text(code, encoding="utf-8")
    fto = FunctionToOptimize("foo99", file, [], 199, 200)
    codeflash_output = fto.get_code_context_hash(); hash1 = codeflash_output
    codeflash_output = fto.get_code_context_hash(); hash2 = codeflash_output



from __future__ import annotations

import hashlib
import shutil
import tempfile
from pathlib import Path
from typing import Optional

# imports
import pytest  # used for our unit tests
from codeflash.cli_cmds.console import logger
from codeflash.discovery.functions_to_optimize import FunctionToOptimize
from codeflash.models.models import FunctionParent
from pydantic.dataclasses import dataclass


# Helper class for parent scope simulation
class DummyParent:
    def __init__(self, name):
        self.name = name

# --------------------------
# UNIT TESTS START HERE
# --------------------------

# Basic Test Cases

def test_basic_hash_different_code(tmp_path):
    # Test that different code contents produce different hashes
    file1 = tmp_path / "a.py"
    file2 = tmp_path / "b.py"
    file1.write_text("def foo():\n    return 1\n")
    file2.write_text("def foo():\n    return 2\n")
    func1 = FunctionToOptimize(
        function_name="foo",
        file_path=file1,
        parents=[],
        starting_line=1,
        ending_line=2
    )
    func2 = FunctionToOptimize(
        function_name="foo",
        file_path=file2,
        parents=[],
        starting_line=1,
        ending_line=2
    )
    codeflash_output = func1.get_code_context_hash(); hash1 = codeflash_output
    codeflash_output = func2.get_code_context_hash(); hash2 = codeflash_output

def test_basic_hash_same_code_same_hash(tmp_path):
    # Test that identical code and context produce the same hash
    file1 = tmp_path / "c.py"
    file1.write_text("def bar():\n    return 3\n")
    func1 = FunctionToOptimize(
        function_name="bar",
        file_path=file1,
        parents=[],
        starting_line=1,
        ending_line=2
    )
    func2 = FunctionToOptimize(
        function_name="bar",
        file_path=file1,
        parents=[],
        starting_line=1,
        ending_line=2
    )
    codeflash_output = func1.get_code_context_hash(); hash1 = codeflash_output
    codeflash_output = func2.get_code_context_hash(); hash2 = codeflash_output

def test_basic_hash_different_function_names(tmp_path):
    # Test that different function names produce different hashes
    file1 = tmp_path / "d.py"
    file1.write_text("def baz():\n    return 4\n")
    func1 = FunctionToOptimize(
        function_name="baz",
        file_path=file1,
        parents=[],
        starting_line=1,
        ending_line=2
    )
    func2 = FunctionToOptimize(
        function_name="qux",
        file_path=file1,
        parents=[],
        starting_line=1,
        ending_line=2
    )
    codeflash_output = func1.get_code_context_hash(); hash1 = codeflash_output
    codeflash_output = func2.get_code_context_hash(); hash2 = codeflash_output


def test_basic_hash_file_path_affects_hash(tmp_path):
    # Test that file path affects the hash even if code and names are the same
    file1 = tmp_path / "f1.py"
    file2 = tmp_path / "f2.py"
    code = "def foo():\n    return 42\n"
    file1.write_text(code)
    file2.write_text(code)
    func1 = FunctionToOptimize(
        function_name="foo",
        file_path=file1,
        parents=[],
        starting_line=1,
        ending_line=2
    )
    func2 = FunctionToOptimize(
        function_name="foo",
        file_path=file2,
        parents=[],
        starting_line=1,
        ending_line=2
    )
    codeflash_output = func1.get_code_context_hash(); hash1 = codeflash_output
    codeflash_output = func2.get_code_context_hash(); hash2 = codeflash_output

# Edge Test Cases

def test_edge_no_starting_or_ending_line(tmp_path):
    # Test fallback to using the entire file content when line numbers are not given
    file1 = tmp_path / "g.py"
    code = "def foo():\n    return 'hello'\n"
    file1.write_text(code)
    func = FunctionToOptimize(
        function_name="foo",
        file_path=file1,
        parents=[],
        starting_line=None,
        ending_line=None
    )
    codeflash_output = func.get_code_context_hash(); hash1 = codeflash_output
    # The hash should be the same as if we explicitly used the entire file content
    func2 = FunctionToOptimize(
        function_name="foo",
        file_path=file1,
        parents=[],
        starting_line=1,
        ending_line=2
    )
    # However, this will only be true if the function covers the whole file
    # So we check that the hash is equal to the hash of the whole file content
    context_parts = [
        file1.name,
        "foo",
        code.strip()
    ]
    context_string = '\n---\n'.join(context_parts)
    expected_hash = hashlib.sha256(context_string.encode('utf-8')).hexdigest()

def test_edge_file_not_found(monkeypatch, tmp_path):
    # Test fallback hash generation when file is missing
    file1 = tmp_path / "missing.py"
    func = FunctionToOptimize(
        function_name="foo",
        file_path=file1,
        parents=[],
        starting_line=1,
        ending_line=2
    )
    # Patch logger to avoid real logging in test
    class DummyLogger:
        def warning(self, msg): pass
    monkeypatch.setattr("codeflash.cli_cmds.console.logger", DummyLogger())
    codeflash_output = func.get_code_context_hash(); hash1 = codeflash_output
    # The fallback hash should be the hash of "missing.py:foo"
    fallback_string = f"{file1.name}:foo"
    expected_hash = hashlib.sha256(fallback_string.encode('utf-8')).hexdigest()

def test_edge_empty_file(tmp_path):
    # Test behavior with an empty file
    file1 = tmp_path / "empty.py"
    file1.write_text("")
    func = FunctionToOptimize(
        function_name="foo",
        file_path=file1,
        parents=[],
        starting_line=1,
        ending_line=1
    )
    codeflash_output = func.get_code_context_hash(); hash1 = codeflash_output
    # The hash should be based on filename, qualified name, and empty content
    context_parts = [
        file1.name,
        "foo",
        ""
    ]
    context_string = '\n---\n'.join(context_parts)
    expected_hash = hashlib.sha256(context_string.encode('utf-8')).hexdigest()

def test_edge_starting_line_greater_than_ending_line(tmp_path):
    # Test with starting_line > ending_line (should produce empty content)
    file1 = tmp_path / "weird.py"
    file1.write_text("def foo():\n    return 1\n    return 2\n")
    func = FunctionToOptimize(
        function_name="foo",
        file_path=file1,
        parents=[],
        starting_line=3,
        ending_line=2
    )
    codeflash_output = func.get_code_context_hash(); hash1 = codeflash_output
    # The content will be empty string
    context_parts = [
        file1.name,
        "foo",
        ""
    ]
    context_string = '\n---\n'.join(context_parts)
    expected_hash = hashlib.sha256(context_string.encode('utf-8')).hexdigest()

def test_edge_non_ascii_characters(tmp_path):
    # Test handling of non-ASCII (unicode) characters in code
    file1 = tmp_path / "unicode.py"
    code = "def foo():\n    return '你好, мир, hello'\n"
    file1.write_text(code)
    func = FunctionToOptimize(
        function_name="foo",
        file_path=file1,
        parents=[],
        starting_line=1,
        ending_line=2
    )
    codeflash_output = func.get_code_context_hash(); hash1 = codeflash_output
    context_parts = [
        file1.name,
        "foo",
        code.splitlines()[0] + "\n" + code.splitlines()[1]
    ]
    context_string = '\n---\n'.join([file1.name, "foo", code.strip()])
    expected_hash = hashlib.sha256(context_string.encode('utf-8')).hexdigest()


def test_large_scale_many_functions(tmp_path):
    # Test that many different functions produce unique hashes
    file1 = tmp_path / "large.py"
    # Create a file with 100 functions, each returning its index
    code = ""
    for i in range(100):
        code += f"def func_{i}():\n    return {i}\n"
    file1.write_text(code)
    hashes = set()
    lines = code.splitlines()
    for i in range(100):
        func = FunctionToOptimize(
            function_name=f"func_{i}",
            file_path=file1,
            parents=[],
            starting_line=2*i+1,
            ending_line=2*i+2
        )
        codeflash_output = func.get_code_context_hash(); h = codeflash_output
        hashes.add(h)

def test_large_scale_long_function(tmp_path):
    # Test a function with a very large body
    file1 = tmp_path / "longfunc.py"
    # Function with 1000 lines
    code = "def big():\n"
    code += "\n".join([f"    x{i} = {i}" for i in range(1000)])
    code += "\n    return x999\n"
    file1.write_text(code)
    func = FunctionToOptimize(
        function_name="big",
        file_path=file1,
        parents=[],
        starting_line=1,
        ending_line=1002
    )
    codeflash_output = func.get_code_context_hash(); h = codeflash_output
    # The hash should be deterministic and not crash
    context_parts = [
        file1.name,
        "big",
        "\n".join(code.splitlines()[0:1002]).strip()
    ]
    context_string = '\n---\n'.join(context_parts)
    expected_hash = hashlib.sha256(context_string.encode('utf-8')).hexdigest()


def test_large_scale_file_with_many_lines(tmp_path):
    # Test with a file with 1000 lines, extracting a function from the middle
    file1 = tmp_path / "bigfile.py"
    code = "\n".join([f"# line {i}" for i in range(500)])
    code += "\ndef foo():\n    return 123\n"
    code += "\n".join([f"# line {i}" for i in range(500, 1000)])
    file1.write_text(code)
    # Function is at line 501 and 502 (1-based)
    func = FunctionToOptimize(
        function_name="foo",
        file_path=file1,
        parents=[],
        starting_line=501,
        ending_line=502
    )
    codeflash_output = func.get_code_context_hash(); h = codeflash_output
    lines = file1.read_text().splitlines()
    function_content = '\n'.join(lines[500:502])
    context_parts = [
        file1.name,
        "foo",
        function_content.strip()
    ]
    context_string = '\n---\n'.join(context_parts)
    expected_hash = hashlib.sha256(context_string.encode('utf-8')).hexdigest()
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-pr275-2025-06-05T20.41.59 and push.

…in PR #275 (`dont-optimize-repeatedly-gh-actions`) Here is an optimized version of your code, targeting the areas highlighted as slowest in your line profiling. ### Key Optimizations 1. **Read Only Necessary Lines:** - When `starting_line` and `ending_line` are provided, instead of reading the entire file and calling `.splitlines()`, read only the lines needed. This drastically lowers memory use and speeds up file operations for large files. - Uses `itertools.islice` to efficiently pluck only relevant lines. 2. **String Manipulation Reduction:** - Reduce the number of intermediate string allocations by reusing objects as much as possible and joining lines only once. - Avoids `strip()` unless absolutely necessary (as likely only for code content). 3. **Variable Lookup:** - Minimize attribute lookups that are inside loops. The function semantics are preserved exactly. All comments are retained or improved for code that was changed for better understanding. ### Rationale - The main bottleneck is reading full files and splitting them when only a small region is needed. By slicing only the relevant lines from file, the function becomes much faster for large files or high call counts. - All behaviors, including fallback and hash calculation, are unchanged. - Import of `islice` is local and lightweight. **This should significantly improve both runtime and memory usage of `get_code_context_hash`.**

misrasaurabh1 · 2025-06-05T20:45:05Z

looks good in concept

codeflash-ai · 2025-06-09T06:27:10Z

This PR has been automatically closed because the original PR #275 by dasarchan was closed.

codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Jun 5, 2025

codeflash-ai bot mentioned this pull request Jun 5, 2025

Don't repeatedly optimize gh actions #275

Merged

codeflash-ai bot closed this Jun 9, 2025

codeflash-ai bot deleted the codeflash/optimize-pr275-2025-06-05T20.41.59 branch June 9, 2025 06:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up method `FunctionToOptimize.get_code_context_hash` by 15% in PR #275 (`dont-optimize-repeatedly-gh-actions`) #291

⚡️ Speed up method `FunctionToOptimize.get_code_context_hash` by 15% in PR #275 (`dont-optimize-repeatedly-gh-actions`) #291

Uh oh!

codeflash-ai bot commented Jun 5, 2025

Uh oh!

misrasaurabh1 commented Jun 5, 2025

Uh oh!

codeflash-ai bot commented Jun 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

⚡️ Speed up method FunctionToOptimize.get_code_context_hash by 15% in PR #275 (dont-optimize-repeatedly-gh-actions) #291

⚡️ Speed up method FunctionToOptimize.get_code_context_hash by 15% in PR #275 (dont-optimize-repeatedly-gh-actions) #291

Uh oh!

Conversation

codeflash-ai bot commented Jun 5, 2025

⚡️ This pull request contains optimizations for PR #275

📄 15% (0.15x) speedup for FunctionToOptimize.get_code_context_hash in codeflash/discovery/functions_to_optimize.py

📝 Explanation and details

Key Optimizations

Rationale

Uh oh!

misrasaurabh1 commented Jun 5, 2025

Uh oh!

codeflash-ai bot commented Jun 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

⚡️ Speed up method `FunctionToOptimize.get_code_context_hash` by 15% in PR #275 (`dont-optimize-repeatedly-gh-actions`) #291

⚡️ Speed up method `FunctionToOptimize.get_code_context_hash` by 15% in PR #275 (`dont-optimize-repeatedly-gh-actions`) #291

📄 15% (0.15x) speedup for `FunctionToOptimize.get_code_context_hash` in `codeflash/discovery/functions_to_optimize.py`