[FEAT] Feedback loop for unmatched test results #945

mohammedahmed18 · 2025-11-27T14:20:31Z

PR Type

Enhancement, Tests

Description

Add structured test diff generation
Capture pytest failures from stdout
Improve Levenshtein performance optimizations
Integrate mismatch handling in optimizer

Diagram Walkthrough

flowchart LR
  ParseStdout["Parse pytest stdout failures"] -- "annotate" --> TestResults["TestResults.test_failures"]
  Compare["compare_test_results returns (match, diffs)"] -- "used by" --> Optimizer["FunctionOptimizer.run_optimized_candidate"]
  Optimizer -- "on mismatch" --> Feedback["Handle >50% mismatches or attempt fix"]
  CSTExtract["Extract test source via libcst"] -- "enrich" --> Compare
  LevOpt["Levenshtein optimizations"] -- "faster distance" --> Core["Core utilities"]

File Walkthrough

Relevant files

Enhancement

functions_to_optimize.py `Optimize Levenshtein distance implementation` codeflash/discovery/functions_to_optimize.py Add early exits for empty strings Use lists for fast indexed access Reuse arrays and local vars for cache locality Simplify min computations and swap buffers	+30/-12
models.py `Enrich models with test source and failures map` codeflash/models/models.py Import libcst for CST parsing Add methods to locate tests and get source Extend TestResults with test_failures mapping	+29/-0
function_optimizer.py `Integrate diff-aware comparison into optimizer flow` codeflash/optimization/function_optimizer.py Compare behavior with detailed diffs Add helper to return unified mismatch Failure Gate feedback loop by mismatch percentage	+21/-4
equivalence.py `Produce structured diffs from test result comparison` codeflash/verification/equivalence.py Introduce TestDiffScope enum and TestDiff dataclass Return (match, diffs) from comparison Capture return/stdout/pass mismatches with context Include pytest error and test source in diffs	+61/-24
parse_test_output.py `Extract pytest failure details from stdout` codeflash/verification/parse_test_output.py Parse pytest stdout to map test failures Attach failures map to TestResults Safe parsing with exception handling	+42/-0

Tests

test_codeflash_capture.py `E2E test for structured diffs and mismatch handling` tests/test_codeflash_capture.py Add end-to-end test covering diff output Validate non-matching results return diffs Exercise optimizer-like flow with instrumented tests	+233/-0

github-actions · 2025-11-27T14:21:32Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Robustness Accessing candidate_results.test_failures without guarding against None may raise an AttributeError; ensure test_failures is initialized or defaulted before use. def compare_test_results(original_results: TestResults, candidate_results: TestResults) -> tuple[bool, list[TestDiff]]: # This is meant to be only called with test results for the first loop index if len(original_results) == 0 or len(candidate_results) == 0: return False, [] # empty test results are not equal original_recursion_limit = sys.getrecursionlimit() if original_recursion_limit < INCREASED_RECURSION_LIMIT: sys.setrecursionlimit(INCREASED_RECURSION_LIMIT) # Increase recursion limit to avoid RecursionError test_ids_superset = original_results.get_all_unique_invocation_loop_ids().union( set(candidate_results.get_all_unique_invocation_loop_ids()) ) test_diffs: list[TestDiff] = [] did_all_timeout: bool = True for test_id in test_ids_superset: original_test_result = original_results.get_by_unique_invocation_loop_id(test_id) cdd_test_result = candidate_results.get_by_unique_invocation_loop_id(test_id) candidate_pytest_error = candidate_results.test_failures.get(original_test_result.id.test_function_name) if cdd_test_result is not None and original_test_result is None: continue API Change compare_test_results now returns (match, diffs) but some other call sites may still expect a bool; verify all usages are updated to handle the tuple and the diffs list. match, diffs = compare_test_results(baseline_results.behavior_test_results, candidate_behavior_results) if match: logger.info("h3\|Test results matched ✅") console.rule() else: result_unmatched_perc = len(diffs) / len(candidate_behavior_results) if result_unmatched_perc > 0.5: # if the test unmatched percentage is greater than 50%, we can't fix it return self.get_results_not_matched_error() # with the parsed test results diff ask the llm to fix the candidate to match the test results of the original code, and run again # self.run_optimized_candidate( # optimization_candidate_index=optimization_candidate_index, # baseline_results=baseline_results, # original_helper_code=original_helper_code, # file_path_to_helper_classes=file_path_to_helper_classes, # ) print(f"should try to fix it, diffs: {diffs}") return self.get_results_not_matched_error() Parsing Errors libcst-based get_src_code lacks error handling for parse failures and invalid files; consider try/except and returning None with logging to avoid crashes. def get_src_code(self, test_path: Path) -> Optional[str]: test_src = test_path.read_text(encoding="utf-8") module_node = cst.parse_module(test_src) if self.test_class_name: for stmt in module_node.body: if isinstance(stmt, cst.ClassDef) and stmt.name.value == self.test_class_name: func_node = self.find_func_in_class(stmt, self.test_function_name) if func_node: return module_node.code_for_node(func_node).strip() # class not found return None # Otherwise, look for a top level function for stmt in module_node.body: if isinstance(stmt, cst.FunctionDef) and stmt.name.value == self.test_function_name: return module_node.code_for_node(stmt).strip() return None

github-actions · 2025-11-27T14:21:55Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Safely access optional mapping Guard access to `test_failures` since it can be `None` and avoid AttributeError. Also handle missing keys safely to keep comparison robust when no failures were parsed. codeflash/verification/equivalence.py [43] -candidate_pytest_error = candidate_results.test_failures.get(original_test_result.id.test_function_name) +candidate_pytest_error = None +if getattr(candidate_results, "test_failures", None): + candidate_pytest_error = candidate_results.test_failures.get(original_test_result.id.test_function_name) Suggestion importance[1-10]: 8 __ Why: `test_failures` is declared Optional in `TestResults`, so direct `.get` can raise if None; guarding prevents an AttributeError and aligns with new parsing logic.	Medium
Possible issue	Ensure recursion limit restoration Preserve the recursion limit restoration even on early returns to avoid leaving the process with a higher limit. Move recursion limit increase before any early return or ensure restoration in all paths. codeflash/verification/equivalence.py [30-35] if len(original_results) == 0 or len(candidate_results) == 0: - return False, [] # empty test results are not equal + return False, [] +original_recursion_limit = sys.getrecursionlimit() +try: + if original_recursion_limit < INCREASED_RECURSION_LIMIT: + sys.setrecursionlimit(INCREASED_RECURSION_LIMIT) + # ... rest of the function body unchanged ... +finally: + sys.setrecursionlimit(original_recursion_limit) Suggestion importance[1-10]: 6 __ Why: Early return before saving/restoring the recursion limit can skip restoration if that logic ever moves; wrapping with try/finally improves robustness though current early return happens before any change.	Low
General	Use logger instead of print Replace `print` with the existing logger to keep consistent output handling and avoid noisy stdout in library code. Log the exception with traceback for better diagnostics. codeflash/verification/equivalence.py [76-87] try: - print( - f"File Name: {original_test_result.file_name}\n" - f"Test Type: {original_test_result.test_type}\n" - f"Verification Type: {original_test_result.verification_type}\n" - f"Invocation ID: {original_test_result.id}\n" - f"Original return value: {original_test_result.return_value}\n" - f"Candidate return value: {cdd_test_result.return_value}\n" + logger.debug( + "File Name: %s\nTest Type: %s\nVerification Type: %s\nInvocation ID: %s\nOriginal return value: %r\nCandidate return value: %r", + original_test_result.file_name, + original_test_result.test_type, + original_test_result.verification_type, + original_test_result.id, + original_test_result.return_value, + cdd_test_result.return_value, ) -except Exception as e: - logger.error(e) +except Exception: + logger.exception("Failed to log return value comparison details") break Suggestion importance[1-10]: 7 __ Why: Replacing `print` with `logger.debug/exception` keeps output consistent and avoids noisy stdout; the improved code accurately mirrors the existing block’s intent with better diagnostics.	Medium

codeflash-ai · 2025-11-27T14:39:27Z

codeflash/discovery/functions_to_optimize.py

+                x = prev[index1]
+                y = prev[index1 + 1]
+                z = curr[index1]
+                min_xy = min(x, y)
+                min_xyz = min(z, min_xy)
+                curr[index1 + 1] = 1 + min_xyz


⚡️Codeflash found 73% (0.73x) speedup for levenshtein_distance in codeflash/discovery/functions_to_optimize.py

⏱️ Runtime : 2.04 seconds → 1.18 seconds (best of 8 runs)

📝 Explanation and details

The optimized version achieves a 73% speedup by eliminating Python's built-in min() function calls and replacing them with direct comparisons. This is a targeted micro-optimization that addresses one of the most expensive operations in the Levenshtein distance algorithm.

Key optimization:

Replaced min() calls with direct comparisons: The original code used min(x, y) and min(z, min_xy) which create temporary tuples and invoke Python's generic minimum function. The optimized version uses nested if statements to find the minimum value directly, avoiding function call overhead and tuple creation.

Why this provides a speedup:

The min() function in Python has significant overhead for small numbers of arguments, especially when called millions of times in nested loops

Direct comparisons (if x < y) are primitive operations that execute much faster than function calls

Eliminates temporary tuple creation that min() uses internally

Reduces the call stack depth in the inner loop

Performance impact by test case type:

Identical/similar strings: 55-65% faster - benefits from reduced overhead in character matching paths

Completely different strings: 109-121% faster - maximizes benefit since every character comparison triggers the min() replacement logic

Large strings with many differences: 83-93% faster - compounds the per-operation savings across many iterations

Small strings: 15-50% faster - still benefits but overhead reduction is less pronounced

The optimization is particularly effective for the Levenshtein algorithm because the min() operation occurs in the innermost loop that executes O(n×m) times, making even small per-call improvements significant when multiplied across all iterations.

✅ Correctness verification report:

Test Status

⚙️ Existing Unit Tests 🔘 None Found

🌀 Generated Regression Tests ✅ 148 Passed

⏪ Replay Tests 🔘 None Found

🔎 Concolic Coverage Tests 🔘 None Found

📊 Tests Coverage 96.6%

🌀 Generated Regression Tests and Runtime

from __future__ import annotations # imports import pytest # used for our unit tests from codeflash.discovery.functions_to_optimize import levenshtein_distance # unit tests # 1. Basic Test Cases def test_identical_strings(): # Levenshtein distance between identical strings should be 0 codeflash_output = levenshtein_distance("kitten", "kitten") # 15.8μs -> 9.90μs (60.0% faster) codeflash_output = levenshtein_distance("", "") # 480ns -> 441ns (8.84% faster) codeflash_output = levenshtein_distance("a", "a") # 2.09μs -> 1.95μs (7.22% faster) def test_single_insertion(): # Inserting one character codeflash_output = levenshtein_distance("kitten", "kitte") # 13.0μs -> 8.69μs (49.4% faster) codeflash_output = levenshtein_distance("kitte", "kitten") # 10.1μs -> 6.12μs (65.6% faster) codeflash_output = levenshtein_distance("", "a") # 421ns -> 421ns (0.000% faster) codeflash_output = levenshtein_distance("a", "") # 360ns -> 361ns (0.277% slower) def test_single_deletion(): # Deleting one character codeflash_output = levenshtein_distance("kitten", "kittn") # 12.6μs -> 8.52μs (48.5% faster) codeflash_output = levenshtein_distance("kittn", "kitten") # 10.0μs -> 5.96μs (67.9% faster) def test_single_substitution(): # Substituting one character codeflash_output = levenshtein_distance("kitten", "sitten") # 14.8μs -> 9.26μs (60.2% faster) codeflash_output = levenshtein_distance("kitten", "kitteb") # 11.7μs -> 6.81μs (72.4% faster) codeflash_output = levenshtein_distance("a", "b") # 2.22μs -> 1.89μs (17.5% faster) def test_multiple_operations(): # Multiple edits required codeflash_output = levenshtein_distance("kitten", "sitting") # 16.4μs -> 10.3μs (58.8% faster) codeflash_output = levenshtein_distance("flaw", "lawn") # 6.70μs -> 4.47μs (50.0% faster) def test_empty_and_nonempty(): # One string empty, one non-empty codeflash_output = levenshtein_distance("", "abc") # 751ns -> 751ns (0.000% faster) codeflash_output = levenshtein_distance("abc", "") # 431ns -> 451ns (4.43% slower) # 2. Edge Test Cases def test_both_empty(): # Both strings are empty codeflash_output = levenshtein_distance("", "") # 781ns -> 761ns (2.63% faster) def test_one_char_vs_empty(): # One string is a single character, other is empty codeflash_output = levenshtein_distance("a", "") # 771ns -> 781ns (1.28% slower) codeflash_output = levenshtein_distance("", "z") # 431ns -> 441ns (2.27% slower) def test_case_sensitivity(): # Case should matter codeflash_output = levenshtein_distance("abc", "Abc") # 7.70μs -> 5.87μs (31.1% faster) codeflash_output = levenshtein_distance("ABC", "abc") # 5.14μs -> 3.73μs (37.9% faster) def test_unicode_characters(): # Unicode characters codeflash_output = levenshtein_distance("café", "cafe") # 9.39μs -> 6.81μs (37.8% faster) codeflash_output = levenshtein_distance("naïve", "naive") # 9.85μs -> 5.75μs (71.3% faster) codeflash_output = levenshtein_distance("你好", "你") # 3.12μs -> 2.81μs (10.7% faster) codeflash_output = levenshtein_distance("你好", "您好") # 3.10μs -> 2.71μs (14.5% faster) def test_completely_different_strings(): # No characters in common codeflash_output = levenshtein_distance("abc", "xyz") # 7.45μs -> 5.61μs (32.9% faster) codeflash_output = levenshtein_distance("123", "abc") # 5.14μs -> 3.46μs (48.7% faster) def test_prefix_and_suffix(): # One string is a prefix or suffix of the other codeflash_output = levenshtein_distance("abc", "abcd") # 7.88μs -> 6.11μs (29.0% faster) codeflash_output = levenshtein_distance("abcd", "abc") # 5.18μs -> 3.78μs (37.1% faster) codeflash_output = levenshtein_distance("abc", "zabc") # 5.23μs -> 3.41μs (53.6% faster) codeflash_output = levenshtein_distance("abc", "abcz") # 4.87μs -> 3.19μs (52.8% faster) def test_repeated_characters(): # Strings with repeated characters codeflash_output = levenshtein_distance("aaa", "aaaa") # 4.89μs -> 4.79μs (2.11% faster) codeflash_output = levenshtein_distance("aaaa", "aaa") # 2.92μs -> 3.06μs (4.89% slower) codeflash_output = levenshtein_distance("aaa", "bbb") # 5.54μs -> 3.56μs (55.7% faster) def test_numbers_and_symbols(): # Strings with digits and symbols codeflash_output = levenshtein_distance("1234", "1243") # 8.68μs -> 6.73μs (28.9% faster) codeflash_output = levenshtein_distance("!@#$", "!@#") # 5.76μs -> 4.13μs (39.6% faster) codeflash_output = levenshtein_distance("!@#$", "$#@!") # 6.25μs -> 4.45μs (40.5% faster) def test_long_identical_strings(): # Long identical strings (edge, but also performance) s = "a" * 100 codeflash_output = levenshtein_distance(s, s) # 519μs -> 535μs (2.86% slower) def test_long_strings_one_difference(): # Long strings with one difference at the end s1 = "a" * 999 + "b" s2 = "a" * 1000 codeflash_output = levenshtein_distance(s1, s2) # 60.1ms -> 59.3ms (1.27% faster) codeflash_output = levenshtein_distance(s2, s1) # 60.3ms -> 59.7ms (1.11% faster) def test_long_strings_completely_different(): # Long completely different strings s1 = "a" * 500 s2 = "b" * 500 codeflash_output = levenshtein_distance(s1, s2) # 67.1ms -> 30.4ms (121% faster) # 3. Large Scale Test Cases def test_large_equal_strings(): # Large identical strings s = "abcde" * 200 # length 1000 codeflash_output = levenshtein_distance(s, s) # 242ms -> 114ms (111% faster) def test_large_one_insertion(): # Large string with one insertion s1 = "a" * 500 + "b" + "a" * 499 # length 1000 s2 = "a" * 1000 codeflash_output = levenshtein_distance(s1, s2) # 58.2ms -> 56.2ms (3.59% faster) def test_large_one_substitution(): # Large string with one substitution in the middle s1 = "a" * 499 + "b" + "a" * 500 s2 = "a" * 1000 codeflash_output = levenshtein_distance(s1, s2) # 57.9ms -> 57.2ms (1.16% faster) def test_large_completely_different(): # Large strings, all substitutions s1 = "a" * 1000 s2 = "b" * 1000 codeflash_output = levenshtein_distance(s1, s2) # 274ms -> 129ms (112% faster) def test_large_half_and_half(): # Half the string is the same, half is different s1 = "a" * 500 + "b" * 500 s2 = "a" * 1000 codeflash_output = levenshtein_distance(s1, s2) # 171ms -> 93.5ms (83.5% faster) def test_large_with_unicode(): # Large string with unicode characters s1 = "你" * 500 + "好" * 500 s2 = "你" * 1000 codeflash_output = levenshtein_distance(s1, s2) # 174ms -> 96.3ms (81.0% faster) # 4. Additional Robustness Cases @pytest.mark.parametrize( "s1,s2,expected", [ ("", "", 0), ("", "abc", 3), ("abc", "", 3), ("abc", "abc", 0), ("abc", "ab", 1), ("a", "b", 1), ("", "a", 1), ("a", "", 1), ("kitten", "sitting", 3), ("flaw", "lawn", 2), ("intention", "execution", 5), ("distance", "difference", 5), ("abcdef", "azced", 3), ("short", "ports", 3), ], ) def test_various_cases(s1, s2, expected): # Parametrized test for various scenarios codeflash_output = levenshtein_distance(s1, s2) # 130μs -> 85.5μs (52.5% faster) # 5. Commutativity property (Levenshtein distance is symmetric) def test_commutativity(): pairs = [ ("kitten", "sitting"), ("flaw", "lawn"), ("abc", "xyz"), ("", "abc"), ("a" * 500, "b" * 500), ("abcde" * 100, "edcba" * 100), ] for s1, s2 in pairs: codeflash_output = levenshtein_distance(s1, s2) d1 = codeflash_output # 126ms -> 58.6ms (116% faster) codeflash_output = levenshtein_distance(s2, s1) d2 = codeflash_output # 126ms -> 58.8ms (115% faster) # 6. Triangle inequality property def test_triangle_inequality(): # For Levenshtein distance, d(x,z) <= d(x,y) + d(y,z) triples = [("kitten", "sitting", "sittin"), ("abc", "abd", "ab"), ("a" * 100, "a" * 99 + "b", "a" * 99 + "c")] for x, y, z in triples: codeflash_output = levenshtein_distance(x, z) d_xz = codeflash_output # 557μs -> 537μs (3.89% faster) codeflash_output = levenshtein_distance(x, y) d_xy = codeflash_output # 553μs -> 532μs (3.98% faster) codeflash_output = levenshtein_distance(y, z) d_yz = codeflash_output # codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from __future__ import annotations # imports import pytest # used for our unit tests from codeflash.discovery.functions_to_optimize import levenshtein_distance # unit tests # 1. Basic Test Cases def test_identical_strings(): # Identical strings should have distance 0 codeflash_output = levenshtein_distance("kitten", "kitten") # 14.4μs -> 9.29μs (55.1% faster) codeflash_output = levenshtein_distance("", "") # 611ns -> 521ns (17.3% faster) codeflash_output = levenshtein_distance("a", "a") # 2.03μs -> 1.98μs (2.52% faster) def test_single_insertion(): # One insertion required codeflash_output = levenshtein_distance("kitten", "kittena") # 16.1μs -> 9.74μs (65.7% faster) codeflash_output = levenshtein_distance("abc", "abcd") # 5.73μs -> 3.86μs (48.6% faster) def test_single_deletion(): # One deletion required codeflash_output = levenshtein_distance("kitten", "kittn") # 12.9μs -> 8.69μs (49.0% faster) codeflash_output = levenshtein_distance("abcd", "abc") # 5.71μs -> 4.03μs (41.8% faster) def test_single_substitution(): # One substitution required codeflash_output = levenshtein_distance("kitten", "kittan") # 14.5μs -> 9.22μs (57.3% faster) codeflash_output = levenshtein_distance("abc", "adc") # 4.67μs -> 3.47μs (34.7% faster) def test_multiple_operations(): # Multiple operations needed codeflash_output = levenshtein_distance("kitten", "sitting") # 16.6μs -> 10.1μs (65.1% faster) codeflash_output = levenshtein_distance("flaw", "lawn") # 6.70μs -> 4.50μs (49.0% faster) codeflash_output = levenshtein_distance("gumbo", "gambol") # 10.7μs -> 6.22μs (72.6% faster) def test_case_sensitivity(): # Should be case-sensitive codeflash_output = levenshtein_distance("a", "A") # 4.12μs -> 3.55μs (16.1% faster) codeflash_output = levenshtein_distance("Python", "python") # 13.1μs -> 7.71μs (69.8% faster) def test_completely_different_strings(): # All characters different codeflash_output = levenshtein_distance("abc", "xyz") # 7.57μs -> 5.60μs (35.2% faster) codeflash_output = levenshtein_distance("aaa", "bbb") # 4.95μs -> 3.26μs (52.0% faster) # 2. Edge Test Cases def test_empty_strings(): # One or both strings empty codeflash_output = levenshtein_distance("", "abc") # 822ns -> 751ns (9.45% faster) codeflash_output = levenshtein_distance("abc", "") # 441ns -> 460ns (4.13% slower) codeflash_output = levenshtein_distance("", "") # 290ns -> 321ns (9.66% slower) def test_one_character_strings(): # Single character to/from empty or another char codeflash_output = levenshtein_distance("a", "") # 742ns -> 771ns (3.76% slower) codeflash_output = levenshtein_distance("", "a") # 431ns -> 411ns (4.87% faster) codeflash_output = levenshtein_distance("a", "b") # 3.80μs -> 3.29μs (15.5% faster) def test_unicode_strings(): # Unicode and multi-byte characters codeflash_output = levenshtein_distance("café", "cafe") # 9.28μs -> 6.86μs (35.2% faster) codeflash_output = levenshtein_distance("你好", "你们好") # 4.51μs -> 3.69μs (22.3% faster) codeflash_output = levenshtein_distance("🙂", "🙃") # 2.33μs -> 2.08μs (12.0% faster) codeflash_output = levenshtein_distance("a🙂b", "a🙃b") # 4.81μs -> 3.54μs (36.0% faster) def test_whitespace_and_special_chars(): # Strings with whitespace and special characters codeflash_output = levenshtein_distance("a b", "ab") # 6.26μs -> 5.17μs (21.1% faster) codeflash_output = levenshtein_distance("a_b", "a-b") # 5.12μs -> 3.48μs (47.3% faster) codeflash_output = levenshtein_distance("hello!", "hello") # 10.1μs -> 5.99μs (68.2% faster) def test_long_repeated_chars(): # Strings with repeated characters codeflash_output = levenshtein_distance("aaaaa", "aaaa") # 5.47μs -> 5.39μs (1.48% faster) codeflash_output = levenshtein_distance("aaaaa", "bbbbb") # 10.9μs -> 6.39μs (71.0% faster) def test_palindromes_and_reverses(): # Palindrome and reversed strings codeflash_output = levenshtein_distance("abcde", "edcba") # 11.9μs -> 7.68μs (54.8% faster) def test_large_difference_in_length(): # One string much longer than the other codeflash_output = levenshtein_distance("a", "a" * 100) # 25.4μs -> 25.7μs (1.09% slower) codeflash_output = levenshtein_distance("b" * 100, "b") # 23.3μs -> 23.4μs (0.474% slower) def test_strings_with_numbers(): # Strings with numbers codeflash_output = levenshtein_distance("abc123", "abc124") # 14.5μs -> 9.02μs (60.9% faster) codeflash_output = levenshtein_distance("12345", "54321") # 9.13μs -> 5.82μs (56.8% faster) # 3. Large Scale Test Cases def test_large_identical_strings(): # Large identical strings should have distance 0 s = "a" * 500 codeflash_output = levenshtein_distance(s, s) # 13.9ms -> 13.5ms (2.37% faster) def test_large_one_insertion(): # Large string with one insertion s1 = "a" * 499 s2 = "a" * 250 + "b" + "a" * 249 codeflash_output = levenshtein_distance(s1, s2) # 13.8ms -> 13.6ms (1.61% faster) def test_large_one_deletion(): # Large string with one deletion s1 = "a" * 500 s2 = "a" * 499 codeflash_output = levenshtein_distance(s1, s2) # 13.7ms -> 13.5ms (1.69% faster) def test_large_one_substitution(): # Large string with one substitution in the middle s1 = "a" * 250 + "b" + "a" * 249 s2 = "a" * 500 codeflash_output = levenshtein_distance(s1, s2) # 13.9ms -> 13.5ms (2.27% faster) def test_large_completely_different(): # Large strings, all characters different s1 = "a" * 500 s2 = "b" * 500 codeflash_output = levenshtein_distance(s1, s2) # 67.2ms -> 30.7ms (119% faster) def test_large_partial_overlap(): # Large strings with partial overlap s1 = "a" * 250 + "b" * 250 s2 = "a" * 200 + "b" * 300 # 50 a's replaced with b's codeflash_output = levenshtein_distance(s1, s2) # 41.7ms -> 21.7ms (92.6% faster) def test_large_strings_with_unicode(): # Large strings with unicode characters s1 = "é" * 500 s2 = "e" * 500 codeflash_output = levenshtein_distance(s1, s2) # 67.2ms -> 30.4ms (121% faster) def test_large_strings_with_alternating_chars(): # Alternating characters s1 = "ab" * 250 s2 = "ba" * 250 # Each position is different except for the middle if even length codeflash_output = levenshtein_distance(s1, s2) # 41.5ms -> 21.5ms (92.9% faster) # 4. Additional Edge Cases def test_nonequivalent_lengths_and_content(): # Both length and content differ codeflash_output = levenshtein_distance("abcdefg", "xyz") # 12.9μs -> 8.40μs (53.8% faster) def test_substring(): # One string is a substring of the other codeflash_output = levenshtein_distance("abcdef", "abc") # 9.93μs -> 7.42μs (33.7% faster) codeflash_output = levenshtein_distance("abc", "abcdef") # 7.66μs -> 4.98μs (53.7% faster) def test_strings_with_tabs_and_newlines(): # Special whitespace characters codeflash_output = levenshtein_distance("abc\tdef", "abcdef") # 16.8μs -> 10.3μs (62.8% faster) codeflash_output = levenshtein_distance("abc\ndef", "abcdef") # 13.7μs -> 7.80μs (76.0% faster) def test_zero_length_and_long_string(): # One empty, one long codeflash_output = levenshtein_distance("", "a" * 999) # 912ns -> 811ns (12.5% faster) codeflash_output = levenshtein_distance("b" * 999, "") # 631ns -> 541ns (16.6% faster) # 5. Determinism and Symmetry @pytest.mark.parametrize( "s1,s2", [ ("kitten", "sitting"), ("flaw", "lawn"), ("", "abc"), ("abc", ""), ("abc", "cba"), ("abc", "abc"), ("", ""), ("a", "b"), ("abc123", "abc124"), ("a" * 500, "a" * 500), ], ) def test_symmetry(s1, s2): # Levenshtein distance is symmetric codeflash_output = levenshtein_distance(s1, s2) # 13.8ms -> 13.5ms (1.90% faster) # 6. Type robustness def test_non_string_inputs(): # Should raise TypeError if input is not string with pytest.raises(TypeError): levenshtein_distance(123, "abc") with pytest.raises(TypeError): levenshtein_distance("abc", None) with pytest.raises(TypeError): levenshtein_distance(["a", "b"], "ab") with pytest.raises(TypeError): levenshtein_distance("ab", ["a", "b"]) # 7. Stress test: Large but feasible within constraints def test_large_strings_max_size(): # Both strings at the upper limit (1000 chars) s1 = "a" * 1000 s2 = "b" * 1000 codeflash_output = levenshtein_distance(s1, s2) # 272ms -> 130ms (109% faster) def test_large_strings_one_char_difference(): # 999 identical, 1 different s1 = "a" * 999 + "b" s2 = "a" * 1000 codeflash_output = levenshtein_distance(s1, s2) # 58.4ms -> 57.5ms (1.56% faster) # codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To test or edit this optimization locally git merge codeflash/optimize-pr945-2025-11-27T14.39.26

Click to see suggested changes

Suggested change

x = prev[index1]

y = prev[index1 + 1]

z = curr[index1]

min_xy = min(x, y)

min_xyz = min(z, min_xy)

curr[index1 + 1] = 1 + min_xyz

# Avoid min() function call overhead by using direct comparisons

x = prev[index1]

y = prev[index1 + 1]

z = curr[index1]

if x < y:

if x < z:

curr[index1 + 1] = 1 + x

else:

curr[index1 + 1] = 1 + z

elif y < z:

curr[index1 + 1] = 1 + y

else:

curr[index1 + 1] = 1 + z

The optimized code achieves a **15% speedup** through several targeted micro-optimizations that reduce computational overhead in the parsing loop: **Key Optimizations:** 1. **Single-pass boundary search**: Instead of checking both conditions (`start_line != -1 and end_line != -1`) on every iteration, the optimized version uses `None` values and breaks immediately when both markers are found, eliminating redundant condition checks. 2. **Fast-path string matching**: Before calling the expensive `.startswith("_______")` method, it first checks if `line[0] == "_"`, avoiding the method call for most lines that don't start with underscores. 3. **Method lookup optimization**: Pulls `current_failure_lines.append` into a local variable to avoid repeated attribute lookups in the hot loop where failure lines are processed. 4. **Memory-efficient list management**: Uses `current_failure_lines.clear()` instead of creating new list objects (`current_failure_lines = []`), reducing object allocation pressure. **Performance Impact:** The optimizations show the most significant gains in large-scale scenarios: - **Large failure sets**: 14.2% faster with 500 failures, 14.0% faster with 999 failures - **Large output**: 29.2% faster for single failures with 1000 lines of output - **Complex scenarios**: 22.3% faster with 50 cases having 10 lines each **Hot Path Context:** Based on the function reference, `parse_test_failures_from_stdout` is called from `parse_test_results`, which appears to be part of a test optimization pipeline. The function processes pytest stdout to extract failure information, making it performance-critical when dealing with large test suites or verbose test outputs. The 15% improvement becomes meaningful when processing hundreds of test failures in CI/CD environments or during iterative code optimization workflows.

codeflash-ai · 2025-11-27T14:49:11Z

⚡️ Codeflash found optimizations for this PR

📄 16% (0.16x) speedup for `parse_test_failures_from_stdout` in `codeflash/verification/parse_test_output.py`

⏱️ Runtime : 2.76 milliseconds → 2.39 milliseconds (best of 250 runs)

A dependent PR with the suggested changes has been created. Please review:

⚡️ Speed up function parse_test_failures_from_stdout by 16% in PR #945 (feat/feedback-loop-for-unmatched-test-results) #946

If you approve, it will be merged into this PR (branch feat/feedback-loop-for-unmatched-test-results).

…25-11-27T14.49.01 ⚡️ Speed up function `parse_test_failures_from_stdout` by 16% in PR #945 (`feat/feedback-loop-for-unmatched-test-results`)

codeflash-ai · 2025-11-27T16:01:33Z

This PR is now faster! 🚀 @mohammedahmed18 accepted my optimizations from:

⚡️ Speed up function parse_test_failures_from_stdout by 16% in PR #945 (feat/feedback-loop-for-unmatched-test-results) #946

codeflash-ai · 2025-11-27T18:27:01Z

⚡️ Codeflash found optimizations for this PR

📄 655% (6.55x) speedup for `compare_test_results` in `codeflash/verification/equivalence.py`

⏱️ Runtime : 90.0 milliseconds → 11.9 milliseconds (best of 5 runs)

A dependent PR with the suggested changes has been created. Please review:

⚡️ Speed up function compare_test_results by 655% in PR #945 (feat/feedback-loop-for-unmatched-test-results) #947

If you approve, it will be merged into this PR (branch feat/feedback-loop-for-unmatched-test-results).

CLAassistant · 2025-11-30T22:11:47Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.

Codeflash Bot seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

…b.com:codeflash-ai/codeflash into feat/feedback-loop-for-unmatched-test-results

codeflash/models/models.py

Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>

codeflash-ai · 2025-12-02T11:48:09Z

This PR is now faster! 🚀 mohammed ahmed accepted my code suggestion above.

…b.com:codeflash-ai/codeflash into feat/feedback-loop-for-unmatched-test-results

quick and dirty

5830a70

github-actions bot added the Review effort 3/5 label Nov 27, 2025

safter

3e0440b

mohammedahmed18 marked this pull request as draft November 27, 2025 14:27

codeflash-ai bot reviewed Nov 27, 2025

View reviewed changes

codeflash-ai bot mentioned this pull request Nov 27, 2025

⚡️ Speed up function parse_test_failures_from_stdout by 16% in PR #945 (feat/feedback-loop-for-unmatched-test-results) #946

Merged

Merge pull request #946 from codeflash-ai/codeflash/optimize-pr945-20…

168118a

…25-11-27T14.49.01 ⚡️ Speed up function `parse_test_failures_from_stdout` by 16% in PR #945 (`feat/feedback-loop-for-unmatched-test-results`)

mohammedahmed18 added 2 commits November 27, 2025 19:51

fix tests

a7f8816

linting

4e9f894

codeflash-ai bot mentioned this pull request Nov 27, 2025

⚡️ Speed up function compare_test_results by 655% in PR #945 (feat/feedback-loop-for-unmatched-test-results) #947

Open

mohammedahmed18 and others added 5 commits November 28, 2025 10:49

did it pass ?

1c9abaf

revert test optimization

0b2d894

cleaner

ecfa89f

test: try to fix the candidate and see if the diff is empty

6ea2545

capture all test discrepancies

fe68772

Codeflash Bot and others added 8 commits November 30, 2025 18:28

do the repair in main loop

ed39ec8

todo write backend endpoint

142da4c

need to test now

5a7c356

Merge branch 'feat/feedback-loop-for-unmatched-test-results' of githu…

8a28d0d

…b.com:codeflash-ai/codeflash into feat/feedback-loop-for-unmatched-test-results

works, figure out logging

5ed5dfc

local db logging

fe33c82

ready to run experiments

83814be

logging fix

0325444

mohammedahmed18 added 2 commits December 1, 2025 16:48

handle test class methods for the test diff

9f7ed90

Merge branch 'feat/feedback-loop-for-unmatched-test-results' of githu…

1ddc87c

…b.com:codeflash-ai/codeflash into feat/feedback-loop-for-unmatched-test-results

codeflash-ai bot reviewed Dec 1, 2025

View reviewed changes

codeflash/models/models.py Outdated Show resolved Hide resolved

codeflash suggestion

6060ffb

Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com>

mohammedahmed18 added 3 commits December 2, 2025 13:59

safer parsing

1120d64

better parsing for pytest stdout

c2e037a

Merge branch 'feat/feedback-loop-for-unmatched-test-results' of githu…

5703889

…b.com:codeflash-ai/codeflash into feat/feedback-loop-for-unmatched-test-results

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEAT] Feedback loop for unmatched test results #945

[FEAT] Feedback loop for unmatched test results #945

Uh oh!

mohammedahmed18 commented Nov 27, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Nov 27, 2025

Uh oh!

github-actions bot commented Nov 27, 2025

Uh oh!

codeflash-ai bot Nov 27, 2025

Uh oh!

codeflash-ai bot commented Nov 27, 2025

⚡️ Speed up function `parse_test_failures_from_stdout` by 16% in PR #945 (`feat/feedback-loop-for-unmatched-test-results`) #946

Uh oh!

codeflash-ai bot commented Nov 27, 2025

Uh oh!

codeflash-ai bot commented Nov 27, 2025

⚡️ Speed up function `compare_test_results` by 655% in PR #945 (`feat/feedback-loop-for-unmatched-test-results`) #947

Uh oh!

CLAassistant commented Nov 30, 2025

Uh oh!

Uh oh!

codeflash-ai bot commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 148 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	96.6%

[FEAT] Feedback loop for unmatched test results #945

Are you sure you want to change the base?

[FEAT] Feedback loop for unmatched test results #945

Uh oh!

Conversation

mohammedahmed18 commented Nov 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

github-actions bot commented Nov 27, 2025

PR Reviewer Guide 🔍

Uh oh!

github-actions bot commented Nov 27, 2025

PR Code Suggestions ✨

Uh oh!

codeflash-ai bot Nov 27, 2025

Choose a reason for hiding this comment

⚡️Codeflash found 73% (0.73x) speedup for levenshtein_distance in codeflash/discovery/functions_to_optimize.py

Uh oh!

codeflash-ai bot commented Nov 27, 2025

⚡️ Codeflash found optimizations for this PR

📄 16% (0.16x) speedup for parse_test_failures_from_stdout in codeflash/verification/parse_test_output.py

A dependent PR with the suggested changes has been created. Please review:

⚡️ Speed up function parse_test_failures_from_stdout by 16% in PR #945 (feat/feedback-loop-for-unmatched-test-results) #946

Uh oh!

codeflash-ai bot commented Nov 27, 2025

Uh oh!

codeflash-ai bot commented Nov 27, 2025

⚡️ Codeflash found optimizations for this PR

📄 655% (6.55x) speedup for compare_test_results in codeflash/verification/equivalence.py

A dependent PR with the suggested changes has been created. Please review:

⚡️ Speed up function compare_test_results by 655% in PR #945 (feat/feedback-loop-for-unmatched-test-results) #947

Uh oh!

CLAassistant commented Nov 30, 2025

Uh oh!

Uh oh!

codeflash-ai bot commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mohammedahmed18 commented Nov 27, 2025 •

edited by github-actions bot

Loading

⚡️Codeflash found 73% (0.73x) speedup for `levenshtein_distance` in `codeflash/discovery/functions_to_optimize.py`

📄 16% (0.16x) speedup for `parse_test_failures_from_stdout` in `codeflash/verification/parse_test_output.py`

⚡️ Speed up function `parse_test_failures_from_stdout` by 16% in PR #945 (`feat/feedback-loop-for-unmatched-test-results`) #946

📄 655% (6.55x) speedup for `compare_test_results` in `codeflash/verification/equivalence.py`

⚡️ Speed up function `compare_test_results` by 655% in PR #945 (`feat/feedback-loop-for-unmatched-test-results`) #947