Skip to content

Conversation

@mohammedahmed18
Copy link
Contributor

@mohammedahmed18 mohammedahmed18 commented Dec 12, 2025

PR Type

Enhancement, Tests


Description

  • Add consistent-loop break based on durations

  • Track per-loop test durations via hook

  • Introduce dynamic tolerance by runtime scale

  • New config: CONSISTENT_LOOP_COUNT


Diagram Walkthrough

flowchart LR
  CFG["Add CONSISTENT_LOOP_COUNT const"]
  HOOK["pytest_runtest_logreport collects durations"]
  LOOP["runtestloop aggregates loop durations"]
  TOL["dynamic_tolerance by avg runtime"]
  BREAK["Break when CONSISTENT_LOOP_COUNT consistent"]

  CFG -- "config value" --> LOOP
  HOOK -- "per-test duration" --> LOOP
  LOOP -- "median compare" --> TOL
  TOL -- "consistency check" --> BREAK
Loading

File Walkthrough

Relevant files
Enhancement
config_consts.py
Introduce loop consistency count constant                               

codeflash/code_utils/config_consts.py

  • Add CONSISTENT_LOOP_COUNT default to 3.
  • Document intent alongside timing configs.
+1/-0     
pytest_plugin.py
Duration-based consistent loop termination logic                 

codeflash/verification/pytest_plugin.py

  • Track passed test durations per loop.
  • Add dynamic_tolerance based on avg runtime.
  • Break loop after N consistent total durations.
  • Use deque with CONSISTENT_LOOP_COUNT.
+50/-5   
Formatting
env_utils.py
Minor formatting cleanup                                                                 

codeflash/code_utils/env_utils.py

  • Minor cleanup: remove stray blank line.
+0/-1     

@github-actions
Copy link

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Hardcoded Path

Writing a file to a hardcoded absolute path under the user's home directory can fail on different environments and pollutes user filesystems; consider using a temp dir, pytest's artifacts, or a configurable path and add error handling.

Path(f"/home/mohammed/Documents/test-results/break-{session.name}.txt").write_text(str(count))
break
Loop Timeout Logic

The updated loop no longer refreshes total_time within the loop and relies solely on timed_out/start_time; verify that SHORTEST_AMOUNT_OF_TIME and timed_out still guarantee at least one run and correct termination across configs.

while total_time >= SHORTEST_AMOUNT_OF_TIME:
    count += 1
    self.current_loop_durations_in_seconds.clear()

    for index, item in enumerate(session.items):
Zero-Duration Handling

Clearing durations when the total duration is zero may mask scenarios with flaky fast tests; consider tracking zeros explicitly or logging a reason to avoid unintended endless loops.

if total_duration_in_seconds > 0:
    durations.append(total_duration_in_seconds)
else:
    durations.clear()

@github-actions
Copy link

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Remove hard-coded file path

Avoid writing to a hard-coded, user-specific absolute path which can fail on other
environments and introduces side effects during tests. Use a temp directory under
the pytest cache dir to persist optional diagnostics, and guard it behind a
debug/verbose flag.

codeflash/verification/pytest_plugin.py [348]

-Path(f"/home/mohammed/Documents/test-results/break-{session.name}.txt").write_text(str(count))
+if session.config.option.verbose > 1:
+    out_dir = Path(session.config.rootpath, ".pytest_cache", "codeflash")
+    out_dir.mkdir(parents=True, exist_ok=True)
+    (out_dir / f"break-{session.name}.txt").write_text(str(count))
Suggestion importance[1-10]: 8

__

Why: Replacing a hard-coded absolute path with a cache-directory path guarded by verbosity prevents environment-specific failures and unintended side effects, and the improved_code correctly reflects this change at the referenced line.

Medium
General
Preserve stability history across loops

Clearing durations on a zero-duration loop resets accumulated stability history and
can cause non-termination flaps under very fast tests. Instead, append zero and let
the consistency check handle zeros; only clear durations when test items were not
executed.

codeflash/verification/pytest_plugin.py [314-339]

 self.current_loop_durations_in_seconds.clear()
 ...
 total_duration_in_seconds = sum(self.current_loop_durations_in_seconds)
 
-if total_duration_in_seconds > 0:
+if session.items:
     durations.append(total_duration_in_seconds)
 else:
     durations.clear()
Suggestion importance[1-10]: 6

__

Why: The suggestion reasonably avoids clearing durations on zero totals to reduce flapping and updates logic in the correct region, though it slightly changes semantics and assumes session.items presence as the clearing condition.

Low
Harden consistency evaluation

Guard against division by zero and ensure dynamic_tolerance is only used with
positive avg; also handle NaN durations defensively. Short-circuit consistency
evaluation if any duration is non-finite.

codeflash/verification/pytest_plugin.py [312-353]

 while total_time >= SHORTEST_AMOUNT_OF_TIME:
     count += 1
     self.current_loop_durations_in_seconds.clear()
 ...
     if len(durations) == CONSISTENT_LOOP_COUNT:
-        avg = statistics.median(durations)
-        if avg == 0:
-            consistent = all(d == 0 for d in durations)
-        else:
-            consistent = all(abs(d - avg) / avg <= self.dynamic_tolerance(avg) for d in durations)
-        if consistent:
-            Path(f"/home/mohammed/Documents/test-results/break-{session.name}.txt").write_text(str(count))
-            break
+        if all((d == 0 or (d > 0 and d != float("inf"))) for d in durations):
+            avg = statistics.median(durations)
+            if avg <= 0:
+                consistent = all(d == 0 for d in durations)
+            else:
+                tol = self.dynamic_tolerance(avg)
+                consistent = all(abs(d - avg) / avg <= tol for d in durations if d >= 0)
+            if consistent:
+                if session.config.option.verbose > 1:
+                    out_dir = Path(session.config.rootpath, ".pytest_cache", "codeflash")
+                    out_dir.mkdir(parents=True, exist_ok=True)
+                    (out_dir / f"break-{session.name}.txt").write_text(str(count))
+                break
 
-    if self._timed_out(session, start_time, count):
-        break
-
Suggestion importance[1-10]: 5

__

Why: Adding guards for non-positive and non-finite durations is a defensive improvement and aligns with the surrounding code, but the original code already avoids divide-by-zero when avg == 0, so the impact is moderate.

Low

@mohammedahmed18 mohammedahmed18 requested review from KRRT7 and misrasaurabh1 and removed request for misrasaurabh1 December 12, 2025 04:13
@mohammedahmed18 mohammedahmed18 changed the title [Enhancement] Stop looping when runtime is stable [Enhancement] Stop looping when runtime is stable (CF-934) Dec 15, 2025
return 0.2
if avg < 0.1: # < 100 ms
return 0.1
return 0.03 # > 0.1 s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mohammedahmed18 how did you set these numbers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aseembits93
there are just experiment numbers, the whole idea is very small runtimes should have high tolerance value because the noise is relatively large

mohammedahmed18 and others added 9 commits December 15, 2025 20:17
The optimization achieves a **437% speedup** by eliminating expensive function calls and using more efficient algorithms for median calculation and min/max operations.

**Key optimizations applied:**

1. **Custom median calculation**: Replaced `statistics.median(recent)` with a custom implementation using `sorted(recent)` and direct indexing. This eliminates the overhead of the statistics module's generic median function.

2. **Reused sorted array**: The sorted array from median calculation is reused for min/max operations (`recent_sorted[0]`, `recent_sorted[-1]`) instead of calling `min(recent)` and `max(recent)` separately, eliminating redundant iterations.

3. **Replaced `statistics.mean()` calls**: Substituted `mean(recent[:half])` and `mean(recent[half:])` with direct `sum()/length` calculations, removing function call overhead.

4. **Early termination optimization**: Changed the `all()` generator expression to an explicit loop with early `break`, avoiding unnecessary iterations when the first non-conforming value is found.

**Performance impact analysis:**
The line profiler shows dramatic improvements in the most expensive operations:
- `first = mean(recent[:half])`: 2.07ms → 47.7μs (98% reduction)
- `second = mean(recent[half:])`: 1.54ms → 35.4μs (98% reduction)  
- `m = median(recent)`: 220μs → eliminated, replaced with ~55μs custom implementation

**Hot path significance:**
Based on the function reference, `should_stop()` is called in the main test loop (`pytest_runtestloop`) after every iteration to determine if benchmarking should continue. This makes it a critical hot path function where the 437% speedup directly translates to faster benchmark completion times.

**Test case performance:**
The optimization performs exceptionally well across all test scenarios, with speedups ranging from 400-500% for typical cases involving stable data, fluctuations, and large datasets. Even edge cases with insufficient data show 10-25% improvements.
@codeflash-ai
Copy link
Contributor

codeflash-ai bot commented Dec 22, 2025

⚡️ Codeflash found optimizations for this PR

📄 438% (4.38x) speedup for should_stop in codeflash/verification/pytest_plugin.py

⏱️ Runtime : 1.23 milliseconds 229 microseconds (best of 11 runs)

A dependent PR with the suggested changes has been created. Please review:

If you approve, it will be merged into this PR (branch exp/consistent-loop-break).

Static Badge

@claude
Copy link

claude bot commented Dec 23, 2025

Code Review for PR #967: Stop looping when runtime is stable

Thanks for this enhancement! Here is my detailed review:


✅ Strengths

  1. Well-structured logic with multiple statistical checks for stability
  2. Good separation of concerns - extracting calculate_best_summed_runtime() to its own module
  3. Clear documentation with mermaid diagram

🔴 Critical Issues - MUST FIX

1. Type Hint Error (best_summed_runtime.py:1)

  • any should be Any (capital A) from typing module
  • This will cause linting/type checking failures

2-4. Division by Zero Vulnerabilities (pytest_plugin.py)

  • Line 305: abs(r - m) / m - crashes if median is 0
  • Line 309: (r_max - r_min) / r_min - crashes if r_min is 0
  • Line 316: (first - second) / first - crashes if first is 0
  • Add guard conditions before each division

⚠️ High Priority Issues - SHOULD FIX

5. Unused Method (line 331)

  • dynamic_tolerance() is defined but never called
  • Either use it or remove it

6. Unused Constant (config_consts.py:11)

  • CONSISTENT_LOOP_COUNT = 3 is defined but never used
  • PR description mentions it but code doesn't reference it

7. Empty Dictionary Edge Case (line 390)

  • calculate_best_summed_runtime() with empty dict returns 0
  • Could accumulate zeros in runtimes list causing issues
  • Add check to skip loop break logic if no data exists

📝 Medium Priority

  1. Hardcoded magic numbers in should_stop() - consider making configurable
  2. Import comment placement inconsistency (line 13)
  3. Regex compiled in hot loop (line 353) - compile once at module level

🟡 Low Priority

  1. Unrelated formatting change in env_utils.py
  2. models.py refactoring could use explanatory comment
  3. should_stop() needs docstring
  4. Missing unit tests for new functions

Summary

MUST FIX: Issues 1-4 (type error and division by zero bugs)
SHOULD FIX: Issues 5-7 and 14 (unused code, edge cases, tests)
NICE TO HAVE: Issues 8-13 (code quality)

Great work on the feature! Core logic is sound but critical issues need addressing before merge.

mohammedahmed18 and others added 3 commits December 23, 2025 17:44
…25-12-22T14.11.46

⚡️ Speed up function `should_stop` by 438% in PR #967 (`exp/consistent-loop-break`)
@codeflash-ai
Copy link
Contributor

codeflash-ai bot commented Dec 23, 2025

This PR is now faster! 🚀 @mohammedahmed18 accepted my optimizations from:

@claude
Copy link

claude bot commented Dec 23, 2025

Code Review: Stop looping when runtime is stable (CF-934)

Thank you for this enhancement! The dynamic loop termination based on runtime stability is a smart optimization. Below is my detailed review:


Strengths

  1. Well-designed stability detection: The should_stop() function uses multiple metrics (median deviation, spread, slope) which provides robust convergence detection.
  2. Dynamic tolerance system: The dynamic_tolerance() method appropriately scales tolerance based on test duration—tighter thresholds for longer tests.
  3. Good refactoring: Extracting calculate_best_summed_runtime() into a reusable module improves code organization.

🐛 Critical Issues

1. Type annotation bug in best_summed_runtime.py:1

def calculate_best_summed_runtime(grouped_runtime_info: dict[any, list[int]]) -> int:

Issue: any should be Any (capitalized) from typing.

Fix: Add import and use proper type:

from typing import Any

def calculate_best_summed_runtime(grouped_runtime_info: dict[Any, list[int]]) -> int:

2. Potential division by zero in should_stop() at line 316

spread_ok = (r_max - r_min) / r_min <= spread_rel_tol

Issue: If r_min == 0, this will raise ZeroDivisionError.

Fix: Add guard condition:

spread_ok = r_min > 0 and (r_max - r_min) / r_min <= spread_rel_tol

3. Potential division by zero in should_stop() at line 323

slope_ok = (first - second) / first <= slope_rel_tol

Issue: If first == 0, this will crash.

Fix:

slope_ok = first == 0 or (first - second) / first <= slope_rel_tol

4. Potential division by zero in pytest_plugin.py:310

if abs(r - m) / m > center_rel_tol:

Issue: If the median m == 0, division will fail.

Fix: Add guard:

if m == 0 or abs(r - m) / m > center_rel_tol:
    centered = False
    break

5. get_runtime_from_stdout() missing error handling

Issue: int(payload[last_colon + 1 :]) at line 286 can raise ValueError if the payload is not a valid integer.

Fix: Wrap in try-except:

try:
    return int(payload[last_colon + 1 :])
except ValueError:
    return None

⚠️ Bugs & Logic Issues

6. Unused dynamic_tolerance() method

The dynamic_tolerance() method is defined at line 338 but never called anywhere in the codebase. This appears to be incomplete functionality.

Questions:

  • Should this be integrated into should_stop()?
  • Was this intended to replace the hardcoded tolerances in should_stop()?

Recommendation: Either integrate it or remove it to avoid dead code.

7. Empty runtime dict causes crash

At line 397:

best_runtime_until_now = calculate_best_summed_runtime(self.usable_runtime_data_by_test_case)

If usable_runtime_data_by_test_case is empty (no passing tests yet), calculate_best_summed_runtime() returns 0, which then gets appended to runtimes. This causes cascading zero-division errors in should_stop().

Fix: Add check before calling:

if self.usable_runtime_data_by_test_case:
    best_runtime_until_now = calculate_best_summed_runtime(self.usable_runtime_data_by_test_case)
    runtimes.append(best_runtime_until_now)

    if should_stop(runtimes):
        break

8. CONSISTENT_LOOP_COUNT is defined but unused

The new constant CONSISTENT_LOOP_COUNT = 3 is added in config_consts.py:11 but is never imported or used anywhere in the code. The PR description mentions it, but it's not actually leveraged.

Fix: Either use it in the should_stop() logic or remove it to avoid confusion.


🏗️ Code Quality Issues

9. Inconsistent import ordering

Lines 11-18 in pytest_plugin.py have the import of calculate_best_summed_runtime separated from other imports by a blank line, violating PEP 8 import grouping conventions.

Fix: Move it with other local imports or follow the existing pattern.

10. Magic numbers in should_stop()

The hardcoded values (warmup=4, window=6, tolerances) should be configurable or at least documented as constants at module level for easier tuning.

Recommendation:

# At module level
STABILITY_WARMUP_LOOPS = 4
STABILITY_WINDOW_SIZE = 6
STABILITY_CENTER_TOLERANCE = 0.01
STABILITY_SPREAD_TOLERANCE = 0.02
STABILITY_SLOPE_TOLERANCE = 0.01

11. Missing docstrings

The new functions get_runtime_from_stdout(), should_stop(), and dynamic_tolerance() lack docstrings explaining their purpose, parameters, and return values.


🔒 Security & Performance

12. Regex performance concern

Line 360 uses regex for every test report:

clean_id = re.sub(r"\s*\[\s*\d+\s*\]\s*$", "", report.nodeid)

Consider compiling the regex once at module level:

# At module level
_NODEID_BRACKET_PATTERN = re.compile(r"\s*\[\s*\d+\s*\]\s*$")

# In method
clean_id = _NODEID_BRACKET_PATTERN.sub("", report.nodeid)

13. Runtime data can grow unbounded

usable_runtime_data_by_test_case grows indefinitely as tests loop. For long-running sessions, this could consume significant memory.

Recommendation: Consider limiting the list size per test or documenting the expected behavior.


🧪 Test Coverage

14. No tests for new functionality

The new functions get_runtime_from_stdout(), should_stop(), and calculate_best_summed_runtime() have zero test coverage.

Recommendation: Add unit tests covering:

  • Edge cases (empty inputs, zero values, single element)
  • Convergence detection logic
  • Runtime extraction from various stdout formats
  • Error handling paths

📝 Documentation

15. Unclear behavior change

The removal of total_time = self._get_total_time(session) from inside the loop (line 386 in original) fundamentally changes the timeout logic. Now total_time is only checked once at loop start.

Question: Is this intentional? The condition at line 378 while total_time >= SHORTEST_AMOUNT_OF_TIME now uses a stale value.

Recommendation: Clarify in PR description or add comment explaining this change.


🎯 Recommendations Summary

Must Fix (Critical):

  1. Fix type annotation: anyAny
  2. Add zero-division guards in all division operations
  3. Add error handling in get_runtime_from_stdout()
  4. Handle empty usable_runtime_data_by_test_case before calling calculate_best_summed_runtime()

Should Fix:
5. Remove or integrate unused dynamic_tolerance() method
6. Remove unused CONSISTENT_LOOP_COUNT or implement its usage
7. Clarify the total_time behavior change

Nice to Have:
8. Add unit tests for new functions
9. Extract magic numbers to named constants
10. Add docstrings
11. Compile regex pattern once


Overall, this is a valuable enhancement, but the critical division-by-zero bugs and unused code need to be addressed before merging. Would you like help implementing any of these fixes?

@claude
Copy link

claude bot commented Dec 23, 2025

PR Review: Stop Looping When Runtime is Stable (CF-934)

Summary

This PR introduces intelligent loop termination based on runtime stability, replacing fixed iteration counts with dynamic detection. The implementation tracks test durations and stops when measurements stabilize, improving efficiency while maintaining accuracy.


🟢 Strengths

  1. Well-designed stability algorithm: The should_stop() function uses three complementary metrics (center clustering, spread tolerance, and slope detection) to robustly detect stability.

  2. Performance optimization integrated: PR ⚡️ Speed up function should_stop by 438% in PR #967 (exp/consistent-loop-break) #984 was merged, providing a 437% speedup to the should_stop() function - excellent proactive optimization!

  3. Configurable parameters: Stability thresholds are well-documented constants in config_consts.py with clear comments explaining their purpose.

  4. Clean refactoring: Extracting calculate_best_summed_runtime() to a separate module improves code reusability.


🟡 Issues & Recommendations

Critical: Type Annotation Bug

Location: codeflash/result/best_summed_runtime.py:1

def calculate_best_summed_runtime(grouped_runtime_info: dict[any, list[int]]) -> int:

Issue: any is not a valid type - should be Any from typing.

Fix:

from typing import Any

def calculate_best_summed_runtime(grouped_runtime_info: dict[Any, list[int]]) -> int:
    return sum([min(usable_runtime_data) for _, usable_runtime_data in grouped_runtime_info.items()])

High Priority: Edge Case Handling

1. Division by Zero in should_stop() (Line 320, 333)

Location: codeflash/verification/pytest_plugin.py:320,333

If m (median) is 0 or first is 0, division will raise ZeroDivisionError.

Recommendation:

# Line 320 - handle zero median
if m == 0 or abs(r - m) / m > center_rel_tol:
    centered = False
    break

# Line 333 - handle zero first half mean
if first == 0:
    slope_ok = second == 0  # both zero means stable
else:
    slope_ok = (first - second) / first <= slope_rel_tol

2. Empty Runtime Data in calculate_best_summed_runtime()

Location: codeflash/result/best_summed_runtime.py:2

If any test has an empty runtime list, min() will raise ValueError.

Recommendation: Add filtering for empty lists.


Medium Priority: Code Quality

1. Unnecessary List Comprehension

Location: codeflash/result/best_summed_runtime.py:2

The list comprehension brackets are unnecessary for sum().

Recommendation:

return sum(min(usable_runtime_data) for _, usable_runtime_data in grouped_runtime_info.items())

2. Missing Documentation

Add docstrings for new functions: get_runtime_from_stdout(), should_stop(), and calculate_best_summed_runtime().

3. Import Organization

The "# System Imports" comment on line 13 is misplaced - move to line 14 before from pathlib import Path.


Low Priority: Minor Improvements

  1. Extract magic strings !###### and ######! as module-level constants
  2. Add comment explaining manual median calculation (optimized per PR ⚡️ Speed up function should_stop by 438% in PR #967 (exp/consistent-loop-break) #984)

🔒 Security Assessment

No security concerns identified. The code performs only mathematical operations on runtime measurements.


🧪 Test Coverage

Major Gap: No unit tests for the new stability detection logic.

Recommendations:

  1. Add tests for should_stop() covering stable/unstable scenarios and edge cases
  2. Add tests for get_runtime_from_stdout() with valid/invalid input
  3. Add tests for calculate_best_summed_runtime() with empty lists

Suggested test file: tests/test_pytest_stability_detection.py


📊 Performance Considerations

Positive: Minimal overhead - runs once per loop iteration with O(n log n) complexity where n=6 (window size). Already optimized by PR #984 (437% speedup).


✅ Summary

Well-designed enhancement with intelligent stability detection. Main concerns:

  1. Critical: Fix anyAny type annotation bug
  2. High: Add edge case handling for division by zero and empty lists
  3. Medium: Add comprehensive unit tests
  4. Low: Add documentation and minor improvements

Overall Assessment: ✅ Approve with required changes

Great work on the stability algorithm and proactive optimization! 🚀

@aseembits93
Copy link
Contributor

@misrasaurabh1 @mohammedahmed18 claude reviews are way too frequent, let's do something about it

N_CANDIDATES_LP = 6

# pytest loop stability
STABILITY_WARMUP_LOOPS = 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how did you determine these magic numbers @mohammedahmed18 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants