Avoid formatting that causes large diffs (CF-637) #274

mohammedahmed18 · 2025-06-03T15:18:45Z

User description

Details

Applied the user-provided formatter commands to the file with the optimized function, calculated the unified diff, and skipped formatting if the number of changed lines exceeds 100 (configurable).

PR Type

Enhancement, Tests

Description

Add diff-count check to skip large formatting
Integrate should_format_file into formatter logic
Introduce sample files for formatting tests
Add tests for skip/format behavior based on diff size

Changes walkthrough 📝

Relevant files

Enhancement

formatter.py `Add diff-based formatting skip` codeflash/code_utils/formatter.py Introduce should_format_file detecting black diff count Use skip logic in format_code when diff too large Add debug and warning logs for skip and errors	+34/-2

Tests

test_formatter.py `Add tests for formatting skip logic` tests/test_formatter.py Add _run_formatting_test helper for formatting validation Add tests for few vs many diff scenarios Update imports and test setup for new functionality	+68/-0

Miscellaneous

few_formatting_errors.py `Add few-formatting-errors sample file` code_to_optimize/few_formatting_errors.py Add sample file with minor formatting issues Define BadlyFormattedClass and process_data function	+47/-0
many_formatting_errors.py `Add many-formatting-errors sample file` code_to_optimize/many_formatting_errors.py Add extensive sample file with major formatting errors Include classes, utility functions, and nested logic	+147/-0

Need help?
Type /help how to ... in the comments thread for any questions about PR-Agent usage.
Check out the documentation for more information.

…r optimizing)

github-actions · 2025-06-03T15:20:10Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Performance Overhead `should_format_file` invokes `black --version` and `black --diff` for every file, which can be slow when formatting multiple files. Consider caching the version check or batching diff operations. def should_format_file(filepath, max_lines_changed=100): try: # check if black is installed subprocess.run(['black', '--version'], check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL) result = subprocess.run( ['black', '--diff', filepath], capture_output=True, text=True Hardcoded Threshold The `max_lines_changed` threshold is hardcoded to 100. Making this value configurable (e.g., via CLI argument or config file) would allow teams to adjust the limit per project needs. def should_format_file(filepath, max_lines_changed=100): try: # check if black is installed subprocess.run(['black', '--version'], check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL) result = subprocess.run( ['black', '--diff', filepath], capture_output=True, text=True ) diff_lines = [line for line in result.stdout.split('\n') if line.startswith(('+', '-')) and not line.startswith(('+++', '---'))] changes_count = len(diff_lines) if changes_count > max_lines_changed: logger.debug(f"Skipping {filepath}: {changes_count} lines would change (max: {max_lines_changed})") return False return True Unused Imports `parse_config_file` and `sort_imports` are imported but not used in the new tests. Removing these imports will eliminate unnecessary dependencies and lint warnings. import argparse import os import tempfile from pathlib import Path import pytest import shutil from codeflash.code_utils.config_parser import parse_config_file from codeflash.code_utils.formatter import format_code, sort_imports

github-actions · 2025-06-03T15:21:29Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Fix invalid attribute assignment Remove the stray spaces before the attribute name to correct the syntax and assign to `self.address` instead of an invalid attribute. code_to_optimize/few_formatting_errors.py [19] -self. address = address +self.address = address Suggestion importance[1-10]: 10 __ Why: The line `self. address = address` is a syntax error due to stray spaces before `address`, so fixing it to `self.address` is critical for the code to run.	High
Possible issue	Remove undefined import Remove the unused import `sort_imports` (which is not defined in `formatter.py`) to prevent ImportError since tests do not invoke it. tests/test_formatter.py [10] -from codeflash.code_utils.formatter import format_code, sort_imports +from codeflash.code_utils.formatter import format_code Suggestion importance[1-10]: 9 __ Why: The test imports `sort_imports` which does not exist in `formatter.py`, causing an `ImportError`; removing it is essential for the tests to execute.	High
General	Use string path for subprocess Convert the `filepath` argument to a string (e.g. via `filepath.as_posix()`) before passing to `subprocess.run` to avoid type errors on some platforms. codeflash/code_utils/formatter.py [22-25] subprocess.run( - ['black', '--diff', filepath], + ['black', '--diff', filepath.as_posix()], capture_output=True, text=True ) Suggestion importance[1-10]: 3 __ Why: Converting `filepath` to a string via `filepath.as_posix()` can avoid potential type issues, but modern `subprocess.run` accepts `Path` objects so this is a minor enhancement.	Low

codeflash/code_utils/formatter.py

tests/test_formatter.py

…formatting-for-large-diffs`) Here is an optimized version of your program. Key Improvements. - Avoids splitting all lines and list allocation; instead, iterates only as needed and sums matches (saves both memory and runtime). - Eliminates the inner function and replaces it with a fast inline check. **Why this is faster:** - Uses a simple for-loop instead of building a list. - Checks first character directly—less overhead than calling `startswith` multiple times. - Skips the closure. - No intermediate list storage. The function result and behavior are identical.

codeflash-ai · 2025-06-03T20:48:36Z

⚡️ Codeflash found optimizations for this PR

📄 15% (0.15x) speedup for `get_diff_lines_count` in `codeflash/code_utils/formatter.py`

⏱️ Runtime : 3.67 milliseconds → 3.19 milliseconds (best of 257 runs)

I created a new dependent PR with the suggested changes. Please review:

⚡️ Speed up function get_diff_lines_count by 15% in PR #274 (skip-formatting-for-large-diffs) #276

If you approve, it will be merged into this PR (branch skip-formatting-for-large-diffs).

…formatting-for-large-diffs`) Here’s a much faster rewrite. The main overhead is in the list comprehension, the function call for every line, and building the temporary list (`diff_lines`). Instead, use a generator expression (which avoids building the list in memory) and inline the test logic. **Explanation of changes:** - Removed the nested function to avoid repeated function call overhead. - Converted the list comprehension to a generator expression fed to `sum()`, so only the count is accumulated (no intermediate list). - Inlined the test logic inside the generator for further speed. This version will be significantly faster and lower on memory usage, especially for large diff outputs. If you have profile results after this, you’ll see the difference is dramatic!

codeflash-ai · 2025-06-03T20:55:55Z

⚡️ Codeflash found optimizations for this PR

📄 10% (0.10x) speedup for `get_diff_lines_count` in `codeflash/code_utils/formatter.py`

⏱️ Runtime : 2.28 milliseconds → 2.06 milliseconds (best of 309 runs)

I created a new dependent PR with the suggested changes. Please review:

⚡️ Speed up function get_diff_lines_count by 10% in PR #274 (skip-formatting-for-large-diffs) #277

If you approve, it will be merged into this PR (branch skip-formatting-for-large-diffs).

…formatting-for-large-diffs`) Here is a **much faster** rewrite. The biggest bottleneck was constructing the entire `diff_lines` list just to count its length. Instead, loop directly through the lines and count matching lines, avoiding extra memory and function call overhead. This also removes the small overhead of the nested function. ### Optimizations made. - **No internal list allocation:** Now iterating and counting in one pass with no extra list. - **No inner function call:** Faster, via direct string checks. - **Short-circuit on empty:** Avoids string indexing on empty lines. - **Direct char compare for '+', '-':** Faster than using tuple membership or `startswith` with a tuple. This reduces both runtime **and** memory usage by avoiding unnecessary data structures!

…formatting-for-large-diffs`) Here's an optimized version of your program. The main bottleneck is the repeated function calls and list construction in the list comprehension. **Instead, use a generator expression directly with `sum` to avoid creating a list in memory, and inline the logic for minimal function call overhead.** The manual string check logic is **inlined for speed**. **Why this is faster:** - Eliminates the creation of an intermediate list. - Eliminates repeated function call overhead by inlining conditions. - Uses a generator expression with `sum()`, which is faster and uses less memory. The output is **identical** to before. All comments are preserved in their original spirit.

codeflash/code_utils/formatter.py

…h-ai/codeflash into skip-formatting-for-large-diffs

…-formatting-for-large-diffs`) Here is an optimized version of your program. Key improvements. - Remove the regular expression and use the built-in `splitlines(keepends=True)`, which is **significantly** faster for splitting text into lines, especially on large files. - Use `extend` instead of repeated `append` calls for cases with two appends. - Minor local optimizations (localize function, reduce attribute lookups). **Performance explanation**. - The regex-based splitting was responsible for a significant portion of time. `str.splitlines(keepends=True)` is implemented in C and avoids unnecessary regex matching. - Using local variable lookups (e.g. `append = diff_output.append`) is slightly faster inside loops that append frequently. - `extend` is ever-so-slightly faster (in CPython) than multiple `append` calls for the rare "no newline" case. --- **This code produces exactly the same output as your original, but should be much faster (especially for large inputs).**

codeflash-ai · 2025-06-04T23:11:06Z

⚡️ Codeflash found optimizations for this PR

📄 99% (0.99x) speedup for `generate_unified_diff` in `codeflash/code_utils/formatter.py`

⏱️ Runtime : 15.3 milliseconds → 7.66 milliseconds (best of 269 runs)

I created a new dependent PR with the suggested changes. Please review:

⚡️ Speed up function generate_unified_diff by 99% in PR #274 (skip-formatting-for-large-diffs) #288

If you approve, it will be merged into this PR (branch skip-formatting-for-large-diffs).

…rmatting-for-large-diffs

Saga4 · 2025-06-09T22:02:15Z

codeflash/code_utils/formatter.py

+            )
+            diff_lines_count = get_diff_lines_count(diff_output)
+
+            max_diff_lines = min(int(original_code_lines * 0.3), 50)


why are we hardcoding this 30% or 50 lines logic?

The overall formatting other than the optimized function could be too small but annoying if its around formatting.
And when we are only looking at code other than optimzied function, there could be changes in helper functions or imports or global variables too, so how are we sure of this count of 50?

Saga4 · 2025-06-10T10:07:18Z

codeflash/code_utils/formatter.py

+                    f"Skipping formatting {path}: {diff_lines_count} lines would change (max: {max_diff_lines})"
+                )
+                return original_code
+


nit: Just add a small ToDO to optimize the optimized function alone and avoid optimizing the whole file again.

add a todo comment

Fix ruff lint

* check large diffs with black, and skipp formatting in such case (after optimizing) * new line * better log messages * remove unnecessary check * new line * remove unused comment * the max lines for formatting changes to 100 * refactoring * refactoring and improvements * added black as dev dependency * made some refactor changes that codeflash suggested * remove unused function * formatting & using internal black dep * fix black import issue * handle formatting files with no formatting issues * use user pre-defined formatting commands, instead of using black * make sure format_code recieves file path as path type not as str * formatting and linting * typo * revert lock file changes * remove comment * pass helper functions source code to the formatter for diff checking * more unit tests * enhancements * Update formatter.py add a todo comment * Update formatter.py Fix ruff lint --------- Co-authored-by: Sarthak Agarwal <sarthak.saga@gmail.com>

mohammedahmed18 added 7 commits June 3, 2025 11:56

check large diffs with black, and skipp formatting in such case (afte…

c75bbf6

…r optimizing)

new line

5cd13ad

better log messages

1522227

remove unnecessary check

d3ca1cb

new line

dcb084a

remove unused comment

689a2d9

the max lines for formatting changes to 100

44c0f85

github-actions bot added the Review effort 4/5 label Jun 3, 2025

refactoring

73ef518

mohammedahmed18 requested review from KRRT7, aseembits93 and misrasaurabh1 and removed request for KRRT7 and aseembits93 June 3, 2025 16:39

KRRT7 requested changes Jun 3, 2025

View reviewed changes

codeflash/code_utils/formatter.py Outdated Show resolved Hide resolved

tests/test_formatter.py Outdated Show resolved Hide resolved

tests/test_formatter.py Outdated Show resolved Hide resolved

refactoring and improvements

a5343fd

codeflash-ai bot mentioned this pull request Jun 3, 2025

⚡️ Speed up function get_diff_lines_count by 15% in PR #274 (skip-formatting-for-large-diffs) #276

Closed

added black as dev dependency

395855d

codeflash-ai bot mentioned this pull request Jun 3, 2025

⚡️ Speed up function get_diff_lines_count by 10% in PR #274 (skip-formatting-for-large-diffs) #277

Closed

made some refactor changes that codeflash suggested

822d6cc

mohammedahmed18 requested a review from KRRT7 June 3, 2025 20:58

codeflash-ai bot mentioned this pull request Jun 3, 2025

⚡️ Speed up function get_diff_lines_count by 19% in PR #274 (skip-formatting-for-large-diffs) #278

Closed

codeflash-ai bot reviewed Jun 4, 2025

View reviewed changes

codeflash/code_utils/formatter.py Outdated Show resolved Hide resolved

Saga4 and others added 3 commits June 5, 2025 02:35

Merge branch 'main' into skip-formatting-for-large-diffs

aed490d

use user pre-defined formatting commands, instead of using black

82a4ee1

Merge branch 'skip-formatting-for-large-diffs' of github.com:codeflas…

90014bd

…h-ai/codeflash into skip-formatting-for-large-diffs

codeflash-ai bot mentioned this pull request Jun 4, 2025

⚡️ Speed up function generate_unified_diff by 99% in PR #274 (skip-formatting-for-large-diffs) #288

Closed

mohammedahmed18 added 5 commits June 5, 2025 02:25

make sure format_code recieves file path as path type not as str

caeda49

formatting and linting

6967fcb

typo

8248c8e

revert lock file changes

15aacdb

remove comment

c24fc90

mohammedahmed18 requested a review from misrasaurabh1 June 5, 2025 00:09

mohammedahmed18 added 4 commits June 5, 2025 03:12

pass helper functions source code to the formatter for diff checking

b48e9e6

Merge branch 'main' of github.com:codeflash-ai/codeflash into skip-fo…

93070a9

…rmatting-for-large-diffs

more unit tests

64f2dd9

enhancements

a1510a3

Saga4 reviewed Jun 9, 2025

View reviewed changes

Saga4 reviewed Jun 10, 2025

View reviewed changes

Saga4 previously approved these changes Jun 10, 2025

View reviewed changes

Saga4 added 2 commits June 10, 2025 15:37

Merge branch 'main' into skip-formatting-for-large-diffs

6f97004

Update formatter.py

6cb8469

add a todo comment

Saga4 dismissed their stale review via 6cb8469 June 10, 2025 10:10

Saga4 previously approved these changes Jun 10, 2025

View reviewed changes

Update formatter.py

94e64d3

Fix ruff lint

Saga4 dismissed their stale review via 94e64d3 June 10, 2025 10:12

Saga4 approved these changes Jun 10, 2025

View reviewed changes

Saga4 merged commit 10e8a13 into main Jun 10, 2025
16 checks passed

Avoid formatting that causes large diffs (CF-637) #274

Avoid formatting that causes large diffs (CF-637) #274

Uh oh!

Conversation

mohammedahmed18 commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Details

PR Type

Description

Changes walkthrough 📝

Uh oh!

github-actions bot commented Jun 3, 2025

PR Reviewer Guide 🔍

Uh oh!

github-actions bot commented Jun 3, 2025

PR Code Suggestions ✨

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codeflash-ai bot commented Jun 3, 2025

⚡️ Codeflash found optimizations for this PR

📄 15% (0.15x) speedup for get_diff_lines_count in codeflash/code_utils/formatter.py

I created a new dependent PR with the suggested changes. Please review:

⚡️ Speed up function get_diff_lines_count by 15% in PR #274 (skip-formatting-for-large-diffs) #276

Uh oh!

codeflash-ai bot commented Jun 3, 2025

⚡️ Codeflash found optimizations for this PR

📄 10% (0.10x) speedup for get_diff_lines_count in codeflash/code_utils/formatter.py

I created a new dependent PR with the suggested changes. Please review:

⚡️ Speed up function get_diff_lines_count by 10% in PR #274 (skip-formatting-for-large-diffs) #277

Uh oh!

Uh oh!

codeflash-ai bot commented Jun 4, 2025

⚡️ Codeflash found optimizations for this PR

📄 99% (0.99x) speedup for generate_unified_diff in codeflash/code_utils/formatter.py

I created a new dependent PR with the suggested changes. Please review:

⚡️ Speed up function generate_unified_diff by 99% in PR #274 (skip-formatting-for-large-diffs) #288

Uh oh!

Saga4 Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

Saga4 Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

Saga4 Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mohammedahmed18 commented Jun 3, 2025 •

edited

Loading

📄 15% (0.15x) speedup for `get_diff_lines_count` in `codeflash/code_utils/formatter.py`

⚡️ Speed up function `get_diff_lines_count` by 15% in PR #274 (`skip-formatting-for-large-diffs`) #276

📄 10% (0.10x) speedup for `get_diff_lines_count` in `codeflash/code_utils/formatter.py`

⚡️ Speed up function `get_diff_lines_count` by 10% in PR #274 (`skip-formatting-for-large-diffs`) #277

📄 99% (0.99x) speedup for `generate_unified_diff` in `codeflash/code_utils/formatter.py`

⚡️ Speed up function `generate_unified_diff` by 99% in PR #274 (`skip-formatting-for-large-diffs`) #288