-
Notifications
You must be signed in to change notification settings - Fork 22
Avoid formatting that causes large diffs (CF-637) #274
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Explore these optional code suggestions:
|
…formatting-for-large-diffs`) Here is an optimized version of your program. Key Improvements. - Avoids splitting all lines and list allocation; instead, iterates only as needed and sums matches (saves both memory and runtime). - Eliminates the inner function and replaces it with a fast inline check. **Why this is faster:** - Uses a simple for-loop instead of building a list. - Checks first character directly—less overhead than calling `startswith` multiple times. - Skips the closure. - No intermediate list storage. The function result and behavior are identical.
⚡️ Codeflash found optimizations for this PR📄 15% (0.15x) speedup for
|
…formatting-for-large-diffs`) Here’s a much faster rewrite. The main overhead is in the list comprehension, the function call for every line, and building the temporary list (`diff_lines`). Instead, use a generator expression (which avoids building the list in memory) and inline the test logic. **Explanation of changes:** - Removed the nested function to avoid repeated function call overhead. - Converted the list comprehension to a generator expression fed to `sum()`, so only the count is accumulated (no intermediate list). - Inlined the test logic inside the generator for further speed. This version will be significantly faster and lower on memory usage, especially for large diff outputs. If you have profile results after this, you’ll see the difference is dramatic!
⚡️ Codeflash found optimizations for this PR📄 10% (0.10x) speedup for
|
…formatting-for-large-diffs`) Here is a **much faster** rewrite. The biggest bottleneck was constructing the entire `diff_lines` list just to count its length. Instead, loop directly through the lines and count matching lines, avoiding extra memory and function call overhead. This also removes the small overhead of the nested function. ### Optimizations made. - **No internal list allocation:** Now iterating and counting in one pass with no extra list. - **No inner function call:** Faster, via direct string checks. - **Short-circuit on empty:** Avoids string indexing on empty lines. - **Direct char compare for '+', '-':** Faster than using tuple membership or `startswith` with a tuple. This reduces both runtime **and** memory usage by avoiding unnecessary data structures!
…formatting-for-large-diffs`) Here's an optimized version of your program. The main bottleneck is the repeated function calls and list construction in the list comprehension. **Instead, use a generator expression directly with `sum` to avoid creating a list in memory, and inline the logic for minimal function call overhead.** The manual string check logic is **inlined for speed**. **Why this is faster:** - Eliminates the creation of an intermediate list. - Eliminates repeated function call overhead by inlining conditions. - Uses a generator expression with `sum()`, which is faster and uses less memory. The output is **identical** to before. All comments are preserved in their original spirit.
…h-ai/codeflash into skip-formatting-for-large-diffs
…-formatting-for-large-diffs`) Here is an optimized version of your program. Key improvements. - Remove the regular expression and use the built-in `splitlines(keepends=True)`, which is **significantly** faster for splitting text into lines, especially on large files. - Use `extend` instead of repeated `append` calls for cases with two appends. - Minor local optimizations (localize function, reduce attribute lookups). **Performance explanation**. - The regex-based splitting was responsible for a significant portion of time. `str.splitlines(keepends=True)` is implemented in C and avoids unnecessary regex matching. - Using local variable lookups (e.g. `append = diff_output.append`) is slightly faster inside loops that append frequently. - `extend` is ever-so-slightly faster (in CPython) than multiple `append` calls for the rare "no newline" case. --- **This code produces exactly the same output as your original, but should be much faster (especially for large inputs).**
⚡️ Codeflash found optimizations for this PR📄 99% (0.99x) speedup for
|
| ) | ||
| diff_lines_count = get_diff_lines_count(diff_output) | ||
|
|
||
| max_diff_lines = min(int(original_code_lines * 0.3), 50) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we hardcoding this 30% or 50 lines logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overall formatting other than the optimized function could be too small but annoying if its around formatting.
And when we are only looking at code other than optimzied function, there could be changes in helper functions or imports or global variables too, so how are we sure of this count of 50?
codeflash/code_utils/formatter.py
Outdated
| f"Skipping formatting {path}: {diff_lines_count} lines would change (max: {max_diff_lines})" | ||
| ) | ||
| return original_code | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Just add a small ToDO to optimize the optimized function alone and avoid optimizing the whole file again.
Fix ruff lint
* check large diffs with black, and skipp formatting in such case (after optimizing) * new line * better log messages * remove unnecessary check * new line * remove unused comment * the max lines for formatting changes to 100 * refactoring * refactoring and improvements * added black as dev dependency * made some refactor changes that codeflash suggested * remove unused function * formatting & using internal black dep * fix black import issue * handle formatting files with no formatting issues * use user pre-defined formatting commands, instead of using black * make sure format_code recieves file path as path type not as str * formatting and linting * typo * revert lock file changes * remove comment * pass helper functions source code to the formatter for diff checking * more unit tests * enhancements * Update formatter.py add a todo comment * Update formatter.py Fix ruff lint --------- Co-authored-by: Sarthak Agarwal <sarthak.saga@gmail.com>
User description
Details
Applied the user-provided formatter commands to the file with the optimized function, calculated the unified diff, and skipped formatting if the number of changed lines exceeds 100 (configurable).
PR Type
Enhancement, Tests
Description
Add diff-count check to skip large formatting
Integrate should_format_file into formatter logic
Introduce sample files for formatting tests
Add tests for skip/format behavior based on diff size
Changes walkthrough 📝
formatter.py
Add diff-based formatting skipcodeflash/code_utils/formatter.py
test_formatter.py
Add tests for formatting skip logictests/test_formatter.py
few_formatting_errors.py
Add few-formatting-errors sample filecode_to_optimize/few_formatting_errors.py
many_formatting_errors.py
Add many-formatting-errors sample filecode_to_optimize/many_formatting_errors.py