Skip to content

Conversation

@mohammedahmed18
Copy link
Contributor

@mohammedahmed18 mohammedahmed18 commented Jun 3, 2025

User description

Details

Applied the user-provided formatter commands to the file with the optimized function, calculated the unified diff, and skipped formatting if the number of changed lines exceeds 100 (configurable).


PR Type

Enhancement, Tests


Description

  • Add diff-count check to skip large formatting

  • Integrate should_format_file into formatter logic

  • Introduce sample files for formatting tests

  • Add tests for skip/format behavior based on diff size


Changes walkthrough 📝

Relevant files
Enhancement
formatter.py
Add diff-based formatting skip                                                     

codeflash/code_utils/formatter.py

  • Introduce should_format_file detecting black diff count
  • Use skip logic in format_code when diff too large
  • Add debug and warning logs for skip and errors
  • +34/-2   
    Tests
    test_formatter.py
    Add tests for formatting skip logic                                           

    tests/test_formatter.py

  • Add _run_formatting_test helper for formatting validation
  • Add tests for few vs many diff scenarios
  • Update imports and test setup for new functionality
  • +68/-0   
    Miscellaneous
    few_formatting_errors.py
    Add few-formatting-errors sample file                                       

    code_to_optimize/few_formatting_errors.py

  • Add sample file with minor formatting issues
  • Define BadlyFormattedClass and process_data function
  • +47/-0   
    many_formatting_errors.py
    Add many-formatting-errors sample file                                     

    code_to_optimize/many_formatting_errors.py

  • Add extensive sample file with major formatting errors
  • Include classes, utility functions, and nested logic
  • +147/-0 

    Need help?
  • Type /help how to ... in the comments thread for any questions about PR-Agent usage.
  • Check out the documentation for more information.
  • @github-actions
    Copy link

    github-actions bot commented Jun 3, 2025

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
    🧪 PR contains tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Performance Overhead

    should_format_file invokes black --version and black --diff for every file, which can be slow when formatting multiple files. Consider caching the version check or batching diff operations.

    def should_format_file(filepath, max_lines_changed=100):
            try:
                # check if black is installed
                subprocess.run(['black', '--version'], check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
    
                result = subprocess.run(
                    ['black', '--diff', filepath], 
                    capture_output=True, 
                    text=True
    Hardcoded Threshold

    The max_lines_changed threshold is hardcoded to 100. Making this value configurable (e.g., via CLI argument or config file) would allow teams to adjust the limit per project needs.

    def should_format_file(filepath, max_lines_changed=100):
            try:
                # check if black is installed
                subprocess.run(['black', '--version'], check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
    
                result = subprocess.run(
                    ['black', '--diff', filepath], 
                    capture_output=True, 
                    text=True
                )
    
                diff_lines = [line for line in result.stdout.split('\n') 
                            if line.startswith(('+', '-')) and not line.startswith(('+++', '---'))]
    
                changes_count = len(diff_lines)
    
                if changes_count > max_lines_changed:
                    logger.debug(f"Skipping {filepath}: {changes_count} lines would change (max: {max_lines_changed})")
                    return False
    
                return True
    Unused Imports

    parse_config_file and sort_imports are imported but not used in the new tests. Removing these imports will eliminate unnecessary dependencies and lint warnings.

    import argparse
    import os
    import tempfile
    from pathlib import Path
    
    import pytest
    import shutil
    
    from codeflash.code_utils.config_parser import parse_config_file
    from codeflash.code_utils.formatter import format_code, sort_imports

    @github-actions
    Copy link

    github-actions bot commented Jun 3, 2025

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Impact
    Possible issue
    Fix invalid attribute assignment

    Remove the stray spaces before the attribute name to correct the syntax and assign
    to self.address instead of an invalid attribute.

    code_to_optimize/few_formatting_errors.py [19]

    -self.   address = address
    +self.address = address
    Suggestion importance[1-10]: 10

    __

    Why: The line self. address = address is a syntax error due to stray spaces before address, so fixing it to self.address is critical for the code to run.

    High
    Remove undefined import

    Remove the unused import sort_imports (which is not defined in formatter.py) to
    prevent ImportError since tests do not invoke it.

    tests/test_formatter.py [10]

    -from codeflash.code_utils.formatter import format_code, sort_imports
    +from codeflash.code_utils.formatter import format_code
    Suggestion importance[1-10]: 9

    __

    Why: The test imports sort_imports which does not exist in formatter.py, causing an ImportError; removing it is essential for the tests to execute.

    High
    General
    Use string path for subprocess

    Convert the filepath argument to a string (e.g. via filepath.as_posix()) before
    passing to subprocess.run to avoid type errors on some platforms.

    codeflash/code_utils/formatter.py [22-25]

     subprocess.run(
    -    ['black', '--diff', filepath], 
    +    ['black', '--diff', filepath.as_posix()], 
         capture_output=True, 
         text=True
     )
    Suggestion importance[1-10]: 3

    __

    Why: Converting filepath to a string via filepath.as_posix() can avoid potential type issues, but modern subprocess.run accepts Path objects so this is a minor enhancement.

    Low

    @mohammedahmed18 mohammedahmed18 requested review from KRRT7, aseembits93 and misrasaurabh1 and removed request for KRRT7 and aseembits93 June 3, 2025 16:39
    codeflash-ai bot added a commit that referenced this pull request Jun 3, 2025
    …formatting-for-large-diffs`)
    
    Here is an optimized version of your program.  
    Key Improvements.
    - Avoids splitting all lines and list allocation; instead, iterates only as needed and sums matches (saves both memory and runtime).
    - Eliminates the inner function and replaces it with a fast inline check.
    
    
    
    **Why this is faster:**  
    - Uses a simple for-loop instead of building a list.
    - Checks first character directly—less overhead than calling `startswith` multiple times.
    - Skips the closure.  
    - No intermediate list storage.
    
    The function result and behavior are identical.
    @codeflash-ai
    Copy link
    Contributor

    codeflash-ai bot commented Jun 3, 2025

    ⚡️ Codeflash found optimizations for this PR

    📄 15% (0.15x) speedup for get_diff_lines_count in codeflash/code_utils/formatter.py

    ⏱️ Runtime : 3.67 milliseconds 3.19 milliseconds (best of 257 runs)

    I created a new dependent PR with the suggested changes. Please review:

    If you approve, it will be merged into this PR (branch skip-formatting-for-large-diffs).

    codeflash-ai bot added a commit that referenced this pull request Jun 3, 2025
    …formatting-for-large-diffs`)
    
    Here’s a much faster rewrite. The main overhead is in the list comprehension, the function call for every line, and building the temporary list (`diff_lines`). Instead, use a generator expression (which avoids building the list in memory) and inline the test logic.
    
    
    
    **Explanation of changes:**
    - Removed the nested function to avoid repeated function call overhead.
    - Converted the list comprehension to a generator expression fed to `sum()`, so only the count is accumulated (no intermediate list).
    - Inlined the test logic inside the generator for further speed.
    
    This version will be significantly faster and lower on memory usage, especially for large diff outputs.  
    
    If you have profile results after this, you’ll see the difference is dramatic!
    @codeflash-ai
    Copy link
    Contributor

    codeflash-ai bot commented Jun 3, 2025

    ⚡️ Codeflash found optimizations for this PR

    📄 10% (0.10x) speedup for get_diff_lines_count in codeflash/code_utils/formatter.py

    ⏱️ Runtime : 2.28 milliseconds 2.06 milliseconds (best of 309 runs)

    I created a new dependent PR with the suggested changes. Please review:

    If you approve, it will be merged into this PR (branch skip-formatting-for-large-diffs).

    @mohammedahmed18 mohammedahmed18 requested a review from KRRT7 June 3, 2025 20:58
    codeflash-ai bot added a commit that referenced this pull request Jun 3, 2025
    …formatting-for-large-diffs`)
    
    Here is a **much faster** rewrite. The biggest bottleneck was constructing the entire `diff_lines` list just to count its length. Instead, loop directly through the lines and count matching lines, avoiding extra memory and function call overhead. This also removes the small overhead of the nested function.
    
    
    
    ### Optimizations made.
    - **No internal list allocation:** Now iterating and counting in one pass with no extra list.
    - **No inner function call:** Faster, via direct string checks.
    - **Short-circuit on empty:** Avoids string indexing on empty lines.
    - **Direct char compare for '+', '-':** Faster than using tuple membership or `startswith` with a tuple.
    
    This reduces both runtime **and** memory usage by avoiding unnecessary data structures!
    codeflash-ai bot added a commit that referenced this pull request Jun 4, 2025
    …formatting-for-large-diffs`)
    
    Here's an optimized version of your program. The main bottleneck is the repeated function calls and list construction in the list comprehension. **Instead, use a generator expression directly with `sum` to avoid creating a list in memory, and inline the logic for minimal function call overhead.**  
    The manual string check logic is **inlined for speed**.
    
    
    
    **Why this is faster:**
    - Eliminates the creation of an intermediate list.
    - Eliminates repeated function call overhead by inlining conditions.
    - Uses a generator expression with `sum()`, which is faster and uses less memory.
    
    The output is **identical** to before. All comments are preserved in their original spirit.
    codeflash-ai bot added a commit that referenced this pull request Jun 4, 2025
    …-formatting-for-large-diffs`)
    
    Here is an optimized version of your program.  
    Key improvements.
    - Remove the regular expression and use the built-in `splitlines(keepends=True)`, which is **significantly** faster for splitting text into lines, especially on large files.
    - Use `extend` instead of repeated `append` calls for cases with two appends.
    - Minor local optimizations (localize function, reduce attribute lookups).
    
    
    
    **Performance explanation**.
    - The regex-based splitting was responsible for a significant portion of time. `str.splitlines(keepends=True)` is implemented in C and avoids unnecessary regex matching.
    - Using local variable lookups (e.g. `append = diff_output.append`) is slightly faster inside loops that append frequently.
    - `extend` is ever-so-slightly faster (in CPython) than multiple `append` calls for the rare "no newline" case.
    
    ---
    **This code produces exactly the same output as your original, but should be much faster (especially for large inputs).**
    @codeflash-ai
    Copy link
    Contributor

    codeflash-ai bot commented Jun 4, 2025

    ⚡️ Codeflash found optimizations for this PR

    📄 99% (0.99x) speedup for generate_unified_diff in codeflash/code_utils/formatter.py

    ⏱️ Runtime : 15.3 milliseconds 7.66 milliseconds (best of 269 runs)

    I created a new dependent PR with the suggested changes. Please review:

    If you approve, it will be merged into this PR (branch skip-formatting-for-large-diffs).

    )
    diff_lines_count = get_diff_lines_count(diff_output)

    max_diff_lines = min(int(original_code_lines * 0.3), 50)
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    why are we hardcoding this 30% or 50 lines logic?

    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    The overall formatting other than the optimized function could be too small but annoying if its around formatting.
    And when we are only looking at code other than optimzied function, there could be changes in helper functions or imports or global variables too, so how are we sure of this count of 50?

    f"Skipping formatting {path}: {diff_lines_count} lines would change (max: {max_diff_lines})"
    )
    return original_code

    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    nit: Just add a small ToDO to optimize the optimized function alone and avoid optimizing the whole file again.

    Saga4
    Saga4 previously approved these changes Jun 10, 2025
    Saga4
    Saga4 previously approved these changes Jun 10, 2025
    Fix ruff lint
    @Saga4 Saga4 merged commit 10e8a13 into main Jun 10, 2025
    16 checks passed
    KRRT7 pushed a commit that referenced this pull request Jun 10, 2025
    * check large diffs with black, and skipp formatting in such case (after optimizing)
    
    * new line
    
    * better log messages
    
    * remove unnecessary check
    
    * new line
    
    * remove unused comment
    
    * the max lines for formatting changes to 100
    
    * refactoring
    
    * refactoring and improvements
    
    * added black as dev dependency
    
    * made some refactor changes that codeflash suggested
    
    * remove unused function
    
    * formatting & using internal black dep
    
    * fix black import issue
    
    * handle formatting files with no formatting issues
    
    * use user pre-defined formatting commands, instead of using black
    
    * make sure format_code recieves file path as path type not as str
    
    * formatting and linting
    
    * typo
    
    * revert lock file changes
    
    * remove comment
    
    * pass helper functions source code to the formatter for diff checking
    
    * more unit tests
    
    * enhancements
    
    * Update formatter.py
    
    add a todo comment
    
    * Update formatter.py
    
    Fix ruff lint
    
    ---------
    
    Co-authored-by: Sarthak Agarwal <sarthak.saga@gmail.com>
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    4 participants