-
Notifications
You must be signed in to change notification settings - Fork 1.1k
⚡️ Speed up function group_broken_paragraphs by 30%
#4088
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
⚡️ Speed up function group_broken_paragraphs by 30%
#4088
Conversation
Here’s an optimized version of your code, preserving all function signatures, return values, and comments. **Key improvements:** - **Precompile regexes** inside the functions where they are used repeatedly. - **Avoid repeated `.strip()` and `.split()`** calls in tight loops by working with stripped data directly. - **Reduce intermediate allocations** (like unnecessary list comps). - **Optimize `all_lines_short` computation** by short-circuiting iteration (`any` instead of `all` and negating logic). - Minimize calls to regex replace by using direct substitution when possible. **Summary of key speedups**. - Precompiled regex references up-front—no repeated compile. - Reordered bullet-matching logic for early fast-path continue. - Short-circuit `all_lines_short`: break on the first long line. - Avoids unnecessary double stripping/splitting. - Uses precompiled regexes even when constants may be strings. This version will be noticeably faster, especially for large documents or tight loops.
Co-authored-by: qued <64741807+qued@users.noreply.github.com>
|
@qued That was an error on our part, I have committed the suggested change. I can confirm that the existing tests pass after this change. The PR is ready to merge. |
|
@qued everything passing except the docker CI, logs seem to indicate |
syncing changelog across PRs
Yes, this was an external issue, and is now resolved. |
📄 30% (0.30x) speedup for
group_broken_paragraphsinunstructured/cleaners/core.py⏱️ Runtime :
21.2 milliseconds→16.3 milliseconds(best of66runs)📝 Explanation and details
Here’s an optimized version of your code, preserving all function signatures, return values, and comments.
Key improvements:
.strip()and.split()calls in tight loops by working with stripped data directly.all_lines_shortcomputation by short-circuiting iteration (anyinstead ofalland negating logic).Summary of key speedups.
all_lines_short: break on the first long line.This version will be noticeably faster, especially for large documents or tight loops.
✅ Correctness verification report:
⚙️ Existing Unit Tests and Runtime
cleaners/test_core.py::test_group_broken_paragraphscleaners/test_core.py::test_group_broken_paragraphs_non_default_settingspartition/test_text.py::test_partition_text_groups_broken_paragraphstest_tracer_py__replay_test_0.py::test_unstructured_cleaners_core_group_broken_paragraphs🌀 Generated Regression Tests and Runtime
To edit these changes
git checkout codeflash/optimize-group_broken_paragraphs-mcg8s57eand push.