⚡️ Speed up function `group_broken_paragraphs` by 30% #4088

aseembits93 · 2025-08-26T16:59:04Z

📄 30% (0.30x) speedup for `group_broken_paragraphs` in `unstructured/cleaners/core.py`

⏱️ Runtime : 21.2 milliseconds → 16.3 milliseconds (best of 66 runs)

📝 Explanation and details

Here’s an optimized version of your code, preserving all function signatures, return values, and comments.
Key improvements:

Precompile regexes inside the functions where they are used repeatedly.
Avoid repeated .strip() and .split() calls in tight loops by working with stripped data directly.
Reduce intermediate allocations (like unnecessary list comps).
Optimize all_lines_short computation by short-circuiting iteration (any instead of all and negating logic).
Minimize calls to regex replace by using direct substitution when possible.

Summary of key speedups.

Precompiled regex references up-front—no repeated compile.
Reordered bullet-matching logic for early fast-path continue.
Short-circuit all_lines_short: break on the first long line.
Avoids unnecessary double stripping/splitting.
Uses precompiled regexes even when constants may be strings.

This version will be noticeably faster, especially for large documents or tight loops.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	✅ 58 Passed
🌀 Generated Regression Tests	✅ 49 Passed
⏪ Replay Tests	✅ 6 Passed
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

⚙️ Existing Unit Tests and Runtime

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`cleaners/test_core.py::test_group_broken_paragraphs`	19.5μs	16.1μs	✅21.0%
`cleaners/test_core.py::test_group_broken_paragraphs_non_default_settings`	23.9μs	21.7μs	✅10.2%
`partition/test_text.py::test_partition_text_groups_broken_paragraphs`	1.97ms	1.96ms	✅0.347%
`test_tracer_py__replay_test_0.py::test_unstructured_cleaners_core_group_broken_paragraphs`	161μs	119μs	✅34.9%

🌀 Generated Regression Tests and Runtime

from __future__ import annotations

import re

# imports
import pytest  # used for our unit tests
from unstructured.cleaners.core import group_broken_paragraphs

# Dummy patterns for testing (since unstructured.nlp.patterns is unavailable)
# These are simplified versions for the sake of testing
DOUBLE_PARAGRAPH_PATTERN_RE = re.compile(r"\n\s*\n")
E_BULLET_PATTERN = re.compile(r"^\s*e\s+", re.MULTILINE)
PARAGRAPH_PATTERN = re.compile(r"\n")
PARAGRAPH_PATTERN_RE = re.compile(r"\n")
# Unicode bullets for test
UNICODE_BULLETS_RE = re.compile(r"^\s*[•○·]", re.MULTILINE)
from unstructured.cleaners.core import group_broken_paragraphs

# unit tests

# -------------------- BASIC TEST CASES --------------------

def test_empty_string():
    # Test that empty input returns empty string
    codeflash_output = group_broken_paragraphs('') # 1.38μs -> 2.69μs (48.7% slower)

def test_single_line():
    # Test that a single line is returned unchanged
    codeflash_output = group_broken_paragraphs('Hello world.') # 6.58μs -> 6.83μs (3.68% slower)

def test_two_paragraphs_with_double_newline():
    # Test that two paragraphs separated by double newline are preserved
    text = "First paragraph.\nSecond line.\n\nSecond paragraph.\nAnother line."
    expected = "First paragraph. Second line.\n\nSecond paragraph. Another line."
    codeflash_output = group_broken_paragraphs(text) # 13.7μs -> 14.2μs (3.07% slower)

def test_paragraphs_with_single_line_breaks():
    # Test that lines in a paragraph are joined with spaces
    text = "The big red fox\nis walking down the lane.\n\nAt the end of the lane\nthe fox met a bear."
    expected = "The big red fox is walking down the lane.\n\nAt the end of the lane the fox met a bear."
    codeflash_output = group_broken_paragraphs(text) # 18.8μs -> 16.2μs (15.7% faster)

def test_bullet_points():
    # Test bullet points are handled and line breaks inside bullets are joined
    text = "• The big red fox\nis walking down the lane.\n\n• At the end of the lane\nthe fox met a bear."
    expected = [
        "• The big red fox is walking down the lane.",
        "• At the end of the lane the fox met a bear."
    ]
    codeflash_output = group_broken_paragraphs(text); result = codeflash_output # 33.4μs -> 19.7μs (69.7% faster)

def test_e_bullet_points():
    # Test pytesseract e-bullet conversion is handled
    text = "e The big red fox\nis walking down the lane.\n\ne At the end of the lane\nthe fox met a bear."
    # e should be converted to ·
    expected = [
        "· The big red fox is walking down the lane.",
        "· At the end of the lane the fox met a bear."
    ]
    codeflash_output = group_broken_paragraphs(text); result = codeflash_output # 27.8μs -> 16.9μs (64.3% faster)

def test_short_lines_not_grouped():
    # Test that lines with <5 words are not grouped
    text = "Apache License\nVersion 2.0, January 2004\nhttp://www.apache.org/licenses/"
    expected = "Apache License\nVersion 2.0, January 2004\nhttp://www.apache.org/licenses/"
    codeflash_output = group_broken_paragraphs(text) # 10.5μs -> 11.5μs (8.37% slower)

def test_mixed_bullet_and_normal():
    # Test that a mix of bullets and normal paragraphs works
    text = (
        "• First bullet\nis split\n\n"
        "A normal paragraph\nwith line break.\n\n"
        "• Second bullet\nis also split"
    )
    expected = [
        "• First bullet is split",
        "A normal paragraph with line break.",
        "• Second bullet is also split"
    ]
    codeflash_output = group_broken_paragraphs(text); result = codeflash_output # 31.2μs -> 21.3μs (46.3% faster)

# -------------------- EDGE TEST CASES --------------------

def test_all_whitespace():
    # Test input of only whitespace returns empty string
    codeflash_output = group_broken_paragraphs('   \n   ') # 3.52μs -> 4.19μs (16.1% slower)

def test_only_newlines():
    # Test input of only newlines returns empty string
    codeflash_output = group_broken_paragraphs('\n\n\n') # 2.44μs -> 3.46μs (29.7% slower)

def test_single_bullet_with_no_linebreaks():
    # Test bullet point with no line breaks is preserved
    text = "• A bullet point with no line breaks."
    codeflash_output = group_broken_paragraphs(text) # 15.3μs -> 8.46μs (81.1% faster)

def test_paragraph_with_multiple_consecutive_newlines():
    # Test that multiple consecutive newlines are treated as paragraph breaks
    text = "First para.\n\n\nSecond para.\n\n\n\nThird para."
    expected = "First para.\n\nSecond para.\n\nThird para."
    codeflash_output = group_broken_paragraphs(text) # 11.4μs -> 11.6μs (1.56% slower)

def test_leading_and_trailing_newlines():
    # Test that leading and trailing newlines are ignored
    text = "\n\nFirst para.\nSecond line.\n\nSecond para.\n\n"
    expected = "First para. Second line.\n\nSecond para."
    codeflash_output = group_broken_paragraphs(text) # 11.9μs -> 12.5μs (4.58% slower)

def test_bullet_point_with_leading_spaces():
    # Test bullet with leading whitespace is handled
    text = "   • Bullet with leading spaces\nand a line break."
    expected = "• Bullet with leading spaces and a line break."
    codeflash_output = group_broken_paragraphs(text) # 18.4μs -> 10.6μs (73.3% faster)

def test_unicode_bullets():
    # Test that various unicode bullets are handled
    text = "○ Unicode bullet\nline two.\n\n· Another unicode bullet\nline two."
    expected = [
        "○ Unicode bullet line two.",
        "· Another unicode bullet line two."
    ]
    codeflash_output = group_broken_paragraphs(text); result = codeflash_output # 27.7μs -> 15.7μs (75.8% faster)

def test_short_lines_with_blank_lines():
    # Test that short lines with blank lines are preserved and not grouped
    text = "Title\n\nSubtitle\n\n2024"
    expected = "Title\n\nSubtitle\n\n2024"
    codeflash_output = group_broken_paragraphs(text) # 9.66μs -> 10.1μs (4.73% slower)

def test_mixed_short_and_long_lines():
    # Test a paragraph with both short and long lines
    text = "Title\nThis is a long line that should be grouped with the next.\nAnother long line."
    expected = "Title This is a long line that should be grouped with the next. Another long line."
    codeflash_output = group_broken_paragraphs(text) # 14.9μs -> 13.2μs (13.3% faster)

def test_bullet_point_with_inner_blank_lines():
    # Test bullet points with inner blank lines
    text = "• Bullet one\n\n• Bullet two\n\n• Bullet three"
    expected = [
        "• Bullet one",
        "• Bullet two",
        "• Bullet three"
    ]
    codeflash_output = group_broken_paragraphs(text); result = codeflash_output # 24.9μs -> 13.7μs (81.4% faster)

def test_paragraph_with_tabs_and_spaces():
    # Test paragraphs with tabs and spaces are grouped correctly
    text = "First\tparagraph\nis here.\n\n\tSecond paragraph\nis here."
    expected = "First\tparagraph is here.\n\n\tSecond paragraph is here."
    codeflash_output = group_broken_paragraphs(text) # 12.4μs -> 12.4μs (0.314% slower)

# -------------------- LARGE SCALE TEST CASES --------------------

def test_large_number_of_paragraphs():
    # Test function with 500 paragraphs
    paras = ["Paragraph {} line 1\nParagraph {} line 2".format(i, i) for i in range(500)]
    text = "\n\n".join(paras)
    expected = "\n\n".join(["Paragraph {} line 1 Paragraph {} line 2".format(i, i) for i in range(500)])
    codeflash_output = group_broken_paragraphs(text) # 1.79ms -> 1.69ms (5.66% faster)

def test_large_number_of_bullets():
    # Test function with 500 bullet points, each split over two lines
    bullets = ["• Bullet {} part 1\nBullet {} part 2".format(i, i) for i in range(500)]
    text = "\n\n".join(bullets)
    expected = "\n\n".join(["• Bullet {} part 1 Bullet {} part 2".format(i, i) for i in range(500)])
    codeflash_output = group_broken_paragraphs(text) # 3.72ms -> 1.88ms (97.3% faster)

def test_large_mixed_content():
    # Test function with 200 normal paragraphs and 200 bullet paragraphs
    paras = ["Normal para {} line 1\nNormal para {} line 2".format(i, i) for i in range(200)]
    bullets = ["• Bullet {} part 1\nBullet {} part 2".format(i, i) for i in range(200)]
    # Interleave them
    text = "\n\n".join([item for pair in zip(paras, bullets) for item in pair])
    expected = "\n\n".join([
        "Normal para {} line 1 Normal para {} line 2".format(i, i)
        for i in range(200)
    ] + [
        "• Bullet {} part 1 Bullet {} part 2".format(i, i)
        for i in range(200)
    ])
    # Since we interleaved, need to interleave expected as well
    expected = "\n\n".join([
        val for pair in zip(
            ["Normal para {} line 1 Normal para {} line 2".format(i, i) for i in range(200)],
            ["• Bullet {} part 1 Bullet {} part 2".format(i, i) for i in range(200)]
        ) for val in pair
    ])
    codeflash_output = group_broken_paragraphs(text) # 2.48ms -> 1.59ms (55.8% faster)

def test_performance_on_large_text():
    # Test that the function can handle a large block of text efficiently (not a correctness test)
    big_text = "This is a line in a very big paragraph.\n" * 999
    # Should be grouped into a single paragraph with spaces
    expected = " ".join(["This is a line in a very big paragraph."] * 999)
    codeflash_output = group_broken_paragraphs(big_text) # 2.62ms -> 2.62ms (0.161% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from __future__ import annotations

import re

# imports
import pytest  # used for our unit tests
from unstructured.cleaners.core import group_broken_paragraphs

# Dummy regexes for test purposes (since we don't have unstructured.nlp.patterns)
DOUBLE_PARAGRAPH_PATTERN_RE = re.compile(r"\n\s*\n")
E_BULLET_PATTERN = re.compile(r"^e\s")
PARAGRAPH_PATTERN = re.compile(r"\n")
PARAGRAPH_PATTERN_RE = re.compile(r"\n")
UNICODE_BULLETS_RE = re.compile(r"^[\u2022\u2023\u25E6\u2043\u2219\u25AA\u25CF\u25CB\u25A0\u25A1\u25B2\u25B3\u25BC\u25BD\u25C6\u25C7\u25C9\u25CB\u25D8\u25D9\u25E6\u2605\u2606\u2765\u2767\u29BE\u29BF\u25A0-\u25FF]")
from unstructured.cleaners.core import group_broken_paragraphs

# unit tests

# -------------------------------
# 1. Basic Test Cases
# -------------------------------

def test_single_paragraph_joined():
    # Should join lines in a single paragraph into one line
    text = "The big red fox\nis walking down the lane."
    expected = "The big red fox is walking down the lane."
    codeflash_output = group_broken_paragraphs(text) # 11.2μs -> 9.78μs (14.9% faster)

def test_multiple_paragraphs():
    # Should join lines in each paragraph, and keep paragraphs separate
    text = "The big red fox\nis walking down the lane.\n\nAt the end of the lane\nthe fox met a bear."
    expected = "The big red fox is walking down the lane.\n\nAt the end of the lane the fox met a bear."
    codeflash_output = group_broken_paragraphs(text) # 17.7μs -> 15.7μs (13.0% faster)

def test_preserve_double_newlines():
    # Double newlines should be preserved as paragraph breaks
    text = "Para one line one\nPara one line two.\n\nPara two line one\nPara two line two."
    expected = "Para one line one Para one line two.\n\nPara two line one Para two line two."
    codeflash_output = group_broken_paragraphs(text) # 13.8μs -> 14.0μs (1.43% slower)

def test_short_lines_not_joined():
    # Short lines (less than 5 words) should not be joined, but kept as separate lines
    text = "Apache License\nVersion 2.0, January 2004\nhttp://www.apache.org/licenses/"
    expected = "Apache License\nVersion 2.0, January 2004\nhttp://www.apache.org/licenses/"
    codeflash_output = group_broken_paragraphs(text) # 10.7μs -> 11.2μs (4.59% slower)

def test_bullet_points_grouped():
    # Bullet points with line breaks should be joined into single lines per bullet
    text = "• The big red fox\nis walking down the lane.\n\n• At the end of the lane\nthe fox met a bear."
    expected = "• The big red fox is walking down the lane.\n\n• At the end of the lane the fox met a bear."
    codeflash_output = group_broken_paragraphs(text) # 35.4μs -> 21.1μs (68.0% faster)

def test_e_bullet_points_grouped():
    # 'e' as bullet should be replaced and grouped
    text = "e The big red fox\nis walking down the lane."
    expected = "· The big red fox is walking down the lane."
    codeflash_output = group_broken_paragraphs(text) # 17.5μs -> 10.9μs (61.7% faster)

# -------------------------------
# 2. Edge Test Cases
# -------------------------------

def test_empty_string():
    # Empty string should return empty string
    codeflash_output = group_broken_paragraphs("") # 1.13μs -> 2.03μs (44.3% slower)

def test_only_newlines():
    # String of only newlines should return empty string
    codeflash_output = group_broken_paragraphs("\n\n\n") # 2.70μs -> 3.52μs (23.1% slower)

def test_spaces_and_newlines():
    # String of spaces and newlines should return empty string
    codeflash_output = group_broken_paragraphs("   \n  \n\n  ") # 2.91μs -> 3.90μs (25.4% slower)

def test_single_word():
    # Single word should be returned as is
    codeflash_output = group_broken_paragraphs("Hello") # 5.77μs -> 6.09μs (5.24% slower)

def test_single_line_paragraphs():
    # Multiple single-line paragraphs separated by double newlines
    text = "First para.\n\nSecond para.\n\nThird para."
    expected = "First para.\n\nSecond para.\n\nThird para."
    codeflash_output = group_broken_paragraphs(text) # 11.3μs -> 12.0μs (5.89% slower)

def test_paragraph_with_trailing_newlines():
    # Paragraph with trailing newlines should be handled
    text = "The big red fox\nis walking down the lane.\n\n"
    expected = "The big red fox is walking down the lane."
    codeflash_output = group_broken_paragraphs(text) # 12.7μs -> 11.1μs (13.6% faster)

def test_bullet_with_extra_spaces():
    # Bullet with extra spaces and newlines
    text = "  •   The quick brown\nfox jumps over\n  the lazy dog.  "
    expected = "•   The quick brown fox jumps over   the lazy dog.  "
    codeflash_output = group_broken_paragraphs(text) # 22.5μs -> 12.6μs (78.1% faster)

def test_mixed_bullets_and_normal():
    # Mixed bullet and non-bullet paragraphs
    text = "• Bullet one\ncontinues here.\n\nNormal para\ncontinues here."
    expected = "• Bullet one continues here.\n\nNormal para continues here."
    codeflash_output = group_broken_paragraphs(text) # 22.0μs -> 15.6μs (40.8% faster)

def test_multiple_bullet_styles():
    # Multiple Unicode bullet styles
    text = "• Bullet A\nline two.\n\n◦ Bullet B\nline two."
    expected = "• Bullet A line two.\n\n◦ Bullet B line two."
    codeflash_output = group_broken_paragraphs(text) # 23.7μs -> 12.4μs (90.4% faster)

def test_short_and_long_lines_mixed():
    # A paragraph with both short and long lines
    text = "Short\nThis is a much longer line that should be joined\nAnother short"
    # Only the first and last lines are short, but the presence of a long line means the paragraph will be joined
    expected = "Short This is a much longer line that should be joined Another short"
    codeflash_output = group_broken_paragraphs(text) # 14.1μs -> 12.7μs (10.9% faster)

def test_paragraph_with_tabs():
    # Paragraph with tabs instead of spaces
    text = "The big red fox\tis walking down the lane."
    expected = "The big red fox\tis walking down the lane."
    codeflash_output = group_broken_paragraphs(text) # 9.45μs -> 7.96μs (18.7% faster)

def test_bullet_with_leading_newline():
    # Bullet point with a leading newline
    text = "\n• Bullet with leading newline\ncontinues here."
    expected = "• Bullet with leading newline continues here."
    codeflash_output = group_broken_paragraphs(text) # 18.7μs -> 9.98μs (87.2% faster)

def test_bullet_with_trailing_newline():
    # Bullet point with a trailing newline
    text = "• Bullet with trailing newline\ncontinues here.\n"
    expected = "• Bullet with trailing newline continues here."
    codeflash_output = group_broken_paragraphs(text) # 17.2μs -> 9.58μs (79.6% faster)

def test_unicode_bullet_variants():
    # Test with a variety of Unicode bullets
    text = "● Unicode bullet one\ncontinues\n\n○ Unicode bullet two\ncontinues"
    expected = "● Unicode bullet one continues\n\n○ Unicode bullet two continues"
    codeflash_output = group_broken_paragraphs(text) # 24.3μs -> 13.8μs (76.7% faster)

def test_multiple_empty_paragraphs():
    # Multiple empty paragraphs between text
    text = "First para.\n\n\n\nSecond para."
    expected = "First para.\n\nSecond para."
    codeflash_output = group_broken_paragraphs(text) # 9.26μs -> 9.85μs (6.00% slower)

# -------------------------------
# 3. Large Scale Test Cases
# -------------------------------

def test_large_number_of_paragraphs():
    # 500 paragraphs, each with two lines to be joined
    paras = ["Line one {}\nLine two {}".format(i, i) for i in range(500)]
    text = "\n\n".join(paras)
    expected = "\n\n".join(["Line one {} Line two {}".format(i, i) for i in range(500)])
    codeflash_output = group_broken_paragraphs(text) # 1.36ms -> 1.29ms (5.79% faster)

def test_large_number_of_bullets():
    # 300 bullet points, each with two lines
    paras = ["• Bullet {}\ncontinues here.".format(i) for i in range(300)]
    text = "\n\n".join(paras)
    expected = "\n\n".join(["• Bullet {} continues here.".format(i) for i in range(300)])
    codeflash_output = group_broken_paragraphs(text) # 1.98ms -> 969μs (104% faster)

def test_large_mixed_content():
    # Mix of 200 normal paras and 200 bullets
    normal_paras = ["Normal {}\ncontinues".format(i) for i in range(200)]
    bullet_paras = ["• Bullet {}\ncontinues".format(i) for i in range(200)]
    all_paras = []
    for i in range(200):
        all_paras.append(normal_paras[i])
        all_paras.append(bullet_paras[i])
    text = "\n\n".join(all_paras)
    expected = "\n\n".join([
        "Normal {} continues".format(i) if j % 2 == 0 else "• Bullet {} continues".format(i//2)
        for j, i in enumerate(range(400))
    ])
    # Fix expected to match the correct sequence
    expected = "\n\n".join(
        ["Normal {} continues".format(i) for i in range(200)] +
        ["• Bullet {} continues".format(i) for i in range(200)]
    )
    # The function will process in order, so we need to interleave
    interleaved = []
    for i in range(200):
        interleaved.append("Normal {} continues".format(i))
        interleaved.append("• Bullet {} continues".format(i))
    expected = "\n\n".join(interleaved)
    codeflash_output = group_broken_paragraphs(text)

def test_large_short_lines():
    # 1000 short lines, all should be preserved as is (not joined)
    text = "\n".join(["A {}".format(i) for i in range(1000)])
    expected = "\n".join(["A {}".format(i) for i in range(1000)])
    codeflash_output = group_broken_paragraphs(text) # 605μs -> 565μs (7.11% faster)

def test_large_paragraph_with_long_lines():
    # One paragraph with 1000 long lines (should be joined into one)
    text = "\n".join(["This is a long line number {}".format(i) for i in range(1000)])
    expected = " ".join(["This is a long line number {}".format(i) for i in range(1000)])
    codeflash_output = group_broken_paragraphs(text) # 2.11ms -> 2.09ms (1.10% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-group_broken_paragraphs-mcg8s57e and push.

Here’s an optimized version of your code, preserving all function signatures, return values, and comments. **Key improvements:** - **Precompile regexes** inside the functions where they are used repeatedly. - **Avoid repeated `.strip()` and `.split()`** calls in tight loops by working with stripped data directly. - **Reduce intermediate allocations** (like unnecessary list comps). - **Optimize `all_lines_short` computation** by short-circuiting iteration (`any` instead of `all` and negating logic). - Minimize calls to regex replace by using direct substitution when possible. **Summary of key speedups**. - Precompiled regex references up-front—no repeated compile. - Reordered bullet-matching logic for early fast-path continue. - Short-circuit `all_lines_short`: break on the first long line. - Avoids unnecessary double stripping/splitting. - Uses precompiled regexes even when constants may be strings. This version will be noticeably faster, especially for large documents or tight loops.

…cg8s57e

unstructured/cleaners/core.py

Co-authored-by: qued <64741807+qued@users.noreply.github.com>

aseembits93 · 2025-08-28T20:50:37Z

@qued That was an error on our part, I have committed the suggested change. I can confirm that the existing tests pass after this change. The PR is ready to merge.

unstructured/cleaners/core.py

…cg8s57e

aseembits93 · 2025-09-05T21:17:05Z

@qued everything passing except the docker CI, logs seem to indicate poppler-utils not present in the system. I am facing this issue in another PR of mine (#4087 (comment)). Hope to get this resolved! let me know if there's anything I could do from my end, Best,

syncing changelog across PRs

…cg8s57e

qued · 2025-09-09T22:56:38Z

everything passing except the docker CI, logs seem to indicate poppler-utils not present in the system.

Yes, this was an external issue, and is now resolved.

codeflash-ai bot and others added 5 commits June 28, 2025 12:53

revert some changes

13b8651

cleanup diff

d544434

cleaning up

b357515

Merge branch 'main' into codeflash/optimize-group_broken_paragraphs-m…

dce5386

…cg8s57e

cragwolfe requested a review from qued August 26, 2025 23:58

qued reviewed Aug 28, 2025

View reviewed changes

unstructured/cleaners/core.py Outdated Show resolved Hide resolved

Update unstructured/cleaners/core.py

225b96e

Co-authored-by: qued <64741807+qued@users.noreply.github.com>

ruff check fix

e799eb8

qued reviewed Aug 29, 2025

View reviewed changes

unstructured/cleaners/core.py Outdated Show resolved Hide resolved

reverting to all()

e57b8ed

aseembits93 requested a review from qued September 2, 2025 20:30

aseembits93 and others added 2 commits September 2, 2025 17:18

Merge branch 'main' into codeflash/optimize-group_broken_paragraphs-m…

fb4db35

…cg8s57e

Update changelog and version

cd3f422

aseembits93 and others added 3 commits September 5, 2025 16:48

Update CHANGELOG.md

d80d138

syncing changelog across PRs

Merge branch 'main' into codeflash/optimize-group_broken_paragraphs-m…

5305036

…cg8s57e

Update CHANGELOG.md

2ea504c

qued enabled auto-merge September 9, 2025 22:56

qued approved these changes Sep 9, 2025

View reviewed changes

qued added this pull request to the merge queue Sep 9, 2025

Merged via the queue into Unstructured-IO:main with commit 6aee131 Sep 10, 2025
38 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up function `group_broken_paragraphs` by 30% #4088

⚡️ Speed up function `group_broken_paragraphs` by 30% #4088

aseembits93 commented Aug 26, 2025

Uh oh!

Uh oh!

aseembits93 commented Aug 28, 2025

Uh oh!

Uh oh!

aseembits93 commented Sep 5, 2025

Uh oh!

qued commented Sep 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

⚡️ Speed up function group_broken_paragraphs by 30% #4088

⚡️ Speed up function group_broken_paragraphs by 30% #4088

Conversation

aseembits93 commented Aug 26, 2025

📄 30% (0.30x) speedup for group_broken_paragraphs in unstructured/cleaners/core.py

📝 Explanation and details

Uh oh!

Uh oh!

aseembits93 commented Aug 28, 2025

Uh oh!

Uh oh!

aseembits93 commented Sep 5, 2025

Uh oh!

qued commented Sep 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

⚡️ Speed up function `group_broken_paragraphs` by 30% #4088

⚡️ Speed up function `group_broken_paragraphs` by 30% #4088

📄 30% (0.30x) speedup for `group_broken_paragraphs` in `unstructured/cleaners/core.py`