⚡️ Speed up function `_assign_hash_ids` by 34% #4089

aseembits93 · 2025-08-28T22:45:37Z

📄 34% (0.34x) speedup for `_assign_hash_ids` in `unstructured/partition/common/metadata.py`

⏱️ Runtime : 88.4 microseconds → 65.8 microseconds (best of 15 runs)

📝 Explanation and details

The optimization replaces itertools.groupby with a simple dictionary-based counting approach in the _assign_hash_ids function.

Key change: Instead of creating intermediate lists (page_numbers and page_seq_numbers) and using itertools.groupby, the optimized version uses a dictionary page_seq_counts to track sequence numbers for each page in a single pass.

Why it's faster:

Eliminates list comprehensions: The original code creates a full page_numbers list upfront, then processes it with groupby. The optimized version processes elements directly without intermediate collections.
Removes itertools.groupby overhead: groupby requires sorting/grouping operations that add computational complexity. The dictionary lookup page_seq_counts.get(page_number, 0) is O(1) vs the O(n) grouping operations.
Single-pass processing: Instead of two passes (first to collect page numbers, then to generate sequences), the optimization does everything in one loop through the elements.

Performance characteristics: The optimization is particularly effective for documents with many pages or elements, as shown in the test results where empty lists see 300%+ speedups. The 34% overall speedup demonstrates the efficiency gain from eliminating the itertools.groupby bottleneck, which consumed 19.5% + 6.3% of the original runtime according to the line profiler.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	✅ 26 Passed
🌀 Generated Regression Tests	✅ 2 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	✅ 1 Passed
📊 Tests Coverage	100.0%

⚙️ Existing Unit Tests and Runtime

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`partition/common/test_metadata.py::test_assign_hash_ids_produces_unique_and_deterministic_SHA1_ids_even_for_duplicate_elements`	39.5μs	31.3μs	26.2%✅

🌀 Generated Regression Tests and Runtime

from __future__ import annotations

import abc
import hashlib
import itertools
from typing import Optional

# imports
import pytest  # used for our unit tests
from typing_extensions import TypeAlias
from unstructured.partition.common.metadata import _assign_hash_ids


class CoordinatesMetadata:
    def __init__(self, points=None, system=None):
        self.points = points
        self.system = system

class ElementMetadata:
    def __init__(
        self,
        filename: Optional[str] = None,
        page_number: Optional[int] = None,
        coordinates: Optional[CoordinatesMetadata] = None,
        detection_origin: Optional[str] = None,
    ):
        self.filename = filename
        self.page_number = page_number
        self.coordinates = coordinates
        self.detection_origin = detection_origin
from unstructured.partition.common.metadata import _assign_hash_ids

# ----------------- UNIT TESTS -----------------

# Helper to create an Element with required metadata
def make_element(
    text: str = "",
    filename: str = "file.pdf",
    page_number: int = 1,
    element_id: Optional[str] = None,
    detection_origin: Optional[str] = None,
):
    metadata = ElementMetadata(filename=filename, page_number=page_number)
    return Element(
        element_id=element_id,
        metadata=metadata,
        detection_origin=detection_origin,
        text=text,
    )

# ----------------- BASIC TEST CASES -----------------








def test_empty_input_list():
    # Should handle empty input gracefully
    codeflash_output = _assign_hash_ids([]); out = codeflash_output # 5.62μs -> 1.30μs (333% faster)














#------------------------------------------------
from __future__ import annotations

import abc
import hashlib
import itertools
from typing import Optional

# imports
import pytest  # used for our unit tests
from unstructured.partition.common.metadata import _assign_hash_ids


class ElementMetadata:
    def __init__(self, filename=None, page_number=None):
        self.filename = filename
        self.page_number = page_number
        self.coordinates = None
        self.detection_origin = None
from unstructured.partition.common.metadata import _assign_hash_ids

# --- Unit Tests ---

# Helper to create elements for testing
def make_element(text, filename="file.txt", page_number=1):
    metadata = ElementMetadata(filename=filename, page_number=page_number)
    e = Element(metadata=metadata)
    e.text = text
    return e

# 1. BASIC TEST CASES




def test_empty_elements_list():
    # Should handle empty list gracefully
    codeflash_output = _assign_hash_ids([]); result = codeflash_output # 5.04μs -> 1.14μs (342% faster)











#------------------------------------------------
from unstructured.documents.elements import Element
from unstructured.partition.common.metadata import _assign_hash_ids

def test__assign_hash_ids():
    _assign_hash_ids([Element(element_id=None, coordinates=None, coordinate_system=None, metadata=None, detection_origin=None)])

🔎 Concolic Coverage Tests and Runtime

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`codeflash_concolic_ktxbqhta/tmpb11w96m9/test_concolic_coverage.py::test__assign_hash_ids`	38.2μs	32.1μs	19.1%✅

To edit these changes git checkout codeflash/optimize-_assign_hash_ids-memtfran and push.

The optimization replaces `itertools.groupby` with a simple dictionary-based counting approach in the `_assign_hash_ids` function. **Key change:** Instead of creating intermediate lists (`page_numbers` and `page_seq_numbers`) and using `itertools.groupby`, the optimized version uses a dictionary `page_seq_counts` to track sequence numbers for each page in a single pass. **Why it's faster:** - **Eliminates list comprehensions:** The original code creates a full `page_numbers` list upfront, then processes it with `groupby`. The optimized version processes elements directly without intermediate collections. - **Removes `itertools.groupby` overhead:** `groupby` requires sorting/grouping operations that add computational complexity. The dictionary lookup `page_seq_counts.get(page_number, 0)` is O(1) vs the O(n) grouping operations. - **Single-pass processing:** Instead of two passes (first to collect page numbers, then to generate sequences), the optimization does everything in one loop through the elements. **Performance characteristics:** The optimization is particularly effective for documents with many pages or elements, as shown in the test results where empty lists see 300%+ speedups. The 34% overall speedup demonstrates the efficiency gain from eliminating the `itertools.groupby` bottleneck, which consumed 19.5% + 6.3% of the original runtime according to the line profiler.

aseembits93 · 2025-08-28T22:49:19Z

@qued hope you could review it :) Best,

remove newline

aseembits93 · 2025-09-10T22:46:14Z

I noticed test_ingest_src was failing a couple of times. I dug deeper and found out that more specifically, test_unstructured_ingest/src/local-single-file-basic-chunking.sh was failing which could potentially be fixed with setting OVERWRITE_FIXTURES=true as an env variable in the ci config. Hope to hear back soon.

qued · 2025-09-22T21:35:20Z

I noticed test_ingest_src was failing a couple of times. I dug deeper and found out that more specifically, test_unstructured_ingest/src/local-single-file-basic-chunking.sh was failing which could potentially be fixed with setting OVERWRITE_FIXTURES=true as an env variable in the ci config. Hope to hear back soon.

Hi Aseem. In your fork, under Actions, do you have a workflow called "Ingest Test Fixtures Update PR"? If so, you should run that manually targeting your branch (the one you are trying to merge into main in this PR). Assuming it works, it should create a PR in your fork into your branch, and once merged, should solve the ingest test fixtures issue.

Let me know if that doesn't work.

aseembits93 · 2025-09-23T19:17:25Z

@qued on my fork, the test_ingest_src workflow is queued and never gets triggered (>24hr). I don't see the Ingest Test Fixtures Update PR workflow either in my fork. Appreciate your help on this.

qued · 2025-09-25T20:50:48Z

Closing as I'm pulling the changes in via #4101

In-repo duplicate of #4089. --------- Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: Aseem Saxena <aseem.bits@gmail.com>

codeflash-ai bot and others added 3 commits August 22, 2025 12:37

cleaning up

136fe26

Merge branch 'main' into codeflash/optimize-_assign_hash_ids-memtfran

e3b1911

aseembits93 added 5 commits September 2, 2025 17:17

Merge branch 'main' into codeflash/optimize-_assign_hash_ids-memtfran

fc3e3e5

changelog

ea4a35a

Merge branch 'main' into codeflash/optimize-_assign_hash_ids-memtfran

0e879c7

Update __version__.py

35770e0

remove newline

Merge branch 'main' into codeflash/optimize-_assign_hash_ids-memtfran

675505c

Merge branch 'main' into codeflash/optimize-_assign_hash_ids-memtfran

bcc492c

This comment was marked as outdated.

Sign in to view

changelog version update

4f8ce2c

qued added 2 commits September 25, 2025 13:37

Update handbook-1p.docx.json

42175a3

Update handbook-1p.docx.json

03e48f3

qued mentioned this pull request Sep 25, 2025

enhancement: Speed up function _assign_hash_ids by 34% #4101

Merged

qued closed this Sep 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

⚡️ Speed up function `_assign_hash_ids` by 34% #4089

⚡️ Speed up function `_assign_hash_ids` by 34% #4089

Uh oh!

aseembits93 commented Aug 28, 2025

Uh oh!

aseembits93 commented Aug 28, 2025

Uh oh!

aseembits93 commented Sep 10, 2025 •

edited

Loading

Uh oh!

qued commented Sep 22, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

aseembits93 commented Sep 23, 2025

Uh oh!

qued commented Sep 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

⚡️ Speed up function _assign_hash_ids by 34% #4089

⚡️ Speed up function _assign_hash_ids by 34% #4089

Uh oh!

Conversation

aseembits93 commented Aug 28, 2025

📄 34% (0.34x) speedup for _assign_hash_ids in unstructured/partition/common/metadata.py

📝 Explanation and details

Uh oh!

aseembits93 commented Aug 28, 2025

Uh oh!

aseembits93 commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qued commented Sep 22, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

aseembits93 commented Sep 23, 2025

Uh oh!

qued commented Sep 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

⚡️ Speed up function `_assign_hash_ids` by 34% #4089

⚡️ Speed up function `_assign_hash_ids` by 34% #4089

📄 34% (0.34x) speedup for `_assign_hash_ids` in `unstructured/partition/common/metadata.py`

aseembits93 commented Sep 10, 2025 •

edited

Loading