-
Notifications
You must be signed in to change notification settings - Fork 1.1k
⚡️ Speed up function _assign_hash_ids by 34%
#4089
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
⚡️ Speed up function _assign_hash_ids by 34%
#4089
Conversation
The optimization replaces `itertools.groupby` with a simple dictionary-based counting approach in the `_assign_hash_ids` function. **Key change:** Instead of creating intermediate lists (`page_numbers` and `page_seq_numbers`) and using `itertools.groupby`, the optimized version uses a dictionary `page_seq_counts` to track sequence numbers for each page in a single pass. **Why it's faster:** - **Eliminates list comprehensions:** The original code creates a full `page_numbers` list upfront, then processes it with `groupby`. The optimized version processes elements directly without intermediate collections. - **Removes `itertools.groupby` overhead:** `groupby` requires sorting/grouping operations that add computational complexity. The dictionary lookup `page_seq_counts.get(page_number, 0)` is O(1) vs the O(n) grouping operations. - **Single-pass processing:** Instead of two passes (first to collect page numbers, then to generate sequences), the optimization does everything in one loop through the elements. **Performance characteristics:** The optimization is particularly effective for documents with many pages or elements, as shown in the test results where empty lists see 300%+ speedups. The 34% overall speedup demonstrates the efficiency gain from eliminating the `itertools.groupby` bottleneck, which consumed 19.5% + 6.3% of the original runtime according to the line profiler.
|
@qued hope you could review it :) Best, |
|
I noticed |
Hi Aseem. In your fork, under Actions, do you have a workflow called "Ingest Test Fixtures Update PR"? If so, you should run that manually targeting your branch (the one you are trying to merge into Let me know if that doesn't work. |
|
@qued on my fork, the |
|
Closing as I'm pulling the changes in via #4101 |
In-repo duplicate of #4089. --------- Co-authored-by: codeflash-ai[bot] <148906541+codeflash-ai[bot]@users.noreply.github.com> Co-authored-by: Aseem Saxena <aseem.bits@gmail.com>

📄 34% (0.34x) speedup for
_assign_hash_idsinunstructured/partition/common/metadata.py⏱️ Runtime :
88.4 microseconds→65.8 microseconds(best of15runs)📝 Explanation and details
The optimization replaces
itertools.groupbywith a simple dictionary-based counting approach in the_assign_hash_idsfunction.Key change: Instead of creating intermediate lists (
page_numbersandpage_seq_numbers) and usingitertools.groupby, the optimized version uses a dictionarypage_seq_countsto track sequence numbers for each page in a single pass.Why it's faster:
page_numberslist upfront, then processes it withgroupby. The optimized version processes elements directly without intermediate collections.itertools.groupbyoverhead:groupbyrequires sorting/grouping operations that add computational complexity. The dictionary lookuppage_seq_counts.get(page_number, 0)is O(1) vs the O(n) grouping operations.Performance characteristics: The optimization is particularly effective for documents with many pages or elements, as shown in the test results where empty lists see 300%+ speedups. The 34% overall speedup demonstrates the efficiency gain from eliminating the
itertools.groupbybottleneck, which consumed 19.5% + 6.3% of the original runtime according to the line profiler.✅ Correctness verification report:
⚙️ Existing Unit Tests and Runtime
partition/common/test_metadata.py::test_assign_hash_ids_produces_unique_and_deterministic_SHA1_ids_even_for_duplicate_elements🌀 Generated Regression Tests and Runtime
🔎 Concolic Coverage Tests and Runtime
codeflash_concolic_ktxbqhta/tmpb11w96m9/test_concolic_coverage.py::test__assign_hash_idsTo edit these changes
git checkout codeflash/optimize-_assign_hash_ids-memtfranand push.