Skip to content

Conversation

@dependabot
Copy link
Contributor

@dependabot dependabot bot commented on behalf of github Jan 20, 2025

Bumps unstructured from 0.10.27 to 0.16.14.

Release notes

Sourced from unstructured's releases.

0.16.14

Enhancements

Features

Fixes

  • Fix an issue with multiple values for infer_table_structure when paritioning email with image attachements the kwarg calls into partition to partition the image already contains infer_table_structure. Now partition function checks if the kwarg has infer_table_structure already

0.16.13

Enhancements

  • Add character-level filtering for tesseract output. It is controllable via TESSERACT_CHARACTER_CONFIDENCE_THRESHOLD environment variable.

Features

Fixes

  • Fix NLTK Download to use nltk assets in docker image
  • removed the ability to automatically download nltk package if missing

0.16.12

Enhancements

  • Prepare auto-partitioning for pluggable partitioners. Move toward a uniform partitioner call signature so a custom or override partitioner can be registered without code changes.
  • Add NDJSON file type support.

Features

Fixes

  • Base image has been updated.
  • Upgrade ruff to latest. Previously the ruff version was pinned to <0.5. Remove that pin and fix the handful of lint items that resulted.
  • CSV with asserted XLS content-type is correctly identified as CSV. Resolves a bug where a CSV file with an asserted content-type of application/vnd.ms-excel was incorrectly identified as an XLS file.
  • Improve element-type mapping for Chinese text. Fixes bug where Chinese text would produce large numbers of false-positive Title elements.
  • Improve element-type mapping for HTML. Fixes bug where certain non-title elements were classified as Title.

0.16.11

Enhancements

  • Enhance quote standardization tests with additional Unicode scenarios
  • Relax table segregation rule in chunking. Previously a Table element was always segregated into its own pre-chunk such that the Table appeared alone in a chunk or was split into multiple TableChunk elements, but never combined with Text-subtype elements. Allow table elements to be combined with other elements in the same chunk when space allows.
  • Compute chunk length based solely on element.text. Previously .metadata.text_as_html was also considered and since it is always longer that the text (due to HTML tag overhead) it was the effective length criterion. Remove text-as-html from the length calculation such that text-length is the sole criterion for sizing a chunk.

Features

Fixes

  • Fix ipv4 regex to correctly include up to three digit octets.

0.16.10

... (truncated)

Changelog

Sourced from unstructured's changelog.

0.16.14

Enhancements

Features

Fixes

  • Fix an issue with multiple values for infer_table_structure when paritioning email with image attachements the kwarg calls into partition to partition the image already contains infer_table_structure. Now partition function checks if the kwarg has infer_table_structure already

0.16.13

Enhancements

  • Add character-level filtering for tesseract output. It is controllable via TESSERACT_CHARACTER_CONFIDENCE_THRESHOLD environment variable.

Features

Fixes

  • Fix NLTK Download to use nltk assets in docker image
  • removed the ability to automatically download nltk package if missing

0.16.12

Enhancements

  • Prepare auto-partitioning for pluggable partitioners. Move toward a uniform partitioner call signature so a custom or override partitioner can be registered without code changes.
  • Add NDJSON file type support.

Features

Fixes

  • Base image has been updated.
  • Upgrade ruff to latest. Previously the ruff version was pinned to <0.5. Remove that pin and fix the handful of lint items that resulted.
  • CSV with asserted XLS content-type is correctly identified as CSV. Resolves a bug where a CSV file with an asserted content-type of application/vnd.ms-excel was incorrectly identified as an XLS file.
  • Improve element-type mapping for Chinese text. Fixes bug where Chinese text would produce large numbers of false-positive Title elements.
  • Improve element-type mapping for HTML. Fixes bug where certain non-title elements were classified as Title.

0.16.11

Enhancements

  • Enhance quote standardization tests with additional Unicode scenarios
  • Relax table segregation rule in chunking. Previously a Table element was always segregated into its own pre-chunk such that the Table appeared alone in a chunk or was split into multiple TableChunk elements, but never combined with Text-subtype elements. Allow table elements to be combined with other elements in the same chunk when space allows.
  • Compute chunk length based solely on element.text. Previously .metadata.text_as_html was also considered and since it is always longer that the text (due to HTML tag overhead) it was the effective length criterion. Remove text-as-html from the length calculation such that text-length is the sole criterion for sizing a chunk.

Features

Fixes

... (truncated)

Commits

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot merge will merge this PR after your CI passes on it
  • @dependabot squash and merge will squash and merge this PR after your CI passes on it
  • @dependabot cancel merge will cancel a previously requested merge and block automerging
  • @dependabot reopen will reopen this PR if it is closed
  • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
  • @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot dependabot bot added the chore label Jan 20, 2025
@github-actions github-actions bot added the dependencies Pull requests that update a dependency file label Jan 20, 2025
Bumps [unstructured](https://github.com/Unstructured-IO/unstructured) from 0.10.27 to 0.16.14.
- [Release notes](https://github.com/Unstructured-IO/unstructured/releases)
- [Changelog](https://github.com/Unstructured-IO/unstructured/blob/main/CHANGELOG.md)
- [Commits](Unstructured-IO/unstructured@0.10.27...0.16.14)

---
updated-dependencies:
- dependency-name: unstructured
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
@dependabot dependabot bot force-pushed the dependabot/pip/unstructured-0.16.14 branch from 8ec02f1 to 24730a6 Compare January 22, 2025 02:18
@dependabot @github
Copy link
Contributor Author

dependabot bot commented on behalf of github Jan 23, 2025

Superseded by #255.

@dependabot dependabot bot closed this Jan 23, 2025
@dependabot dependabot bot deleted the dependabot/pip/unstructured-0.16.14 branch January 23, 2025 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

chore dependencies Pull requests that update a dependency file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant