Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
897f97c
feat: add functionality to merge `inferred` with `extracted` when `fi…
christinestraub Nov 22, 2023
57eba40
feat: add functionality to merge `inferred` with `extracted` when `fi…
christinestraub Nov 22, 2023
17ec904
Merge branch 'main' into refactor/pdf_text_extraction_for_hi_res
christinestraub Nov 24, 2023
e888047
feat: sort extracted layout by deterministic ordering
christinestraub Nov 24, 2023
c637ed0
chore: add force `pip install -e .`
christinestraub Nov 24, 2023
68aacd3
chore: update changelog & version
christinestraub Nov 24, 2023
8669d8b
fix: lint
christinestraub Nov 24, 2023
bb6a16a
chore: update `flake8` config to exclude `unstructured-inference` di…
christinestraub Nov 24, 2023
f66cda6
feat: reflect added `Source.PDFMINER` constant
christinestraub Nov 24, 2023
0e8e466
chore: update ci
christinestraub Nov 24, 2023
646a29d
refactor: import `order_layout` within function
christinestraub Nov 24, 2023
f8f004d
test: fix lint errors
christinestraub Nov 24, 2023
8cd75a4
Merge branch 'main' into refactor/pdf_text_extraction_for_hi_res
christinestraub Nov 27, 2023
d2e2e07
test: fix unit test errors
christinestraub Nov 27, 2023
4d2d190
refactor: organize files for partitioning pdf/image
christinestraub Nov 27, 2023
4abefa9
refactor: add a new module `pdfminer_processing`
christinestraub Nov 28, 2023
5978279
feat: update `_merge_inferred_with_extracted()` to get image size fro…
christinestraub Nov 28, 2023
1a4083a
refactor: `_merge_inferred_with_extracted()`
christinestraub Nov 28, 2023
f0be24c
test: update module import
christinestraub Nov 28, 2023
fbfe8de
Merge branch 'main' into refactor/pdf_text_extraction_for_hi_res
christinestraub Nov 28, 2023
62b1513
chore: update version
christinestraub Nov 28, 2023
dff68e6
feat: use elements returned by `inference.PageLayout.get_elements_fro…
christinestraub Nov 28, 2023
149444a
fix: lint errors
christinestraub Nov 28, 2023
0219040
refactor: move code related to `pdfminer` patch from `unstructured-in…
christinestraub Nov 29, 2023
a42b7e6
test: fix unit test errors
christinestraub Nov 29, 2023
383c496
refactor: move `_merge_inferred_with_extracted()` to pdfminer_process…
christinestraub Nov 29, 2023
d148950
Merge branch 'main' into refactor/pdf_text_extraction_for_hi_res
christinestraub Nov 29, 2023
c204cf5
test: fix lint errors
christinestraub Nov 29, 2023
2603374
feat: import modules depend on `unstructured_inference` library only …
christinestraub Nov 29, 2023
62ea5e8
Merge branch 'main' into refactor/pdf_text_extraction_for_hi_res
christinestraub Nov 29, 2023
7de19d5
Merge branch 'main' into refactor/pdf_text_extraction_for_hi_res
christinestraub Nov 29, 2023
e6f6511
refactor: use `init_pdfminer()` in `_open_pdfminer_pages_generator()`
christinestraub Nov 29, 2023
d8bf20c
Merge branch 'main' into refactor/pdf_text_extraction_for_hi_res
christinestraub Nov 30, 2023
dd88999
chore: update changelog & version
christinestraub Nov 30, 2023
1327055
chore: update ci
christinestraub Nov 30, 2023
fe29e79
feat: use the `open_pdfminer_pages_generator()` procedure in the `hi_…
christinestraub Nov 30, 2023
4126e87
chore: revert all CI yaml changes
christinestraub Dec 1, 2023
d801ed9
chore: bump unstructured-inference==0.7.17
christinestraub Dec 1, 2023
d2fa91f
Merge branch 'main' into refactor/pdf_text_extraction_for_hi_res
christinestraub Dec 1, 2023
651221f
chore: make pip-compile
christinestraub Dec 1, 2023
6bf43d7
fix: dependency path error when running pip-compile
christinestraub Dec 1, 2023
f0f07ab
chore: make pip-compile
christinestraub Dec 1, 2023
4fba0b6
Merge branch 'main' into refactor/pdf_text_extraction_for_hi_res
christinestraub Dec 1, 2023
bcea80f
chore: make pip-compile
christinestraub Dec 1, 2023
f2e5128
chore: update version
christinestraub Dec 1, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
## 0.11.4-dev0
## 0.11.4-dev1

### Enhancements

* **Refactor pdfminer code.** The pdfminer code is moved from `unstructured-inference` to `unstructured`.

### Features

### Fixes
Expand All @@ -23,8 +25,8 @@
## 0.11.1

### Enhancements
* **Use `pikepdf` to repair invalid PDF structure** for PDFminer when we see error `PSSyntaxError` when PDFminer opens the document and creates the PDFminer pages object or processes a single PDF page.

* **Use `pikepdf` to repair invalid PDF structure** for PDFminer when we see error `PSSyntaxError` when PDFminer opens the document and creates the PDFminer pages object or processes a single PDF page.
* **Batch Source Connector support** For instances where it is more optimal to read content from a source connector in batches, a new batch ingest doc is added which created multiple ingest docs after reading them in in batches per process.

### Features
Expand Down
2 changes: 1 addition & 1 deletion docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ idna==3.6
# requests
imagesize==1.4.1
# via sphinx
importlib-metadata==6.8.0
importlib-metadata==6.9.0
# via sphinx
jinja2==3.1.2
# via
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
from PIL import Image

from unstructured.documents.elements import PageBreak
from unstructured.partition.pdf import partition_pdf
from unstructured.partition.pdf_image.pdf import partition_pdf
from unstructured.partition.utils.constants import SORT_MODE_BASIC, SORT_MODE_DONT, SORT_MODE_XY_CUT
from unstructured.partition.utils.xycut import (
bbox2points,
Expand Down
2 changes: 1 addition & 1 deletion examples/layout-analysis/visualization.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
from unstructured_inference.visualize import draw_bbox

from unstructured.documents.elements import PageBreak
from unstructured.partition.pdf import partition_pdf
from unstructured.partition.pdf_image.pdf import partition_pdf

CUR_DIR = pathlib.Path(__file__).parent.resolve()

Expand Down
2 changes: 1 addition & 1 deletion requirements/build.txt
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ idna==3.6
# requests
imagesize==1.4.1
# via sphinx
importlib-metadata==6.8.0
importlib-metadata==6.9.0
# via sphinx
jinja2==3.1.2
# via
Expand Down
12 changes: 6 additions & 6 deletions requirements/dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ idna==3.6
# anyio
# jsonschema
# requests
importlib-metadata==6.8.0
importlib-metadata==6.9.0
# via
# build
# jupyter-client
Expand Down Expand Up @@ -138,7 +138,7 @@ jsonschema[format-nongpl]==4.20.0
# jupyter-events
# jupyterlab-server
# nbformat
jsonschema-specifications==2023.11.1
jsonschema-specifications==2023.11.2
# via jsonschema
jupyter==1.0.0
# via -r dev.in
Expand Down Expand Up @@ -301,7 +301,7 @@ qtconsole==5.5.1
# via jupyter
qtpy==2.4.1
# via qtconsole
referencing==0.31.0
referencing==0.31.1
# via
# jsonschema
# jsonschema-specifications
Expand All @@ -319,7 +319,7 @@ rfc3986-validator==0.1.1
# via
# jsonschema
# jupyter-events
rpds-py==0.13.1
rpds-py==0.13.2
# via
# jsonschema
# referencing
Expand Down Expand Up @@ -354,7 +354,7 @@ tomli==2.0.1
# jupyterlab
# pip-tools
# pyproject-hooks
tornado==6.3.3
tornado==6.4
# via
# ipykernel
# jupyter-client
Expand Down Expand Up @@ -395,7 +395,7 @@ urllib3==1.26.18
# -c constraints.in
# -c test.txt
# requests
virtualenv==20.24.7
virtualenv==20.25.0
# via pre-commit
wcwidth==0.2.12
# via prompt-toolkit
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-markdown.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
#
# pip-compile --output-file=extra-markdown.txt extra-markdown.in
#
importlib-metadata==6.8.0
importlib-metadata==6.9.0
# via markdown
markdown==3.5.1
# via -r extra-markdown.in
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-msg.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,5 @@
#
msg-parser==1.2.0
# via -r extra-msg.in
olefile==0.46
olefile==0.47
# via msg-parser
2 changes: 1 addition & 1 deletion requirements/extra-paddleocr.txt
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ imageio==2.33.0
# scikit-image
imgaug==0.4.0
# via unstructured-paddleocr
importlib-metadata==6.8.0
importlib-metadata==6.9.0
# via flask
importlib-resources==6.1.1
# via matplotlib
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-pdf-image.in
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ pikepdf
pypdf
# Do not move to contsraints.in, otherwise unstructured-inference will not be upgraded
# when unstructured library is.
unstructured-inference==0.7.15
unstructured-inference==0.7.17
# unstructured fork of pytesseract that provides an interface to allow for multiple output formats
# from one tesseract call
unstructured.pytesseract>=0.3.12
2 changes: 1 addition & 1 deletion requirements/extra-pdf-image.txt
Original file line number Diff line number Diff line change
Expand Up @@ -250,7 +250,7 @@ typing-extensions==4.8.0
# torch
tzdata==2023.3
# via pandas
unstructured-inference==0.7.15
unstructured-inference==0.7.17
# via -r extra-pdf-image.in
unstructured-pytesseract==0.3.12
# via
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest/airtable.txt
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ idna==3.6
# requests
inflection==0.5.1
# via pyairtable
pyairtable==2.2.0
pyairtable==2.2.1
# via -r ingest/airtable.in
pydantic==1.10.13
# via
Expand Down
4 changes: 1 addition & 3 deletions requirements/ingest/azure.txt
Original file line number Diff line number Diff line change
Expand Up @@ -76,9 +76,7 @@ portalocker==2.8.2
pycparser==2.21
# via cffi
pyjwt[crypto]==2.8.0
# via
# msal
# pyjwt
# via msal
requests==2.31.0
# via
# -c ingest/../base.txt
Expand Down
4 changes: 1 addition & 3 deletions requirements/ingest/box.txt
Original file line number Diff line number Diff line change
Expand Up @@ -9,9 +9,7 @@ attrs==23.1.0
boxfs==0.2.1
# via -r ingest/box.in
boxsdk[jwt]==3.9.2
# via
# boxfs
# boxsdk
# via boxfs
certifi==2023.11.17
# via
# -c ingest/../base.txt
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest/confluence.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
#
# pip-compile --output-file=ingest/confluence.txt ingest/confluence.in
#
atlassian-python-api==3.41.3
atlassian-python-api==3.41.4
# via -r ingest/confluence.in
certifi==2023.11.17
# via
Expand Down
8 changes: 5 additions & 3 deletions requirements/ingest/embed-aws-bedrock.txt
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,8 @@ frozenlist==1.4.0
# via
# aiohttp
# aiosignal
greenlet==3.0.1
# via sqlalchemy
idna==3.6
# via
# -c ingest/../base.txt
Expand All @@ -62,11 +64,11 @@ jsonpatch==1.33
# langchain-core
jsonpointer==2.4
# via jsonpatch
langchain==0.0.341
langchain==0.0.344
# via -r ingest/embed-aws-bedrock.in
langchain-core==0.0.6
langchain-core==0.0.8
# via langchain
langsmith==0.0.67
langsmith==0.0.68
# via
# langchain
# langchain-core
Expand Down
8 changes: 5 additions & 3 deletions requirements/ingest/embed-huggingface.txt
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,8 @@ fsspec==2023.9.1
# -c ingest/../constraints.in
# huggingface-hub
# torch
greenlet==3.0.1
# via sqlalchemy
huggingface==0.0.1
# via -r ingest/embed-huggingface.in
huggingface-hub==0.19.4
Expand All @@ -77,11 +79,11 @@ jsonpatch==1.33
# langchain-core
jsonpointer==2.4
# via jsonpatch
langchain==0.0.341
langchain==0.0.344
# via -r ingest/embed-huggingface.in
langchain-core==0.0.6
langchain-core==0.0.8
# via langchain
langsmith==0.0.67
langsmith==0.0.68
# via
# langchain
# langchain-core
Expand Down
11 changes: 7 additions & 4 deletions requirements/ingest/embed-openai.txt
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,8 @@ frozenlist==1.4.0
# via
# aiohttp
# aiosignal
greenlet==3.0.1
# via sqlalchemy
h11==0.14.0
# via httpcore
httpcore==1.0.2
Expand All @@ -62,11 +64,11 @@ jsonpatch==1.33
# langchain-core
jsonpointer==2.4
# via jsonpatch
langchain==0.0.341
langchain==0.0.344
# via -r ingest/embed-openai.in
langchain-core==0.0.6
langchain-core==0.0.8
# via langchain
langsmith==0.0.67
langsmith==0.0.68
# via
# langchain
# langchain-core
Expand All @@ -87,7 +89,7 @@ numpy==1.24.4
# -c ingest/../base.txt
# -c ingest/../constraints.in
# langchain
openai==1.3.5
openai==1.3.7
# via -r ingest/embed-openai.in
packaging==23.2
# via
Expand Down Expand Up @@ -116,6 +118,7 @@ sniffio==1.3.0
# via
# anyio
# httpx
# openai
sqlalchemy==2.0.23
# via langchain
tenacity==8.2.3
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest/gcs.txt
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ google-api-core==2.14.0
# via
# google-cloud-core
# google-cloud-storage
google-auth==2.23.4
google-auth==2.24.0
# via
# gcsfs
# google-api-core
Expand Down
4 changes: 1 addition & 3 deletions requirements/ingest/github.txt
Original file line number Diff line number Diff line change
Expand Up @@ -30,9 +30,7 @@ pycparser==2.21
pygithub==2.1.1
# via -r ingest/github.in
pyjwt[crypto]==2.8.0
# via
# pygithub
# pyjwt
# via pygithub
pynacl==1.5.0
# via pygithub
python-dateutil==2.8.2
Expand Down
4 changes: 2 additions & 2 deletions requirements/ingest/google-drive.txt
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,9 @@ charset-normalizer==3.3.2
# requests
google-api-core==2.14.0
# via google-api-python-client
google-api-python-client==2.108.0
google-api-python-client==2.109.0
# via -r ingest/google-drive.in
google-auth==2.23.4
google-auth==2.24.0
# via
# google-api-core
# google-api-python-client
Expand Down
10 changes: 5 additions & 5 deletions requirements/ingest/hubspot.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,19 @@
# This file is autogenerated by pip-compile with Python 3.8
# by the following command:
#
# pip-compile requirements/ingest-hubspot.in
# pip-compile --output-file=ingest/hubspot.txt ingest/hubspot.in
#
certifi==2023.7.22
certifi==2023.11.17
# via hubspot-api-client
hubspot-api-client==8.1.1
# via -r requirements/ingest-hubspot.in
# via -r ingest/hubspot.in
python-dateutil==2.8.2
# via hubspot-api-client
six==1.16.0
# via
# hubspot-api-client
# python-dateutil
urllib3==1.26.17
urllib3==2.1.0
# via
# -r requirements/ingest-hubspot.in
# -r ingest/hubspot.in
# hubspot-api-client
2 changes: 1 addition & 1 deletion requirements/ingest/jira.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
#
# pip-compile --output-file=ingest/jira.txt ingest/jira.in
#
atlassian-python-api==3.41.3
atlassian-python-api==3.41.4
# via -r ingest/jira.in
certifi==2023.11.17
# via
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest/mongodb.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,5 @@
#
dnspython==2.4.2
# via pymongo
pymongo==4.6.0
pymongo==4.6.1
# via -r ingest/mongodb.in
4 changes: 1 addition & 3 deletions requirements/ingest/onedrive.txt
Original file line number Diff line number Diff line change
Expand Up @@ -40,9 +40,7 @@ office365-rest-python-client==2.4.2
pycparser==2.21
# via cffi
pyjwt[crypto]==2.8.0
# via
# msal
# pyjwt
# via msal
pytz==2023.3.post1
# via office365-rest-python-client
requests==2.31.0
Expand Down
4 changes: 1 addition & 3 deletions requirements/ingest/outlook.txt
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,7 @@ office365-rest-python-client==2.4.2
pycparser==2.21
# via cffi
pyjwt[crypto]==2.8.0
# via
# msal
# pyjwt
# via msal
pytz==2023.3.post1
# via office365-rest-python-client
requests==2.31.0
Expand Down
4 changes: 2 additions & 2 deletions requirements/ingest/pinecone.in
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
-c constraints.in
-c base.txt
-c ../constraints.in
-c ../base.txt
pinecone-client
Loading