Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
7d84877
Dockerfile updated with a conditional to decide which paddlepaddle wh…
tabossert Sep 8, 2023
c3181c9
Bump version, add makefile command for non docker installs
tabossert Sep 8, 2023
550866e
Removing dockerfile command and adding script to install
tabossert Sep 8, 2023
5d175b1
Revert pip-compile to use python3.8
tabossert Sep 8, 2023
d593e37
add to setup.py extra
tabossert Sep 8, 2023
acb1d51
Merge branch 'main' into trevor/paddleocr
tabossert Sep 8, 2023
fad1e46
add nl
tabossert Sep 8, 2023
2876387
Resolve PR comments
tabossert Sep 11, 2023
4478c88
Merge branch 'main' into trevor/paddleocr
tabossert Sep 11, 2023
384e6f8
bump version
tabossert Sep 11, 2023
0fcc952
Update deps due to conflicts
tabossert Sep 11, 2023
a9cdfe0
Upgrade opencv-python version due to conflict
tabossert Sep 11, 2023
4012b81
Merge branch 'main' into trevor/paddleocr
tabossert Sep 11, 2023
0ca4da5
bump version
tabossert Sep 12, 2023
3c4aa99
Merge branch 'main' into trevor/paddleocr
tabossert Sep 12, 2023
16b6b69
Merge branch 'main' into trevor/paddleocr
tabossert Sep 13, 2023
954e1ae
linting fix
tabossert Sep 13, 2023
52dc30c
Merge branch 'main' into trevor/paddleocr
tabossert Sep 14, 2023
1d29d35
Merge branch 'main' into trevor/paddleocr
tabossert Sep 14, 2023
f5b4c69
Merge branch 'main' into trevor/paddleocr
tabossert Sep 15, 2023
8aa28c4
Merge branch 'main' into trevor/paddleocr
yuming-long Sep 15, 2023
308ffce
pin matplotlib==3.7.2 for paddle install
yuming-long Sep 15, 2023
d6f1c73
Merge branch 'main' into trevor/paddleocr
tabossert Sep 15, 2023
f432d19
Merge branch 'main' into trevor/paddleocr
yuming-long Sep 15, 2023
779d5d0
compile with pinned fsspec version
tabossert Sep 15, 2023
79a9f8e
Merge branch 'main' into trevor/paddleocr
yuming-long Sep 15, 2023
ead3748
changelog nit
yuming-long Sep 15, 2023
05cb727
Merge branch 'main' into trevor/paddleocr
yuming-long Sep 15, 2023
c3e9222
Update CHANGELOG.md
cragwolfe Sep 16, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
## 0.10.15-dev15


### Enhancements

* **Suport for better element categories from the next-generation image-to-text model ("chipper").**. Previously, not all of the classifications from Chipper were being mapped to proper `unstructured` element categories so the consumer of the library would see many `UncategorizedText` elements. This fixes the issue, improving the granularity of the element categories outputs for better downstream processing and chunking. The mapping update is:
Expand All @@ -24,6 +25,7 @@
* **Add delta table destination connector** New delta table destination connector added to ingest CLI. Users may now use `unstructured-ingest` to write partitioned data from over 20 data sources (so far) to a Delta Table.
* **Rename to Source and Destination Connectors in the Documentation.** Maintain naming consistency between Connectors codebase and documentation with the first addition to a destination connector.
* **Non-HTML text files now return unstructured-elements as opposed to HTML-elements.** Previously the text based files that went through `partition_html` would return HTML-elements but now we preserve the format from the input using `source_format` argument in the partition call.
* **Adds `PaddleOCR` as an optional alternative to `Tesseract`** for OCR in processing of PDF or Image files, it is installable via the `makefile` command `install-paddleocr`. For experimental purposes only.

### Features

Expand All @@ -47,7 +49,7 @@
* Update all connectors to use new downstream architecture
* New click type added to parse comma-delimited string inputs
* Some CLI options renamed

### Features

### Fixes
Expand All @@ -66,6 +68,7 @@
* Add Jira Connector to be able to pull issues from a Jira organization
* Add `clean_ligatures` function to expand ligatures in text


### Fixes

* `partition_html` breaks on `<br>` elements.
Expand Down
3 changes: 3 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -214,6 +214,9 @@ install-local-inference: install install-all-docs
install-pandoc:
ARCH=${ARCH} ./scripts/install-pandoc.sh

.PHONY: install-paddleocr
install-paddleocr:
ARCH=${ARCH} ./scripts/install-paddleocr.sh

## pip-compile: compiles all base/dev/test requirements
.PHONY: pip-compile
Expand Down
2 changes: 2 additions & 0 deletions requirements/constraints.in
Original file line number Diff line number Diff line change
Expand Up @@ -32,3 +32,5 @@ safetensors<=0.3.2
# use the known compatible version of weaviate and unstructured.pytesseract
unstructured.pytesseract>=0.3.12
weaviate-client==3.23.2
# Note(yuming) - pining to avoid conflict with paddle install
matplotlib==3.7.2
4 changes: 2 additions & 2 deletions requirements/dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ filelock==3.12.4
# via virtualenv
fqdn==1.5.1
# via jsonschema
identify==2.5.28
identify==2.5.29
# via pre-commit
idna==3.4
# via
Expand Down Expand Up @@ -176,7 +176,7 @@ jupyter-server==2.7.3
# notebook-shim
jupyter-server-terminals==0.4.4
# via jupyter-server
jupyterlab==4.0.5
jupyterlab==4.0.6
# via notebook
jupyterlab-pygments==0.2.2
# via nbconvert
Expand Down
4 changes: 4 additions & 0 deletions requirements/extra-paddleocr.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
-c constraints.in
-c base.txt

unstructured.paddleocr==2.6.1.3
219 changes: 219 additions & 0 deletions requirements/extra-paddleocr.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,219 @@
#
# This file is autogenerated by pip-compile with Python 3.8
# by the following command:
#
# pip-compile requirements/extra-paddleocr.in
#
attrdict==2.0.1
# via unstructured-paddleocr
babel==2.12.1
# via flask-babel
bce-python-sdk==0.8.90
# via visualdl
blinker==1.6.2
# via flask
cachetools==5.3.1
# via premailer
certifi==2023.7.22
# via
# -c requirements/base.txt
# -c requirements/constraints.in
# requests
charset-normalizer==3.2.0
# via
# -c requirements/base.txt
# requests
click==8.1.7
# via
# -c requirements/base.txt
# flask
contourpy==1.1.0
# via matplotlib
cssselect==1.2.0
# via premailer
cssutils==2.7.1
# via premailer
cycler==0.11.0
# via matplotlib
cython==3.0.2
# via unstructured-paddleocr
et-xmlfile==1.1.0
# via openpyxl
flask==2.3.3
# via
# flask-babel
# visualdl
flask-babel==3.1.0
# via visualdl
fonttools==4.42.1
# via matplotlib
future==0.18.3
# via bce-python-sdk
idna==3.4
# via
# -c requirements/base.txt
# requests
imageio==2.31.3
# via
# imgaug
# scikit-image
imgaug==0.4.0
# via unstructured-paddleocr
importlib-metadata==6.8.0
# via flask
importlib-resources==6.0.1
# via matplotlib
itsdangerous==2.1.2
# via flask
jinja2==3.1.2
# via
# flask
# flask-babel
kiwisolver==1.4.5
# via matplotlib
lanms-neo==1.0.2
# via unstructured-paddleocr
lazy-loader==0.3
# via scikit-image
lmdb==1.4.1
# via unstructured-paddleocr
lxml==4.9.3
# via
# -c requirements/base.txt
# premailer
# unstructured-paddleocr
markupsafe==2.1.3
# via
# jinja2
# werkzeug
matplotlib==3.7.2
# via
# -c requirements/constraints.in
# imgaug
# visualdl
networkx==3.1
# via scikit-image
numpy==1.24.4
# via
# -c requirements/constraints.in
# contourpy
# imageio
# imgaug
# matplotlib
# opencv-contrib-python
# opencv-python
# pandas
# pywavelets
# scikit-image
# scipy
# shapely
# tifffile
# unstructured-paddleocr
# visualdl
opencv-contrib-python==4.8.0.76
# via unstructured-paddleocr
opencv-python==4.8.0.76
# via
# imgaug
# unstructured-paddleocr
openpyxl==3.1.2
# via unstructured-paddleocr
packaging==23.1
# via
# -c requirements/base.txt
# matplotlib
# scikit-image
# visualdl
pandas==2.0.3
# via visualdl
pdf2image==1.16.3
# via unstructured-paddleocr
pillow==10.0.1
# via
# imageio
# imgaug
# matplotlib
# pdf2image
# scikit-image
# visualdl
polygon3==3.0.9.1
# via unstructured-paddleocr
premailer==3.10.0
# via unstructured-paddleocr
protobuf==4.23.4
# via
# -c requirements/constraints.in
# visualdl
psutil==5.9.5
# via visualdl
pyclipper==1.3.0.post5
# via unstructured-paddleocr
pycryptodome==3.18.0
# via bce-python-sdk
pyparsing==3.0.9
# via
# -c requirements/constraints.in
# matplotlib
python-dateutil==2.8.2
# via
# matplotlib
# pandas
pytz==2023.3.post1
# via
# babel
# flask-babel
# pandas
pywavelets==1.4.1
# via scikit-image
rapidfuzz==3.3.0
# via unstructured-paddleocr
rarfile==4.0
# via visualdl
requests==2.31.0
# via
# -c requirements/base.txt
# premailer
# visualdl
scikit-image==0.21.0
# via
# imgaug
# unstructured-paddleocr
scipy==1.10.1
# via
# -c requirements/constraints.in
# imgaug
# scikit-image
shapely==2.0.1
# via
# imgaug
# unstructured-paddleocr
six==1.16.0
# via
# attrdict
# bce-python-sdk
# imgaug
# python-dateutil
# visualdl
tifffile==2023.7.10
# via scikit-image
tqdm==4.66.1
# via
# -c requirements/base.txt
# unstructured-paddleocr
tzdata==2023.3
# via pandas
unstructured-paddleocr==2.6.1.3
# via -r requirements/extra-paddleocr.in
urllib3==1.26.16
# via
# -c requirements/base.txt
# -c requirements/constraints.in
# requests
visualdl==2.5.3
# via unstructured-paddleocr
werkzeug==2.3.7
# via flask
zipp==3.16.2
# via
# importlib-metadata
# importlib-resources
12 changes: 7 additions & 5 deletions requirements/extra-pdf-image.txt
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ flatbuffers==23.5.26
# via onnxruntime
fonttools==4.42.1
# via matplotlib
fsspec==2023.9.0
fsspec==2023.9.1
# via huggingface-hub
huggingface-hub==0.17.1
# via
Expand All @@ -62,8 +62,10 @@ layoutparser[layoutmodels,tesseract]==0.3.4
# via unstructured-inference
markupsafe==2.1.3
# via jinja2
matplotlib==3.7.3
# via pycocotools
matplotlib==3.7.2
# via
# -c requirements/constraints.in
# pycocotools
mpmath==1.3.0
# via sympy
networkx==3.1
Expand Down Expand Up @@ -113,7 +115,7 @@ pdfminer-six==20221105
# pdfplumber
pdfplumber==0.10.2
# via layoutparser
pillow==10.0.0
pillow==10.0.1
# via
# layoutparser
# matplotlib
Expand Down Expand Up @@ -202,7 +204,7 @@ tqdm==4.66.1
# huggingface-hub
# iopath
# transformers
transformers==4.33.1
transformers==4.33.2
# via unstructured-inference
typing-extensions==4.7.1
# via
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-pptx.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
#
lxml==4.9.3
# via python-pptx
pillow==10.0.0
pillow==10.0.1
# via python-pptx
python-pptx==0.6.21
# via -r requirements/extra-pptx.in
Expand Down
4 changes: 2 additions & 2 deletions requirements/huggingface.txt
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ filelock==3.12.4
# huggingface-hub
# torch
# transformers
fsspec==2023.9.0
fsspec==2023.9.1
# via huggingface-hub
huggingface-hub==0.17.1
# via transformers
Expand Down Expand Up @@ -91,7 +91,7 @@ tqdm==4.66.1
# huggingface-hub
# sacremoses
# transformers
transformers==4.33.1
transformers==4.33.2
# via -r requirements/huggingface.in
typing-extensions==4.7.1
# via
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest-azure.in
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
-c constraints.in
-c base.txt
adlfs
fsspec
fsspec==2023.9.1
2 changes: 1 addition & 1 deletion requirements/ingest-azure.txt
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ frozenlist==1.4.0
# via
# aiohttp
# aiosignal
fsspec==2023.9.0
fsspec==2023.9.1
# via
# -r requirements/ingest-azure.in
# adlfs
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest-box.in
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
-c constraints.in
-c base.txt
boxfs
fsspec
fsspec==2023.9.1
2 changes: 1 addition & 1 deletion requirements/ingest-box.txt
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ charset-normalizer==3.2.0
# requests
cryptography==41.0.3
# via boxsdk
fsspec==2023.9.0
fsspec==2023.9.1
# via
# -r requirements/ingest-box.in
# boxfs
Expand Down
2 changes: 1 addition & 1 deletion requirements/ingest-delta-table.in
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
-c constraints.in
-c base.txt
deltalake
fsspec
fsspec==2023.9.1
Loading