Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
19ffbdc
First commit
ajjimeno Mar 23, 2023
ac95ddb
Table processing in document layout
ajjimeno Mar 24, 2023
7b4320d
Platform x86_64 check'
ajjimeno Mar 24, 2023
6f78f65
PaddleOCR integrated
Mar 24, 2023
e3c8c72
Deactivate show_log in paddleocr
Mar 24, 2023
8b194d4
Merge branch 'main' into table-processing
qued Mar 25, 2023
04689ee
Utilize layout updates
qued Mar 26, 2023
d8b3068
Formatting and linting
qued Mar 26, 2023
a696a12
Correct how linting ignores are accumulated
qued Mar 26, 2023
fea4d8c
Removed fitz
ajjimeno Mar 26, 2023
1f0b617
Merged table processing
ajjimeno Mar 26, 2023
b88e3f5
Bug fixed in intersect_rect
Mar 27, 2023
8c1d9fe
Bump to default 200 dpi
qued Mar 27, 2023
7987bd5
Updated README with instructions to install paddleocr
Mar 27, 2023
ea8489c
Fixed typo
Mar 27, 2023
f96d33a
Deal with empty case
qued Mar 27, 2023
9bd7133
Formatting
qued Mar 27, 2023
3ef7963
Updates to pass flake8
Mar 27, 2023
52371ed
Merged
Mar 27, 2023
3b5ee44
Added table test
Mar 27, 2023
f398f35
Fixed test
ajjimeno Mar 27, 2023
bd5671d
Typing changes
qued Mar 27, 2023
25547c1
Merge branch 'table-processing' of github.com:Unstructured-IO/unstruc…
qued Mar 27, 2023
b6df905
formatting for large fixture
qued Mar 27, 2023
51f5b52
Up pixel to reflect new dpi
qued Mar 27, 2023
ab9bd1f
Make table extraction opt-in
qued Mar 27, 2023
3d625f0
Change content to check for
qued Mar 27, 2023
e011a4c
Add install targets for paddleocr
qued Mar 27, 2023
a511cc8
Add optional pip install for paddleocr
qued Mar 27, 2023
698df93
Update README.md
ajjimeno Mar 27, 2023
6ec858c
Merge branch 'table-processing' of github.com:Unstructured-IO/unstruc…
qued Mar 27, 2023
e118a2a
Remove unused functions
qued Mar 27, 2023
ec1fddb
New image for table testing
ajjimeno Mar 27, 2023
5cef516
Remove non-unique assignment case
qued Mar 27, 2023
c343c5b
Merge branch 'table-processing' of github.com:Unstructured-IO/unstruc…
qued Mar 27, 2023
c934bd9
Correct slot_into_contains arguments
qued Mar 27, 2023
b9f97d6
Test for nms
ajjimeno Mar 27, 2023
7409021
update fixtures
qued Mar 27, 2023
2640ac3
Added test for nms
ajjimeno Mar 27, 2023
731e9a4
Added test for nms
ajjimeno Mar 27, 2023
7023da0
fix for disable table extraction by default
qued Mar 27, 2023
33b1c88
Revised test
ajjimeno Mar 27, 2023
b08322e
Update old tests
qued Mar 27, 2023
bd017c8
reuse postprocess
qued Mar 27, 2023
779c461
Additional tests
ajjimeno Mar 28, 2023
666b634
More rect tests, extract_text_from_spans
qued Mar 28, 2023
1947715
Remove unused code
qued Mar 28, 2023
4fcb541
Merge branch 'table-processing' of github.com:Unstructured-IO/unstruc…
qued Mar 28, 2023
7ab3c4c
Align supercells test
ajjimeno Mar 28, 2023
5b9bbb3
Updated removal supercell test
ajjimeno Mar 28, 2023
af5d73f
header_supercell_tree test
qued Mar 28, 2023
58a445b
Merge branch 'table-processing' of github.com:Unstructured-IO/unstruc…
qued Mar 28, 2023
2f82b65
name change to forked paddleocr
qued Mar 28, 2023
cf1a946
Updated installation and removel of a print statement
ajjimeno Mar 28, 2023
e71310b
tidied file
ajjimeno Mar 28, 2023
32812e4
Version update
ajjimeno Mar 28, 2023
7f1712a
linting
qued Mar 29, 2023
4c8d061
Updated test
ajjimeno Mar 29, 2023
b79e65b
Merge branch 'table-processing' of github.com:Unstructured-IO/unstruc…
ajjimeno Mar 29, 2023
9fe7b50
Changed Makefile
ajjimeno Mar 29, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
## 0.2.13-dev0
## 0.2.13

* Add table processing
* Change OCR logic to be aware of PDF image elements

## 0.2.12
Expand Down
6 changes: 5 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ install-base: install-base-pip-packages
install: install-base-pip-packages install-dev install-detectron2 install-test

.PHONY: install-ci
install-ci: install-base-pip-packages install-test
install-ci: install-base-pip-packages install-test install-paddleocr

.PHONY: install-base-pip-packages
install-base-pip-packages:
Expand All @@ -31,6 +31,10 @@ install-base-pip-packages:
install-detectron2:
pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@78d5b4f335005091fe0364ce4775d711ec93566e"

.PHONY: install-paddleocr
install-paddleocr:
pip install "unstructured.PaddleOCR"

.PHONY: install-test
install-test:
pip install -r requirements/test.txt
Expand Down
11 changes: 11 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,17 @@ Windows is not officially supported by Detectron2, but some users are able to in
See discussion [here](https://layout-parser.github.io/tutorials/installation#for-windows-users) for
tips on installing Detectron2 on Windows.

### PaddleOCR

[PaddleOCR](https://github.com/Unstructured-IO/unstructured.PaddleOCR) is required for table processing for `x86_64` architectures.
It should not be installed under MacOS with Apple Silicon cpu.

PaddleOCR should be installed using the following instructions.

```shell
pip install "unstructured.PaddleOCR"
```

### Repository

To install the repository for development, clone the repo and run `make install` to install dependencies.
Expand Down
Binary file added sample-docs/example_table.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ license_files = LICENSE.md

[flake8]
max-line-length = 100
ignore = D100, D101, D104, D105, D107, D2, D4
extend-ignore = D100, D101, D104, D105, D107, D2, D4
per-file-ignores =
test_*/**: D

Expand Down
3 changes: 2 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
limitations under the License.
"""
from setuptools import setup, find_packages
from platform import machine

from unstructured_inference.__version__ import __version__

Expand Down Expand Up @@ -60,5 +61,5 @@
"onnxruntime",
"transformers",
],
extras_require={},
extras_require={"paddle-ocr": "unstructured.PaddleOCR"},
)
3 changes: 2 additions & 1 deletion test_unstructured_inference/inference/test_layout.py
Original file line number Diff line number Diff line change
Expand Up @@ -186,11 +186,12 @@ def points(self):


class MockPageLayout(layout.PageLayout):
def __init__(self, layout=None, model=None, ocr_strategy="auto"):
def __init__(self, layout=None, model=None, ocr_strategy="auto", extract_tables=False):
self.image = None
self.layout = layout
self.model = model
self.ocr_strategy = ocr_strategy
self.extract_tables = extract_tables

def ocr(self, text_block: MockTextRegion):
return text_block.ocr_text
Expand Down
Loading