This tool is designed to perform simple information extraction from pdf/images based on provided keywords.
Key features:
- PDFPlumber for digital documents
- OCR for scanned documents
- PaddleOCR as default OCR engine
- Easy OCR for multilingual support
- TrOCR model for handwriting recognition
- Cosine similarity to locate the most similar keyword
- Fuzziness is 65% by default, and adjustable
- Locate information by defining a
relative position
to the given element- above, below, left, right
- on the same row, on the same column
- Perform NER analysis for given texts (spacy)
- Merge long sentences on the same line into 1 bbox
Road ahead:
- Table detection
- Table structure analysis
Example code:
from EasyDoc import EasyDoc
doc = EasyDoc(r"Test.pdf")
ocr_result = doc.extract_words()
doc.on_the_same_column(text='Name (as shown', relation='below')
doc.set_region(text='Business name', relation='above')
Name = doc.extract_text(engine='TrOCR-handwritten') #Bruce Wayne
doc.draw_region('Name', show_image=True)
Example code:
from EasyDoc import EasyDoc
doc = EasyDoc(r"doc/Test.pdf")
ocr_result = doc.extract_words(apply_ocr=False)
doc.on_the_same_row(text='Cum Income')
doc.on_the_same_column(text='Pence')
NAV = doc.extract_text() #30.65
Prepare env: Python 3.9, PyTorch 1.12.1, CUDA 11.6, Cudnn 8.4
pip install pandas sentence-transformers pypdfium2 easyocr pdfplumber Pillow 'spacy[cuda-autodetect]'
pip install paddlepaddle-gpu==2.4.1.post116 -f https://www.paddlepaddle.org.cn/whl/windows/mkl/avx/stable.html
pip uninstall opencv-python opencv-python-headless
pip install "paddleocr>=2.0.1"
python -m spacy download en_core_web_trf
doc = EasyDoc(r"Test.pdf")
ocr_result = doc.extract_words()
Paremeters | Default value |
---|---|
apply_OCR | True |
lang | en, ch, cht |
page | 1 |
temp_folder | tmp |
tmp_prefix | image |
element = doc.find_text(keyword='Name').iloc[0][:]
Paremeters | Values | Default |
---|---|---|
keyword | ||
fuzzy | 0-1 | 0.65 |
position | None, top, bottom, left, right | None |
nth | 0: return all; >=1, return nth | 1 |
sort_by | any columns in self.ocr_result | fuzzy_matching_lower_trim |
element = doc.find_text(keyword='Name',fuzzy=0.65, position='top').iloc[0][:]
doc.set_region(element, relation='above', offset=-30)
#or directly search by text
doc.set_region(text='Employer Identification number', relation='above', offset=-30)
Fuzziness is calculated using cosine similarity, based on model multi-qa-mpnet-base-cos-v1
.
relation | position | default | Note |
---|---|---|---|
above | whole | Yes | Above the entire element |
above | bottom | Above the bottom of the element | |
below | whole | Yes | Below the entire element |
below | top | Below the top of the element | |
left | whole | Yes | On the left of the entire element |
left | right | On the left of the right-edge of the element | |
right | whole | Yes | On the right of the entire element |
right | left | On the right of the left-edge of the element |
We should reset the region when working on a new extraction area
doc.reset_region()
Search by element
element = doc.find_text(keyword='Name (as shown',fuzzy=0.65, position='top').iloc[0][:]
doc.on_the_same_column(element, relation='below')
Search by text
doc.on_the_same_column(text='Name (as shown', relation='below')
- offset = (a,b)
- relation = 'above' or 'below'
Search by element
element = doc.find_text(keyword='Social security number',fuzzy=0.65, position='top').iloc[0][:]
doc.on_the_same_column(element, relation='below', offset=(0, 420))
Search by text
doc.on_the_same_column(text='Social security number', relation='below', offset=(0, 420))
- offset = (a,b)
- relation = 'above' or 'below'
Available OCR engines:
- PaddleOCR
- EasyOCR
- TrOCR-handwritten (English only)
Layout analysis for merging long sentences on the same line into 1 bbox
doc.analyze_layout(w=2, h=1.0)
w | text indent | 2 (2 characters) |
---|---|---|
h | vertical merging | 1.0 (no merging) |
Draw the region for debug purpose, image is saved to tmp/output.png:
doc.draw_region(label='Name', show_image=True)
Draw the bboxes for debug purpose, image is saved to tmp/output.png:
doc.draw_region(show_image=True)
Return the texts in the given region
text = doc.extract_text()
Paremeters | Values |
---|---|
apply_OCR | True |
engine | PaddleOCR, EasyOCR, TrOCR-handwritten |
separator | ' ' |
offset | 5 |
NER_analysis= doc.NER(text='')
print(nlp_analysis)
Returns NER analysis by spacy transformer model
Text | Label | Start | End |
---|---|---|---|
01 November 2022 | Date | 85 | 101 |
NAV_date = doc.get_entity_by_label(text='', labels=['DATE'])
print(NAV_date)
Available labels: CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART
pip uninstall opencv-python opencv-python-headless
pip install "paddleocr>=2.0.1"
- Locate
zlib.dll
fromC:\Program Files\NVIDIA Corporation
- Copy the
zlibe.dll
to the correspondent CUDA folder:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6\bin
, and rename it aszlibwapi.dll