# LayoutParser

A Python library for Document Image Analysis (DIA)

References:

Official Notebook
https://github.com/Layout-Parser/layout-parser/blob/master/examples/OCR%20Tables%20and%20Parse%20the%20Output.ipynb

Github repo
https://github.com/Layout-Parser/layout-parser

Research paper
https://arxiv.org/abs/2103.15348
https://arxiv.org/pdf/2103.15348.pdf


Dataset
https://arxiv.org/abs/2004.08686



You can refer [this](https://analyticsindiamag.com/guide-to-layoutparser-a-document-image-analysis-python-library/) article for detailing.

# OCR from Table Document Image

install LayoutParser library from PyPi package and Tesseract OCR Engine. Install other dependencies.

In [None]:
!python -m pip install pip --upgrade --user -q --no-warn-script-location
!python -m pip install numpy pandas seaborn matplotlib scipy statsmodels sklearn nltk gensim tensorflow keras torch torchvision \
    tqdm scikit-image pillow --user -q --no-warn-script-location

In [None]:
%%bash
python -m pip install -U layoutparser --user -q
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git@v0.4#egg=detectron2' --user -q
python -m pip install layoutparser[ocr] --user -q




In [None]:
import IPython
IPython.Application.instance().kernel.do_shutdown(True)

Import the libraries

In [None]:
import layoutparser as lp

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline
import cv2

Read an image from the source files to infer on it

In [None]:
image = cv2.imread('https://raw.githubusercontent.com/Layout-Parser/layout-parser/master/examples/data/example-table.jpeg')
# display image
plt.figure(figsize=(12,16))
plt.imshow(image)
plt.xticks([])
plt.yticks([])
plt.show()

Load the TesseractAgent OCR Engine

In [None]:
model = lp.TesseractAgent()

Detect the texts and their locations from the sample image.

In [None]:
res = model.detect(image, return_response=True)

Collect texts and their bounding boxes details as a processible data structure.

In [None]:
ocr  = model.gather_data(res, lp.TesseractFeatureType(4)) 


In [None]:
ocr

Display the image with texts along with their bounding boxes

In [None]:
lp.draw_text(image, ocr, font_size=12, with_box_on_text=True,
             text_box_width=1)

We can recognize that the output texts are reproduced with Engine-specified fonts and sizes. Thus the system has recognized texts and their locations precisely. Further, we can post-process these texts in a column-wise manner or row-wise manner as per need.

