![title](./pics/dd_logo.png)

# Getting started

This introductory notebook introduces the Deep Doctection Analyzer. The analyzer is a built-in pipeline, which offers a rudimentary framework to identify layout structures in business documents and to extract its text.

We assume that the installation was carried out using the options described. If a virtual environment and a kernel have been created using the make files, the deep-doc kernel can be selected using the kernel selection on the notebook.

![title](./pics/dd_kernel.png) 

In [1]:
import os
from matplotlib import pyplot as plt
from IPython.core.display import HTML

## Analyzer

There is a factory function `get_dd_analyzer` for the analyzer that outputs a preconfigured version. 

We start with the standard configuration, which we will explain in more detail below. 

In [2]:
from deep_doctection.analyzer import get_dd_analyzer
from deep_doctection.utils.systools import get_package_path

In [3]:
path = os.path.join(get_package_path(),"notebooks/pics/samples/sample_2")

Knowing the language in advance will increase the text output significantly. As the document is german we will pass language='deu' as only configuration.  

In [None]:
analyzer = get_dd_analyzer(language='deu')

## Pipeline components

We will not go into details of the entire configuration. However, the top line is important. It says that all optional components are set to True by default.


`
Will establish with table: True and ocr: True
`

The analyzer therefore performs layout recognition, table segmentation and OCR extraction. These steps represent the components that were performed using neural networks.

In addition, there are other components that are rule-based: These include, for example, a refinement process for table segmentation, matching the OCR-determined words to the layout results and a heuristic determination of the reading order.

## Analyze methods

The `analyze` method has various transfer parameters. The `path` parameter can be used to transfer a path to a directory to the analyzer. In this directory, all individual pages are processed at the start that are in this directory and have the ending .png or .jpg.

It is also possible to transfer a path to a PDF document using the `doc_path` parameter. When the analyzer is started, the individual pages of the entire document are successively analyzed. 

In [5]:
df = analyzer.analyze(path=path)

[32m[1119 17:38:23 @common.py:558][0m [JoinData] Size check failed for the list of dataflow to be joined!


You can see when running the cell that not much has happened. Indeed, the analyze method returns a generator. The generator allows processing to be started via a for-loop.

We use the iter / next method here. The image is only processed when the next function is called.

In [6]:
doc=iter(df)
page = next(doc)

processing sample_2.png


## Page object

Page object is returned, which contains a structure in which the determined page components are contained. We can get meta information.

In [7]:
page.height, page.width, page.file_name

(2339, 1654, 'sample_2.png')

In [8]:
image = page.viz()

The viz method, which draws in the identified layout components as a bounding box, gives a good impression. These can be output with common visualization tools.

It can be seen that a layout analysis of the document was carried out, which reproduces the title, text and tables. In addition, lists and figures, if any, are identified. We can see here that a table with table cells was recognized on the page. In addition, the segmentations such as rows and columns were framed. The row and column positions can be seen in the cell names

In [None]:
plt.figure(figsize = (20,10))
plt.axis('off')
plt.imshow(image)

We can use the get_text method to output the running text. Only the text layout components that can be assigned to the running text are output. Table contents are not included. 

In [None]:
print(page.get_text())

In [11]:
len(page.tables)

1

In [12]:
print(page.tables[0])

______________ row: 1 ______________
______________ row: 2 ______________
row: 1, col: 1, rs: 1, cs: 1, text: Jahresdurchschnitt der Mitarbeiterzahl 
row: 1, col: 2, rs: 1, cs: 1, text: 139 
______________ row: 3 ______________
row: 2, col: 1, rs: 1, cs: 1, text: Gesamtvergütung ? 
row: 2, col: 2, rs: 1, cs: 1, text: EUR 15.315.952 
______________ row: 4 ______________
row: 3, col: 1, rs: 1, cs: 1, text: Fixe Vergütung 
row: 3, col: 2, rs: 1, cs: 1, text: EUR 13.151.856 
______________ row: 5 ______________
row: 4, col: 1, rs: 1, cs: 1, text: Variable Vergütung 
row: 4, col: 2, rs: 1, cs: 1, text: EUR 2.164.096 
______________ row: 6 ______________
row: 5, col: 1, rs: 1, cs: 1, text: davon: Carried Interest 
row: 5, col: 2, rs: 1, cs: 1, text: EURO 
______________ row: 7 ______________
row: 6, col: 1, rs: 1, cs: 1, text: Gesamtvergütung für Senior Management ® 
row: 6, col: 2, rs: 1, cs: 1, text: EUR 1.468.434 
______________ row: 8 ______________
row: 7, col: 1, rs: 1, cs: 1, text: Ge

The print function can be used to display an output of the table that includes the segmentation. In addition, an HTML version is generated that visually reproduces the recognized structure well.

In [13]:
HTML(page.tables[0].html)

0,1
139,Jahresdurchschnitt der Mitarbeiterzahl
EUR 15.315.952,Gesamtvergütung ?
EUR 13.151.856,Fixe Vergütung
EUR 2.164.096,Variable Vergütung
EURO,davon: Carried Interest
EUR 1.468.434,Gesamtvergütung für Senior Management ®
EUR 324.229,Gesamtvergütung für sonstige Risikoträger
EUR 554.046,Gesamtvergütung für Mitarbeiter mit Kontrollfunktionen


Let's have a look at a second example

In [14]:
path = os.path.join(get_package_path(),"notebooks/pics/samples/sample_1")
df = analyzer.analyze(path=path)
doc=iter(df)
page = next(doc)

[32m[1119 17:39:04 @common.py:558][0m [JoinData] Size check failed for the list of dataflow to be joined!
processing sample.png


A quick look shows that the division of the tables into cells is not correct in some cells. It must also be said that this result is to be expected. Because, on the one hand, the approach to segmentation is afflicted by many weaknesses. On the other hand, the data set that was used for training is not varied enough, so that better results cannot be delivered without further measures.

A substantial part of this repo is intended to drive forward and improve the present results significantly.

 The table contents are treated separately. Since no text group was identified (the header was not recognized) the output is empty.

A table is saved with the table attribute. The print function can be used to display a readable output of the table.

In [None]:
plt.figure(figsize = (20,10))
plt.axis('off')
image = page.viz()
plt.imshow(image)

In [16]:
print(page.get_text())




With the save method the page object can be saved as a JSON file. Conversely, a file can be loaded into a page object with the load_page function.