![title](./pics/dd_logo.png) 

# Getting started

This introductory notebook introduces the **deep**doctection analyzer. The analyzer is a built-in pipeline, which offers a rudimentary framework to identify layout structures in documents and to extract its text. We will start with analyzing a business document.

Before starting, however, we have to say:

All pipeline components of the analyzer were trained on standard data sets for document layout analysis (Publaynet, Pubtabnet). These datasets contain document pages and tables from medical research articles. This means that there is already a bias in the training data set and it is not to be expected that the layout analysis would deliver results with the precision as on documents from medical studies. To improve the results we refer to the Fine Tuning Tutorial

## Choosing the kernel

We assume that the installation was carried out using the options described. If a virtual environment and a kernel have been created using the make files, the deep-doc kernel can be chosen using the kernel selection on the notebook.

![title](./pics/dd_kernel.png) 

You can check if the installation was successful by activating the next cell. 

## Running this notebook on Colab

You can run this notebook on Colab. To install activate the following cell. 

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/deepdoctection/deepdoctection/blob/master/notebooks/Get_Started.ipynb)

In [None]:
!git clone https://github.com/deepdoctection/deepdoctection.git
!cd deepdoctection
!make up-reqs-dev

In [1]:
import os
import cv2
from matplotlib import pyplot as plt
from IPython.core.display import HTML
from deep_doctection.utils.systools import get_package_path

## Sample

Let's first look at a sample page we want to process.

In [None]:
image_path = os.path.join(get_package_path(),"notebooks/pics/samples/sample_2/sample_2.png")
image = cv2.imread(image_path)
plt.figure(figsize = (25,17))
plt.axis('off')
plt.imshow(image)

![title](./pics/samples/sample_2/sample_2.png)

## Analyzer

We now start by introducing the **deep**doctection analyzer. There is a factory function `get_dd_analyzer` for that outputs a preconfigured version.  

In [3]:
from deep_doctection.analyzer import get_dd_analyzer

Knowing the language in advance will increase the text output significantly. As the document is german, we will pass a customizing: language='deu'.  

In [None]:
analyzer = get_dd_analyzer(language='deu')

## Pipeline components

We will not go into details of the entire configuration. However, the top line is important. It says that all optional components are set to 'True' by default.


`
Will establish with table: True and ocr: True
`

The analyzer therefore performs layout recognition, table segmentation and OCR extraction. These steps represent the components that are going to be performed using neural networks.

In addition, there are other components that are rule-based: These include, for example, a refinement process for table segmentation, matching the OCR-determined words to the layout results and a heuristic determination of the reading order.

## Analyze methods

The `analyze` method has various transfer parameters. The `path` parameter can be used to transfer a path to a directory to the analyzer. In this directory, all individual pages can processed successively provided they have a file name suffix '.png' or '.jpg'.

It is also possible to transfer a path to a PDF document using the `doc_path` parameter. When the analyzer is started, the individual pages of the entire document are successively analyzed. 

In [5]:
path = os.path.join(get_package_path(),"notebooks/pics/samples/sample_2")
df = analyzer.analyze(path=path)

[32m[1214 10:40:05 @common.py:558][0m [JoinData] Size check failed for the list of dataflow to be joined!


You can see when running the cell that not much has happened. Indeed, the analyze method returns a generator. The generator allows processing to be started via a for-loop.

We use the iter / next method here. The image is only processed when the next function is called.

In [6]:
doc=iter(df)
page = next(doc)

processing sample_2.png


## Page object

A Page object is returned, which has some handy tools for vizualising a retrieving the detected results. There are some attributes that store meta data information.

In [7]:
page.height, page.width, page.file_name

(2339, 1654, 'sample_2.png')

In [8]:
image = page.viz()

The viz method draws the identified layout bounding box components into the image. These can be output with common visualization tools.

The layout analysis reproduces the title, text and tables. In addition, lists and figures, if any, are identified. We can see here that a table with table cells was recognized on the page. In addition, the segmentations such as rows and columns were framed. The row and column positions can be seen in the cell names.

In [None]:
plt.figure(figsize = (25,17))
plt.axis('off')
plt.imshow(image)

![title](https://github.com/deepdoctection/deepdoctection/raw/master/docs/tutorials/pics/output_16_1.png)

The next methods are devoted to the text output.

We can use the get_text method to output the running text only. Table contents are not included here. 

In [10]:
print(page.get_text())


Festlegung der VV und angemessene Risikoadjustierung
Die VV-Pools der DWS Gruppe werden einer angemessenen Anpassung der Risiken unterzogen, die die Adjustierung ex ante als auch ex post umfasst. Die angewandte robuste Methode soll sicherstellen, dass bei der Festlegung der VV sowohl der risikoadjustierten Leistung als auch der Kapital- und Liquiditätsausstattung der DWS Gruppe Rechnung getragen wird. Die Er- mittlung des Gesamtbetrags der VV orientiert sich primär an (i) der Tragfähigkeit für die DWS Gruppe (das heißt, was „kann” die DWS Gruppe langfristig an VV im Einklang mit regulatorischen ‚Anforderungen gewähren) und (il) der Leistung (das heißt, was „sollte” die DWS Gruppe an VV gewähren, um für eine angemessene leistungsbezogene Vergütung zu sorgen und gleichzeitig den langfristigen Erfolg des Unternehmens zu sichern)
Die DWS Gruppe hat für die Festlegung der auf Ebene der individuellen Mitarbeiter die „Grundsätze für die Festlegung der variablen Vergütung” eingeführt. Diese e

Tables are stored in page.tables which is a python list of table objects. Obviously, only one table has been detected.

In [15]:
len(page.tables)

1

In [16]:
print(page.tables[0])

______________ row: 1 ______________
______________ row: 2 ______________
row: 1, col: 1, rs: 1, cs: 1, text: Jahresdurchschnitt der Mitarbeiterzahl 
row: 1, col: 2, rs: 1, cs: 1, text: 139 
______________ row: 3 ______________
row: 2, col: 1, rs: 1, cs: 1, text: Gesamtvergütung ? 
row: 2, col: 2, rs: 1, cs: 1, text: EUR 15.315.952 
______________ row: 4 ______________
row: 3, col: 1, rs: 1, cs: 1, text: Fixe Vergütung 
row: 3, col: 2, rs: 1, cs: 1, text: EUR 13.151.856 
______________ row: 5 ______________
row: 4, col: 1, rs: 1, cs: 1, text: Variable Vergütung 
row: 4, col: 2, rs: 1, cs: 1, text: EUR 2.164.096 
______________ row: 6 ______________
row: 5, col: 1, rs: 1, cs: 1, text: davon: Carried Interest 
row: 5, col: 2, rs: 1, cs: 1, text: EURO 
______________ row: 7 ______________
row: 6, col: 1, rs: 1, cs: 1, text: Gesamtvergütung für Senior Management ® 
row: 6, col: 2, rs: 1, cs: 1, text: EUR 1.468.434 
______________ row: 8 ______________
row: 7, col: 1, rs: 1, cs: 1, text: Ge

The print function can be used to display an output of the table that includes the segmentation. In addition, an HTML version is generated that visually reproduces the recognized structure well.

In [17]:
HTML(page.tables[0].html)

0,1
139,Jahresdurchschnitt der Mitarbeiterzahl
EUR 15.315.952,Gesamtvergütung ?
EUR 13.151.856,Fixe Vergütung
EUR 2.164.096,Variable Vergütung
EURO,davon: Carried Interest
EUR 1.468.434,Gesamtvergütung für Senior Management ®
EUR 324.229,Gesamtvergütung für sonstige Risikoträger
EUR 554.046,Gesamtvergütung für Mitarbeiter mit Kontrollfunktionen


Finally, you can save the full results to a JSON file.

In [19]:
page.save(path)