![title](./pics/dd_logo.png) 

# Getting started

**deep**doctection is a package that can be used to extract text from complex structured documents. These can be native PDFs but also scans. In contrast to various text miners **deep**doctection makes use of deep learning models either for solving OCR, vision or language embedding problems. Neural networks and object detectors have proven to not only identify objects on images, but also to detect structures like titles, tables, figures or lists. Another advantage is that deep learning models can be trained on your own data to improve accuracy.

This introductory notebook showcases the **deep**doctection analyzer. The analyzer is an example of a built-in pipeline, which offers a rudimentary framework to identify layout structures in documents and to extract text and tables. We will be starting with a text extraction task of business document.

Before starting, however, we have to say:

All models used when invoking the analyzer were trained on publicly available data sets for document layout analysis (Publaynet, Pubtabnet). These datasets contain document pages and tables from medical research articles. This means that there is already a bias in the training data set and it is not to be expected that layout analysis would deliver precise results on documents of different domains. To improve precision we refer to the **Fine Tuning Tutorial**, where we deal with improving the parsing results of business reports. 

Check also this [Huggingface space](https://huggingface.co/spaces/deepdoctection/deepdoctection) where models have been trained on a more diverse data set.

## Choosing the kernel

We assume that the installation was carried out as described in the guidelines. If a virtual environment and a kernel have been created, the deep-doc kernel can be chosen using the kernel selection at the upper right corner.

![title](./pics/dd_kernel.png) 

You can check if the installation was successful by activating the next cell. 

In [None]:
import os
import cv2
from matplotlib import pyplot as plt
from IPython.core.display import HTML

import deepdoctection as dd

## Sample

Let's first look at a sample page we want to process. (You will probably need to change ```image_path```.)

In [None]:
image_path = os.path.join(dd.get_package_path(),"notebooks/pics/samples/sample_2/sample_2.png")
image = cv2.imread(image_path)
plt.figure(figsize = (25,17))
plt.axis('off')
plt.imshow(image)

![title](./pics/samples/sample_2/sample_2.png)

## Analyzer

We now start by introducing the **deep**doctection analyzer. There is a factory function `get_dd_analyzer` for that outputs a pre-configured version.  

Knowing the language in advance will increase the text output significantly. As the language is german, we will pass: `language='deu'`.  

In [None]:
analyzer = dd.get_dd_analyzer(language='deu')

## Pipeline components

The analyzer is an example of a pipeline that can be built depending on the problem you want to tackle. The pipeline is made up of the building blocks as described in the diagram

![title](./pics/dd_pipeline.png) 


The default setting performs layout recognition, table segmentation and OCR extraction. You can turn table segmentation and OCR off in order to get less but quicker results.

Beside detection and OCR tasks, some other components are needed e.g. text matching and reading order. Text matching for instance tries to match words to detected layout regions by measuring intersection over area.

Both matching and reading order are purely rule based components.

## Analyze methods

The `analyze` method has various transfer parameters. The `path` parameter can be used to transfer a path to a directory to the analyzer or to a PDF document. If the path points to a directory, all individual pages can processed successively provided they have a file name suffix '.png', '.jpg' or '.tif'.
If your path points to a PDF document with multiple pages the analyzer will work iteratively through all document pages. 

In [None]:
path = os.path.join(dd.get_package_path(),"notebooks/pics/samples/sample_2")
df = analyzer.analyze(path=path)
df.reset_state()

You can see when activating the cell that not much has happened. Indeed, the `analyze` method returns a generator and you need to create an iterator so you can loop over the pages you wan to process.

We use the `iter` / `next` method here. The image is only processed when the `next` function is called.

In [None]:
doc=iter(df)
page = next(doc)

## Page object

A Page object is returned, which has some handy tools for vizualising a retrieving the detected results. There are some attributes that store meta data about the input.

In [6]:
page.height, page.width, page.file_name

(2339, 1654, 'sample_2.png')

In [7]:
image = page.viz()

The viz method draws the identified layout bounding box components into the image. These can be visualized with matplotlib.

The layout analysis reproduces the title, text and tables. In addition, lists and figures, if any, are identified. We can see here that a table with table cells was recognized on the page. In addition, the segmentations such as rows and columns were framed. The row and column positions can be seen in the cell names.

In [None]:
plt.figure(figsize = (25,17))
plt.axis('off')
plt.imshow(image)

![title](https://github.com/deepdoctection/deepdoctection/raw/master/docs/tutorials/pics/output_16_1.png)

We can use the `get_text` method to output the running text only. Table content is not included in the output.

In [9]:
print(page.get_text())


Festlegung der VV und angemessene Risikoadjustierung
Die VV-Pools der DWS Gruppe werden einer angemessenen Anpassung der Risiken unterzogen, die die Adjustierung ex ante als auch ex post umfasst. Die angewandte robuste Methode soll sicherstellen, dass bei der Festlegung der VV sowohl der risikoadjustierten Leistung als auch der Kapital- und Liquiditätsausstattung der DWS Gruppe Rechnung getragen wird. Die Er- mittlung des Gesamtbetrags der VV orientiert sich primär an (i) der Tragfähigkeit für die DWS Gruppe (das heißt, was „kann” die DWS Gruppe langfristig an VV im Einklang mit regulatorischen ‚Anforderungen gewähren) und (il) der Leistung (das heißt, was „sollte” die DWS Gruppe an VV gewähren, um für eine angemessene leistungsbezogene Vergütung zu sorgen und gleichzeitig den langfristigen Erfolg des Unternehmens zu sichern)
Die DWS Gruppe hat für die Festlegung der VV auf Ebene der individuellen Mitarbeiter die „Grundsätze für die Festlegung der variablen Vergütung” eingeführt. Dies

Tables are stored in `page.tables` which is a python list of table objects. Obviously, only one table has been detected.

In [10]:
len(page.tables)

1

In [11]:
page.tables[0].text

' Jahresdurchschnitt der Mitarbeiterzahl 139\n Gesamtvergütung ? EUR 15.315.952\n Fixe Vergütung EUR 13.151.856\n Variable Vergütung EUR 2.164.096\n davon: Carried Interest EURO\n Gesamtvergütung für Senior Management ® EUR 1.468.434\n Gesamtvergütung für sonstige Risikoträger EUR 324.229\n Gesamtvergütung für Mitarbeiter mit Kontrollfunktionen EUR 554.046\n'

In addition, an HTML version is generated that visually reproduces the recognized structure.

In [12]:
HTML(page.tables[0].html)

0,1
Jahresdurchschnitt der Mitarbeiterzahl,139
Gesamtvergütung ?,EUR 15.315.952
Fixe Vergütung,EUR 13.151.856
Variable Vergütung,EUR 2.164.096
davon: Carried Interest,EURO
Gesamtvergütung für Senior Management ®,EUR 1.468.434
Gesamtvergütung für sonstige Risikoträger,EUR 324.229
Gesamtvergütung für Mitarbeiter mit Kontrollfunktionen,EUR 554.046


Finally, you can save the full results to a JSON file.

In [13]:
page.save(image_path)

# How to continue

In this notebook we have shown how to use the built-in analyzer for text extraction from image/PDF documents. 

We recommend that the next step is to explore the notebook **Custom_Pipeline**. Here we go into more detail about the composition of pipelines and explain with an example how you can build a pipeline yourself.