![title](./pics/dd_logo.png) 

# Getting started

**deep**doctection is a package that can be used to extract text from complex structured documents. These can be native PDFs but also scans. In contrast to various text miners **deep**doctection makes use of deep learning models from powerful third party libraries for solving OCR, vision or language embedding problems. 

This notebook will give you a quick tour so that you can get started straight away.

In [1]:
import cv2
from pathlib import Path
from matplotlib import pyplot as plt
from IPython.core.display import HTML

import deepdoctection as dd

## Sample

Take an image (e.g. .png, .jpg, ...). If you take the example below you'll maybe need to change ```image_path```.

In [None]:
image_path = Path(dd.get_package_path()) / "notebooks/pics/samples/sample_2/sample_2.png"
image = cv2.imread(image_path.as_posix())
plt.figure(figsize = (25,17))
plt.axis('off')
plt.imshow(image)

![title](./pics/samples/sample_2/sample_2.png)

## Analyzer

Next, we instantiate the **deep**doctection analyzer. There is a built-in pipeline you can use. The analyzer is an example of a pipeline that can be built depending on the problem you want to tackle. This particular pipeline is built from various building blocks as shown in the diagram. 

There is a lot going on under the hood. The analyzer calls three object detectors to structure the page and an OCR engine to extract the text. However, this is clearly not enough. On top of that, words have to be mapped to layout structures and a reading order has to be inferred. 

![title](./pics/dd_pipeline.png)  

In [None]:
analyzer = dd.get_dd_analyzer(language='deu')

The language of the sample is german and passing the argument `language='deu'` will use a Tesseract model that has been trained on a german corpus giving much better result than the default english version.

## Analyze methods

Now, that once all models have been loaded, we can process single pages or documents. You can either set `path=path/to/dir` if you have a folder of scans or `path=path/to/my/doc.pdf` if you have a single pdf document.

In [5]:
path = Path(dd.get_package_path()) / "notebooks/pics/samples/sample_2"

df = analyzer.analyze(path=path)
df.reset_state()  # This method must be called just before starting the iteration. It is part of the API.

|                                                                                                                                                                                                 |1/?[00:00<00:00,1247.93it/s]


You can see when activating the cell that not much has happened yet. The reason is that `analyze` is a generator function. We need a `for` loop or `next` to start the process.   

In [None]:
doc=iter(df)
page = next(doc)

## Page

Let's see what we got back. We start with some header information about the page. With `get_attribute_names()` you get a list of all attributes. 

In [7]:
page.height, page.width, page.file_name, page.location

(2339.0,
 1654.0,
 'sample_2.png',
 '/home/janis/Public/deepdoctection_pt/deepdoctection/notebooks/pics/samples/sample_2/sample_2.png')

In [8]:
page.get_attribute_names()

{<PageType.document_type>, <PageType.language>, 'layouts', 'tables', 'text'}

`page.document_type` returns None. The reason is that this pipeline is not built for document classification. You can easily build a pipeline containing a document classifier, though. Check the docs for further information.

In [10]:
print(page.document_type)

None


We can visualize the detected segments. If you set `interactive=True` a viewer will pop up. Use + and - to zoom out/in. Use q to close the page.

Alternatively, you can visualize the output with matplotlib.

In [None]:
image = page.viz()
plt.figure(figsize = (25,17))
plt.axis('off')
plt.imshow(image)

![title](https://github.com/deepdoctection/deepdoctection/raw/master/docs/tutorials/pics/output_16_1.png)

Let's have a look at other attributes. We can use the `text` property to get the content of the document. You will notice that the table is not included. You can therefore filter tables from the other content. In fact you can even filter on every layout.

In [11]:
print(page.text)


Festlegung der VV und angemessene Risikoadjustierung
Die VV-Pools der DWS Gruppe werden einer angemessenen Anpassung der Risiken unterzogen, die die Adjustierung ex ante als auch ex post umfasst. Die angewandte robuste Methode soll sicherstellen, dass bei der Festlegung der VV sowohl der risikoadjustierten Leistung als auch der Kapital- und Liquiditätsausstattung der DWS Gruppe Rechnung getragen wird. Die Er- mittlung des Gesamtbetrags der VV orientiert sich primär an (i) der Tragfähigkeit für die DWS Gruppe (das heißt, was „kann” die DWS Gruppe langfristig an VV im Einklang mit regulatorischen ‚Anforderungen gewähren) und (il) der Leistung (das heißt, was „sollte” die DWS Gruppe an VV gewähren, um für eine angemessene leistungsbezogene Vergütung zu sorgen und gleichzeitig den langfristigen Erfolg des Unternehmens zu sichern)
Die DWS Gruppe hat für die Festlegung der VV auf Ebene der individuellen Mitarbeiter die „Grundsätze für die Festlegung der variablen Vergütung” eingeführt. Dies

In [13]:
for layout in page.layouts:
    if layout.category_name=="title":
        print(f"Title: {layout.text}")

Title: Identifi ierung von Risikoträgern
Title: Vergütung für das Jahr 2018
Title: Festlegung der VV und angemessene Risikoadjustierung


Tables are stored in `page.tables` which is a python list of table objects. Obviously, only one table has been detected. Let's have a closer look at the table. Most attributes are hopefully self explained. If you `print(page.tables)` you will get a very cryptic `__repr__` output.

In [14]:
len(page.tables)

1

In [15]:
table = page.tables[0]
table.get_attribute_names()

{'bbox',
 'cells',
 'columns',
 <TableType.html>,
 <TableType.item>,
 <TableType.max_col_span>,
 <TableType.max_row_span>,
 <TableType.number_of_columns>,
 <TableType.number_of_rows>,
 <Relationships.reading_order>,
 'rows',
 'text',
 'words'}

In [16]:
table.number_of_rows, table.number_of_columns

(8, 2)

In [17]:
HTML(table.html)

0,1
Jahresdurchschnitt der Mitarbeiterzahl,139
Gesamtvergütung ?,EUR 15.315.952
Fixe Vergütung,EUR 13.151.856
Variable Vergütung,EUR 2.164.096
davon: Carried Interest,EURO
Gesamtvergütung für Senior Management ®,EUR 1.468.434
Gesamtvergütung für sonstige Risikoträger,EUR 324.229
Gesamtvergütung für Mitarbeiter mit Kontrollfunktionen,EUR 554.046


Let's go deeper into the rabbit hole. A `Table` has cells and we can even get the text of one particular cell. Note that the output list is not sorted by row or column. You have to do it yourself.

In [18]:
cell = table.cells[0]
cell.get_attribute_names()

{'bbox',
 <CellType.body>,
 <CellType.column_number>,
 <CellType.column_span>,
 <CellType.header>,
 <Relationships.reading_order>,
 <CellType.row_number>,
 <CellType.row_span>,
 'text',
 'words'}

In [19]:
cell.column_number, cell.row_number, cell.text, cell.annotation_id  # every object comes with a unique annotation_id

(1,
 8,
 'Gesamtvergütung für Mitarbeiter mit Kontrollfunktionen',
 'afb3c667-5d58-3689-a82b-69a8a5f71cbd')

Still not down yet, we have a list of words that is responsible to generate the text string.

In [20]:
word = cell.words[0]
word.get_attribute_names()

{'bbox',
 <WordType.block>,
 <WordType.characters>,
 <Relationships.reading_order>,
 <WordType.tag>,
 <WordType.text_line>,
 <WordType.token_class>,
 <WordType.token_tag>}

The reading order determines the string position. OCR engines generally provide a some heuristics to infer a reading order. This library, however, follows the apporach to disentangle every processing step.

In [21]:
word.characters, word.reading_order, word.token_class

('Gesamtvergütung', 1, None)

The `Page` object is read-only and even though you can change the value it will not be persited.

In [22]:
word.token_class = "ORG"

In [23]:
word #  __repr__ of the base object does carry <WordType.token_class> information.  

Word(active=True, _annotation_id='f35f5c53-8af3-3ed9-971a-4cd65c0a37ce', category_name=<LayoutType.word>, _category_name=<LayoutType.word>, category_id='1', score=0.91, sub_categories={<WordType.characters>: ContainerAnnotation(active=True, _annotation_id='fa28e8c0-5883-392f-b23b-92adb8537b8a', category_name=<WordType.characters>, _category_name=<WordType.characters>, category_id='None', score=0.91, sub_categories={}, relationships={}, value='Gesamtvergütung'), <WordType.block>: CategoryAnnotation(active=True, _annotation_id='8a40178f-1dff-3a02-81be-2b5f5b6d6250', category_name=<WordType.block>, _category_name=<WordType.block>, category_id='47', score=None, sub_categories={}, relationships={}), <WordType.text_line>: CategoryAnnotation(active=True, _annotation_id='34bd3cdf-0048-3647-af75-b43532688418', category_name=<WordType.text_line>, _category_name=<WordType.text_line>, category_id='1', score=None, sub_categories={}, relationships={}), <Relationships.reading_order>: CategoryAnnotati

You can save your result in a big `.json` file. The default `save` configuration will store the image as b64 string, so be aware: The `.json` file with that image has a size of 6,2 MB!

In [24]:
page.save()

Having saved the results you can easily parse the file into the `Page` format.

In [25]:
path = Path(dd.get_package_path()) / "notebooks/pics/samples/sample_2/sample_2.json"

df = dd.SerializerJsonlines.load(path)
page = dd.Page.from_dict(**next(iter(df)))