![title](./pics/dd_logo.png) 


# Getting started

**deep**doctection is a package that can be used to extract text from complex structured documents. It also allows to run multi-modal models (text+vision) in an end-to end pipeline. Inputs can be native PDFs or images. In contrast to various text miners **deep**doctection makes use of deep learning models from powerful third party libraries solving OCR, layout detection, classification or entity recognition problems and brings everything together into one unified output structure.

This notebook will give you a quick introduction to show you, how you can use **deep**doctection for extracting text information from complex documents. In other words: We are going to attack the **document parsing problem**. 

We assume that you have successfully installed the basic setup with Pytorch, Timm, Transformers, DocTr and **deep**doctection.

In [1]:
from pathlib import Path
from matplotlib import pyplot as plt
from IPython.core.display import HTML

import deepdoctection as dd

## Sample

Take an image (e.g. .png, .jpg, ...). If you take the example below you'll maybe need to change ```image_path```.

In [None]:
image_path = Path.cwd() / "pics/samples/sample_2/sample_2.png"

# viz_handler is a tool for loading and processing images 
image = dd.viz_handler.read_image(image_path)  
plt.figure(figsize = (25,17))
plt.axis('off')
plt.imshow(image)

![title](./pics/samples/sample_2/sample_2.png)

## Analyzer

Next, we instantiate the **deep**doctection analyzer with its default config. While instantiating, various models need to be downloaded. As they are being cached, only the first invocation will take some time.

In [None]:
analyzer = dd.get_dd_analyzer()

## Analyze methods

Once all models have been loaded, we can process single pages, multi page PDF-documents or `Dataflow`s. Leaving `Dataflow`s aside for now, you can either set `path='path/to/dir'` if you have a **directory with images** or `path='path/to/my/doc.pdf'` if you have a **pdf document**. 

You will receive an error if your path points to a single image. Processing images requires to pass the path to the base image directory.

In [None]:
path = Path.cwd() / "pics/samples/sample_2"

df = analyzer.analyze(path=path)
df.reset_state()  # This method must be called just before starting the iteration. It is part of the deepdoctection's data processing API.

You can see, when activating the cell, that not much has happened yet. The reason is that `analyze` is a [generator function](https://wiki.python.org/moin/Generators). It does not return instantly any results. Instead it returns a `Dataflow`. 

A `Dataflow` is an object to create iterators for data loading and data processing. You can traverse through all the values of the `Dataflow` simply by using a `for`-loop, the `next` or the `list` function. Let's go!  

The logs will show you how the image traverses the analyzer.

In [5]:
doc=iter(df)
page = next(doc)

[32m[1222 12:38.59 @doctectionpipe.py:118][0m  [32mINF[0m  [97mProcessing sample_2.png[0m
[32m[1222 12:39.02 @context.py:156][0m  [32mINF[0m  [97mImageLayoutService total: 2.0588 sec.[0m
[32m[1222 12:39.02 @context.py:156][0m  [32mINF[0m  [97mAnnotationNmsService total: 0.0019 sec.[0m
[32m[1222 12:39.02 @context.py:156][0m  [32mINF[0m  [97mSubImageLayoutService total: 0.1899 sec.[0m
[32m[1222 12:39.02 @maputils.py:119][0m  [5m[35mWRN[0m  [97mMappingContextManager error. Will filter bounding box[0m
[32m[1222 12:39.02 @maputils.py:119][0m  [5m[35mWRN[0m  [97mMappingContextManager error. Will filter bounding box[0m
[32m[1222 12:39.02 @context.py:156][0m  [32mINF[0m  [97mPubtablesSegmentationService total: 0.0076 sec.[0m
[32m[1222 12:39.02 @context.py:156][0m  [32mINF[0m  [97mImageLayoutService total: 0.6173 sec.[0m
[32m[1222 12:39.03 @context.py:156][0m  [32mINF[0m  [97mTextExtractionService total: 0.7271 sec.[0m
[32m[1222 12:39.03 

## Page

Let's see what we got back. For each iteration we receive a `Page` object. This object stores all informations that have been collected from a page document when running through the pipeline. 

In [6]:
type(page)

dd_core.datapoint.view.Page

Let's also have a look on some top level information. 

In [7]:
print(f" height: {page.height} \n width: {page.width} \n file_name: {page.file_name} \n document_id: {page.document_id} \n image_id: {page.image_id}\n")

 height: 2339 
 width: 1654 
 file_name: sample_2.png 
 document_id: c1776412-857f-3102-af7c-1869139a278d 
 image_id: c1776412-857f-3102-af7c-1869139a278d



`document_id` and `image_id` are the same. The reason is because we only process a single image. The naming convention silently assumes that we deal with a one page document. Once we process multi page PDFs `document_id` and `image_id` differ.

## Layout segments

We can visualize detected layout segments. If you set `interactive=True` a viewer will pop up. Use `+` and `-` to zoom out/in. Use `q` to close the page.

Alternatively, you can visualize the output with matplotlib.

In [None]:
image = page.viz()
plt.figure(figsize = (25,17))
plt.axis('off')
plt.imshow(image)

![title](./pics/output_16_1.png)

Let's have a look at other attributes. We can use the `text` property to get the content of the document. You will notice that the table is not included. You can therefore filter tables from the other content. In fact you can even filter on every layout segment.

In [9]:
print(page.text)

Diel W-Pools der DWS Gruppe werden einer angemessenen/ Anpassung der Risiken unterzogen, die die Adjustierung ex ante als auch ex post umfasst. Die angewandte robuste Methode soll sicherstellen, dass bei der Festlegung der Ws sowohl derr risikoadjustierten Leistung als auch der Kapital- undl Liquiditatsausstattung der DWS Gruppe Rechnung getragen wird. Die Er mittlung des Gesamtbetrags derW orientierts sich primar ran()der Tragfahigkeit fur diel DWS Gruppe (dash heilst, was, kann" 'die DWS Gruppe langfristig an W im Einklangmitr regulatorischen Anforderungen gewahren) und (ii) der Leistung (das heilst, was, sollte" die DWS Gruppe anW gewahren, umf fur eine angemessene leistungsbezogene Vergutung zu sorgen und gleichzeitig denl langfristigen Erfolg des Unternehmens zu sichern).
Diel DWS Gruppel hati fur dief Festlegung derWa auf Ebene deri individuellen Mitarbeiter die, Grundsatze fur dief Festlegung der variablenVergutung" eingefuhrt. Diese enthalten Informationen uber die Faktoren und

You can get the individual layout segments like `text`, `title` or `line`. 

In [10]:
for layout in page.layouts:
    print(f"{layout.category_name}: {layout.text}")

text: Nach der hervorragenden Entwicklung im Jahr 2017 hatte die globale Vermogensverwaltungsbranche 20181 mit einigen Schwierigkeiten zul kâmpfen. Grunde waren ungunstige Marktbedin- gungen, stàrkere geopolitische Spannungen und dier negative Stimmung unter den Anlegern, vora allem am europaischen Retail-Markt. Auch die DWS Gruppe blieb von dieser Entwicklung nicht verschont.
text: GemaB Gesetz vom 17. Dezember 2010 uber die Organismen fur gemeinsame Anlagen (in seiner jeweils gultigen Fassung) sowie dent ESMA-Leitlinien unter Beruicksichtigung der OGAW Richtlinie hat die Gesellschaft Mitarbeiter mity wesentlichem Einfluss auf das Risikoprofil der Gesellschaft ermittelt .Risikotrager"). Dasl Identifiaierungsverfahren basierta auf der Bewertung des Einflusses folgender Kategorien von Mitarbeitern auf das Risikoprofil der Gesellschaft oder einen von ihr verwalteten Fonds: (a) Geschaftsfuhrung/Senior Management, (b) Portfolio-/ Investmentmanager (c) Kontrollfunktionen, (d) Mitarbeiter mi

In [11]:
for layout in page.residual_layouts:
        print(f"{layout.category_name}: {layout.text}")

page_footer: 22
page_header: Festlegung derVV und angemessenel Risikoadjustierung


You can also get the layout segments from the `chunks` attribute. The output is a list of tuples with the essential meta data for each layout segment, namely: `document_id, image_id, page_number, annotation_id, reading_order, category_name` and `text`.

In [12]:
page.chunks[0]

('c1776412-857f-3102-af7c-1869139a278d',
 'c1776412-857f-3102-af7c-1869139a278d',
 0,
 '1cbf803e-32fb-3ffe-aa77-5a18ed324e99',
 1,
 <LayoutType.TEXT>,
 'Diel W-Pools der DWS Gruppe werden einer angemessenen/ Anpassung der Risiken unterzogen, die die Adjustierung ex ante als auch ex post umfasst. Die angewandte robuste Methode soll sicherstellen, dass bei der Festlegung der Ws sowohl derr risikoadjustierten Leistung als auch der Kapital- undl Liquiditatsausstattung der DWS Gruppe Rechnung getragen wird. Die Er mittlung des Gesamtbetrags derW orientierts sich primar ran()der Tragfahigkeit fur diel DWS Gruppe (dash heilst, was, kann" \'die DWS Gruppe langfristig an W im Einklangmitr regulatorischen Anforderungen gewahren) und (ii) der Leistung (das heilst, was, sollte" die DWS Gruppe anW gewahren, umf fur eine angemessene leistungsbezogene Vergutung zu sorgen und gleichzeitig denl langfristigen Erfolg des Unternehmens zu sichern).')

Tables cannot be retrieved from `page.layouts`. They have a special `page.tables` which is a python list of table objects. In our situation, only one table has been detected. 

In [13]:
len(page.tables)

1

Let's have a closer look at the table. 

In [14]:
table = page.tables[0]

In [15]:
print(f" number of rows: {table.number_of_rows} \n number of columns: {table.number_of_columns} \n reading order: {table.reading_order}")

 number of rows: 8 
 number of columns: 2 
 reading order: None


There is no reading order. This is different from `page.layouts` sections. 

The reason is again, that tables are also no part of the narrative text. As they are not part of narrative text the don't need to be sorted, such that they form a consistent text block. Being part of narrative text for a layout section is pure customizing and we can it so that tables are part of `page.text`.

You can get an html, csv or text version of your table. Use `table.csv` to load the table into a Pandas Dataframe.

In [16]:
HTML(table.html)

0,1
Jahresdurchschnitt der Mitarbeiterzahl,139
Gesamtvergutung?,EUR 15.315.952
FixeVergûtung,EUR1 13.151.856
Variable Vergûtung,EUR2.164.096
davon: Carried Interest,EURO
Gesamtvergutung fur Senior Management3,EUR1 1.468.434
Gesamtvergutung fur sonstige Risikotràger,EUR3 324.229
Gesamtvergutung fur Mitarbeiter mit Kontrolfunktionen,EUR554.046


Use `table.csv` to load the table into a Pandas Dataframe.

In [17]:
table.csv  #pd.DataFrame(table.csv, columns=["Key", "Value"])

[['Jahresdurchschnitt der Mitarbeiterzahl ', '139 '],
 ['Gesamtvergutung? ', 'EUR 15.315.952 '],
 ['FixeVergûtung ', 'EUR1 13.151.856 '],
 ['Variable Vergûtung ', 'EUR2.164.096 '],
 ['davon: Carried Interest ', 'EURO '],
 ['Gesamtvergutung fur Senior Management3 ', 'EUR1 1.468.434 '],
 ['Gesamtvergutung fur sonstige Risikotràger ', 'EUR3 324.229 '],
 ['Gesamtvergutung fur Mitarbeiter mit Kontrolfunktionen ', 'EUR554.046 ']]

There is also a string representation of the table.

In [18]:
table.text

'Jahresdurchschnitt der Mitarbeiterzahl 139 Gesamtvergutung? EUR 15.315.952 FixeVergûtung EUR1 13.151.856 Variable Vergûtung EUR2.164.096 davon: Carried Interest EURO Gesamtvergutung fur Senior Management3 EUR1 1.468.434 Gesamtvergutung fur sonstige Risikotràger EUR3 324.229 Gesamtvergutung fur Mitarbeiter mit Kontrolfunktionen EUR554.046'

The method `kv_header_rows(row_number)` allows returning column headers and cell contents as key-value pairs for entire rows. Admittedly, the example here is flawed because the table lacks column headers. In fact, the table recognition model determines whether and where a column has a header. In this case, the prediction was incorrect.

However, the principle becomes clear: we receive a dictionary with the schema 

```{(column_number, column_header(column_number)): cell(row_number, column_number).text}```.

In [19]:
table.kv_header_rows(2)

{"(1, 'Jahresdurchschnitt der Mitarbeiterzahl')": 'Gesamtvergutung?',
 "(2, '139')": 'EUR 15.315.952'}

Let's go deeper down the rabbit hole. A `Table` has cells and we can even get the text of one particular cell.

In [20]:
cell = table.cells[0]

In [22]:
print(f"column number: {cell.column_number} \n row_number: {cell.row_number}  \n bounding_box: {cell.bbox} \n text: {cell.text} \n annotation_id: {cell.annotation_id}")

column number: 1 
 row_number: 1  
 bounding_box: [137, 1292, 776, 1317] 
 text: Jahresdurchschnitt der Mitarbeiterzahl 
 annotation_id: 15aeda57-6508-3f0b-8a04-fac35db66369


Still not down yet, we have a list of words that is responsible to generate the text string.

In [23]:
word = cell.words[0]

As already mentioned, the reading order determines the position of text in a larger text block. There are two levels of reading orders: 

- Reading order at the level of words in layout sections. 
- Reading order at the level of layout section in a page.

Ordering text is a huge challenge, especially when ordering layout sections. Documents can have a very complex layout structure and if you use a heuristic ordering approach you need to compromise to some extent. Reading order at the level of layout sections is basically ordering words in a rectangle. This is easier.

Let's look at some more attributes.

In [24]:
print(f" characters: {word.characters} \n reading order: {word.reading_order} \n token class: {word.token_class}")

 characters: Mitarbeiterzahl 
 reading order: 3 
 token class: None


## Saving and reading

You can use the `save` method to save the result of the analyzer in a `.json` file. Setting `image_to_json=True` you will also save image as b64 encoding in the file. Beware, the files can be quite large then. 

In [None]:
page.save(image_to_json=True, path="/path/to/dir/test.json")

Having saved the results you can easily parse the file into the `Page` format without loosing any information. 

In [33]:
page = dd.Page.from_file(file_path="/home/janis/Downloads/test.json")