![title](./pics/dd_logo.png) 

# Pipelines

In the [**configuration notebook**](./Analyzer_Configuration.ipynb), we have already touched on the structure of a pipeline. In this section, we will take a deeper look into its architecture.

As a first step, we will take a closer look at the **deep**doctection analyzer and lay the groundwork for developing custom pipelines tailored to specific use cases. After all, document parsing is just one possible application.

You might, for example, want to build a pipeline that classifies pages from a scanned binder and extracts entities from them. For such scenarios, **deep**doctection provides powerful components and tools that help you achieve your goals with a manageable amount of code.

If you go to the get started notebook and scroll to the cell where the image processed, you will see some logs:


```
[0523 22:14.35 @doctectionpipe.py:103]  INF  Processing sample_2.png
[0523 22:14.37 @context.py:133]         INF  ImageLayoutService total: 2.3095 sec.
[0523 22:14.37 @context.py:133]         INF  AnnotationNmsService total: 0.002 sec.
[0523 22:14.38 @context.py:133]         INF  SubImageLayoutService total: 0.3684 sec.
[0523 22:14.38 @context.py:133]         INF  PubtablesSegmentationService total: 0.0066 sec.
[0523 22:14.38 @context.py:133]         INF  ImageLayoutService total: 0.4052 sec.
[0523 22:14.39 @context.py:133]         INF  TextExtractionService total: 0.9374 sec.
[0523 22:14.39 @context.py:133]         INF  MatchingService total: 0.0059 sec.
[0523 22:14.39 @context.py:133]         INF  TextOrderService total: 0.0301 sec.
```

The logs reveal which pipeline component the image passed through and how much time each component took to process it. Some pipeline components have been deactivated by configuration.

A pipeline is built as a sequence of tasks. These tasks are called pipeline components or services.

![pipelines](./pics/dd_overview_pipeline.png)

Once a pipeline is defined, images or documents can be processed. These are either pure image files (like JPG, PNG, TIFF) or PDF files. PDF files are read and processed page by page. Each page is converted into a numpy array because arrays are the input for vision and OCR models. 


In [1]:
import deepdoctection as dd
from pathlib import Path

  from .autonotebook import tqdm as notebook_tqdm
[32m[0525 21:16.21 @file_utils.py:31][0m  [32mINF[0m  [97mPyTorch version 2.2.2 available.[0m
[32m[0525 21:16.21 @file_utils.py:69][0m  [32mINF[0m  [97mDisabling Tensorflow because USE_TORCH is set[0m


In [2]:
analyzer = dd.get_dd_analyzer()

[32m[0525 21:16.23 @dd.py:131][0m  [32mINF[0m  [97mConfig: 
 {'DEVICE': device(type='mps'),
 'LANGUAGE': None,
 'LAYOUT_LINK': {'CHILD_CATEGORIES': [<LayoutType.CAPTION>],
                 'PARENTAL_CATEGORIES': [<LayoutType.FIGURE>, <LayoutType.TABLE>]},
 'LAYOUT_NMS_PAIRS': {'COMBINATIONS': [[<LayoutType.TABLE>, <LayoutType.TITLE>],
                                       [<LayoutType.TABLE>, <LayoutType.TEXT>],
                                       [<LayoutType.TABLE>, <LayoutType.KEY_VALUE_AREA>],
                                       [<LayoutType.TABLE>, <LayoutType.LIST_ITEM>],
                                       [<LayoutType.TABLE>, <LayoutType.LIST>],
                                       [<LayoutType.TABLE>, <LayoutType.FIGURE>],
                                       [<LayoutType.TITLE>, <LayoutType.TEXT>],
                                       [<LayoutType.TEXT>, <LayoutType.KEY_VALUE_AREA>],
                                       [<LayoutType.TEXT>, <LayoutType.L

Let's take a closer look at the **deep**doctection analyzer. 

![pipeline](./pics/dd_pipeline_250525.png)

The architecture is modular, and a pipeline consists of individual components, each typically performing a single processing step. We have already explored the configuration options. When the analyzer is instantiated, a dictionary is printed to the logs, which begins approximately like this:

```
{'DEVICE': device(type='mps'),
 'LANGUAGE': None,
 'LAYOUT_LINK': {'CHILD_CATEGORIES': ['caption'], 'PARENTAL_CATEGORIES': ['figure', 'table']},
 'LAYOUT_NMS_PAIRS': {'COMBINATIONS': [['table', 'title'], ['table', 'text'],
                                       ['table', 'key_value_area'], ['table', 'list_item'],
                                       ['table', 'list'], ['table', 'figure'], ['title', 'text'],
                                       ['text', 'key_value_area'], ['text', 'list_item'],
                                       ['key_value_area', 'list_item']],
                      'PRIORITY': ['table', 'table', 'table', 'table', 'table', 'table', 'text',
                                   'text', None, 'key_value_area'],
                      'THRESHOLDS': [0.001, 0.01, 0.01, 0.001, 0.01, 0.01, 0.05, 0.01, 0.01, 0.01]},
 ...
}
```

Having a pipeline, you can list the components with `get_pipeline_info()`. It returns a dictionary with the a so called `service_id` and a name of the component. Note, that the name of the component depends not only on the service itself but also on the model that is being chosen.

In [3]:
analyzer.get_pipeline_info()


{'5497d92c': 'image_Transformers_Tatr_deformable-detr-DocLayNet_model.safetensors',
 '3b56c997': 'nms',
 '8c23055e': 'sub_image_Transformers_Tatr_tatr_tab_struct_v2_pytorch_model.bin',
 '03844ddb': 'table_transformer_segment',
 '01a15bff': 'image_doctr_db_resnet50pt_db_resnet50-ac60cadc.pt',
 '1cedc14d': 'text_extract_doctr_crnn_vgg16_bnpt_crnn_vgg16_bn-9762b0b0.pt',
 'd6219eba': 'matching',
 'f10aa678': 'text_order'}

In [4]:
component = analyzer.get_pipeline_component(service_id="5497d92c")
component.name

'image_Transformers_Tatr_deformable-detr-DocLayNet_model.safetensors'

Each pipeline component has a `DatapointManager`, which manages the dataclass responsible for collecting all information related to a page. The results are then provided through the `Page` object, which generates the corresponding `JSON` output. If a service uses a model, the model will also receive a `model_id`. If you want to process a pile of documents with a pipeline, you can pass a `service_id` to the `analyze` method which allows you to version the run. The `service_id` will be added in your JSON output. Note, that the analyzer will not generate a `session_id` by itself.

The `get_meta_annotation()` method allows you to see which elements are being detected. We already mentioned this method in the [**More on parsing**](./Analyzer_More_On_Parsing.ipynb) notebook. It exists both at the level of individual pipeline components and at the level of the entire pipeline.
At the pipeline level, all `get_meta_annotation()` outputs from the individual components are aggregated.

In [5]:
component.dp_manager.model_id, component.dp_manager.service_id, component.dp_manager.session_id

('af516519', '5497d92c', None)

In [6]:
component.get_meta_annotation()

MetaAnnotation(image_annotations=(<DefaultType.DEFAULT_TYPE>, <LayoutType.CAPTION>, <LayoutType.TEXT>, <LayoutType.TITLE>, <LayoutType.FOOTNOTE>, <LayoutType.FORMULA>, <LayoutType.LIST_ITEM>, <LayoutType.PAGE_FOOTER>, <LayoutType.PAGE_HEADER>, <LayoutType.FIGURE>, <LayoutType.SECTION_HEADER>, <LayoutType.TABLE>), sub_categories={}, relationships={}, summaries=())

## Pipeline operations and parsing results

Let’s now take a look at the parsed results from a technical perspective.


In [7]:
pdf_path = Path.cwd() / "sample/2312.13560.pdf"

df = analyzer.analyze(path=pdf_path, session_id="9999z99z", max_datapoints=3) # max_datapoints limits the number of samples to at most 3
df.reset_state()
all_results = [dp for dp in df]

[32m[0525 21:16.39 @doctectionpipe.py:103][0m  [32mINF[0m  [97mProcessing 2312.13560_0.pdf[0m
Unused or unrecognized kwargs: masks_path, annotations.
[32m[0525 21:16.42 @context.py:133][0m  [32mINF[0m  [97mImageLayoutService total: 2.3687 sec.[0m
[32m[0525 21:16.42 @context.py:133][0m  [32mINF[0m  [97mAnnotationNmsService total: 0.0027 sec.[0m
[32m[0525 21:16.42 @context.py:133][0m  [32mINF[0m  [97mSubImageLayoutService total: 0.0 sec.[0m
[32m[0525 21:16.42 @context.py:133][0m  [32mINF[0m  [97mPubtablesSegmentationService total: 0.0 sec.[0m
[32m[0525 21:16.42 @context.py:133][0m  [32mINF[0m  [97mImageLayoutService total: 0.6443 sec.[0m
[32m[0525 21:16.43 @context.py:133][0m  [32mINF[0m  [97mTextExtractionService total: 1.0304 sec.[0m
[32m[0525 21:16.43 @context.py:133][0m  [32mINF[0m  [97mMatchingService total: 0.0065 sec.[0m
[32m[0525 21:16.43 @context.py:133][0m  [32mINF[0m  [97mTextOrderService total: 0.0319 sec.[0m
[32m[0525 21

If we select any layout section or word, we can immediately see from the object in which component it was generated.


In [8]:
page_2 = all_results[1]

sample_layout_section = page_2.layouts[0] # get the first layout section
print(f"service_id: {sample_layout_section.service_id}, model_id: {sample_layout_section.model_id}, session_id: {sample_layout_section.session_id}")


service_id: 5497d92c, model_id: af516519, session_id: 9999z99z


In [9]:
sample_word = sample_layout_section.get_ordered_words()[2]
print(f"layout section text: {sample_layout_section.text} \n \n word text: {sample_word.characters}  service_id: {sample_word.service_id}, model_id: {sample_word.model_id}, session_id: {sample_word.session_id}")

layout section text: When processing audio at the frame level, an immense vol- ume of entries is generated, where a considerable portion of the frames are assigned to the <blank * symbol due to the characteristic peak behavior of CTC. We propose skip-blank strategy to prune the datastore and accelerat KNN retrieval. During datastore construction, this strateg omits frames whose CTC pseudo labels correspond to th <blank " symbol, thereby reducing the size of the data store. This process is indicated by the blue dashed lines 11 
 
 word text: audio  service_id: 01a15bff, model_id: 65b358ea, session_id: 9999z99z


As is well known, a pipeline does not only generate layout sections but also determines other elements—such as reading orders or relationships between layout sections. The associated `service_id` must be extracted from the container that stores the reading order information.

You can find more information about the data structure in the [**Data structure notebook**](deprecated/Data_structure.ipynb).


In [10]:
reading_order = sample_layout_section.get_sub_category("reading_order")
print(type(reading_order))

<class 'deepdoctection.datapoint.annotation.CategoryAnnotation'>


In [11]:
print(f"position: {reading_order.category_id} \n \n service_id: {reading_order.service_id}  model_id: {reading_order.model_id}, session_id: {reading_order.session_id}")

position: 25 
 
 service_id: f10aa678  model_id: None, session_id: 9999z99z


We can get an overview of all `service_id`'s and their `annotation_id`'s they generated.

In [12]:
service_id_to_annotation_id = page_2.get_service_id_to_annotation_id()
service_id_to_annotation_id.keys(), service_id_to_annotation_id["01a15bff"][:10]

(dict_keys(['5497d92c', 'f10aa678', '01a15bff', '1cedc14d', 'd6219eba']),
 ['a0ef728f-57c7-304e-9d98-903a492b6dea',
  '9340f60a-8088-3b7c-bf5b-f41e472e61b1',
  '14bdcabc-49a6-37e7-8890-4c03f6b6bfed',
  '2c119bf0-d344-3c6c-9ec2-60f7f07c3092',
  '48e5f8f8-90d4-3d2d-bbc1-a8793961d3e1',
  '811addaf-6388-345c-97b4-0def4114822f',
  '589cbd78-dfbc-3af6-9de7-0c2a9ee31956',
  '9b4474c8-7c14-3b97-8533-3998aa832ed3',
  '29440674-f9d3-3611-942b-1f9370ac213a',
  '1b44b477-8c20-3fd2-8f14-accc04abe601'])

Conversely, for a given `annotation_id`, we can use the `AnnotationMap` to locate the position where the object with that specific `annotation_id` can be found.
In the case below, the object with `annotation_id="966e6cc7-8b2c-38c5-9416-cfe114af1cc1"` is located within the layout section with `annotation_id="6ac8cd0b-8425-392c-ae8a-76c1f198ecf0"` and represents the `sub_category = <Relationships.READING_ORDER>`.


In [13]:
annotation_id_to_annotation_maps = page_2.get_annotation_id_to_annotation_maps()
annotation_id_to_annotation_maps["966e6cc7-8b2c-38c5-9416-cfe114af1cc1"]

[AnnotationMap(image_annotation_id='6ac8cd0b-8425-392c-ae8a-76c1f198ecf0', sub_category_key=<Relationships.READING_ORDER>, relationship_key=None, summary_key=None)]

We can retrieve the object in two steps:

In [14]:
image_annotation = page_2.get_annotation(annotation_ids="6ac8cd0b-8425-392c-ae8a-76c1f198ecf0")[0] # get the layout section
reading_order_ann = image_annotation.get_sub_category("reading_order") # get the reading order annotation
reading_order_ann.annotation_id # reading_order_ann is the object we were looking for

'966e6cc7-8b2c-38c5-9416-cfe114af1cc1'

## Undoing a pipeline component operation

Not only the creation of objects but also the revision of a parsing structure can be important. In particular, the output of a component can be reverted using `undo`. In this example, we remove all word positions that were identified by the DocTr text detection model.




In [15]:
text_detection_component = analyzer.get_pipeline_component(service_id="01a15bff")

In [16]:
df = dd.DataFromList([dp.image_orig for dp in all_results]) # Check the notebook Data_structure  in order to understand why we use dp.image_orig
df = text_detection_component.undo(df)
df.reset_state()

all_results_modified = [dp for dp in df]

[32m[0525 21:40.21 @context.py:133][0m  [32mINF[0m  [97mImageLayoutService total: 0.1084 sec.[0m
[32m[0525 21:40.21 @context.py:133][0m  [32mINF[0m  [97mImageLayoutService total: 0.1151 sec.[0m
[32m[0525 21:40.21 @context.py:133][0m  [32mINF[0m  [97mImageLayoutService total: 0.1776 sec.[0m


Note that not only the objects generated by the pipeline itself are affected, but also those from all pipeline components that build upon its results. In this case, objects related to text ordering and layout segments remain. The fact that text ordering objects persist might be surprising, but it can be explained by the fact that not only the words, but also the layout segments themselves are ordered. As a result, the reading order determined for these segments leads to the retention of objects from this service.


In [17]:
page_2_modified = all_results_modified[1]
service_id_to_annotation_id_modified = page_2_modified.get_service_id_to_annotation_id()
service_id_to_annotation_id_modified.keys(), service_id_to_annotation_id_modified.get("01a15bff")

(dict_keys(['5497d92c', 'f10aa678']), None)

## Analyzer Factory


**How is an Analyzer constructed and where does the configuration come from?**

The configuration is provided by a default instance called `cfg`, which can be modified by defrosting it.







In [2]:
from deepdoctection.analyzer import cfg, ServiceFactory

cfg.USE_OCR

True

In [7]:
cfg.freeze(False)
cfg.USE_OCR=False  # After defrosting we can change values and add new attributes
cfg.freeze(True)


In [8]:
cfg

{'DEVICE': None,
 'LANGUAGE': None,
 'LAYOUT_LINK': {'CHILD_CATEGORIES': [<LayoutType.CAPTION>],
                 'PARENTAL_CATEGORIES': [<LayoutType.FIGURE>, <LayoutType.TABLE>]},
 'LAYOUT_NMS_PAIRS': {'COMBINATIONS': [[<LayoutType.TABLE>, <LayoutType.TITLE>],
                                       [<LayoutType.TABLE>, <LayoutType.TEXT>],
                                       [<LayoutType.TABLE>, <LayoutType.KEY_VALUE_AREA>],
                                       [<LayoutType.TABLE>, <LayoutType.LIST_ITEM>],
                                       [<LayoutType.TABLE>, <LayoutType.LIST>],
                                       [<LayoutType.TABLE>, <LayoutType.FIGURE>],
                                       [<LayoutType.TITLE>, <LayoutType.TEXT>],
                                       [<LayoutType.TEXT>, <LayoutType.KEY_VALUE_AREA>],
                                       [<LayoutType.TEXT>, <LayoutType.LIST_ITEM>],
                                       [<LayoutType.TEXT>, <LayoutTy

For constructing predictors (layout, table segmentation, OCR, etc.), pipeline components, and the default pipeline, a `ServiceFactory` offers a variety of methods.
We will not cover all the methods provided by the factory here, but rather give just one example. For a complete overview, please refer to the documentation.

In [5]:
#A very simple example of a pipeline includes a rotation detector that determines the rotation angle
# of a page (in multiples of 90 degrees) and rotates each page so that the text is in its correct,
# readable orientation.

rotation_detector = ServiceFactory.build_rotation_detector()
transform_service = ServiceFactory.build_transform_service(transform_predictor=rotation_detector)
pipeline = dd.DoctectionPipe(pipeline_component_list=[transform_service])

<deepdoctection.extern.tessocr.TesseractRotationTransformer at 0x1d11af7c0>