![title](./pics/dd_logo.png) 

# Analyzer Configuration

**deep**doctection's analyzer comes equipped with extensive configurations that allows:

- to add and remove processing steps 
- to swap models for layout analysis or ocr
- to adjust config setting for rule based processes
- to configure the output structure

In this notebook, we aim to take a closer look at the configuration and we will cover the most important features. 

We will assume familiarity with the [Get started notebook](./Analyzer_Get_Started.ipynb). If you want to understand the architecture of a pipeline we recommend having a look in the [pipeline notebook](./Pipelines.ipynb).

## How to change configuration

There are essentially two ways to adjust the configuration: by modifying the configuration file or by explicitly setting parameters.

### Explicit parameter adjustment

Pass the adjustments in a list:

In [5]:
import deepdoctection as dd

In [None]:
config_overwrite = ["USE_TABLE_SEGMENTATION=False",
                    "USE_OCR=False",
                    "TEXT_ORDERING.BROKEN_LINE_TOLERANCE=0.01",
                    "LAYOUT.FILTER=['title']"]  # Make sure to include quotation marks around the string values 
#(e.g., ['title']), as omitting them may cause parsing errors. Do not leave any empty space, e.g. "USE_TABLE_SEGMENTATION = False"

analyzer = dd.get_dd_analyzer(config_overwrite=config_overwrite)

### Configuration file

**deep**doctection creates a cache directory the first time `dd.get_dd_analyzer()` is called. This cache directory stores all models and configurations that are used at any point in time. The configuration file can be found at `os.environ["DD_ONE_CONFIG"]`. You can adjust the config file and load the new config.

In [None]:
analyzer = dd.get_dd_analyzer(load_default_config_file=True)

You can also specify your own config file path.

In [None]:
analyzer = dd.get_dd_analyzer(path_config_file="path/to/your/config")

## High level Configuration

The analyzer consists of various processing steps that can be switched on and off.

```yaml
# Enables the initial pipeline component using TesseractRotationTransformer to auto-rotate pages
# by 90-degree increments. All subsequent components process the rotated page.
USE_ROTATOR = False

# Enables layout analysis component (second in the pipeline) for either full document layout analysis (DLA)
# or single-object detection. Additional configurations via LAYOUT.*, LAYOUT.*, and ENFORCE_WEIGHTS.LAYOUT.
USE_LAYOUT = True

# Enables optional fine-grained Non-Maximum Suppression (NMS) after layout detection.
# Configure via LAYOUT_NMS_PAIRS.* settings.
USE_LAYOUT_NMS = True

# Enables table segmentation (third and later pipeline components).
# Applies row/column detection, optional cell detection, and segmentation services.
# Configure sub-services via ITEM.*, ITEM.*, CELL.*, CELL.*, and SEGMENTATION.*
USE_TABLE_SEGMENTATION = True

# Enables optional refinement of table structure to ensure valid HTML generation.
# Should be set to False when using the TableTransformer models: 
# ITEM.WEIGHTS = deepdoctection/tatr_tab_struct_v2/model.safetensors
USE_TABLE_REFINEMENT = False

# Enables text extraction using PDFPlumber. Only works on PDFs with embedded text layers.
# Configure additional behavior using PDF_MINER.*
USE_PDF_MINER = False

# Enables OCR functionality using Tesseract, DocTr, or Textract.
# Also activates MatchingService and TextOrderingService to associate text with layout elements.
# Further configurations via OCR.*, WORD_MATCHING.*, TEXT_CONTAINER, and TEXT_ORDERING.*
USE_OCR = True

# Enables MatchingService to associate nearby layout elements (e.g., figures and captions).
USE_LAYOUT_LINK = False

# Enables line matching in post-processing. Useful when synthetic line elements are created
# (e.g., by grouping orphan text containers). Only applicable if list items were previously grouped.
USE_LINE_MATCHER = False

# Enables a sequence classification pipeline component, e.g. a LayoutLM or a Bert-like model.
USE_LM_SEQUENCE_CLASS = False

# Enables a token classification pipeline component, e.g. a LayoutLM or Bert-like model
USE_LM_TOKEN_CLASS = False
```

## Rotator models

There are two approaches available: One that uses `Tesseract` and a second method based on `DocTr`. Set `ROTATOR.MODEL=tesseract` or  
`ROTATOR.MODEL=doctr`. 

## Layout models

Once `USE_LAYOUT=True` you can configure the layout pipeline component further. You can choose between `layout/d2_model_0829999_layout_inf_only.pt`, `microsoft/table-transformer-detection/model.safetensors` or you can keep the default model `Aryn/deformable-detr-DocLayNet/model.safetensors`. These model are members of a model registry which is why they are easy to use. 

Note, the these models have been trained on different datasets and will therefore vary in accuracy depending on your use-case.

Use `layout/d2_model_0829999_layout_inf_only.pt` for scientific articles, `microsoft/table-transformer-detection/model.safetensors` if you are only interested in table detection. `Aryn/deformable-detr-DocLayNet/model.safetensors` is more general and can be used for financial reports, patents, manuals or laws and regulation documents.

You can also add some custom models as well, but this requires to add them to the model registry and maybe, you need to write a **deep**doctection wrapper.

## Layout Non-Maximum-Supression

This is relevant if `USE_LAYOUT_NMS=True`.

Layout models often produce overlapping layout sections. These can be removed using Non-Maximum Suppression (NMS). 
Suppose a large and complex table is detected — it's not uncommon for a text block or a title to be mistakenly recognized within the table as well, potentially even with a high confidence score. In such cases, you may still want to retain the table at all costs.

Using the `LAYOUT_NMS_PAIRS` configuration, you can define pairs of layout sections that should be subjected to NMS once a certain overlap threshold is exceeded. Additionally, you can set a priority to specify which category should be favored when overlaps occur.

In `.yaml`-terms, the configuration consists of three parts:

```yaml
LAYOUT_NMS_PAIRS:
  COMBINATIONS:  # Pairs of layout categories to be checked for NMS
    - - table
      - title
  PRIORITY:  # Preferred category when overlap occurs. If set to `None`, NMS uses the confidence score.
    - table
  THRESHOLDS:  # IoU overlap threshold. Pairs with lower IoU will be ignored.
    - 0.001
```

Using Python, the config looks as follows:

```python
config_overwrite=["LAYOUT_NMS_PAIRS.COMBINATIONS=['table','title']",
                  "LAYOUT_NMS_PAIRS.PRIORITY=['table']",
                  "LAYOUT_NMS_PAIRS.THRESHOLDS=[0.001]"]
```

This allows fine-grained control over which layout sections should be retained and which should be suppressed during postprocessing.

## Table segmentation models

To infer rows, columns, simple and multi spanning cells of a detected table layout segment, a segmentation step is required. Once `USE_TABLE_SEGMENTATION=True` is set: **deep**doctection provides two approaches that depend on the choice of the model.

### [Table transformer](https://github.com/microsoft/table-transformer)

This is the default setting and the most general approach.

```python
config_overwrite=["ITEM.WEIGHTS=deepdoctection/tatr_tab_struct_v2/model.safetensors"]
```

We have observed that the recognition of multi-spanning cells is **less reliable** for non-scientific tables. If multi-spanning cells or headers are not essential, we recommend filtering them out. The result is a table structure consisting only of simple cells, i.e., cells with `row_span=1` and `column_span=1`.

### Filtering Redundant Detections

The Table Transformer may detect redundant table structures and headers. To filter those out, apply the following:

```yaml
ITEM:
   WEIGHTS: microsoft/table-transformer-structure-recognition/pytorch_model.bin
   FILTER:
      - table
      - column_header
      - projected_row_header
      - spanning
```

This ensures that only relevant cell structures are retained, improving clarity and reducing noise in the segmentation output.


### deepdoctection's built-in approach

```python
config_overwrite=["ITEM.WEIGHTS=item/d2_model_1639999_item_inf_only.pt",
                  "CELL.WEIGHTS=cell/d2_model_1849999_cell_inf_only.pt"]
```

These models have been trained on scientific reports and are more specialized. 

`USE_TABLE_REFINEMENT=True` will improve the table structure but only works in conjunction with **deep**doctection's built-in approach.

### Configuration

The segmentation configuration is extensive, and we cannot cover every setting in detail. For a comprehensive description of all parameters, we refer to the source code. We will focus on the parameters that have the most significant impact on the segmentation results.


```python
# Specifies the rule used to assign detected cells to rows and columns.
# Can be either 'iou' (Intersection over Union) or 'ioa' (Intersection over Area).
# In the Table Transformer approach, this also applies to special cell types like spanning or header cells.
SEGMENTATION.ASSIGNMENT_RULE = "ioa"

# Threshold for assigning a (special) cell to a row based on the chosen rule (IOU or IOA).
# The row assignment is based on the highest-overlapping row.
# Multiple overlaps can lead to increased rowspan.
SEGMENTATION.THRESHOLD_ROWS = 0.4

# Threshold for assigning a (special) cell to a column based on the chosen rule (IOU or IOA).
# The column assignment is based on the highest-overlapping column.
SEGMENTATION.THRESHOLD_COLS = 0.4

# Removes overlapping rows based on an IoU threshold.
# Helps to prevent multiple row spans caused by overlapping detections.
# Note: for better alignment, SEGMENTATION.FULL_TABLE_TILING can be enabled.
# Using a low threshold here may result in a very coarse grid.
SEGMENTATION.REMOVE_IOU_THRESHOLD_ROWS = 0.2

# Same as above, but applied to columns.
SEGMENTATION.REMOVE_IOU_THRESHOLD_COLS = 0.2

# Ensures that predicted rows and columns fully cover the table region.
# When enabled, rows will be stretched horizontally and vertically to fit the full region.
# For rows, the first row will be stretched to the top, and the space to the second row is used to estimate the
# bottom edge. This rule applies similarly to columns.
SEGMENTATION.FULL_TABLE_TILING = True

# Defines how row and column boundaries are stretched when tiling is enabled.
# Options:
# - "left": lower edge equals the upper edge of the next row
# - "equal": lower edge is halfway between two adjacent rows
SEGMENTATION.STRETCH_RULE = "equal"

# Defines the threshold values for matching column/row header cells to their respective rows/columns
# in the Table Transformer approach. The matching rule is defined in SEGMENTATION.ASSIGNMENT_RULE.
SEGMENTATION.PUBTABLES_ITEM_HEADER_THRESHOLDS = [0.6, 0.0001]
```

## Text extraction

There are two ways to extract text:

- If the document is a PDF one can try to extract the text from its text layer. This requires native PDF documents where the text can be extracted from the byte encoding.
- Running OCR.

### PDFPlumber

To activate the PDF miner:

```yaml
USE_PDF_MINER: True
```

PDF Miners need to group character like objects into words. This task is rule based and can be further customized:

```python
# Horizontal tolerance when merging characters into words.
# Characters that are horizontally closer than this value will be grouped into a single word.
PDF_MINER.X_TOLERANCE=3

# Vertical tolerance when grouping characters into lines.
# Characters within this vertical range will be considered part of the same line.
PDF_MINER.Y_TOLERANCE=3
```

### OCR

There are three are all OCR engines available.

```yaml
OCR:
  USE_TESSERACT: False
  USE_DOCTR: True
  USE_TEXTRACT: False
```

**To activate one OCR engine, you also must ensure to deactivate the other two.**


#### DocTr

DocTr is a powerful library that provides small but very efficient models. What makes it particularly valuable is that it includes training scripts and allows models to be trained with custom vocabularies. This makes it possible to build OCR models for highly specialized scripts where standard OCR solutions easily fail.

A DocTr OCR pipeline consists of two steps: spatial word detection and character recognition within the region of interest.

For word detection, there is currently one model available:

```python
OCR.WEIGHTS.DOCTR_WORD="doctr/db_resnet50/db_resnet50-ac60cadc.pt"
```

For text recognition, the default model is:

```python
OCR.WEIGHTS.DOCTR_RECOGNITION="doctr/crnn_vgg16_bn/crnn_vgg16_bn-0417f351.pt"
```

but you can also use: 


 * `Felix92/doctr-torch-parseq-multilingual-v1/pytorch_model.bin`
 * `doctr/crnn_vgg16_bn/pt/master-fde31e4a.pt`


#### Tesseract

```yaml
OCR:
  USE_TESSERACT: True
  USE_DOCTR: False
  USE_TEXTRACT: False
```

In addition to DocTr, Tesseract is arguably the most widely known open-source OCR solution and provides pre-trained models for a large number of languages. However, Tesseract must be installed separately. We refer to the official Tesseract documentation.

Tesseract comes with its own configuration file, which is located alongside other configuration files under `~/.cache/deepdoctection/configs/dd/conf_dd_one.yaml`. 


#### AWS Textract

Textract is the AWS OCR solution that can be accessed via API. It is superior to many Open Source solutions. This is a paid service and requires an AWS account. You also need to install `boto3`. We refer to the official documentation to access the service via API.

To use the API, credentials must be provided. You can use the AWS CLI with its built-in secret management or use an `.env` file with

```
AWS_ACCESS_KEY_ID=your-aws-key
AWS_SECRET_ACCESS_KEY=your-secret-key
AWS_REGION=your-region
```





The following two pipeline configuration will automatically be effective once you set `USE_OCR=True` or `USE_PDF_MINER=True`.



## Word matching

Word matching serves to merge the results of layout analysis (including table structure) with those of OCR (words and maybe lines). Up to this point, all layout segments and words are independent elements of a page, with no established relationships between them. Word matching creates a link between each word and the appropriate layout segment.

The most effective way to establish this link is by evaluating the spatial overlap between a word and layout sections. It must be specified which layout sections are eligible for such associations — not all segments are suitable. For example, words that are part of a table should not be linked to the table's outer frame, but rather to the individual cell identified during table segmentation.

The following configuration uses a class `IMAGE_DEFAULTS` that contains default values for the word matching.
Depending on which OCR engine is used, some return text and bounding boxes on word level, while others return text on text line level with bounding boxes. `IMAGE_DEFAULTS.TEXT_CONTAINER` is therefore the element, that contain the lowest level text elements. The default value is `word`.

```python
# Specifies the annotation type used as a text container.
# A text container is typically an ImageAnnotation generated by the OCR engine or PDF mining tool.
# It contains a sub-categories of type `characters`.
# Most commonly, text containers are of type `word`, but `line` may also be used.
# It is recommended to align this value with IMAGE_DEFAULTS.TEXT_CONTAINER
# rather than modifying it directly in the config.
TEXT_CONTAINER = IMAGE_DEFAULTS.TEXT_CONTAINER

# Configuration for matching text containers (e.g., words or lines) to layout elements
# such as titles, paragraphs, tables, etc., using spatial overlap.
# When a match occurs, a parent-child relationship (Relationships.CHILD) is assigned.

# Specifies the layout categories considered as potential parents of text containers.
WORD_MATCHING.PARENTAL_CATEGORIES = IMAGE_DEFAULTS.TEXT_BLOCK_CATEGORIES

# Rule used for matching: either 'iou' (intersection over union) or 'ioa' (intersection over area).
WORD_MATCHING.RULE = "ioa"

# Threshold for the selected matching rule (IOU or IOA).
# Text containers must exceed this threshold to be assigned to a layout section.
WORD_MATCHING.THRESHOLD = 0.3

# If a text container overlaps with multiple layout sections,
# setting this to True will assign it only to the best-matching (i.e., highest-overlapping) section.
# Prevents duplication of text in the output.
WORD_MATCHING.MAX_PARENT_ONLY = True
```


## Reading Order

Next, words and layout segments must be arranged to form coherent, continuous text. This task is handled by the `TextOrderService` component.

As already mentioned in several occasions, establishing a reading order is a sophisticated task. This becomes even more true, if the document has a complex layout structure, with a lot of different elements.

When thinking about getting some sort of narrative text, you first need to think what layout sections need have to be part of that narrative text.
This depends on the layout sections being extracted and eventually depends on the layout model.

Layout sections of narrative text are given by `IMAGE_DEFAULT.FLOATING_TEXT_BLOCK_CATEGORIES` and its default is given by:

```python
IMAGE_DEFAULT.FLOATING_TEXT_BLOCK_CATEGORIES=['text,'title','list','key_value_area']
```

Beside narrative text, you need to specify which layout sections actually may contain text. This is a result from word matching and is given by `IMAGE_DEFAULTS.TEXT_BLOCK_CATEGORIES`.

```python
IMAGE_DEFAULT.TEXT_BLOCK_CATEGORIES=['text,'title','list_item','list', 'caption', page_header',
                                      'page_footer', 'page_number', 'mark', 'key_value_area',
                                      'figure', 'spanning', 'cell']
```

Using the terminology of `IMAGE_DEFAULTS`:

- `IMAGE_DEFAULTS.TEXT_CONTAINER`, that have been assigned to some `IMAGE_DEFAULT.TEXT_BLOCK_CATEGORIES` will be ordered within it text blocks.
- `IMAGE_DEFAULT.FLOATING_TEXT_BLOCK_CATEGORIES`, which should be in general a subset of `IMAGE_DEFAULT.TEXT_BLOCK_CATEGORIES` will be ordered to generate the narrative text of the page document. We are not going into more detail here, how the ordering works in detail. For further details on layout parsing and text ordering, please refer to the [**documentation**](https://deepdoctection.readthedocs.io/en/latest/tutorials/layout_parsing_structure).

```python

# Specifies which layout categories must be ordered (e.g., paragraphs, list items).
# These are layout blocks that will be processed by the TextOrderingService.
cfg.TEXT_ORDERING.TEXT_BLOCK_CATEGORIES = IMAGE_DEFAULTS.TEXT_BLOCK_CATEGORIES

# Specifies which text blocks are considered floating (not aligned with strict columns or grids).
# These will be linked with a subcategory of type Relationships.READING_ORDER.
cfg.TEXT_ORDERING.FLOATING_TEXT_BLOCK_CATEGORIES = IMAGE_DEFAULTS.FLOATING_TEXT_BLOCK_CATEGORIES

# In the word matching process it is possible that some words do not overlap with any layout segment.
# If `INCLUDE_RESIDUAL_TEXT_CONTAINER` is set to `False`, these words will not receive a `reading_order` and will be excluded from the text output.
# If set to `True`, orphan words are grouped into `line`s and included in the output, ensuring no text is lost. This setting is often crucial and may
# need to be adjusted depending on your use case.
cfg.TEXT_ORDERING.INCLUDE_RESIDUAL_TEXT_CONTAINER = True

# Tolerance used to determine whether a text block's left/right coordinate lies within a column’s boundary.
# Helps with assigning text blocks to columns based on horizontal alignment.
cfg.TEXT_ORDERING.STARTING_POINT_TOLERANCE = 0.005

# Horizontal distance threshold for grouping words into the same line.
# If the gap between words exceeds this value, they will be treated as belonging to separate lines or columns.
cfg.TEXT_ORDERING.BROKEN_LINE_TOLERANCE = 0.003

# Used for ordering vertically broken floating text blocks into coherent columns.
# Defines vertical alignment tolerance between adjacent text blocks.
cfg.TEXT_ORDERING.HEIGHT_TOLERANCE = 2.0

# Defines the spacing threshold that indicates a paragraph break in vertically arranged text blocks.
# Helps determine reading order in multi-column, broken layouts.
cfg.TEXT_ORDERING.PARAGRAPH_BREAK = 0.035
```