# Evaluating and Fine Tuning

## Introduction

We show how a model can be evaluated and fine-tuned on a specific dataset.

For this purpose, we want to try to improve the table extraction in the **deep**doctection analyzer as an example. To better understand what we are trying to address, we need to say a little more about processing table extraction.


![title](./pics/dd_table.png)


Table extraction is carried out in different stages:

- Table detection
- Cell detection
- Row and column detection
- Segmentation / cell labeling

Tables, cells and rows / columns are recognized with object detectors (Cascade-RCNN with FPN).
The segmentation is carried out by determining the coverage of cells to rows and columns and is rule-based.

Cell recognition was carried out on the [**PubTabNet**](https://github.com/ibm-aur-nlp/PubTabNet) dataset. PubTabNet contains approx. 500K tables from the field of medical research.

We want to fine tune the cell recognition on Fintabnet, a dataset which contains pages from business reports. But before doing that, we want to see if the cell recognition model inference results are similar if we switch the domain. If yes, than it will support the hypothesis, that tables from different domains (e.g. medical reports / business reports) have a different intrinsic structure so that fine tuning would actually make sense.

In [1]:
import os

from deepdoctection.datasets.instances import pubtabnet as pt
from deepdoctection.utils import get_weights_dir_path,get_configs_dir_path
from deepdoctection.datasets import DatasetRegistry
from deepdoctection.eval import MetricRegistry, Evaluator
from deepdoctection.extern.tpdetect import TPFrcnnDetector
from deepdoctection.pipe.layout import ImageLayoutService
from deepdoctection.extern import ModelCatalog

## Dataset

We will make use of the builtin datasets. We use a DatasetRegistry to be able to access the built-in dataset directly. Note, that there is no automatism to download, extract and save the datasets. We will show you how to get the required details.

In [2]:
DatasetRegistry.print_dataset_names()

['fintabnet', 'funsd', 'iiitar13k', 'testlayout', 'publaynet', 'pubtables1m', 'pubtabnet', 'xfund']


In [3]:
pubtabnet = DatasetRegistry.get_dataset("pubtabnet")
pubtabnet.dataset_info.description

"PubTabNet is a large dataset for image-based table recognition, containing 568k+ images of tabular data annotated with the corresponding HTML representation of the tables. The table images are extracted from the scientific publications included in the PubMed Central Open Access Subset (commercial use collection). Table regions are identified by matching the PDF format and the XML format of the articles in the PubMed Central Open Access Subset. More details are available in our paper 'Image-based table recognition: data, model, and evaluation'. Pubtabnet can be used for training cell detection models as well as for semantic table understanding algorithms. For detection it has cell bounding box annotations as well as precisely described table semantics like row - and column numbers and row and col spans. Moreover, every cell can be classified as header or non-header cell. The dataflow builder can also return captions of bounding boxes of rows and columns. Moreover, various filter condit

To install the dataset, go to the url below and download the zip-file. 

In [4]:
pubtabnet.dataset_info.url

'https://dax-cdn.cdn.appdomain.cloud/dax-pubtabnet/2.0.0/pubtabnet.tar.gz?_ga=2.267291150.146828643.1629125962-1173244232.1625045842'

You will have to unzip and place the dataset in your local .cache directory. Once extracted the dataset ought to be in the format the no further rearraging is required. However, if you are unsure, you can get some additional information about the physical structure by call the dataset modules docstring.

In [5]:
pubtabnet.dataflow.get_workdir()

'/home/janis/.cache/deepdoctection/datasets/pubtabnet'

In [6]:
print(pt.__doc__)


Module for Pubtabnet dataset. Place the dataset as follows

|    pubtabnet
|    ├── test
|    │ ├── PMC1.png
|    ├── train
|    │ ├── PMC2.png
|    ├── val
|    │ ├── PMC3.png
|    ├── PubTabNet_2.0.0.jsonl



Let's display a tiny fraction of annotations that is available for each datapoint. `df_dict["annotations"][0]` displays all informations that are available for one cell, i.e. sub categories, like row and column number, header information and bounding boxes. 

In [7]:
df = pubtabnet.dataflow.build(split="train")
df.reset_state()
df_iter = iter(df)
df_dict = next(df_iter).as_dict()
df_dict["file_name"],df_dict["location"],df_dict["image_id"], df_dict["annotations"][0]

('PMC4840965_004_00.png',
 '/home/janis/.cache/deepdoctection/datasets/pubtabnet/train/PMC4840965_004_00.png',
 'c87ee674-4ddc-3efe-a74e-dfe25da5d7b3',
 {'active': True,
  'annotation_id': '84cbfafb-c878-323a-afcf-6159206f2e49',
  'category_name': 'CELL',
  'category_id': '1',
  'score': None,
  'sub_categories': {'ROW_NUMBER': {'active': True,
    'annotation_id': '37cd395e-a09d-3f73-b7e5-98c0d284c75f',
    'category_name': 'ROW_NUMBER',
    'category_id': '28',
    'score': None,
    'sub_categories': {},
    'relationships': {}},
   'COLUMN_NUMBER': {'active': True,
    'annotation_id': '626c0980-5a45-3223-b7c8-39bc3648722c',
    'category_name': 'COLUMN_NUMBER',
    'category_id': '3',
    'score': None,
    'sub_categories': {},
    'relationships': {}},
   'ROW_SPAN': {'active': True,
    'annotation_id': '02458dd5-e774-3cf6-a299-5546d9c63880',
    'category_name': 'ROW_SPAN',
    'category_id': '1',
    'score': None,
    'sub_categories': {},
    'relationships': {}},
   'COLUM

## Models and weights

All pre-trained models are cataloged in the ModelCatalog. You can get a list of all pre trained models. For a specific model you can get more information about the
model type and the Huggingface repo from its profile.

To instantiate a predictor we need to pass the configs and the weights. This will depend on the DL framework you are currently using and assume, this to be Tensorflow. Hence we use `cell/model-1800000.data-00000-of-00001`. If your framework, however, is PyTorch you must choose `cell/d2_model-1800000.data-00000-of-00001`. 

We expect the model already to be locally available. If you haven't downloaded anything yet, you can do this using the ModelDownloadManager:

`ModelDownloadManager.maybe_download_weights_and_configs("cell/model-1800000.data-00000-of-00001")`

We then specify the local path to the config file and the weights.

In [8]:
ModelCatalog.get_weights_names()

['layout/model-800000_inf_only.data-00000-of-00001',
 'cell/model-1800000_inf_only.data-00000-of-00001',
 'item/model-1620000_inf_only.data-00000-of-00001',
 'item/model-1620000.data-00000-of-00001',
 'layout/model-800000.data-00000-of-00001',
 'cell/model-1800000.data-00000-of-00001',
 'layout/d2_model-800000-layout.pkl',
 'cell/d2_model-1800000-cell.pkl',
 'item/d2_model-1620000-item.pkl']

In [9]:
profile = ModelCatalog.get_profile("cell/model-1800000.data-00000-of-00001")
profile

{'config': 'dd/tp/conf_frcnn_cell',
 'size': [823509160, 25905],
 'hf_repo_id': 'deepdoctection/tp_casc_rcnn_X_32xd4_50_FPN_GN_2FC_pubtabnet_c',
 'hf_model_name': 'model-1800000',
 'hf_config_file': ['conf_frcnn_cell.yaml'],
 'tp_model': True,
 'urls': []}

In [10]:
path_config_yaml=os.path.join(get_configs_dir_path(),profile["config"]+".yaml")
path_weights = os.path.join(get_weights_dir_path(),"cell/model-1800000.data-00000-of-00001")

## Evaluation

An evaluator needs a dataset on which to run the evaluation, as well as a predictor and a metric. The predictor must be wraped into a pipeline component, which is why we use the ImageLayoutService.

We take the COCO metric for the problem, but define settings that deviate from the standard. We have to consider the following issues, which differ from ordinary object detection tasks:

- The objects to be identified are generally smaller
- There are many objects to identify.

Therefore, we change the maximum number of detections to consider when calculating the mean average precision and also choose a different range scale for segmenting the cells into the categories small, medium and large.

We then set up the predictor, the pipeline component and the evaluator.

In [11]:
coco_metric = MetricRegistry.get_metric("coco")
coco_metric.set_params(max_detections=[50,200,600], area_range=[[0,1000000],[0,200],[200,800],[800,1000000]])

In [12]:
categories = pubtabnet.dataflow.categories.get_categories(filtered=True)
cell_detector = TPFrcnnDetector(path_config_yaml,path_weights,categories)

layout_service =  ImageLayoutService(cell_detector)

[32m[0309 17:52:12 @varmanip.py:214][0m Checkpoint path /home/janis/.cache/deepdoctection/weights/cell/model-1800000.data-00000-of-00001 is auto-corrected to /home/janis/.cache/deepdoctection/weights/cell/model-1800000.
[32m[0309 17:52:12 @registry.py:80][0m 'conv0' input: [1, 3, ?, ?]
[32m[0309 17:52:12 @registry.py:90][0m   'conv0/gn': [1, 64, ?, ?] --> [1, 64, ?, ?]
[32m[0309 17:52:12 @registry.py:93][0m 'conv0' output: [1, 64, ?, ?]
[32m[0309 17:52:12 @registry.py:90][0m 'pool0': [1, 64, ?, ?] --> [1, 64, ?, ?]
[32m[0309 17:52:12 @registry.py:80][0m 'group0/block0/conv1' input: [1, 64, ?, ?]
[32m[0309 17:52:12 @registry.py:90][0m   'group0/block0/conv1/gn': [1, 128, ?, ?] --> [1, 128, ?, ?]
[32m[0309 17:52:12 @registry.py:93][0m 'group0/block0/conv1' output: [1, 128, ?, ?]
[32m[0309 17:52:12 @registry.py:80][0m 'group0/block0/conv2' input: [1, 128, ?, ?]
[32m[0309 17:52:12 @registry.py:90][0m   'group0/block0/conv2/gn': [1, 128, ?, ?] --> [1, 128, ?, ?]
[32m[03



[32m[0309 17:52:13 @registry.py:80][0m 'group0/block1/conv1' input: [1, 256, ?, ?]
[32m[0309 17:52:13 @registry.py:90][0m   'group0/block1/conv1/gn': [1, 128, ?, ?] --> [1, 128, ?, ?]
[32m[0309 17:52:13 @registry.py:93][0m 'group0/block1/conv1' output: [1, 128, ?, ?]
[32m[0309 17:52:13 @registry.py:80][0m 'group0/block1/conv2' input: [1, 128, ?, ?]
[32m[0309 17:52:13 @registry.py:90][0m   'group0/block1/conv2/gn': [1, 128, ?, ?] --> [1, 128, ?, ?]
[32m[0309 17:52:13 @registry.py:93][0m 'group0/block1/conv2' output: [1, 128, ?, ?]
[32m[0309 17:52:13 @registry.py:80][0m 'group0/block1/conv3' input: [1, 128, ?, ?]
[32m[0309 17:52:13 @registry.py:90][0m   'group0/block1/conv3/gn': [1, 256, ?, ?] --> [1, 256, ?, ?]
[32m[0309 17:52:13 @registry.py:93][0m 'group0/block1/conv3' output: [1, 256, ?, ?]
[32m[0309 17:52:13 @registry.py:80][0m 'group0/block2/conv1' input: [1, 256, ?, ?]
[32m[0309 17:52:13 @registry.py:90][0m   'group0/block2/conv1/gn': [1, 128, ?, ?] --> [1, 1

We start the evaluation with the `run`. max_datapoints limits the number of samples in the evaluation to 100 samples. The val split is used by default. If this is not available, it must be given as an argument along with other possible build configurations.

In [13]:
evaluator = Evaluator(pubtabnet,layout_service, coco_metric)
output= evaluator.run(category_names=["CELL"],max_datapoints=100)

[32m[0309 17:52.17 @eval.py:67][0m [32mINF[0m Building multi threading pipeline component to increase prediction throughput. Using 2 threads
[32m[0309 17:52:17 @varmanip.py:214][0m Checkpoint path /home/janis/.cache/deepdoctection/weights/cell/model-1800000.data-00000-of-00001 is auto-corrected to /home/janis/.cache/deepdoctection/weights/cell/model-1800000.
[32m[0309 17:52:19 @sessinit.py:86][0m [5m[31mWRN[0m The following variables are in the checkpoint, but not found in the graph: global_step, learning_rate
[32m[0309 17:52:20 @sessinit.py:114][0m Restoring checkpoint from /home/janis/.cache/deepdoctection/weights/cell/model-1800000 ...
INFO:tensorflow:Restoring parameters from /home/janis/.cache/deepdoctection/weights/cell/model-1800000
[32m[0309 17:52.20 @logger.py:193][0m [32mINF[0m Loading annotations for 'val' split from Pubtabnet will take some time...
[32m[0309 17:53.02 @logger.py:193][0m [32mINF[0m dp: 549232 is malformed, err: IndexError,
            msg

100%|██████████| 99/99 [00:11<00:00,  8.86it/s]

[32m[0309 17:53.14 @eval.py:121][0m [32mINF[0m Starting evaluation...





creating index...
index created!
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=8.14s).
Accumulating evaluation results...
DONE (t=0.10s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=600 ] = 0.950
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=600 ] = 0.938
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=600 ] = 0.802
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=600 ] = 0.845
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=600 ] = 0.828
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 50 ] = 0.532
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=200 ] = 0.850
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=600 ] = 0.859
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=600 ] = 0.838
 A

As mentioned we are now going to evaluate the cell predictor on tables from business documents. One difference from the previous evaluation is the representation of the dataset. Unlike Pubtabnet where tables are already cropped from their surronding document, the images of Fintabnet are complete document pages with embedded tables. In order to get tables only we can change the build mode, which is a specific implementation for some datasets. In this case we set `build_mode = "table"`. This will under the hood crop the table from the image and adjust the bounding boxes to the sub image, so that the datasets dataflow will look like the Pubtabnet dataset. For those looking closer at the configuration, they will also observe a second parameter `load_image=True`. We will not go into the details of this setting and will only refer to the fact, that an AssertionError will be raised otherwise, when using this `build_mode`. 

We only need to re-instantiate the evaluator.

In [15]:
fintabnet = DatasetRegistry.get_dataset("fintabnet")
fintabnet.dataflow.categories.filter_categories(categories="CELL")

evaluator = Evaluator(fintabnet,layout_service, coco_metric)
output= evaluator.run(category_names=["CELL"],max_datapoints=100,build_mode="table",load_image=True, use_multi_proc=False)

[32m[0309 17:54.22 @eval.py:67][0m [32mINF[0m Building multi threading pipeline component to increase prediction throughput. Using 2 threads
[32m[0309 17:54:22 @varmanip.py:214][0m Checkpoint path /home/janis/.cache/deepdoctection/weights/cell/model-1800000.data-00000-of-00001 is auto-corrected to /home/janis/.cache/deepdoctection/weights/cell/model-1800000.




[32m[0309 17:54:25 @sessinit.py:86][0m [5m[31mWRN[0m The following variables are in the checkpoint, but not found in the graph: global_step, learning_rate
[32m[0309 17:54:25 @sessinit.py:114][0m Restoring checkpoint from /home/janis/.cache/deepdoctection/weights/cell/model-1800000 ...
INFO:tensorflow:Restoring parameters from /home/janis/.cache/deepdoctection/weights/cell/model-1800000
[32m[0309 17:54.47 @eval.py:116][0m [32mINF[0m Predicting objects...


100%|██████████| 100/100 [00:07<00:00, 13.32it/s]

[32m[0309 17:54.55 @eval.py:121][0m [32mINF[0m Starting evaluation...





creating index...
index created!
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
DONE (t=1.70s).
Accumulating evaluation results...
DONE (t=0.06s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = -1.000
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=600 ] = 0.902
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=600 ] = 0.701
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=600 ] = 0.555
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=600 ] = 0.559
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=600 ] = 0.690
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 50 ] = 0.587
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=200 ] = 0.648
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=600 ] = 0.648
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=600 ] = 0.631
 A

We observe a certain confidence decrease for cell detection, especially for higher confidences. Note that the mAP for IoU decreases from 0.938 to 0.701 ! 

## Training Tensorpack Predictor

The following steps only work for Tensorpack models and not for Detectron2 models. We currently do not provide built-in training scripts for Detectron2. Also note, 
that for training/fine-tuning an already pre-trained model we must not use the inference-only weights as these do not include important checkpoint information for resuming training. 

For training, we use a script that stems from the training of the Faster-RCNN model from Tensorpack. We use the same model as above.

We recommend to restart the kernel if you have worked through this notebook from the beginning and therefore re-import all necessary modules.

In [3]:
import os
from deepdoctection.utils import get_weights_dir_path,get_configs_dir_path
from deepdoctection.datasets import DatasetRegistry
from deepdoctection.eval import MetricRegistry
from deepdoctection.extern import ModelCatalog
from deepdoctection.train import train_faster_rcnn

Fintabnet has a train, val and test split from which we use the first two. For each split, we need to define the dataflow built configuration. Even though not necessary, as already set by default within the training script, we explicitly pass the split.  

In [4]:
profile = ModelCatalog.get_profile("cell/model-1800000.data-00000-of-00001")
path_config_yaml=os.path.join(get_configs_dir_path(),profile["config"]+".yaml")
path_weights = os.path.join(get_weights_dir_path(),"cell/model-1800000.data-00000-of-00001")

fintabnet = DatasetRegistry.get_dataset("fintabnet")
fintabnet.dataflow.categories.filter_categories(categories="CELL")

dataset_train = fintabnet
build_train_config=["max_datapoints=500","build_mode='table'","load_image=True", "use_multi_proc_strict=True","split='train'"]

dataset_val = fintabnet
build_val_config = ["max_datapoints=10","build_mode='table'","load_image=True", "use_multi_proc_strict=True","split='val'"]

coco_metric = MetricRegistry.get_metric("coco")
coco_metric.set_params(max_detections=[50,200,600], area_range=[[0,1000000],[0,200],[200,800],[800,1000000]])


The next configs require

In [6]:
config_overwrite=["LR_SCHEDULE=50000","TRAIN.EVAL_PERIOD=20","TRAIN.CHECKPOINT_PERIOD=20","BACKBONE.FREEZE_AT=0","TRAIN.BASE_LR=1e-3"]

We can now start training. Make sure that the log directory is set correctly. If such a directory already exists, the existing one will be deleted and created again!

In [None]:
train_faster_rcnn(path_config_yaml=path_config_yaml,
                  dataset_train= dataset_train,
                  path_weights=path_weights,
                  config_overwrite=config_overwrite,
                  log_dir="/home/janis/Documents/sample_train",
                  build_train_config=build_train_config,
                  dataset_val=dataset_val,
                  build_val_config=build_val_config,
                  metric=coco_metric,
                  pipeline_component_name="ImageLayoutService"
                 )

In [None]:
train_faster_rcnn(path_config_yaml=path_config_yaml,
                  dataset_train=pubtabnet,
                  path_weights=path_weights,
                  config_overwrite=config_overwrite,
                  log_dir="/path/to/log_dir",
                  build_train_config=build_train_config,
                  dataset_val=dataset_val,
                  build_val_config=build_val_config,
                  metric=coco_metric,
                  pipeline_component_name="ImageLayoutService"
                  )