# Fine Tuning

## Introduction

We show how a model can be fine-tuned for a specific task and how the performance can be compared with the pre-trained model.

For this purpose, we want to try to improve the table extraction in the Deep-Doctection Analyzer as an example. To better understand what we are trying to address, we need to say a little more about processing table extraction.


![title](./pics/dd_table.png)


Table extraction is carried out in different stages:

- Table detection
- Cell detection
- Row and column detection
- Segmentation / cell labeling

Tables, cells and rows / columns are recognized with object detectors (Cascade-RCNN with FPN).
The segmentation is carried out by determining the coverage of cells to rows and columns and is rule-based.

Cell recognition was carried out on the [**PubTabNet**](https://github.com/ibm-aur-nlp/PubTabNet) dataset. PubTabNet contains approx. 500K tables from the field of medical research.

We want to try to fine-tune cell recognition on a dataset that comes from a completely different domain, namely financial reports. For this we use [**FinTabNet**](https://arxiv.org/pdf/2005.00589.pdf), a data set that contains around 100K tables from that domain.

With this we show what is fundamentally necessary to fine-tune a detector in a pipeline component.

## Dataset

In order to quickly fine-tune your own data set, you should create your own data set based on the example of the existing data sets. We use the DatasetRegistry to be able to access the dataset directly. Before we start fine tuning, let's take a look at the data set.

In [1]:
import os

from matplotlib import pyplot as plt

from deep_doctection.utils import get_weights_dir_path,get_configs_dir_path
from deep_doctection.datasets import DatasetRegistry
from deep_doctection.mapper import to_page
from deep_doctection.dataflow import MapData
from deep_doctection.train import train_faster_rcnn

In [2]:
DatasetRegistry.print_dataset_names()

['fintabnet', 'testlayout', 'publaynet', 'pubtabnet', 'xfund']


In [3]:
fintabnet = DatasetRegistry.get_dataset("fintabnet")
fintabnet.dataset_info.description

'FinTabNet dataset contains complex tables from the annual reports of S&P 500 companies with detailed table structure annotations to help train and test structure recognition. To generate the cell structure labels, one uses token matching between the PDF and HTML version of each article from public records and filings. Financial tables often have diverse styles when compared to ones in scientific and government documents, with fewer graphical lines and larger gaps within each table and more colour variations. Fintabnet can be used for training cell detection models as well as for semantic table understanding algorithms. For detection it has cell bounding box annotations as well as precisely described table semantics like row - and column numbers and row and col spans. The dataflow builder can also return captions of bounding boxes of rows and columns. Moreover, various filter conditions on the table structure are available: maximum cell numbers, maximal row and column numbers and their

We refer to the in depths tutorial for more details about the construction of datasets and the architecture of Deep-Doctection. Nevertheless, we will briefly go into the individual steps to display a sample from Fintabnet.

In [4]:
df = fintabnet.dataflow.build(split="train",load_image=True,use_multi_proc=False)
df = MapData(df,to_page)
df.reset_state()
df_iter = iter(df)

[32m[1129 12:17.21 @fintabnet.py:158][0m [32mINF[0m Logic will currently display only ONE table per page, even if there are more !!


1.) Datasets have dataflow components. Dataflows allow efficient loading and mapping of data and thus represent the bloodstream of deep doctection. The build method creates a dataflow of the dataset. By selecting certain parameters, for example, a split can be selected or it can be determined whether the underlying image should be loaded.

2.) In the second line, the core data model is mapped to an easily consumable page object. Parsed results can be queried and visualized in the page object.

3.) The reset_state () method of the dataflow must be called before iterating the dataflow and belongs to the Dataflow API.

4.) We want to use the next method to look at samples, so we create an iterator.

After we have created a page object, we enter the annotations in the image with viz () and visualize them.

In [None]:
table=next(df_iter)
image = table.viz()
plt.figure(figsize = (20,10))
plt.axis('off')
plt.imshow(image)

As shown in the diagram, no embedded tables are required, but cut-out tables. A corresponding data flow can be generated by passing the build_mode parameter with "table". The coordinates of the cells are converted to the coordinate system of the cropped image.

For the training, we use a training script, the content of which corresponds to the Faster-RCNN from Tensorpack, but is adapted to the Deep-Doctection API. Let's collect all necessary inputs.

In [6]:
path_config_yaml=os.path.join(get_configs_dir_path(),"dd/conf_frcnn_cell.yaml")
dataset_train = fintabnet
path_weights = os.path.join(get_weights_dir_path(),"cell/model-2840000.data-00000-of-00001")
config_overwrite=["TRAIN.STEPS_PER_EPOCH=500","TRAIN.EVAL_PERIOD=20","TRAIN.BASE_LR=1e-3","TRAIN.LR_SCHEDULE=[50000]","PREPROC.TRAIN_SHORT_EDGE_SIZE=[600,1200]"]
build_train_config=["build_mode=table","load_image=True","max_datapoints=15000","use_multi_proc=False"]
dataset_val = fintabnet
build_val_config = ["build_mode=table","load_image=True","max_datapoints=1000","use_multi_proc=False"]

We choose the LR and the preproc resizing as in  https://github.com/tensorpack/tensorpack/blob/master/examples/FasterRCNN/BALLOON.md. 

Note that, as we crop the tables off the document image we have to store the cropped image in memory. Hence choose carefully max_datapoints according to your RAM available. You will see a notification in the training protocol

[1129 10:51.23 @logger.py:193] WRN Datapoint have images as np arrays stored and they will be loaded into memory. To avoid OOM set 'load_image'=False in dataflow build config. This will load images when needed and reduce memory costs!!!


In [None]:
train_faster_rcnn(path_config_yaml=path_config_yaml,
                  dataset_train=fintabnet,
                  path_weights=path_weights,
                  config_overwrite=config_overwrite,
                  log_dir="/home/janis/Documents/test_train",
                  build_train_config=build_train_config,
                  dataset_val=dataset_val,
                  build_val_config=build_val_config,
                  metric_name="coco",
                  pipeline_component_name="ImageLayoutService"
                  )