# Historical document : page and line extraction

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17k4FlJlwxVht7QhT9GKmitQni_e9evce)

You should open this jupyter in Colab in order to have acess to GPUs.

In this tutorial you will use the [dhSegment package](https://github.com/dhlab-epfl/dhSegment), which is a tool for segmentation of historical documents.

## Installation and downloads

Before running this notebook, you need to install the dhSegment package:

In [0]:
!pip install git+https://github.com/dhlab-epfl/dhSegment


The models that you will use were already trained (on the READ-BAD dataset), so you'll have to donwload them.

Download and unzip model.zip from https://github.com/dhlab-epfl/dhSegment/releases/tag/v0.2 for a page extraction model.

In [0]:
!wget https://github.com/dhlab-epfl/dhSegment/releases/download/v0.2/model.zip
!unzip model.zip

In [0]:
! wget https://github.com/dhlab-epfl/fdh-tutorials/releases/download/v0.1/line_model.zip
!unzip line_model.zip

Download a few images to process:

In [0]:
!mkdir images
!wget https://raw.githubusercontent.com/dhlab-epfl/fdh-tutorials/master/computer-vision-deep-learning/3-applications/dl-document-processing-textlines/images/002_451_001.jpg -P images
!wget https://raw.githubusercontent.com/dhlab-epfl/fdh-tutorials/master/computer-vision-deep-learning/3-applications/dl-document-processing-textlines/images/0167.jpg -P images
!wget https://raw.githubusercontent.com/dhlab-epfl/fdh-tutorials/master/computer-vision-deep-learning/3-applications/dl-document-processing-textlines/images/RT_Aigen_am_Inn_013_0127.jpg -P images

## Page extraction

In [0]:
from skimage import io
import numpy as np
import matplotlib.pyplot as plt
from skimage.transform import resize
import tensorflow as tf

In [0]:
import dh_segment
from dh_segment.inference import LoadedModel
from dh_segment.io import PAGE
from dh_segment.post_processing import thresholding, cleaning_binary, cleaning_probs

Show one image:

In [0]:
image_filename = '/content/images/0167.jpg'
img = io.imread(image_filename)
plt.figure(figsize=(10,10))
plt.imshow(img)

Load the model to extract page

In [0]:
sess1 = tf.InteractiveSession()
with sess1.graph.as_default():
    model_page = LoadedModel("/content/model/")


You'll feed the image to the network:


In [0]:
output_page = model_page.predict(image_filename)

The predict method returns a dictionary with keys :

- ``probs`` : the probability maps
- ``original_shape`` : the shape of the original image
- ``labels`` : the labels (the binary prediction, i.e thresholded probs)

The probability maps that is of interest for us in in the channel 1 (``output_page['probs'][0,:,:,1]``)


In [0]:
page_probs = output_page['probs'][0,:,:,1]
plt.figure(figsize=(10,10))
plt.imshow(page_probs)

### Page map probabilty post-processing


First we need to threshold the probabilities to obtain a binary image (``thresholding``) and then the binary image is cleaned in order to remove small elements (``cleaning_binary``).

We then need to obtain the coordinates of the page


In [0]:
page_mask = thresholding(page_probs, threshold=0.7)
page_mask = cleaning_binary(page_mask, kernel_size=7).astype(np.uint8)*255
plt.figure(figsize=(10,10))
plt.imshow(page_mask)

### Find and save the coordinates of the page

In [0]:
from dh_segment.post_processing import boxes_detection
page_coords = boxes_detection.find_boxes(resize(page_mask, img.shape[:2]).astype(np.uint8), n_max_boxes=1)

The ``PAGE.Page`` object is a class representing a historical document page with its ``TextRegions``, ``TextLines``, ``Border``, etc.

In [0]:
PAGE_info = PAGE.Page(image_width=img.shape[1], image_height=img.shape[0],
                      page_border=PAGE.Border(PAGE.Point.list_to_point(list(page_coords))))

In [0]:
plot_img = img.copy()
PAGE_info.draw_page_border(plot_img, autoscale=True, fill=False, thickness=15)
plt.figure(figsize=(10,10))
plt.imshow(plot_img)

## Texline detection
Now that we have extracted the page, we'd like to extract the text lines, so again ge call ``predict`` function

In [0]:
sess2 = tf.InteractiveSession(graph=tf.Graph()) # New Graph need to be initialized in the session 2
with sess2.graph.as_default():
    model_textline = LoadedModel("/content/polylines/")

In [0]:
output_textline = model_textline.predict(image_filename)

In [0]:
textline_probs = output_textline['probs'][0,:,:,1]
plt.figure(figsize=(10,10))
plt.imshow(textline_probs)

### Line map probabilty post-processing
We will use the binary mask we extracted in the previous section to eliminate the lines belonging to the left page that were wrongly detected.

In [0]:
from dh_segment.post_processing import hysteresis_thresholding

textline_probs2 = cleaning_probs(textline_probs, 2)
extracted_page_mask = np.zeros(textline_probs.shape, dtype=np.uint8)
PAGE_info.draw_page_border(extracted_page_mask, color=(255,))
textline_mask = hysteresis_thresholding(textline_probs2, low_threshold=0.3, high_threshold=0.6,
                                        candidates_mask=extracted_page_mask>0
                                       )

In [0]:
plt.figure(figsize=(10,10))
plt.imshow(textline_mask)

### Find the lines, vectorize and save them

In [0]:
from dh_segment.post_processing import line_vectorization

lines = line_vectorization.find_lines(resize(textline_mask, img.shape[:2]))

In the ``Page`` structure, ``TextLines`` are always contained in ``TextRegion`` object, so first we create a ``TextRegion`` object which ``TextLines`` are the ones we just extracted.

In [0]:
text_region = PAGE.TextRegion()
text_region.text_lines = [PAGE.TextLine.from_array(line) for line in lines]
PAGE_info.text_regions.append(text_region)

## Visualize the final result

In [0]:
plot_img = img.copy()
PAGE_info.draw_page_border(plot_img, autoscale=True, fill=False, thickness=15)
PAGE_info.draw_lines(plot_img, autoscale=True, fill=False, thickness=15, color=(0,255,0))
plt.figure(figsize=(15,15))
plt.imshow(plot_img)