# Transforming archives into data-driven analyses

[CHR 2023](https://2023.computational-humanities-research.org/) Workshop

Florian Cafiero, Jean-Luc Falcone and Simon Gabay

<img alt="Licence Creative Commons" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" align="right"/>

### Installations

We will use two principal tools for information extraction:

- To segment the pages, we are going to use [YALTAi](https://github.com/PonteIneptique/YALTAi) developped by Thibault Clérice (more info: [arXiv.2207.11230](https://doi.org/10.48550/arXiv.2207.11230)).
- To extract the text we use [Kraken](https://github.com/mittagessen/kraken) developed by Benjamin Kiessling (more info: [10.34894/Z9G2EX](https://doi.org/10.34894/Z9G2EX)).

⚠️ YALTAi contains Kraken, no need to install it separately

In [None]:
!pip install kraken

## Image Segmentation

We download [from the Digital Library of the UN](https://digitallibrary.un.org/record/196769) a resolution (`A_RES_45_212-EN`) on the _Protection of global climate for present and future generations of mankind_.

In [None]:
!mkdir -p content
!wget https://digitallibrary.un.org/record/196769/files/A_RES_45_212-EN.pdf  -P content
# Change the name to simplify manipulations
!mv content/A_RES_45_212-EN.pdf content/resolution.pdf
# Convert pdf into images
!pip install pypdfium2
import pypdfium2 as pdfium
#path to file
pdf = pdfium.PdfDocument("content/resolution.pdf")
#number of pages
n_pages = len(pdf)
#turn into png
for page_number in range(n_pages):
    page = pdf.get_page(page_number)
    pil_image = page.render(
        scale=5, #1=72dpi, increase for a better resolution
        rotation=0,
        crop=(0, 0, 0, 0),
    ).to_pil()
    pil_image.save(f"content/image_{page_number+1}.png")
#bit of cleaning
!rm content/resolution.pdf
!mkdir content/images
!mv content/image*png content/images/

Let's have a look at the this resolution now. Here is the first page:

In [None]:
from matplotlib import pyplot as plt
from matplotlib import image as mpimg

image = mpimg.imread("content/images/image_1.png")
plt.figure(figsize=(30, 12), dpi=100)
plt.imshow(image)
plt.gca().axes.get_yaxis().set_visible(False)
plt.show()

Some models are already available. We are going to use of model for historical French prints (16th c.-18th c.) trained at the University of Geneva by Maxime Humeau. This model is used for layout analyzing, using the controled vocabulary [SegmOnto](https://segmonto.github.io).

SegmOnto is based on an as universal as possible modelling of a page.

<table>
  <tr>
    <th>Historical Print</th>
    <th>Medieval manuscript</th>
  </tr>
  <tr>
    <td><img src="images/btv1b86070385_f140_ann.jpg" width="300px"></td>
    <td><img src="images/btv1b84259980_f29_ann.jpg" width="340px"></td>
  </tr>
</table>

Data have been prepared under the supervision of Ariane Pinche (CNRS) and Simon Gabay (UniGE) with [eScriptorium](https://ieeexplore.ieee.org/document/8893029), an open source web app to prepare data.

<img src="images/escriptorium.png" width="600px">

The University of Geneva is contributing via its own instance called [FoNDUE](https://www.unige.ch/lettres/humanites-numeriques/recherche/projets-de-la-chaire/fondue). The FoNDUE project aims at interfacing eScriptorium with HPC clusters using slurm (right) and not a single machine like other instances (left).

<img src="https://raw.githubusercontent.com/gabays/CHR_2023/main/images/Fondue.png"  width="600px">


In [None]:
# Download the model
!wget https://github.com/rayondemiel/Yolov8-Segmonto/releases/download/yolov8/remaining_goat_6779_best.pt -P content
!mv content/remaining_goat_6779_best.pt content/seg_model.pt
# Load the model
from ultralytics import YOLO
model = YOLO("content/seg_model.pt")
# Use a GPU if you have one
#model.to('cuda')
model.info()
# Fuse PyTorch Conv2d and BatchNorm2d layers. This improves inference time and therefore execution time.
model.fuse()

Let's use it now!

In [None]:
from PIL import Image
# Load the image
img = "content/images/image_1.png"
# Prediction
results = model(img)
# Plot the result
for r in results:
    im_array = r.plot(conf=True)  # plot a BGR numpy array of predictions
    im = Image.fromarray(im_array[..., ::-1])  # RGB PIL image
    plt.figure(figsize=(30, 12), dpi=100)
    plt.imshow(im)
    plt.gca().axes.get_yaxis().set_visible(False)
    plt.show()

## Optical character recognition

I now need a Kraken model:

In [None]:
!cp UN_ft.mlmodel content
!mv content/UN_ft.mlmodel content/htr_model.mlmodel

First we segment

⚠️ It takes a bit of time, approx. 1 minute / image

In [None]:
# Image Segmentation
!yaltai kraken --device cpu -I "content/images/*.png" --suffix ".xml" segment --yolo content/seg_model.pt
print("pages have been segmented 🥳")

We need to correct the path of the image in the ALTO

In [None]:
#For mac
!sed  -i'' -e "s/content\/images\/image\_/image_/g" content/images/*.xml

#For Linux
#!sed -i "s/content\/images\/image_/image_/g" content/images/*.xml

Then we OCRise

In [None]:
# HTR
#!kraken --alto --suffix ".xml" -I "content/images/image*.xml" -f alto ocr -m "content/htr_model.mlmodel"
!kraken --suffix ".txt" -I "content/images/image*.xml" -f alto ocr -m "content/htr_model.mlmodel"
!mkdir -p content/data
!mv content/images/*.xml data

In [None]:
#For mac
!sed  -i'' -e "s/content\/images\/image\_/image_/g" content/images/*.xml
#For Linux
#!sed -i "s/content\/images\/image_/image_/g" content/images/*.xml

In [None]:
!python scripts/alto2tei.py --config config.yml --version "4.1.3" --sourcedoc --body