[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gabays/CHR_2023/blob/main/CHR_digital_diplomacy.ipynb)

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/gabays/CHR_2023/HEAD)


# Transforming archives into data-driven analyses

[CHR 2023](https://2023.computational-humanities-research.org/) Workshop

Florian Cafiero, Jean-Luc Falcone and Simon Gabay

<img alt="Licence Creative Commons" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" align="right"/>

### Installations

We will use two principal tools for information extraction:

- To segment the pages, we are going to use [YALTAi](https://github.com/PonteIneptique/YALTAi) developped by Thibault Clérice (more info: [arXiv.2207.11230](https://doi.org/10.48550/arXiv.2207.11230)).
- To extract the text we use [Kraken](https://github.com/mittagessen/kraken) developed by Benjamin Kiessling (more info: [10.34894/Z9G2EX](https://doi.org/10.34894/Z9G2EX)).

⚠️ YALTAi contains Kraken, no need to install it separately

In [None]:
!pip install --root-user-action=ignore --upgrade setuptools
print("setuptools")
!pip install --root-user-action=ignore --upgrade pip
print("pip")
!pip install --root-user-action=ignore fastapi kaleido python-multipart uvicorn tabulate>=0.9 jedi>=0.16
print("fastapi")
!pip install --root-user-action=ignore YALTAi torch==2.1.0

## Image Segmentation

We download [from the Digital Library of the UN](https://digitallibrary.un.org/record/196769) a resolution (`A_RES_45_212-EN`) on the _Protection of global climate for present and future generations of mankind_.

In [None]:
!wget https://digitallibrary.un.org/record/196769/files/A_RES_45_212-EN.pdf?ln=fr
# Change the name to simplify manipulations
!mv A_RES_45_212-EN.pdf?ln=fr resolution.pdf
# Convert pdf into images
!pip install --root-user-action=ignore pdf2image
!apt-get install poppler-utils
from pdf2image import convert_from_path
# Choose resolution
dpi = 500 # dots per inch
pdf_file = '/content/resolution.pdf'
pages = convert_from_path(pdf_file ,dpi)
# Convert images
for i in range(len(pages)):
   page = pages[i]
   page.save('output_{}.jpg'.format(i), 'JPEG')
# Save the result
!mkdir /content/images
!mv output_*.jpg images
!rm /content/resolution.pdf

Let's have a look at the this resolution now. Here is the first page:

In [None]:
from matplotlib import pyplot as plt
from matplotlib import image as mpimg

image = mpimg.imread("/content/images/output_0.jpg")
plt.figure(figsize=(30, 12), dpi=100)
plt.imshow(image)
plt.gca().axes.get_yaxis().set_visible(False)
plt.show()

Some models are already available. We are going to use of model for historical French prints (16th c.-18th c.) trained at the University of Geneva by Maxime Humeau. This model is used for layout analyzing, using the controled vocabulary [SegmOnto](https://segmonto.github.io).

SegmOnto is based on an as universal as possible modelling of a page.

<table>
  <tr>
    <th>Historical Print</th>
    <th>Medieval manuscript</th>
  </tr>
  <tr>
    <td><img src="https://raw.githubusercontent.com/gabays/CHR_2023/main/images/btv1b86070385_f140_ann.jpg" height="300px"></td>
    <td><img src="https://raw.githubusercontent.com/gabays/CHR_2023/main/images/btv1b84259980_f29_ann.jpg" height="300px"></td>

  </tr>
</table>

Data have been prepared under the supervision of Ariane Pinche (CNRS) and Simon Gabay (UniGE) with [eScriptorium](https://ieeexplore.ieee.org/document/8893029), an open source web app to prepare data.

<img src="https://raw.githubusercontent.com/gabays/CHR_2023/main/images/escriptorium.png" height="300px">

The University of Geneva is contributing via its own instance called [FoNDUE](https://www.unige.ch/lettres/humanites-numeriques/recherche/projets-de-la-chaire/fondue). The FoNDUE project aims at interfacing eScriptorium with HPC clusters using slurm (right) and not a single machine like other instances (left).

<img src="https://raw.githubusercontent.com/gabays/CHR_2023/main/images/Fondue.png" height="250px">


In [None]:
# Download the model
!wget https://github.com/rayondemiel/Yolov8-Segmonto/releases/download/yolov8/remaining_goat_6779_best.pt
!mv remaining_goat_6779_best.pt seg_model.pt
# Load the model
from ultralytics import YOLO
model = YOLO("/content/seg_model.pt")
# Use GPU
model.to('cuda')
model.info()
# Fuse PyTorch Conv2d and BatchNorm2d layers. This improves inference time and therefore execution time.
model.fuse()

Let's use it now!

In [None]:
from PIL import Image
# Load the image
img = "/content/images/output_0.jpg"
# Prediction
results = model(img)
# Plot the result
for r in results:
    im_array = r.plot(conf=True)  # plot a BGR numpy array of predictions
    im = Image.fromarray(im_array[..., ::-1])  # RGB PIL image
    plt.figure(figsize=(30, 12), dpi=100)
    plt.imshow(im)
    plt.gca().axes.get_yaxis().set_visible(False)
    plt.show()

## Optical character recognition

I now need a Kraken model:

In [None]:
!wget https://github.com/gabays/CHR_2023/raw/main/19thcenturyprint.mlmodel
!mv /content/19thcenturyprint.mlmodel /content/htr_model.mlmodel

In [None]:
# Image Segmentation
!yaltai kraken --device cuda:0 -I "/content/images/output_0.jpg" --suffix ".xml" segment --yolo /content/seg_model.pt
!mkdir segmented
!mv /content/images/*.xml segmented
# HTR
!kraken --alto --suffix ".xml" -I "/content/segmented/*.xml" -f alto ocr -m "/content/htr_model.mlmodel"