# Transforming archives into data-driven analyses

[CHR 2023](https://2023.computational-humanities-research.org/) Workshop

Florian Cafiero, Jean-Luc Falcone and Simon Gabay

<img alt="Licence Creative Commons" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" align="right"/>

## Image Segmentation

We download [from the Digital Library of the UN](https://digitallibrary.un.org/record/196769) a resolution (`A_RES_45_212-EN`) on the _Protection of global climate for present and future generations of mankind_.

In [None]:
!wget https://digitallibrary.un.org/record/196769/files/A_RES_45_212-EN.pdf?ln=fr
# Change the name to simplify manipulations
!mv A_RES_45_212-EN.pdf?ln=fr resolution.pdf
# Convert pdf into images
!pip install pdf2image
!apt-get install poppler-utils
from pdf2image import convert_from_path
# Choose resolution
dpi = 500 # dots per inch
pdf_file = '/content/resolution.pdf'
pages = convert_from_path(pdf_file ,dpi)
# Convert images
for i in range(len(pages)):
   page = pages[i]
   page.save('output_{}.jpg'.format(i), 'JPEG')
# Save the result
!mkdir /content/images
!mv output_*.jpg images
!rm /content/resolution.pdf

Let's have a look at the this resolution now. Here is the forst page:

In [None]:
from matplotlib import pyplot as plt
from matplotlib import image as mpimg
image = mpimg.imread("/content/images/output_0.jpg")
plt.figure(figsize=(30, 12), dpi=100)
plt.imshow(image)
plt.gca().axes.get_yaxis().set_visible(False)
plt.show()

To segment the image, we are going to use [YALTAi](https://github.com/PonteIneptique/YALTAi) developped by Thibault Clérice (more info: [arXiv.2207.11230](https://doi.org/10.48550/arXiv.2207.11230))

In [None]:
!pip install YALTAi

Some models are already available. We are going to use of model for historical French prints (16th c.-18th c.) trained at the University of Geneva by Maxime Humeau

In [None]:
# Download the model
!wget https://github.com/rayondemiel/Yolov8-Segmonto/releases/download/yolov8/remaining_goat_6779_best.pt
!mv remaining_goat_6779_best.pt seg_model.pt
#Load the model
from ultralytics import YOLO
model = YOLO("/content/seg_model.pt")
model.to('cuda') # to use GPU
model.info()
model.fuse()

Let's use it

In [None]:
from PIL import Image
#Load the image
#img = "/content/images/output_0.jpg"
img = "/content/images/"
# Prediction
results = model(img)
#Plot the result
for r in results:
    im_array = r.plot(conf=True)  # plot a BGR numpy array of predictions
    im = Image.fromarray(im_array[..., ::-1])  # RGB PIL image
    plt.figure(figsize=(30, 12), dpi=100)
    plt.imshow(im)
    plt.gca().axes.get_yaxis().set_visible(False)
    plt.show()

## Optical character recognition

Now we extract the text With Tesseract

In [None]:
!sudo apt install tesseract-ocr
!pip install pytesseract
import pytesseract
import shutil
import os
import random

In [None]:
image_path_in_colab="/content/images/output_0.jpg"
extractedInformation = pytesseract.image_to_string(Image.open(image_path_in_colab))
print(extractedInformation)
print(pytesseract.image_to_boxes(Image.open(image_path_in_colab)))


In [None]:
!pip install kraken

In [None]:
!yaltai kraken --device cuda:0 --device cpu -I "/content/images/output_0.jpg" --suffix ".xml" segment --yolo /content/seg_model.pt
!mkdir segmented
!mv /content/images/*.xml segmented

In [None]:
!pip install torch==2.1.0