# Collecting Data for Training the Model

In this notebook, we download images of pages with illustrations (or not), that will be later used for manual annotations, to train the Detectron2 model.

In [57]:
# Basic imports
import requests
import pandas as pd
import os
from tqdm import tqdm

We want to have training data images that have the same global structure as the future test images, but not exactly those images. Recall that our test images will be the data obtained in the TODO notebook.

With this goal in mind, we will draw our training data from the `data/DFKV_Master.csv`, but retain only the documents that :
- Have a IIIF Gallica link (because these are easy to get)
- Are not present in the test data, which is the Gallica subset from [this notebook](https://github.com/dfk-paris/DFKV-illustrations/blob/main/2_gallica_subset/Gallica_subset.ipynb)

We begin gathering this dataset in a dataframe.

In [58]:
# Master Dataset
master_df = pd.read_csv('data/DFKV_Master.csv')
# Drop rows with IIIF link unknown
master_df = master_df.dropna(subset=['liens iiif'])
# Only keep Gallica IIIF links
master_df = master_df[master_df['liens iiif'].str.contains("https://gallica.bnf.fr/iiif/")]
master_df.sample()

Unnamed: 0,ID,Volume_ID,_journal-id,liens iiif,liens de citation (page),liens de citation (volume),bibliographie
5795,14477,8071,1602.0,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k43...,https://gallica.bnf.fr/ark:/12148/bpt6k431799g...,,p. 457-468


In [59]:
# Gallica test subset 
gallica_df = pd.read_csv('data/DFKV_gallica_subset.csv')
# Gallica data that is not in the test subset
not_test_gallica_df = master_df.set_index('ID').drop(list(gallica_df['ID'])).dropna(subset=['liens iiif'])
not_test_gallica_df.sample()

Unnamed: 0_level_0,Volume_ID,_journal-id,liens iiif,liens de citation (page),liens de citation (volume),bibliographie
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
15708,8378,1479.0,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k58...,https://gallica.bnf.fr/ark:/12148/bpt6k5864957...,,"Tome II, oct. 1905-mars 1906, p. 131-132"


In the `liens iiif` column, the link points to the canvas of the document. We need a link that points directly to the images, so that it is esay to download. In the following cells we modify the urls to directly get these ones.

In [60]:
PREFIX_URL = 'https://gallica.bnf.fr/iiif/ark:/'
SUFFIX_URL_IMAGE = '/full/pct:50/0/native.jpg' # pct:50 because we download the images with lower quality, to gain storage space

# function that from the canvas link create the image link
def modify_url(link):
    try:
        ml = PREFIX_URL + link.split('/')[5] + '/' + link.split('/')[6] + '/' + link.split('/')[8] + SUFFIX_URL_IMAGE
        return ml
    except:
        return '' # when the url is not conform, just ignore it - it only happens twice

In [61]:
# Modifying the urls
not_test_gallica_df['link_image'] = [modify_url(link) for link in not_test_gallica_df['liens iiif']]
# Dropping the rows that failed
not_test_gallica_df = not_test_gallica_df.drop(not_test_gallica_df[not_test_gallica_df['link_image']==''].index)

In [62]:
print('Example of image url : ')
list(not_test_gallica_df.sample()['link_image'])

Example of image url : 


['https://gallica.bnf.fr/iiif/ark:/12148/bpt6k57228330/f398/full/pct:50/0/native.jpg']

Let's actually download the images into the `data/training_images` folder. Each image is named `DFKV_<DOC_ID>_<PAGE_NUMBER>.jpg`.

In [44]:
# Function to get the image from the link and save it with the right name at the right place
def download_image(link, doc_id, page):
    response = requests.get(link) # Request the image
    if response.status_code == 200:
        # If request successful, then save the file
        im_path = "./data/training_images/DFKV_" + str(doc_id) + "_" + str(page) + ".jpg"
        file = open(im_path, "wb")
        file.write(response.content) 
        file.close()

In [48]:
# Iterate over all the document to download the images
for doc in tqdm(not_test_gallica_df.iterrows()):
    try :
        page = int(doc[1]['link_image'].split('/')[7][1:])
        doc_id = doc[0]
        
        # For each document, we also take one page before and one page after the one that is linked, 
        # in the hope to have more images with illustrations in them
        for i in range(page-1, page+2):
            link = doc[1]['link_image'].replace(str(page), str(i)) # we change the page number here
            download_image(link, doc_id, i)
    except:
        continue

752it [17:00:27, 81.42s/it]   


All done ! We can find the data in `data/training_images` 