# Subset Data with Images from Gallica

In this notebook we will, from the available tables, find all the documents which have a IIIF manifest at Gallica, and that contain images of interest.

To know which images are interesting, we will use a the manual annotations gathered in `data/DFKV_id_illustration.csv`. All the IDs present in the dataframe have been marked at containing images.

We will then filter the database (`data/DFKV_Master.csv`) to keep only these documents, and also only those that have a Gallica IIIF link.

Let´s load the data.

In [1]:
# Basic imports
import requests
import pandas as pd
import os
from tqdm import tqdm

In [2]:
docs_illus_df = pd.read_csv('data/DFKV_id_illustration.csv')

In [3]:
docs_illus_df.head()

Unnamed: 0,id,PW_bemerkung_extern,PW_Abbildung,project_id
0,10314,,-1,2
1,10327,,-1,2
2,10331,,-1,2
3,10332,,-1,2
4,10340,,-1,2


In [4]:
master_df = pd.read_csv('data/DFKV_Master.csv')

In [5]:
master_df.head()

Unnamed: 0,ID,Volume_ID,_journal-id,liens iiif,liens de citation (page),liens de citation (volume),bibliographie
0,15573,8640,1411.0,,,https://gallica.bnf.fr/ark:/12148/bpt6k7522165...,supplément
1,14385,8640,1518.0,,x,,
2,14389,8641,1568.0,,,https://gallica.bnf.fr/ark:/12148/bpt6k360915?...,
3,14390,8642,1568.0,,,https://gallica.bnf.fr/ark:/12148/bpt6k36087x?...,
4,14394,8643,1568.0,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k36...,https://gallica.bnf.fr/ark:/12148/bpt6k36008s/...,,


## Filtering 

We now filter the master dataframe to only keep the desired documents (the ones that have illustrations in them and that have a gallica IIIF link).

In [6]:
docs_illus_df.rename(columns={'id':'ID'}, inplace=True)

In [7]:
# Merging dataframes on ID, inner join
illus = pd.merge(docs_illus_df, master_df, on=["ID"])
illus = illus.drop(columns=['Volume_ID', '_journal-id', 'liens de citation (page)', 'liens de citation (volume)', 'PW_Abbildung', 'project_id'])
illus = illus.dropna(subset=['liens iiif'])
illus.sample(5)

Unnamed: 0,ID,PW_bemerkung_extern,liens iiif,bibliographie
439,11426,,https://digi.ub.uni-heidelberg.de/diglit/iiif/...,"19.1927.12, S. 367-376"
1105,13376,,https://digi.ub.uni-heidelberg.de/diglit/iiif/...,"64.1929, S. 204-207"
243,10976,,https://digi.ub.uni-heidelberg.de/diglit/iiif/...,"44.1928/1929.8, S. 240-242"
982,13059,,https://digi.ub.uni-heidelberg.de/diglit/iiif/...,"25.1909/10, S. 126-135"
801,12431,,https://digi.ub.uni-heidelberg.de/diglit/iiif/...,"68.1931, S. 76-80"


In [8]:
gallica_iiif_df = illus[illus['liens iiif'].str.contains("https://gallica.bnf.fr/iiif")]
gallica_iiif_df.sample(3)

Unnamed: 0,ID,PW_bemerkung_extern,liens iiif,bibliographie
1345,15503,,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k62...,p. 147
1390,15698,,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k61...,"XVI, oct. 1912-mars 1913, p. 130-133"
1327,15285,,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k20...,p. 311-324


In [9]:
print('Size of the Gallica with illustrations subset : ', len(gallica_iiif_df.index))

Size of the Gallica with illustrations subset :  210


## Working links for images

In the dataframe, we have the urls for the canvas, that look like this :

https://gallica.bnf.fr/iiif/ark:/12148/bpt6k4226263w/canvas/f76

We will want to request images, so we need to tweak a bit the url into something like :

https://gallica.bnf.fr/iiif/ark:/12148/bpt6k4226263w/f76/full/full/0/native.jpg

In [10]:
# Modifying the urls
PREFIX_URL = 'https://gallica.bnf.fr/iiif/ark:/'
SUFFIX_URL_IMAGE = '/full/full/0/native.jpg'
gallica_iiif_df['link_image'] = [PREFIX_URL + link.split('/')[5] + '/' + link.split('/')[6] + '/' + link.split('/')[8] + SUFFIX_URL_IMAGE for link in gallica_iiif_df['liens iiif']]
gallica_iiif_df.sample(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gallica_iiif_df['link_image'] = [PREFIX_URL + link.split('/')[5] + '/' + link.split('/')[6] + '/' + link.split('/')[8] + SUFFIX_URL_IMAGE for link in gallica_iiif_df['liens iiif']]


Unnamed: 0,ID,PW_bemerkung_extern,liens iiif,bibliographie,link_image
1266,14920,,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k30...,"3e année, n° 46, 15.4.1918, p. 361-362",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k30...
1185,14324,,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k42...,p. 218-220,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k42...
1649,16435,"Article publié dans la rubrique ""Les Arts"".\n\...",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k47...,1950,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k47...


## Length of documents

Now, we might want the full document, not only one page of it. We will then need to know the number of pages it has. For that, we will look into the manifest.json of the document, and look at length of "canvases". This is an example of a manifest url : https://gallica.bnf.fr/iiif/ark:/12148/bpt6k9795256m/manifest.json


### Length of one document

Let's try to make a request for an example manifest.

In [11]:
import requests

In [12]:
response = requests.get("https://gallica.bnf.fr/iiif/ark:/12148/bpt6k9795256m/manifest.json")

In [13]:
print(response.status_code)

200


The response status code is 200, which means that it has succeeded ! 

And of how many images is this specific document made of ?

In [15]:
len(response.json()['sequences'][0]['canvases']) # number of images for the document

577

### On all documents

Let's repeat the operations for all the documents from Gallica, and add a new column to our dataframe describing the length of the documents. We start by finding the links to the manifests urls :

In [16]:
PREFIX_URL = 'https://gallica.bnf.fr/iiif/ark:/'
SUFFIX_URL_MANIFEST = '/manifest.json'
gallica_iiif_df['link_manifest'] = [PREFIX_URL + link.split('/')[5] + '/' + link.split('/')[6]  + SUFFIX_URL_MANIFEST for link in gallica_iiif_df['liens iiif']]
gallica_iiif_df.sample(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gallica_iiif_df['link_manifest'] = [PREFIX_URL + link.split('/')[5] + '/' + link.split('/')[6]  + SUFFIX_URL_MANIFEST for link in gallica_iiif_df['liens iiif']]


Unnamed: 0,ID,PW_bemerkung_extern,liens iiif,bibliographie,link_image,link_manifest
1234,14549,,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k20...,p. 201 - 217,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k20...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k20...
1383,15678,,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k61...,"XV, avril-sept. 1912, p. 17-32",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k61...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k61...
1203,14375,,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k97...,"n° 1, p. 34-38 [Texte : p. 35]",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k97...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k97...


In [17]:
def get_length_document(request_url):
    response = requests.get(request_url)
    return len(response.json()['sequences'][0]['canvases']) if response.status_code == 200 else -1       

In [18]:
doc_len = [get_length_document(url) for url in gallica_iiif_df['link_manifest']]

In [19]:
gallica_iiif_df['length'] = doc_len

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gallica_iiif_df['length'] = doc_len


Let's drop the `liens iiif` column, as we won't need it anymore, and save the dataframe as as csv.

In [20]:
gallica_iiif_df = gallica_iiif_df.drop(columns=['liens iiif'])

In [21]:
gallica_iiif_df.head()

Unnamed: 0,ID,PW_bemerkung_extern,bibliographie,link_image,link_manifest,length
1160,14254,,"9e année, 1934, n° 5-8, p. 178-184",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k97...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k97...,312
1161,14256,,"9e année, 1934, n° 5-8, p. 193-196",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k97...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k97...,312
1162,14267,,"n° 4, janvier 1930, s.p.",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k61...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k61...,62
1163,14268,"Bemerkenswerter Text, als Kopie vorhanden","n° 6, mai 1930, p. 6-10",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k61...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k61...,68
1164,14279,,p. 105-118,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k61...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k61...,445


In [22]:
gallica_iiif_df.to_csv('data/DFKV_Gallica_subset.csv', index=False)

### Statistics on document length

In [12]:
link_docs_df = pd.read_csv('data/DFKV_Gallica_subset.csv')
link_docs_df

Unnamed: 0,ID,PW_bemerkung_extern,bibliographie,link_image,link_manifest,length
0,14254,,"9e année, 1934, n° 5-8, p. 178-184",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k97...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k97...,312
1,14256,,"9e année, 1934, n° 5-8, p. 193-196",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k97...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k97...,312
2,14267,,"n° 4, janvier 1930, s.p.",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k61...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k61...,62
3,14268,"Bemerkenswerter Text, als Kopie vorhanden","n° 6, mai 1930, p. 6-10",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k61...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k61...,68
4,14279,,p. 105-118,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k61...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k61...,445
...,...,...,...,...,...,...
205,16382,"Illustrations : \n- Kandinsky, esquisse.",1950,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k47...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k47...,8
206,16383,"Illustrations : \n- Kandinsky, tableau (détail).",1949,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k47...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k47...,6
207,16432,"Article publié dans la rubrique ""Les Arts"".\n\...",1949,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k47...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k47...,6
208,16435,"Article publié dans la rubrique ""Les Arts"".\n\...",1950,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k47...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k47...,8


Some basic statistics : 

In [13]:
link_docs_df['length'].describe()

count     210.000000
mean      431.019048
std       280.180512
min         4.000000
25%       300.000000
50%       456.000000
75%       627.000000
max      1235.000000
Name: length, dtype: float64

In [14]:
print('We would have a total number of images of : ', link_docs_df['length'].sum())

We would have a total number of images of :  90514


This is a lot of images, but an idea is that we take N random pages for each book. It also has the advantage that it doesn't give too much weight to a particular book (ex: if there is one book focusing on van gogh, then there will only be illustrations of his works, and would be over-represented in the dataset). We could also take a random subset of all the dataset. Or take the subset of document that only appear in one of the three projects. Another possibiliy is to ask for images of lower quality, easy to do with IIIF (risk : as we will later use segments of the images, the resolution will be even lower. But it should be fine to lower it a bit I think)

UPDATE : actually, we have notes for some of the documents, which indicate which pages interested the researchers. Our next task is then to find the specific pages.

## Parsing the pages indications

In the 'bibliographie' column, we actually have the indication of which page are interesting. However, our main problem is that no two indications are written the same way. We need to process these string in order to find page numbers that we actually can use. Let's make a few observations on these :

- If document less than 10 pages (and no page indication): we keep them all
- if 'S.' or 'P.' (upper or lower case) then it means we have a page number
- if '-' after then it means there is a beginning page and an ending page
- if twice 'p.' then take the larger interval
- else if no indication (happens only twice in the dataset) : we just take the image from the link

In [15]:
link_docs_df.sample(10)

Unnamed: 0,ID,PW_bemerkung_extern,bibliographie,link_image,link_manifest,length
155,15722,Copie de l'article dans les annexes de DEA de ...,p. 93-96,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k42...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k42...,782
183,16014,,"n° 9, p. 368-371 [texte p. 369]",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k42...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k42...,602
136,15505,,"Tome LXX, 1er semestre 1892, p. 38",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k62...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k62...,16
106,15055,,"5e année, n° 14, p. 314-317",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k61...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k61...,24
84,14977,Hommage an frz. Malerei und an Hugo von Tschudi,"jan.-déc. 1926, p. 269-276 [texte p. 269-272, ...",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k42...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k42...,628
123,15289,,p. 415 - 436,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k20...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k20...,624
24,14323,,"n° 1, p. 35",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k42...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k42...,602
129,15399,,"59.1931, p. 21-33",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k54...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k54...,250
25,14324,,p. 218-220,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k42...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k42...,602
121,15280,,p.133-138,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k20...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k20...,627


Here an example : it says 'p. 317-320' but in the link it's 'f329', which indeed links to the page 317 -> we need to find the number of pages, not the page number

In [16]:
import re

def extract_nb_pages(text):
    if not pd.isna(text):
        after_page = re.findall(r"(?<=[pPsS].).*",text) # extract anything after 'p.'
        if len(after_page) != 0 :
            beginning = re.findall(r"\d+(?=-)", after_page[0]) # extract numbers before '-'
            end = re.findall(r"-(\d+)", after_page[0]) # extract numbers before '-'
            if len(beginning) != 0 and len(end) != 0:
                return int(end[0]) - int(beginning[0]) + 1
            else:
                return 1 # means that there is only one page (e.g. 'n° 25, p. 279')
        else:
            return 0  # means that we didn't find any page indication
    else:
        return 1 # no page indication, we will only take the one from the link

In [17]:
link_docs_df['pages_to_extract'] = link_docs_df.apply(lambda row: extract_nb_pages(row['bibliographie']), axis=1)

In [18]:
link_docs_df['pages_to_extract'] = link_docs_df.apply(lambda row: row['length'] if row['length']<10 and row['pages_to_extract'] == 0 else row['pages_to_extract'], axis = 1)

Let's see which pages escaped our vigilence. We will manually enter their values, as there are just a few of them. 

In [19]:
link_docs_df[link_docs_df['pages_to_extract']==0]

Unnamed: 0,ID,PW_bemerkung_extern,bibliographie,link_image,link_manifest,length,pages_to_extract
66,14728,,22.1907,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k57...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k57...,662,0
67,14728,,23.1908,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k57...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k57...,662,0
124,15300,,"Jg.5, Bd.16, 314-316",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k10...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k10...,427,0
125,15302,,"5.Jg., Bd.17, 81-92",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k10...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k10...,405,0
186,16031,,"n° 27, 1924-1925",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k10...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k10...,1137,0


In [20]:
link_docs_df.at[66,'pages_to_extract'] = 1
link_docs_df.at[67,'pages_to_extract'] = 1
link_docs_df.at[124,'pages_to_extract'] = 3
link_docs_df.at[125,'pages_to_extract'] = 2
link_docs_df.at[186,'pages_to_extract'] = 1

In [21]:
PREFIX_URL = 'https://gallica.bnf.fr/iiif/ark:/'
SUFFIX_URL_IMAGE = '/full/pct:50/0/native.jpg' # pct:50 because we download the images with lower quality, to gain storage space

# function that from the canvas link create the image link
def modify_url(link):
    try:
        ml = PREFIX_URL + link.split('/')[5] + '/' + link.split('/')[6] + '/' + link.split('/')[7] + SUFFIX_URL_IMAGE
        return ml
    except:
        return '' # when the url is not conform, just ignore it - it only happens twice

## Downloading the images

We now know exactly which pages we want, let's download them !

In [22]:
# Function to get the image from the link and save it with the right name at the right place
def download_image(link, doc_id, page):
    response = requests.get(link) # Request the image
    
    if response.status_code == 200:
        # If request successful, then save the file
        im_path = "./data/test_images/DFKV_" + str(doc_id) + "_" + str(page) + ".jpg"
        file = open(im_path, "wb")
        file.write(response.content) 
        file.close()

In [23]:
link_docs_df['link_image'] = link_docs_df.apply(lambda row: modify_url(row['link_image']), axis=1)

In [24]:
# Iterate over all the document to download the images
for doc in tqdm(link_docs_df.iterrows()):
    try :
        page = int(doc[1]['link_image'].split('/')[7][1:])
        doc_id = doc[1]['ID']
        
        # For each document, go through all the desired pages
        for i in range(page, page + int(doc[1]['pages_to_extract'])+1):
            link = doc[1]['link_image'].replace(str(page), str(i)) # we change the page number here 
            download_image(link, doc_id, i)
    except:
        continue

210it [21:02,  6.01s/it]
