# Subset Data with Images from Gallica

In this notebook we will, from the available tables, find all the documents which have a IIIF manifest at Gallica, and that contain images of interest.

To know which images are interesting, we will use a the manual annotations gathered in `data/DFKV_id_illustration.csv`. All the IDs present in the dataframe have been marked at containing images.

We will then filter the database (`data/DFKV_Master.csv`) to keep only these documents, and also only those that have a Gallica IIIF link.

Let´s load the data.

In [1]:
import pandas as pd

In [2]:
docs_illus_df = pd.read_csv('data/DFKV_id_illustration.csv')

In [3]:
docs_illus_df.head()

Unnamed: 0,id,PW_bemerkung_extern,PW_Abbildung,project_id
0,10314,,-1,2
1,10327,,-1,2
2,10331,,-1,2
3,10332,,-1,2
4,10340,,-1,2


In [4]:
master_df = pd.read_csv('data/DFKV_Master.csv')

In [5]:
master_df.head()

Unnamed: 0,ID,Volume_ID,_journal-id,liens iiif,liens de citation (page),liens de citation (volume),bibliographie
0,15573,8640,1411.0,,,https://gallica.bnf.fr/ark:/12148/bpt6k7522165...,supplément
1,14385,8640,1518.0,,x,,
2,14389,8641,1568.0,,,https://gallica.bnf.fr/ark:/12148/bpt6k360915?...,
3,14390,8642,1568.0,,,https://gallica.bnf.fr/ark:/12148/bpt6k36087x?...,
4,14394,8643,1568.0,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k36...,https://gallica.bnf.fr/ark:/12148/bpt6k36008s/...,,


## Filtering 

We now filter the master dataframe to only keep the desired documents (the ones that have illustrations in them and that have a gallica IIIF link).

In [6]:
docs_illus_df.rename(columns={'id':'ID'}, inplace=True)

In [7]:
# Merging dataframes on ID, inner join
illus = pd.merge(docs_illus_df, master_df, on=["ID"])
illus = illus.drop(columns=['Volume_ID', '_journal-id', 'liens de citation (page)', 'liens de citation (volume)', 'PW_Abbildung', 'project_id'])
illus = illus.dropna(subset=['liens iiif'])
illus.sample(5)

Unnamed: 0,ID,PW_bemerkung_extern,liens iiif,bibliographie
452,11447,,https://digi.ub.uni-heidelberg.de/diglit/iiif/...,"22.1930.4, S. 105-109"
1283,14981,,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k42...,"jan.-déc. 1927, p. 367-372"
972,13045,,https://digi.ub.uni-heidelberg.de/diglit/iiif/...,"7.1900/1901, S. 90-96"
370,11322,,https://digi.ub.uni-heidelberg.de/diglit/iiif/...,"12.1920.8, S. 315-317"
389,11358,,https://digi.ub.uni-heidelberg.de/diglit/iiif/...,"17.1925.13, S. 648-652"


In [8]:
gallica_iiif_df = illus[illus['liens iiif'].str.contains("https://gallica.bnf.fr/iiif")]
gallica_iiif_df.sample(3)

Unnamed: 0,ID,PW_bemerkung_extern,liens iiif,bibliographie
1401,15734,,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k61...,"n° 13, p. 50"
1326,15280,,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k20...,p.133-138
1186,14326,,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k42...,"n° 11, p. 462-463 [texte p. 463]"


In [9]:
print('Size of the Gallica with illustrations subset : ', len(gallica_iiif_df.index))

Size of the Gallica with illustrations subset :  210


## Working links for images

In the dataframe, we have the urls for the canvas, that look like this :

https://gallica.bnf.fr/iiif/ark:/12148/bpt6k4226263w/canvas/f76

We will want to request images, so we need to tweak a bit the url into something like :

https://gallica.bnf.fr/iiif/ark:/12148/bpt6k4226263w/f76/full/full/0/native.jpg

In [10]:
# Modifying the urls
PREFIX_URL = 'https://gallica.bnf.fr/iiif/ark:/'
SUFFIX_URL_IMAGE = '/full/full/0/native.jpg'
gallica_iiif_df['link_image'] = [PREFIX_URL + link.split('/')[5] + '/' + link.split('/')[6] + '/' + link.split('/')[8] + SUFFIX_URL_IMAGE for link in gallica_iiif_df['liens iiif']]
gallica_iiif_df.sample(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gallica_iiif_df['link_image'] = [PREFIX_URL + link.split('/')[5] + '/' + link.split('/')[6] + '/' + link.split('/')[8] + SUFFIX_URL_IMAGE for link in gallica_iiif_df['liens iiif']]


Unnamed: 0,ID,PW_bemerkung_extern,liens iiif,bibliographie,link_image
1233,14548,,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k20...,p.105 -124,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k20...
1167,14286,,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k97...,"3e année, 1928, p. 228-230",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k97...
1397,15717,,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k58...,"Tome IV ; Octobre 1906-mars 1907, p. 443-446",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k58...


## Length of documents

Now, we might want the full document, not only one page of it. We will then need to know the number of pages it has. For that, we will look into the manifest.json of the document, and look at length of "canvases". This is an example of a manifest url : https://gallica.bnf.fr/iiif/ark:/12148/bpt6k9795256m/manifest.json


### Length of one document

Let's try to make a request for an example manifest.

In [11]:
import requests

In [12]:
response = requests.get("https://gallica.bnf.fr/iiif/ark:/12148/bpt6k9795256m/manifest.json")

In [13]:
print(response.status_code)

200


The response status code is 200, which means that it has succeeded ! Let's have a look at the response.

In [14]:
response.json()

{'@id': 'https://gallica.bnf.fr/iiif/ark:/12148/bpt6k9795256m/manifest.json',
 'label': 'BnF, département Sciences et techniques, FOL-V-5953',
 'attribution': 'Bibliothèque nationale de France',
 'license': 'https://gallica.bnf.fr/html/und/conditions-dutilisation-des-contenus-de-gallica',
 'logo': 'https://gallica.bnf.fr/mbImage/logos/logo-bnf.png',
 'related': 'https://gallica.bnf.fr/ark:/12148/bpt6k9795256m',
 'seeAlso': ['http://oai.bnf.fr/oai2/OAIHandler?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:bnf.fr:gallica/ark:/12148/bpt6k9795256m'],
 'description': "Cahiers d'art : bulletin mensuel d'actualité artistique / [directeur Christian Zervos]",
 'metadata': [{'label': 'Repository',
   'value': 'Bibliothèque nationale de France'},
  {'label': 'Digitised by', 'value': 'Bibliothèque nationale de France'},
  {'label': 'Source Images',
   'value': 'https://gallica.bnf.fr/ark:/12148/bpt6k9795256m'},
  {'label': 'Metadata Source',
   'value': 'http://oai.bnf.fr/oai2/OAIHandler?verb

And of how many images is this specific document made of ?

In [15]:
len(response.json()['sequences'][0]['canvases']) # number of images for the document

577

### On all documents

Let's repeat the operations for all the documents from Gallica, and add a new column to our dataframe describing the length of the documents. We start by finding the links to the manifests urls :

In [16]:
PREFIX_URL = 'https://gallica.bnf.fr/iiif/ark:/'
SUFFIX_URL_MANIFEST = '/manifest.json'
gallica_iiif_df['link_manifest'] = [PREFIX_URL + link.split('/')[5] + '/' + link.split('/')[6]  + SUFFIX_URL_MANIFEST for link in gallica_iiif_df['liens iiif']]
gallica_iiif_df.sample(3)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gallica_iiif_df['link_manifest'] = [PREFIX_URL + link.split('/')[5] + '/' + link.split('/')[6]  + SUFFIX_URL_MANIFEST for link in gallica_iiif_df['liens iiif']]


Unnamed: 0,ID,PW_bemerkung_extern,liens iiif,bibliographie,link_image,link_manifest
1234,14549,,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k20...,p. 201 - 217,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k20...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k20...
1383,15678,,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k61...,"XV, avril-sept. 1912, p. 17-32",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k61...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k61...
1203,14375,,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k97...,"n° 1, p. 34-38 [Texte : p. 35]",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k97...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k97...


In [17]:
def get_length_document(request_url):
    response = requests.get(request_url)
    return len(response.json()['sequences'][0]['canvases']) if response.status_code == 200 else -1       

In [18]:
doc_len = [get_length_document(url) for url in gallica_iiif_df['link_manifest']]

In [19]:
gallica_iiif_df['length'] = doc_len

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gallica_iiif_df['length'] = doc_len


Let's drop the `liens iiif` column, as we won't need it anymore, and save the dataframe as as csv.

In [20]:
gallica_iiif_df = gallica_iiif_df.drop(columns=['liens iiif'])

In [21]:
gallica_iiif_df.head()

Unnamed: 0,ID,PW_bemerkung_extern,bibliographie,link_image,link_manifest,length
1160,14254,,"9e année, 1934, n° 5-8, p. 178-184",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k97...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k97...,312
1161,14256,,"9e année, 1934, n° 5-8, p. 193-196",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k97...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k97...,312
1162,14267,,"n° 4, janvier 1930, s.p.",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k61...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k61...,62
1163,14268,"Bemerkenswerter Text, als Kopie vorhanden","n° 6, mai 1930, p. 6-10",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k61...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k61...,68
1164,14279,,p. 105-118,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k61...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k61...,445


In [22]:
gallica_iiif_df.to_csv('data/DFKV_Gallica_subset.csv', index=False)

### Statistics on document length

In [23]:
link_docs_df = pd.read_csv('data/DFKV_Gallica_subset.csv')
link_docs_df

Unnamed: 0,ID,PW_bemerkung_extern,bibliographie,link_image,link_manifest,length
0,14254,,"9e année, 1934, n° 5-8, p. 178-184",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k97...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k97...,312
1,14256,,"9e année, 1934, n° 5-8, p. 193-196",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k97...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k97...,312
2,14267,,"n° 4, janvier 1930, s.p.",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k61...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k61...,62
3,14268,"Bemerkenswerter Text, als Kopie vorhanden","n° 6, mai 1930, p. 6-10",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k61...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k61...,68
4,14279,,p. 105-118,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k61...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k61...,445
...,...,...,...,...,...,...
205,16382,"Illustrations : \n- Kandinsky, esquisse.",1950,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k47...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k47...,8
206,16383,"Illustrations : \n- Kandinsky, tableau (détail).",1949,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k47...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k47...,6
207,16432,"Article publié dans la rubrique ""Les Arts"".\n\...",1949,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k47...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k47...,6
208,16435,"Article publié dans la rubrique ""Les Arts"".\n\...",1950,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k47...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k47...,8


Some basic statistics : 

In [24]:
link_docs_df['length'].describe()

count     210.000000
mean      431.019048
std       280.180512
min         4.000000
25%       300.000000
50%       456.000000
75%       627.000000
max      1235.000000
Name: length, dtype: float64

In [25]:
print('We would have a total number of images of : ', link_docs_df['length'].sum())

We would have a total number of images of :  90514


This is a lot of images, but an idea is that we take N random pages for each book. It also has the advantage that it doesn't give too much weight to a particular book (ex: if there is one book focusing on van gogh, then there will only be illustrations of his works, and would be over-represented in the dataset). We could also take a random subset of all the dataset. Or take the subset of document that only appear in one of the three projects. Another possibiliy is to ask for images of lower quality, easy to do with IIIF (risk : as we will later use segments of the images, the resolution will be even lower. But it should be fine to lower it a bit I think)

UPDATE : actually, we have notes for some of the documents, which indicate which pages interested the researchers. Our next task is then to find the specific pages.

## Parsing the pages indications

- If document less than 10 pages (and no page indication): we keep them all
- if 'S.' or 'P.' (upper or lower case) then it means we have a page number
- if '-' after then it means there is a beginning page and an ending page
- if twice 'p.' then take the larger interval
- else if no indication (happens only twice in the dataset) : we just take the image from the link

In [61]:
link_docs_df.loc[50:100]

Unnamed: 0,ID,PW_bemerkung_extern,bibliographie,link_image,link_manifest,length
50,14473,,"t. 16, p. 489-508",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k20...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k20...,612
51,14474,,"t.15, p.80-95, 2e article",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k20...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k20...,653
52,14480,,"t. 37, p. 418-435, 3e article",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k20...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k20...,631
53,14485,,"t.6, p. 405-422.",https://gallica.bnf.fr/iiif/ark:/12148/bpt6k20...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k20...,553
54,14498,,p. 147-156,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k20...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k20...,1164
55,14548,,p.105 -124,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k20...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k20...,658
56,14549,,p. 201 - 217,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k20...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k20...,684
57,14550,,p. 512-522,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k20...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k20...,637
58,14551,,p. 129-138,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k20...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k20...,614
59,14552,,p. 181-198 ; p. 295-310,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k43...,https://gallica.bnf.fr/iiif/ark:/12148/bpt6k43...,558


In [55]:
list(link_docs_df[link_docs_df['ID']==14728]['link_image'])

['https://gallica.bnf.fr/iiif/ark:/12148/bpt6k57805404/f506/full/full/0/native.jpg',
 'https://gallica.bnf.fr/iiif/ark:/12148/bpt6k57805404/f506/full/full/0/native.jpg']

Here an example : it says 'p. 317-320' but in the link it's 'f329', which indeed links to the page 317 -> we need to find the number of pages, not the page number

## Downloading the images

In [36]:
response = requests.get("https://gallica.bnf.fr/iiif/ark:/12148/bpt6k4226263w/f76/full/full/0/native.jpg")
file = open("sample_image.jpg", "wb")
file.write(response. content)
file.close()

In [6]:
# test with lower quality
response = requests.get("https://gallica.bnf.fr/iiif/ark:/12148/bpt6k4226263w/f76/full/pct:50/0/native.jpg")
file = open("sample_image.jpg", "wb")
file.write(response. content)
file.close()