**This Python 2 notebook extracts images of a Gallica document (using the IIIF protocol), and then applies an IBM Watson classification model to the images**
1. Extract the document bibliographical metadata from the Gallica OAI-PMH repository
2. Extract the document technical image metadata from its IIIF manifest, and then the images
3. Classify the images with a Watson Cloud Vision model (the model must be available)

In [2]:
# insert here the Gallica document ID you want to process
docID = '12148/btv1b103365619'

In [3]:
import sys
print("Python version")
print (sys.version)

Python version
2.7.14 (default, Sep 25 2017, 09:53:22) 
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.37)]


### 1. PyGallica (https://github.com/ian-nai/PyGallica) is used to access the Gallica OAI. The call returns metadata as a dictionary

In [16]:
# we import the Document class from the PyGallica package (https://github.com/ian-nai/PyGallica)
from document_api import Document

In [17]:
# we get the document metadata with the Gallica OAI API wrapped within the PyGallica package
json_dict4doc = Document.OAI(docID)

https://gallica.bnf.fr/services/OAIRecord?ark=ark:/12148/btv1b103365619


In [18]:
# do whatever you want with the document bibliographic metadata
# get the title
print json_dict4doc['results']['title']
# get the Dublin Core medatada
print json_dict4doc['results']['notice']['record']['metadata']['oai_dc:dc']

[Recueil. Portraits de Joseph Caillaux]
OrderedDict([(u'@xmlns:dc', u'http://purl.org/dc/elements/1.1/'), (u'@xmlns:oai_dc', u'http://www.openarchives.org/OAI/2.0/oai_dc/'), (u'@xmlns:xsi', u'http://www.w3.org/2001/XMLSchema-instance'), (u'@xsi:schemaLocation', u'http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd'), (u'dc:identifier', u'https://gallica.bnf.fr/ark:/12148/btv1b103365619'), (u'dc:title', u'[Recueil. Portraits de Joseph Caillaux]'), (u'dc:subject', [OrderedDict([(u'@xml:lang', u'fre'), ('#text', u'Caillaux, Joseph (1863-1944) -- Portraits')]), OrderedDict([(u'@xml:lang', u'fre'), ('#text', u'Portraits')])]), (u'dc:format', [u'31 doc. iconogr.', u'image/jpeg', u'Nombre total de vues :  33']), (u'dc:language', [u'fre', u'fran\xe7ais']), (u'dc:relation', u'Notice du catalogue : http://catalogue.bnf.fr/ark:/12148/cb43604814s'), (u'dc:type', [OrderedDict([(u'@xml:lang', u'fre'), ('#text', u'image fixe')]), OrderedDict([(u'@xml:lang', u'en

### 2. We ask for the document IIIF manifest to then have access to the images files

In [19]:
# we build the IIIF URL
import requests
METADATA_BASEURL = 'https://gallica.bnf.fr/iiif/ark:/'
req_url = "".join([METADATA_BASEURL, docID, '/manifest.json'])
print req_url

https://gallica.bnf.fr/iiif/ark:/12148/btv1b103365619/manifest.json


In [20]:
# we ask for the manifest. The call returns a dictionary
r = requests.get(req_url)
r.raise_for_status()
json_4img = r.json()
print json_4img.keys()

[u'@context', u'attribution', u'description', u'license', u'@type', u'related', u'label', u'sequences', u'logo', u'metadata', u'@id', u'thumbnail', u'seeAlso']


In [21]:
# get the sequence of images metadata. It's a list
sequences = json_4img.get('sequences')
# get the canvases, first element of the list. Its a dict
canvases = sequences[0]
print canvases.keys()
# parse each canvas data for each image
# each canvas has these keys: [u'height', u'width', u'@type', u'images', u'label', u'@id', u'thumbnail']
nImages = 0
print "--- getting image metadata from the IIIF manifest..."
for c in canvases.get('canvases'): 
    nImages += 1
    print " label:",c.get('label')," width:",c.get('width'), " height:",c.get('height')
    # we also get a Gallica thumbnail (it's not a IIIF image)
    thumbnail = c.get('thumbnail')
    print " thumbnail: ",thumbnail.get('@id')
print "-------"
print "images:", nImages
print

[u'canvases', u'@id', u'@type', u'label']
--- getting image metadata from the IIIF manifest...
 label: NP  width: 2289  height: 3510
 thumbnail:  https://gallica.bnf.fr/ark:/12148/btv1b103365619/f1.thumbnail
 label: NP  width: 1508  height: 2485
 thumbnail:  https://gallica.bnf.fr/ark:/12148/btv1b103365619/f2.thumbnail
 label: NP  width: 1830  height: 3076
 thumbnail:  https://gallica.bnf.fr/ark:/12148/btv1b103365619/f3.thumbnail
 label: NP  width: 3576  height: 1850
 thumbnail:  https://gallica.bnf.fr/ark:/12148/btv1b103365619/f4.thumbnail
 label: NP  width: 4376  height: 3449
 thumbnail:  https://gallica.bnf.fr/ark:/12148/btv1b103365619/f5.thumbnail
 label: NP  width: 4083  height: 5506
 thumbnail:  https://gallica.bnf.fr/ark:/12148/btv1b103365619/f6.thumbnail
 label: NP  width: 3847  height: 5365
 thumbnail:  https://gallica.bnf.fr/ark:/12148/btv1b103365619/f7.thumbnail
 label: NP  width: 3062  height: 2507
 thumbnail:  https://gallica.bnf.fr/ark:/12148/btv1b103365619/f8.thumbnail
 

In [22]:
# Now we'd like to get the image files with the IIIF Image API (PyGallica package again)
from iiif_api import IIIF

# IIIF export factor (%)
docExportFactor = 25

In [23]:
# get the image files #12 to #16 (we only process 4 images)
for i in range(12, 17):
    print "--- getting image..."
    # we build the IIIF URL. We ask for the full image with a size factor of docExportFactor
    IIIF.iiif("".join([docID,'/f',str(i)]), 'full', "".join(['pct:',str(docExportFactor)]), '0', 'native', 'jpg')
print
# the files are stored in the . folder

--- getting image...
https://gallica.bnf.fr/iiif/ark:/12148/btv1b103365619/f12/full/pct:25/0/native.jpg
--- getting image...
https://gallica.bnf.fr/iiif/ark:/12148/btv1b103365619/f13/full/pct:25/0/native.jpg
--- getting image...
https://gallica.bnf.fr/iiif/ark:/12148/btv1b103365619/f14/full/pct:25/0/native.jpg
--- getting image...
https://gallica.bnf.fr/iiif/ark:/12148/btv1b103365619/f15/full/pct:25/0/native.jpg
--- getting image...
https://gallica.bnf.fr/iiif/ark:/12148/btv1b103365619/f16/full/pct:25/0/native.jpg



### 3. Now we have to call the Watson classification model on the local image files

In [1]:
from PIL import Image

# Watson parameters:
WATSON_BASEURL = 'https://gateway.watsonplatform.net/visual-recognition/api/v3/classify?version=2018-03-19'
WATSON_VERSION = (('version', '2018-03-19'),)
# insert your Watson key here
WATSON_KEY = '***' 
# insert your Watson visual recognition model ID here
WATSON_MODEL = 'DefaultCustomModel_1457318034'

# first we read the images
import os
# the images have been stored in a folder based on the document ID
# like 12148/btv1b103365619
entries = os.listdir(docID)
i = 1
for file in entries:
    print "--- infering image ",i," ..."
    fileName = "".join([docID,"/",file])
    #req_url = "".join(["curl -X POST -u 'apikey:",WATSON_KEY,"' -F 'images_file=@",fileName,"' -F 'classifier_ids=",WATSON_MODEL,"' '",WATSON_BASEURL,"'"])
    print(fileName)
    # we display the image
    img = Image.open(fileName)
    img.show() 
    # we use the requests package
    files = {
        'images_file': (fileName, open(fileName, 'rb')),
        'classifier_ids': (None, WATSON_MODEL),
    }
    # calling the Watson API 
    response = requests.post('https://gateway.watsonplatform.net/visual-recognition/api/v3/classify',params=WATSON_VERSION, files=files, auth=('apikey', WATSON_KEY))
    json_watson = response.json()
    # Watson returns a JSON with classification and confidence score informations
    print json_watson
    i +=1

NameError: name 'docID' is not defined

### 4. We could do the same on a IIIF image URL

In [12]:
import requests
# Wellcome collection
iiifURL = "https://iiif.wellcomecollection.org/image/L0009407.jpg/1,1,1568,1213/1000,/0/default.jpg"
# Gallica
#iiifURL = "https://gallica.bnf.fr/iiif/ark:/12148/bpt6k4628326j/f1/4317.695641814265,2899.28514719721,1006.9642711679644,774.944853848157/217,167/0/native.jpg"
CURL_URL = (('url', iiifURL),)
WATSON_CLASSIFIER = (('classifier_ids', WATSON_MODEL),)
curlParams = {
        'url': (None, iiifURL),
        'classifier_ids': (None, WATSON_MODEL),
        'version': (None, '2018-03-19')
    }
print "--- infering image ",iiifURL," ..."
img = Image.open(requests.get(iiifURL, stream=True).raw)
img.show() 
# call to the Watson API 
response = requests.post('https://gateway.watsonplatform.net/visual-recognition/api/v3/classify',params=curlParams, auth=('apikey', WATSON_KEY))
json_watson = response.json()
print json_watson


--- infering image  https://iiif.wellcomecollection.org/image/L0009407.jpg/1,1,1568,1213/1000,/0/default.jpg  ...
{u'images': [{u'classifiers': [{u'classes': [{u'score': 0.902, u'class': u'Photo_30'}], u'classifier_id': u'DefaultCustomModel_1457318034', u'name': u'Default Custom Model'}], u'resolved_url': u'https://iiif.wellcomecollection.org/image/L0009407.jpg/1,1,1568,1213/1000,/0/default.jpg', u'source_url': u'https://iiif.wellcomecollection.org/image/L0009407.jpg/1,1,1568,1213/1000,/0/default.jpg'}], u'custom_classes': 4, u'images_processed': 1}


In [32]:
images = json_watson.get('images')
#print images[0].keys() # dict
print "-> classification:",images[0].get('classifiers')[0].get('classes')[0].get('class')
print "-> confidence score:",images[0].get('classifiers')[0].get('classes')[0].get('score')

-> classification: Photo_30
-> confidence score: 0.902
