# Viewing Pages with Images from Internet Archive
adapted from Stephen Krewson, “Extracting Illustrated Pages from Digital Libraries with Python,” Programming Historian, January 14, 2019, https://programminghistorian.org/en/lessons/extracting-illustrated-pages.

- displays pages for Internet Archive books with pictures or illustrations, using ABBYY fine reader data from Internet Archive
- the main differences from Krewson's lesson is that instead of dowloading pages from Internet Archive, this notebook calls out to IIIF URLS to (a.) preview the results of internet archive queries and (b.) displays IA-hosted IIIF results in tabs in the notebook itself via ```ipyplot```

## Import Libraries and Enter Credentials


In [None]:

import requests
import ipyplot
from IPython.display import Image
from IPython.display import display, HTML
import gzip
import os
import sys
import xml.etree.ElementTree as ET
import internetarchive as ia

# Dnter IA credentials
ia_email = input("please enter IA account email")
ia_password = input("please enter IA account password")

# add these credentials to the API's configuration object
ia.configure(ia_email, ia_password)

## to use more of the browser window for display
display(HTML(data="""
<style> div#notebook-container { width: 95%; } div#menubar-container { width: 95%; } div#maintoolbar-container { width: 99%; } </style>

"""))


## Define Functions

In [8]:
# define a function for downloading pictures from a given IA volume
def ia_picture_urls(item_id, out_dir=None, out_urls=False):
    """
    :param item_id: unique Internet Archive volume identifier
    :param out_dir: destination for images; if None, no download
    
    Note: if supplied, out_dir must be an existing directory and
    the caller must have write permissions in that directory
    
    :rtype list of pages with one or more blockType=Picture in Abbyy OCR data
    """

    print("[{}] Starting processing".format(item_id))
    
    # Use command-line client to see available metadata formats:
    # `ia metadata formats VOLUME_ID`
    
    # for this lesson, only the Abbyy file is needed
    returned_files = list(ia.get_files(item_id, formats=["Abbyy GZ"]))
    
    # make sure something got returned
    if len(returned_files) > 0:
        abbyy_file = returned_files[0].name
    else:
        print("[{}] Could not get Abbyy file".format(item_id))
        return None
    
    # download the abbyy file to CWD
    ia.download(item_id, formats=["Abbyy GZ"], ignore_existing=True, destdir=os.getcwd(), no_directory=True)
    
    # collect the pages with at least one picture block
    img_pages = []
    
    with gzip.open(abbyy_file) as fp:
        tree = ET.parse(fp)
        document = tree.getroot()
        for i, page in enumerate(document):
            for block in page:
                try:
                    if block.attrib['blockType'] == 'Picture':
                        img_pages.append(i)
                        break
                except KeyError:
                    continue
    
    # 0 is not a valid page for making GET requests to IA,
    #yet sometimes it's in the zipped Abbyy file
    img_pages = [page for page in img_pages if page > 0]
    
    # track for download progress report
    total_pages = len(img_pages)
    print("total pages", total_pages)

    # OCR files are huge, so just delete once we have pagelist
    os.remove(abbyy_file)
    

    urls = ["https://iiif.archivelab.org/iiif/{}${}/full/full/0/default.jpg".format(item_id, page) for page in img_pages]
    
    if out_urls == True:
        print(urls)
    else:
        pass
    
    # return list of URLs with 1+ picture blocks
    return urls

## Grid plotting function
def look(urls, labels=None):
    if len(urls) > 0:
        smaller = [x.replace('/full/full', '/full/,450') for x in urls]
        ipyplot.plot_images(smaller, max_images=None, labels=labels)
    else:
        print("No results for query")
        
def tabs(vol_ids):
    labels = []
    urls = []
    for item_id in vol_ids:
        try:
            bookwise_urls = ia_picture_urls(item_id)
            urls = urls + bookwise_urls
            labels = labels + [x.split('$')[0].split('.org/iiif/')[1] for x in bookwise_urls]
        except:
            pass
    return urls, labels

## Query Internet Archive & Look at Thumbnails of Results

- query guidelines at https://archive.org/advancedsearch.php

In [11]:
query = "milton date:[1650 TO 1655] mediatype:texts"
vol_ids = [result['identifier'] for result in ia.search_items(query)]
thumbs = ['https://archive.org/services/img/{}'.format(im) for im in vol_ids]
look(thumbs, labels=vol_ids)
print(vol_ids)

['bub_gb_AuRss2enn0gC', 'bub_gb_E7UdY_fFAq0C', 'bub_gb_dDFhAAAAcAAJ', 'bub_gb_dPwJjf4BGU0C', 'bub_gb_mHom52vCg4wC', 'bub_gb_qdUYvmyfLiEC', 'ned-kbn-all-00002983-001']


In [12]:
# loop over our search results and call the function


urls, labels = tabs(vol_ids)
ipyplot.plot_class_tabs(urls, labels=labels)

[bub_gb_AuRss2enn0gC] Starting processing
bub_gb_AuRss2enn0gC: d - success
total pages 26
[bub_gb_E7UdY_fFAq0C] Starting processing
bub_gb_E7UdY_fFAq0C: d - success
total pages 83
[bub_gb_dDFhAAAAcAAJ] Starting processing
bub_gb_dDFhAAAAcAAJ: d - success
total pages 41
[bub_gb_dPwJjf4BGU0C] Starting processing
bub_gb_dPwJjf4BGU0C: d - success
total pages 93
[bub_gb_mHom52vCg4wC] Starting processing
bub_gb_mHom52vCg4wC: d - success
total pages 215
[bub_gb_qdUYvmyfLiEC] Starting processing
bub_gb_qdUYvmyfLiEC: d - success
total pages 85
[ned-kbn-all-00002983-001] Starting processing
ned-kbn-all-00002983-001: d - success
total pages 119
