# Scraping data from web sites meant for human readers #

So far, we've focused on downloading from the web data that was presented with automated querying and processing in mind. But sometimes you'll have a need to get information off of the web that's presented with the assumption that a user is going to sit in front of the screen and read, perhaps clicking on links to go elsewhere or download files. In some situations, it makes more sense to automate that process so that you can assemble the data or materials you need.

*Note*: Web scraping can raise ethical questions: What's okay to scrape and what's not? How much scraping is okay, and how much is abusive? People can get up to all sorts of nefarious things with web scraping. But people can also preserve and better disseminate important information by scraping it and repurposing it. You'll have to decide what is and isn't appropriate to scrape from the web.

## The problem ##

The British Library has made available on Flickr millions of images taken from scans of books in the public domain. They have published metadata directories of those images on GitHub and encourage anyone who wants to reuse those images as they see fit.

The British Library has also made PDFs of the volumes from which those images were taken freely available in their entirety. But they have not provided metadata for those volumes to point users to them for bulk download (that I can find so far, at least). Each page for an image on Flickr, however, includes a link to the full PDF, so it's certainly possible to download scans of entire volumes from the British Library. But you have to go through Flickr first.

## Our solution ##

This script will search a .csv file of metadata for the British Library's Flickr image set and find images that appear in books by James Thomson (author of *Sophonisba*). We'll figure out how many distinct books those images are taken from, then get a link to one image from each book (since we only need one--they all point to the same PDF). With that link, we'll be able to download the otherwise-unindexed PDF file from the British Library's servers.

## New Tricks ##

Much of what we're doing in this script is just like what we've done already, but there are a few new wrinkles.
* Filename mangling, including some some mildly gnarly regular expressions. We'll need to provide filenames for the PDFs we're downloading (the British Library's files are just arbitrary strings of characters, so we'll want something more descriptive). We'll use a few words from each title to construct a filename, but we want to avoid a bunch of pointless "The"s, "A"s, "An"s, etc. (i.e., what are commonly called "stopwords").
* Streaming large file downloads and saving iteratively. The PDFs provided by the British Library are high quality, with correspondingly large file sizes. We don't want to wait to load the entire file into memory. We'll use the `requests` module's streaming feature to begin saving the file as we get it, rather than waiting for the whole thing to arrive.



In [None]:
# Import libraries
import csv
import re
import requests
from bs4 import BeautifulSoup

# Create an empty dictionary to hold information about books by James Thomson in the BL Flickr set
books = {}
# Open our .csv file of metadata, initiate a csv DictReader, and begin reading the file a line at a time
with open('/media/sf_RBSDigitalApproaches/data/BL-Flickr-C18.csv', 'rU') as flickr_file :
    reader = csv.DictReader(flickr_file, delimiter=',', quotechar='"')
    for row in reader :
        # Check to see if the contents of the first_author cell are "Thomson, James"
        if row['first_author'] == 'Thomson, James' :
            # Save information from this row as variables
            book_identifier = row['book_identifier']
            flickr_url = row['flickr_url']
            author = row['first_author']
            title = row['title']
            # Check to see if we already have a link for an image from this book--we only need one link per book
            if book_identifier not in books.keys() :
                books.setdefault(book_identifier, {'flickr_url': flickr_url, 'author': author, 'title': title})


### Progress check ###
Let's have a look at the information we've saved about these Flickr links.

In [None]:
for book in books.items() :
    print(book)

### Scraping the Flickr page for the download link at the British Library ###

Now that we have URLs to take us to pages at Flickr, we need to use the `requests` module to get the contents of those pages, then use `BeautifulSoup` to find the links to the PDF at the British library, which all include "access.bl.uk" in their URLs. We'll then add those new URLs to our `books` dictionary.

In [None]:
for book_identifier, values in books.items() :
    # Get the flickr_url for an image from each book
    scrape_url = values['flickr_url']
    
    # Use the requests module to retrieve the content at that URL
    r = requests.get(scrape_url)
    
    # Take the text that requests brings back and pass it over to BeautifulSoup, using the html parser
    soup = BeautifulSoup(r.text, 'html.parser')
    
    # Use BeautifulSoup's find to look for a link with an href attribute that contains the pattern "access.bl.uk,"
    # then get the value of the href attribute itself (i.e., just the link, not the whole tag). Save this link as
    # 'bl_book_url'.
    bl_book_url = soup.find(href=re.compile('access.bl.uk'))['href']
    
    #Use setdefault to add a new key/value pair for this new URL to our books dictionary
    values.setdefault('bl_book_url', bl_book_url)
    
    # Close the request module's connection to the server
    r.close()

### Progress check ###

Let's see the current state of our `books` dictionary, now with links to the British Library's PDFs.

In [None]:
for book in books.items() :
    print(book)

### Constructing a filename for the PDF we're about to download ###

We've got the link to the PDFs at the British Library, but we have to call them something when we download them: I can't tell "lsidyv345e48ed" from "lsidyv345e7a47" at a glance. We saved some basic information to our `books` dictionary, but we wouldn't want to use some of those long titles as filenames for our PDFs. Warning: things are about to get a litle weird.

The filename I want to arrive at will consist of the author's last name, a few selected words of the title (eliminating any punctuation), and the British Library's identifier, all separated by underscores. So we need to:
* Isolate the author's last name
* Get a list of the words in the title
* Eliminate stopwords
* Get a manageably small number of words to use
* Get just the identifer from the British Library's link
* Combine all of this together with underscores

In [None]:
for book, data in books.items() :
    # First, let's get the author's last name, using partition on the comma and taking the first item in the 
    # resulting list of strings: index[0]
    author_name = data['author'].partition(',')[0]
    
    # Next, several steps for dealing with the words in the title:
    # 1) split the title string into a list of its constituent words, splitting on whitespaces.
    title_words = data['title'].split(' ')
    
    # 2) Define a regular expression to use for searching for and eliminating stop words
    stopwords = re.compile('^\\[?[Aa]$|^\\[?[Aa]n$|^[Aa]nd$|^\\[?[Tt]he$|^[Oo]f$|^\\[?[Ww]ith$|^\.\.\.$')
    
    # 3) Iterate through the list of title_words, searching for the stopwords regular expression pattern
    # in each word, then removing that word from the list of title_words if we find it.
    for word in title_words :
        if re.search(stopwords, word) :
            title_words.remove(word)
    
    # 4) Use the first four words that remain in the list, joining them together with underscores
    prep_title = '_'.join(title_words[:4])
    
    # 5) Define a regular expression to look for any unwanted punctuation in our concatenated title words:
    punct = re.compile('[\\[\\]\.,:;?!]')
    
    # 6) Look for instances of that regular expression in our prep_title and, when we find one, delete it
    # (by substituting nothing: '').
    trunc_title = re.sub(punct,'',prep_title)
    
    # To get the BL identifier, we use a variation on partition--rpartition--to work from the right end of the string
    bl_id = data['bl_book_url'].rpartition('/')[-1]
    
    # Put it all together to get a filename
    filename = author_name + '_' + trunc_title + '_' + bl_id + '.pdf'
    
    # Add the filename to our dictionary
    data.setdefault('filename', filename)
    
for book in books.items() :
    print(book)

### Think about downloading the books ###

The code in this cell will download the seven books in our set, two megabytes at a time. But the seven books together total over 300 megabytes, which is not only a lot of storage, but will also take quite a while to download. The next cell will download the smallest of the PDFs (ca. 16.1MB), though, so that you can see that it actually works.

In [None]:
####### THINK BEFORE RUNNING THIS CELL: DO YOU REALLY WANT SEVEN LARGE PDFs OF JAMES THOMSON? ######

for book_identifer, values in books.items() :
    with open('/media/sf_RBSDigitalApproaches/output/' + values['filename'], 'wb') as download :
        r2 = requests.get(values['bl_book_url'], stream=True)
        file_size = r2.headers['Content-Length']
        # This is part of a very crude download timer. You can find better ideas on Stackoverflow...
        progress = 0
        print('Downloading ' + values['filename'] + '...')
        for chunk in r2.iter_content(chunk_size=2000000) : 
            if chunk :
                download.write(chunk)
                progress += 2000000
                print('Downloaded ' + str(progress) + ' of ' + str(file_size) + ' (' + \
                      str(float(progress)/float(file_size) * 100) + '%)')
            
        print(values['filename'] + ' complete.')
        r2.close()

In [None]:
sample = books['3626484']
with open('/media/sf_RBSDigitalApproaches/output/' + sample['filename'], 'wb') as download :
    r2 = requests.get(sample['bl_book_url'], stream=True)
    file_size = r2.headers['Content-Length']
    progress=0
    print('Downloading ' + sample['filename'] + '...')
    for chunk in r2.iter_content(chunk_size=2000000) :
        if chunk :
            download.write(chunk)
            progress += 2000000
            print('Downloaded ' + str(progress) + ' of ' + str(file_size) + \
                  ' (' + str(float(progress)/float(file_size) * 100) + '%)')
    print(sample['filename'] + ' complete.')