# Bulk downloading ECCO-TCP texts #

Earlier, we identified which texts printed by William Bowyer are present in the ECCO-TCP corpus of TEI-encoded transcriptions and figured out the ECCO-TCP id that corresponds to each ESTC record for a work printed by Bowyer. Now, let's download those texts. 

Perverse as it will seem to anybody who's ever labored over taging a text following the TEI guidelines, we're going to throw away all of the markup to leave ourselves with a collection of plain text files. While we're at it, we'll modernize all of the long-s characters. 

In addition to the modules we've used before, we'll import the os module to create a directory in our file system.

In [1]:
# Import libraries
import csv
import requests
from bs4 import BeautifulSoup
import re
import os

# Create a string variable with the beginning of the url we'll need to download our TCP texts from GitHub
baseurl = 'https://raw.githubusercontent.com/textcreationpartnership/'
# Crete an empty list to hold the URLs we want to download
urls = []
# Open our .csv file with TCP ids, initiate the csv DictReader, and read the file a line at a time
with open('/media/sf_RBSDigitalApproaches/data/Bowyer_TCP_texts.csv', 'r') as infile :
    reader = csv.DictReader(infile, delimiter=',', quotechar='"')
    for row in reader :
        # Get the TCP id
        tcp_id = row['tcp_id']
        # Fill in the remainder of our URL with information based on the TCP id
        url = baseurl + tcp_id + '/master/' + tcp_id + '.xml'
        # Add our completed URL to our list of URLs
        urls.append(url)

# Let's see what we have
print(urls)

['https://raw.githubusercontent.com/textcreationpartnership/K025327.000/master/K025327.000.xml', 'https://raw.githubusercontent.com/textcreationpartnership/K020914.000/master/K020914.000.xml', 'https://raw.githubusercontent.com/textcreationpartnership/K029653.000/master/K029653.000.xml', 'https://raw.githubusercontent.com/textcreationpartnership/K036148.000/master/K036148.000.xml', 'https://raw.githubusercontent.com/textcreationpartnership/K036380.000/master/K036380.000.xml', 'https://raw.githubusercontent.com/textcreationpartnership/K036706.000/master/K036706.000.xml', 'https://raw.githubusercontent.com/textcreationpartnership/K039324.000/master/K039324.000.xml', 'https://raw.githubusercontent.com/textcreationpartnership/K022124.000/master/K022124.000.xml', 'https://raw.githubusercontent.com/textcreationpartnership/K040172.000/master/K040172.000.xml', 'https://raw.githubusercontent.com/textcreationpartnership/K041028.000/master/K041028.000.xml', 'https://raw.githubusercontent.com/text

### Make our filenames for saving the files ###

We saved the URLs, which include a filename ending in .xml, but we're going to save these files as plaintext (.txt) files. Note that, because we're outside of our earlier "for" loop now, we no longer have access to the tcp_id variable (which we could have simply combined with ".txt"), so we'll have to get our filename a different way. 

(*Note*: If I were writing this as a freestanding script, rather than in a Jupyter Notebook, I'd probably do this differently, taking care of the URL construction, filename mangling, downloading, XML parsing, and file writing all inside that loop. That is, as I understand it, a relatively unsophisticated way to do things, but it certainly works...) 

In [2]:
for url in urls :
    # First, use rpartition to split our url on the '/' character, starting from the right end and keeping 
    # the last bit. Then, take a portion of that string, beginning at character 0 and stopping four characters
    # from the end of the string (this eliminates our .xml file extension). Finally, add the new file extension
    # '.txt'.
    filename = url.rpartition('/')[-1][0:-4] + '.txt'
    print(filename)

K025327.000.txt
K020914.000.txt
K029653.000.txt
K036148.000.txt
K036380.000.txt
K036706.000.txt
K039324.000.txt
K022124.000.txt
K040172.000.txt
K041028.000.txt
K022939.001.txt
K022979.000.txt
K023013.000.txt
K023048.000.txt
K023195.000.txt
K020308.000.txt
K020310.000.txt
K132743.000.txt
K048041.000.txt
K052227.006.txt
K056178.002.txt
K060329.000.txt
K066805.000.txt
K075014.000.txt
K080528.000.txt
K081295.000.txt
K084859.000.txt
K086750.000.txt
K088839.000.txt
K089370.001.txt
K092785.004.txt
K109292.001.txt
K111725.000.txt
K114616.000.txt
K043848.000.txt
K044031.000.txt
K046245.000.txt


In [3]:
# Check to be sure that we don't already have this folder (we'd raise an error if we tried running the code in this
# cell a second time, because the directory would already exist)
if not os.path.exists('/media/sf_RBSDigitalApproaches/data/Bowyer_TCP/') :
    # If not, create the directory
    os.mkdir('/media/sf_RBSDigitalApproaches/data/Bowyer_TCP/')
for url in urls :
    filename = url.rpartition('/')[-1][:-4] + '.txt'
    # Pass the URL to the requests module and get the resource
    r = requests.get(url)
    # Pass the text that requests brings back over to teh BeautifulSoup module, using the xml parser from lxml. 
    soup = BeautifulSoup(r.text,'xml')
    # Find the "text" element of our TEI document, then get_text() to get the text content, throwing away all 
    # the markup.
    stripped = soup.find('text').get_text()
    # Use the re.sub() to find all the long-s's (we have to designate that as a unicode character with the u
    # outside the quotation marks) and replace them with short-s's.
    modernized = re.sub(u'ſ','s',stripped)
    # Open a new text file in our target directory and write our modernized text to it, encoding as utf-8
    with open('/media/sf_RBSDigitalApproaches/data/Bowyer_TCP/' + filename, 'wb') as file :
        file.write(modernized.encode('utf-8'))