# Gathering a Corpus

**<< Previous module: [What Is An Algorithm?](01-AlgorithmsOfResistance-WhatIsAnAlgorithm.ipynb) <<**

*30-60 minutes*

<div class="alert alert-block alert-info">
    <strong>By the end of this module, you should be able to</strong>
    <ul>
        <li>define the term "corpus" as it relates to digital scholarship;</li>
        <li>describe the purpose of a corpus in text/data analysis;</li>
        <li>describe the process of creating corpora, and identify possible resources for locating existing corpora;</li>
        <li>describe the purpose of automated downloading, its benefits and limitations, and ethical considerations when gathering texts via automated downloading;</li>
        <li>identify and implement steps needed to automate downloads from the Internet Archive.</li>
    </ul>
</div>

## Table of Contents

- [What is a corpus?](#corpus)
- [How can I find a corpus?](#create-corpus)
- [Corpus Structures & Formats](#corpus-formats)
- [Automated Downloading](#downloading)
- [How to download from the Internet Archive](#how-to-download)
    - [Option 1: Automated downloading from scratch](#option-1)
    - [Option 2: The Internet Archive's Python Library](#option-2)
- [Resources](#resources)

## What is a Corpus? <a class="anchor" id="corpus"></a>

It turns out there are multiple meanings for the word "corpus." If you're a logophile (word nerd), check out its long entry in the [Oxford English Dictionary](https://www-oed-com.libproxy.lib.unc.edu/). In academic research, though, "corpus" usually refers to:  

<div class="alert alert-block alert-danger">
    <p><strong>A collection of texts</strong>. Typically, a corpus is made up of texts <strong>that are all the same type and are all connected to each other in some way.</strong> They might be</p> 
    <ul>
        <li>books by the same author (such as <a href="https://www.octaviabutler.com/work" target="blank">Octavia Butler's novels</a>);</li>
        <li>magazine articles from the same publication (such as <a href="https://time.com/vault/" target="blank">all the articles ever published in <em>TIME Magazine</em></a>);</li>
        <li><a href="https://dc.lib.unc.edu/cdm/landingpage/collection/sohp" target="blank">interview transcripts</a>;</li>
        <li>emails from a company (such as the <a href="https://www.cs.cmu.edu/~enron/" target="blank">Enron corpus</a>);</li>
        <li>or newspaper articles that cover the same topic (such as <a href="https://newspaperarchive.com/free-newspaper-archives/environmental-science/environment/climate-change-p-1/" target="blank">articles discussing climate change</a>).</li>
    </ul>
    <p>For another perspective, check out Penn State Libraries' <a href="https://guides.libraries.psu.edu/textanalysis" alt="definition of corpus">definition of "corpus"</a>.</p>
</div>

The *On The Books* team works with **3 corpora**: 
- [a corpus](https://onthebooks.lib.unc.edu/laws/all-laws/) of all laws passed in the state of North Carolina between 1866-1967;
- a corpus of laws used as a training set to train algorithms to identify possilble Jim Crow laws (to be publicly released later in 2021);
- [a corpus](https://onthebooks.lib.unc.edu/laws/jim-crow-laws/) of all laws from the above corpus that have been identified by scholars and/or machine learning as laws that were or may have been created to enforce [Jim Crow](https://onthebooks.lib.unc.edu/teach/glossary/).

## Corpus Structures & Formats <a class="anchor" id="corpus-formats"></a>

<img src="images/06-corpus-01.jpeg" width="35%" alt="A list of files, each representing one text in a corpus. All files are in the .txt format."  style="float:right; padding-left:20px; margin-right:50px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" title="An example of a corpus" />

So, what does a digital corpus look like?

Typically, a digital corpus is 
- a collection (folder) of files;
- and each text or volume is stored in a separate file.

There are a few **best practices for creating a corpus** that we should keep in mind:
- All files should be of the *same file format*.
- The chosen file format should be *interoperable* (usable by many software and operating systems) and stable (changes rarely if ever).
- The file format should be *human and computer-readable*.

<div class="alert alert-block alert-warning">
    <p>In the screenshot above, we show a selection from an <em>On The Books</em> corpus.</p>
    <ul>
        <li>What do you notice about each file's name?</li>
        <li>How is the corpus separated into individual texts?</li>
        <li>Which file format is being used? What are the benefits of that file format?</li>
    </ul>
</div>

In *On The Books*, our corpus is made up of volumes separated by year. Each file is named accordingly to keep the files in chronological order and to help researchers locate a specific volume. All files are stored in the [.txt (text)](https://en.wikipedia.org/wiki/Text_file) and [.xml (eXtensible Markup Language)](https://en.wikipedia.org/wiki/XML) file formats. 

**.txt** is a format that any text editor or word processor can read, regardless of which version software or operating system (Mac, Windows, etc.) is used. The .txt format does not apply any formatting (such as headings, bold or italics, font, or layout) to its context. It contains only **plain text**, like the text in the screenshot below.

<img src="images/06-corpus-02.jpeg" width="70%" style="padding-top:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="A snippet from a plain text file in the On The Books corpus." title="A snippet from a plain text file in the On The Books corpus." />

**.xml** is a textual data format that we can use to structure and identify data within a document to make that text machine readable. In .txt files, a computer doesn't know the difference between text that forms a chapter title and text that forms section content. In XML, we can use tags to help a computer "read" the structure of a document. The tags below can be identified by their surrounding angle brackets `< >`. `<chapter_text>` is the tag for a chapter title. The title of a chapter is found between two opening and closing tags, `<chapter_text>...</chapter_text>`. Can you find the chapter title in the text below?

<img src="images/06-corpus-12.jpeg" width="70%" style="padding-top:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="A snippet from an XML file in the On The Books corpus." title="A snippet from an XML file in the On The Books corpus." />

Most text and XML files (such as those you create on your computer today) will be encoded using the [UTF-8 (Unicode Transformation Format) standard](https://en.wikipedia.org/wiki/UTF-8). If you're working with text files created more than 10 years ago, these may be encoded using a different standard. 

**What do we mean by *encoded*?** You can learn more about this by [reading the Wikipedia page](https://en.wikipedia.org/wiki/UTF-8), but basically "encoded" or "encoding" refers to a standard that all software programs and operating systems use to identify characters in human alphabets and numeric systems. This is necessary so that programs such as Microsoft Word, Pages, TextEdit, Nodepad, and Open Office all refer to the letter "A", for example, in the same way. 

Remember that at their very core, computers operate in [binary code](https://en.wikipedia.org/wiki/Binary). If you need a refresher on this, [listen to Admiral Grace Hopper explain the concept of binary in computing](http://libproxy.lib.unc.edu/login?url=https://fod.infobase.com/PortalPlaylists.aspx?wID=102632&xtid=70610).

So, in a nutshell, **UTF-8 encoding ensures that when you type the letter "A" into any of these programs, all of them read** `01000001` **and show on the screen or read aloud** `A`.

<div class="alert alert-block alert-warning">
    <p>If you're curious, you can</p>
        <ul>
            <li>view <a href="https://utf8-chartable.de/unicode-utf8-table.pl?utf8=bin" alt="A table of UTF codes and their binary equivalents.">a table of common Latin characters and their UTF and binary codes</a>, or</li>
            <li>type your name into a <a href="https://codebeautify.org/text-to-binary" alt="a text to binary converter">text to binary converter</a> to learn your name in binary.</li>
    </ul>
</div>

Here's some of the text above in binary:

`An act to incorporate Blackwells Durham Co-operative Tobacco Company.`

`01000001 01101110 00100000 01100001 01100011 01110100 00100000 01110100 01101111 00100000 01101001 01101110 01100011 01101111 01110010 01110000 01101111 01110010 01100001 01110100 01100101 00100000 01000010 01101100 01100001 01100011 01101011 01110111 01100101 01101100 01101100 01110011 00100000 01000100 01110101 01110010 01101000 01100001 01101101 00100000 01000011 01101111 00101101 01101111 01110000 01100101 01110010 01100001 01110100 01101001 01110110 01100101 00100000 01010100 01101111 01100010 01100001 01100011 01100011 01101111 00100000 01000011 01101111 01101101 01110000 01100001 01101110 01111001 00101110`

There's more to this, but the main thing to know for now is that UTF-8 is the most widely used and most comprehensive encoding. Without encoding, the machine readable text we create may not *actually* be consistently legible to the various analytical and presentational tools we might want to apply to our corpus. It's especially important to make sure your corpus is encoded in UTF-8 if you are working in languages that use non-Latin alphabets and/or languages that use [diacritics](https://en.wikipedia.org/wiki/Diacritic) -- letters with accents such as `é`, `ñ`, `ü`, etc.

## How can I find a corpus? <a class="anchor" id="create-corpus"></a>

Depending on your discipline and research question, there are [some corpora that already exist and are ready to use](https://www.english-corpora.org/). If you study linguistics, this [corpus survey](https://www.lancaster.ac.uk/fass/projects/corpus/cbls/corpora.asp#_Toc92298867) may be useful. If you are a scholar of English-speaking cultures in the Early Modern period, you are blessed with the riches of such projects as the [Old Bailey Online](https://www.oldbaileyonline.org/) or [EEBO](https://www.english-corpora.org/eebo/) (available via institutional access only). There are also corpora of scientific information such as [MEDLINE/PubMed](https://www.nlm.nih.gov/databases/download/pubmed_medline.html). More likely, though, you may need to search a bit to see if there is a corpus already created that you can work with. A good way to do this is to **ask your librarian** for help searching.

Even more likely, you'll need to **create your own corpus by gathering texts**. If these materials are in the [public domain](https://en.wikipedia.org/wiki/Public_domain) or licensed for [open access](https://en.wikipedia.org/wiki/Open_access), you may be able to find them in repositories such as the [Internet Archive](https://archive.org/), [Hathi Trust](), [Digital Public Library of America](https://dp.la/), [Chronicling America](https://chroniclingamerica.loc.gov/), and [Project Gutenberg](http://www.gutenberg.org/).

**Whether a corpus of texts is *available* and *ready to use* will depend on several factors**, including whether
- the texts are in the public domain or are licensed for open access;
- there are other scholars in your field interested in your subject and in computational research practices;
- an institution has invested resources in digitizing the texts and making them accessible.

<div class="alert alert-block alert-danger">
<p><strong>This means that other scholars and institutions need to have considered the materials you are interested in <em>valuable enough</em>, and need to have had <em>enough resources</em>, to do all of the processing work that goes into creating a digital corpus ready for computational analysis.</strong></p>
<p>The subject of what is included in and excluded from archives, both digital and analog, is much discussed not only among librarians and archivists but also among digital humanists who seek to disrupt and rebuild "the canon" as a more inclusive, democratic space. If you'd like to read more about this, check out our bibliography in the <a href="#decolonizing">Resources section</a> below.</p>
</div>

This series of modules has been created assuming that some parts of the work to create a digital corpus have been done (e.g. your corpus has been digitized but not made computer readable) but that you need to complete the process (e.g. making the corpus computer readable for analysis). That said, we'll point to resources that can help you get started if you're beginning from paper.

Let's not sugar coat it...

### Going from print to fully computer readable text takes a lot of work.

<img src="images/LawBooks-feature.png" width="70%" style="padding-top:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="A photograph of North Carolina law volumes on a bookshelf. Source: https://library.unc.edu/page/26/?cat=non" title="A photograph of North Carolina law volumes on a bookshelf." />

When the *On The Books* project began in 2019, it relied on a [2009-11 digitization initiative](http://ncgovdocs.org/aboutcollection.html) that had digitized approximately [2,300 volumes published by the North Carolina state government in the 19th and 20th centuries](https://archive.org/details/ncgovdocs) and made them publicly available for viewing and downloading via the [Internet Archive](https://archive.org/). These texts are [available](https://archive.org/details/lawsresolutionso1887nort/page/776/mode/2up) in high resolution image and pdf files as well as plain text: 

[<img src="images/06-corpus-03.jpeg" width="70%" style="padding-top:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="Screenshot of a volume of North Carolina laws shown in PDF format on the Internet Archive" title="Screenshot of a volume of North Carolina laws shown in PDF format on the Internet Archive" />](https://archive.org/details/lawsresolutionso1887nort/page/776/mode/2up)

This sounds great, right? With all this work already done, surely the *On The Books* team could get on with their work of identifying Jim Crow laws, right? Unfortunately, it wasn't that simple. To do attempt this work, the team needed fully computer-readable text files that a computer could use to return *reliable* results for both manual and algorithmic text searches. Let's take a look at why this wasn't possible:

<div class="alert alert-block alert-warning">
    <p><strong>If you haven't already, open a new browser window and navigate to the volume shown above:</strong> <a href="https://archive.org/details/lawsresolutionso1887nort/page/776/mode/2up" alt="URL for a sample volume on the Internet Archive">https://archive.org/details/lawsresolutionso1887nort/page/776/mode/2up</a>.</p>
    <p>When the page loads, <strong>scroll down below the view of the volume itself.</strong> Here, you'll find a bunch of information about the volume and its digitization (metadata) and a list of possible download options.</p> 
    <p><strong>Click the <a href="https://archive.org/stream/lawsresolutionso1887nort/lawsresolutionso1887nort_djvu.txt" alt="link to the plain text version of a volume in the Internet Archive">"FULL TEXT"</a> option below "Download Options" on the right</strong> to preview the plain text version of this volume.</p>
    <p><strong>What do you notice?</strong></p>
</div>

<img src="images/06-corpus-04.jpg" width="70%" style="padding-top:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="Screenshot of metadata and download information for a volume of North Carolina laws in the Internet Archive" title="Screenshot of metadata and download information for a volume of North Carolina laws in the Internet Archive" />

Below is a comparison of the plain text available on the Internet Archive (left) and a corrected plain text version of the same text from the *On The Books* (right). Neither of these texts is perfect (notice the red underlined typo "Nerth", which should be "North," in the corrected version), but the right version has noticably fewer errors than the version on the left.

<div class="row">
    <div class="column">
<img src="images/06-corpus-05.jpeg" width="45%" style="float:left; margin:10px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="Screenshot of plain text from the sample volume in the Internet Archive" title="Screenshot of plain text from the sample volume in the Internet Archive" />
    </div>
    <div class="column">  
<img src="images/06-corpus-02.jpeg" width="45%" style="float:right; margin:10px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);"" alt="Screenshot of plain text from the sample volume from the On The Books corpus" title="Screenshot of plain text from the sample volume from the On The Books corpus" />
    </div>
</div>

**Why are errors in plain text a problem?** If you still have the plain text from the Internet Archive open, try using `Command + F` or `Control + F` to search for the name "Blackwells Durham" and see what is returned. You may see some mentions of the name, but at least one mention shown in the example above, "BlacknelPs Darliam," will not be returned. So if you are looking for every mention of this company in the 1887 session laws, you might miss some mentions -- these are not *reliable* results.

The *On The Books* team recognized early in the project that they needed to create a much more reliable dataset in order to complete their project. To do this, they went back to the original image files scanned and uploaded to the Internet Archive, reproduced plain text from them, and undertook an extensive clean-up effort using both human and computer labor.

To begin this process, the team needed to download all of the corpus files from the Internet Archive. **The rest of this module will show you how to download files from the Internet Archive using the Python programming language.** Modules that follow will walk you through the processes of transforming image files into plain text and cleaning up that text to create a reliable dataset.

## Automated Downloading <a class="anchor" id="downloading"></a>

If you are working on a project that doesn't already have its own digitized, computer-readable corpus -- or maybe it does, but the corpus is available in a way that would make downloading everything manually very time consuming -- then you might want to try a programmatic approach to gathering materials in your corpus. 


Some platforms, such as the [Internet Archive](https://archive.org/) and [Project Gutenberg](http://www.gutenberg.org/), now offer custom tools that make these programming tasks more efficient. For an example of how to use IA's Python Library, see [Option 2](#option-2). Or take a look at [Project Gutenberg's Python package](https://pypi.org/project/Gutenberg/) for another example. Many platforms, like [Chronicling America](https://chroniclingamerica.loc.gov/about/api/), provide an **API (Application Programming Interface)** that lets researchers access a bunch of information at once without having to click through many webpages.


But, in some cases, you may find that you need to write a download script specific to your context: the files are stored on multiple webpages, you need to get file URLs out of a webpage's HTML, you need to convert the files to a usable format on download, and so on. There are a number of possible ways to create a script that fits each context, and there are a bunch of different Python modules that can help you make it happen. There are also [lots of resources](#resources) if you want to dig deeper into automated downloading and related practices such as web scraping and data mining with Python.

### What is automated downloading?

<div class="alert alert-block alert-danger">
<p>By "automated downloading," we mean using <strong>a script to programmatically download many files at once from the web</strong>--as opposed to manually clicking each file to select and download it. This method is related to <a href="https://en.wikipedia.org/wiki/Web_scraping" alt="Wikipedia page for web scraping">web scraping</a>.</p>
</div>

### Why might you automate downloading files from the web?

Researchers use methods such as automated downloading to gather publicly available historical documents. Businesses use it to learn more about their market. Journalists use it to gather data as part of a story. There are many reasons why you might find that you need to download many files at once from a site that offers files for individual download.

<div class="alert alert-block alert-danger">
    <h3>STOP. Think before you download.</h3>
    <p>Before you jump ahead to learning how you can download many files at once from the Internet Archive, <strong>there are some important questions to ask yourself</strong>:</p>
    <ol>
        <li>Why, or for what purpose, am I seeking out this information?</li>
        <li>Who will be most impacted by my accessing and downloading this information?</li>
        <li>Would downloading many files at once from this, or any site, cause harm, especially to historically oppressed or vulnerable peoples?</li>
        <li>Would it perpetuate any form of oppression?</li>
        <li>Would it contribute to the recognition of and resistance against oppression?</li>
    </ol>
    <p>Think back to our previous discussion of <a href="01-AlgorithmsOfResistance-WhatIsAnAlgorithm.ipynb" target="blank">algorithms of resistance</a>. Likely, your answers to the above questions will not be simply "yes" or "no." There will be nuance, and it will be up to you and your collaborators, mentors, and supervisors, to determine whether gathering information at scale is an ethical approach to answering your research question.</p>
  
</div>

In addition to the **ethical** questions discussed above, there are some important **legal** questions:

1. Is the information I'm hoping to gather publicly available?
2. Is the information <a href="https://guides.lib.unc.edu/scomm/copyrightbasics" alt="UNC Libraries Scholarly Communications Office LibGuide on copyright">copyrighted</a> or <a href="https://guides.lib.unc.edu/scomm/creativecommons" alt="UNC Libraries Scholarly Communications Office LibGuide on Creative Commons licenses">licensed</a> in any way? Is it <a href="https://guides.lib.unc.edu/recman/sensitiverecords" alt="UNC LibGuide on Records Management">sensitive</a>?
3. Would my gathering and use of this information qualify as <a href="https://guides.lib.unc.edu/fair-use" alt="UNC Libraries Scholarly Communications Office LibGuide on Fair Use">fair use</a>?
4. Do I need to seek the information provider's permission?
5. Would my downloading activities violate an information provider's terms of service or privacy policy?

Answers to some of these legal questions may help you answer the ethical questions above.

In *On The Books'* case, the materials we are accessing are in the public domain (not copyrighted), and the Internet Archive has made them freely available for download. One of the institutions that made them available online is the same institution hosting the *On The Books* project. Our aim is to gather these texts for research and educational purposes with a particular focus on identifying laws that were used to oppress people of color for over a century. We used the files downloaded to create a computer-readable corpus that we'll then use to perform various analyses to better understand the corpus' contents.

When thinking about your own information gathering activities, **if you're not sure about any of the questions above, STOP. Contact your institution's <a href="https://library.unc.edu/scholarly-communications/" alt="UNC Libraries Scholarly Communications Office">scholarly communications</a> or <a href="https://library.unc.edu/hub/" alt="UNC Research Hub">digital research</a> experts for advice <em>before</em> you download.**</p>

## How to download from the Internet Archive <a class="anchor" id="how-to-download"></a>

We're sharing 2 ways to use Python to download files from the Internet Archive (IA). These options show you different possible approaches to automated downloading. Which you choose will depend on the stability of your internet and the type of files you want to download. 


[**Option 1**](#option-1) demonstrates a method that could be applied to other sites *besides* the IA and gives a more in-depth look at the steps a computer needs to go through to complete a downloading or scraping task. This is *not* the harder option -- we walk you through each step.

[**Option 2**](#option-2) uses a Python tool built by the Internet Archive for this same purpose. It's simpler, but it doesn't show what's happening "under the hood," and it *doesn't* complete the task we need for these modules: to download .jpg image files from items in the *On The Books* corpus. This option is here in case you want to work with other materials or file types in the IA, but it won't help you complete these modules. *And* you need a stable internet connection to complete this option successfully.

### Option 1: Automated downloading from scratch <a class="anchor" id="option-1"></a>

**Go to the download page for files in the 1887 volume of North Carolina laws:** https://archive.org/download/lawsresolutionso1887nort. 

The scanning institution and the Internet Archive have made this volume available in a variety of formats. We need the image files stored in the file ending with `.zip`, a file format that is great if you want to share a lot of files with someone and don't want any of the files to get lost along the way. In this case, though, we want to be able to access individual images inside the .zip file.

Click on "View Contents" next to the .zip file, or click this link: [https://ia600300.us.archive.org/view_archive.php?archive=/11/items/lawsresolutionso1887nort/lawsresolutionso1887nort_jp2.zip](https://ia600300.us.archive.org/view_archive.php?archive=/11/items/lawsresolutionso1887nort/lawsresolutionso1887nort_jp2.zip). You should now see a list of .jp2 files with `.jpg` listed next to each file. The Internet Archive has made it possible for us to preview and download individual images (pages) from our item. These .jpg's are the files we need: 

<img src="images/06-corpus-11.jpeg" width="70%" style="padding-top:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="Screenshot of a .jp2 download page on the Internet Archive."/>

We could now start clicking on each file and saving them to our computer, but if you scroll down the page you will see that there are 1210 image files that make up this volume. We may not need all of the images, but it will be easier to download them all and remove the ones we don't need later than to try to identify the ones we need one at a time online. **This is why we need to use a script: to get these files without spending hours clicking to download each file.**

#### 1. Let's import some modules.
We'll need to use some modules from outside of Python's standard library and one that is within Python's library. These packages are going to help us tell Python to do the following things:
- Contact a webpage. ([os module](https://docs.python.org/3/library/os.html) -- part of Python's standard library)
- Take a look at the webpage and see if it has the content we're looking for (in this case, image files). ([requests module](https://pypi.org/project/requests/) -- not part of Python's standard library)
- If those image files are present, download them one at a time and save them to a place of our choosing. ([Image module](https://pillow.readthedocs.io/en/stable/reference/Image.html) -- part of the PIL, or Python Imaging Library)

To get started, let's first look at how to run code in Jupyter Notebooks:

<div class="alert alert-block alert-warning">
    <p><strong>How to Run Code in Jupyter Notebooks</strong></p>
    <p>If you haven't run code before in Jupyter Notebooks, the steps are simple. Any code block is formatted differently from the text you've been reading: in a gray box with a <code>code-style font</code> to help you recognize it. Here's how to run the code below:</p>
    <ol>
        <li>Click your mouse inside the code block. A green box with a thick bar will appear around the code block.         <img src="images/06-corpus-13.jpeg" width="70%" style="margin:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="Screen capture showing a selected code block." /></li>
        <li>In the menu bar at the top of the page, click the "Run," or play button, icon. Alternatively, press <code>Shift + Enter</code> on your keyboard.<img src="images/06-corpus-14.jpeg" width="15%" style="margin:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="Screen capture showing the Run icon in the Jupyter Notebooks menu bar." /></li>       
        <li>The code will run. While code is running, an hourglass icon will appear on your browser tab, and you will see a `*` or asterisk in <code>In [ ]:</code>. When it's finished, you'll see a number appear in <code>In [1]:</code> to the left of the code block.</li>
    </ol>    
    <p>If you run code that provides a specific output (e.g. the result of a math equation or a statement confirming a successful run of the script), you'll see that output below the code block.</p>
    <img src="images/06-corpus-15.jpg" width="70%" style="margin:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="Screen capture showing output below a code block." />
    <p>If it's helpful, <a href="images/06-corpus-runcode.mp4" target="blank">here's a short video clip showing the steps above</a>.</p>
</div>

Now, run the following code to import the needed libraries and modules. *To understand what each line of code is doing, read the green text beginning with a `#`.*

In [None]:
# os (Operating System) is a module that helps us manage files and folders
# on our computers. It's one of the most basic and most helpful modules.
import os

# Requests helps us call up a webpage and access the content on that page.
import requests

# The Image module in PIL, or the Python Imaging Library, helps us work with image files.
# Here we use "from" to let the computer to look in a different library 
# (not the standard library). We then "import" only one of the many modules in that library.
from PIL import Image

# io, short for input/output, can help us read information stored in bytes. 
# Hang on -- you'll see why.
from io import BytesIO

print("Libraries successfully imported.")

#### 2. Move to a specific folder where we'll store the files we download.

First, we'll specify where the downloaded images will be saved. Right click on the Jupyter Notebook symbol at the top of the page, and select "Open in a New Tab" to view all of the files accompanying these modules. 

<img src="images/06-corpus-16.jpg" width="90%" style="margin:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="Screen capture showing the Jupyter Notebook menu with a red arrow pointing to the Jupyter logo." />

**If you are running Jupyter Notebooks locally** on your computer (not in Binder), you will see your computer account's home folder. You'll need to navigate to the folder where you stored these modules. You'll be in the right folder when you see all of the files and folders as they are shown in the image below.

**If you are running Jupyter Notebooks in Binder**, you should see all of the files in these modules as they are shown below:

<img src="images/06-corpus-18.jpg" width="90%" style="margin:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="Screen capture showing the Jupyter Notebook menu with a red arrow pointing to the Jupyter logo." />

Run the following code to get set up in the folder where we'll store all of the images we download.

In [None]:
# We're about to download a bunch of images.
# Let's move to a place where we can store them 
# together using the os library. This folder will show
# up wherever you've stored this Jupyter Notebook.
path = 'jpg_output'

# Let's also move ourselves into that folder 
# before we get going.
jpg_output = os.chdir(path)

# And let's get the current file path and print it
# so that we can be sure we've successfuly moved:
new_path = os.getcwd()
print("I am currently in the ", new_path, " file path.")

#### 3. Tell the computer which Internet Archive item we want to collect image files from and where to begin in the file sequence.

We're going to set this up as if we might want to get the image files from multiple items on the IA. Assigning the identifier for our current item to a variable, `id`, will help make this easier. You can change the identifier later if you'd like to try running this script with another item.

<div class="alert alert-block alert-warning">
<p>On any item page (such as <a href="https://archive.org/details/lawsresolutionso1887nort/page/776/mode/2up" target="blank">the example we are using</a>), you can find the item's identifier, a unique set of characters that the Internet Archive uses to find and show the item, in several places:</p>
    <p>You can find it in the <em>metadata</em>, the information about the item.</p>
    <img src="images/06-corpus-17.jpg" width="90%" style="margin:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="Screen capture showing an Internet Archive identifier in the item metadata." />
    <p>You can also find it in the item's URL where it's bolded below:</p>
    <br/>
    <code>https://archive.org/details/<strong>sessionlawsresol1955nort</strong></code>
    <br/>
</div>

Next, run the following to assign the volume's identifier to a variable ([a container for holding information in Python](https://www.w3schools.com/python/python_variables.asp)), called `id`.

The other variable we'll create, `fileCount`, is going to help us *iterate* through each image file in the list. It will help us start with the image at the top of the list and download each subsequent image without repeating or skipping ahead. `fileCount` must begin with 0 because Python begins counting at 0 instead of 1--if we began with 1, we would begin downloading at the *second* file in the list.

Note that this script will not return any output.

In [None]:
# Assign a variable to an IA item identifier. 
# Change this identifier to try this script with something else.
id = 'sessionlawsresol1955nort'

# Start from the first file in the list.
fileCount = 0

#### 4. Tell the computer to go to a specific webpage and get each .jpg file linked from that page.

There's a lot that's happening here, so we'll break it down in the comments:

In [None]:
# In order to get every .jpg file, we need to run a loop.
# Basically, we're telling the computer to run this code 
# once for every image in the list until our fileCount reaches 
# a certain number. 

# Right now, we're telling the computer to run the loop
# until the fileCount gets to 24. We've done this to avoid
# tying up your computer and filling your storage.

# So when you run this code, you'll only download the first 
# 25 image files. If you want to get more, change this number
# **and** change the fileCount number above. (For example,
# after you run this the first time, change fileCount to 25
# and change '25' below to '50' or higher.) This will ensure that
# you don't keep downloading the same files over and over, and
# it gives you some control over how many files you download at once.
while fileCount < 25:
    
    # We use a try statement here as a kind of test. 
    # We're telling the computer to 'try' to run the code below. 
    # If it doesn't work, it will jump down to the except statement 
    # below.
    try:
        
        # This next step uses the fileCount variable and the 
        # str.zfill function to create the 4-digit number of each 
        # image we will download. The str.zfill function pads 
        # strings (groups of characters) in a series by prepending 0s. 
        # The following will produce a number that is 4 characters 
        # long and that ends with the current fileCount value:
        fileNumber = str(fileCount).zfill(4)
        
        # The url variable below creates a url going to the 
        # .jpg file's location using both the id and fileNumber 
        # variables. We create it this way so that the url will 
        # change for each image. Here's an example of what one of 
        # those urls looks like. Sorry it's so long.:
        # https://archive.org/download/lawsresolutionso1887nort/lawsresolutionso1887nort_jp2.zip/lawsresolutionso1887nort_jp2%2Flawsresolutionso1887nort_0000.jp2&ext=jpg
        url = 'https://archive.org/download/' + id + '/' + id + '_jp2.zip/' + id + '_jp2%2F' + id + '_' + fileNumber + '.jp2&ext=jpg'
        
        # Here's where we use the requests module to call up
        # the webpage at the url we specified above.
        r = requests.get(url)
        
        # Next we use a combination of PIL, io, and requests 
        # to get access to the image and assign image data to the 
        # i variable.
        i = Image.open(BytesIO(r.content))
        
        # Now we need to create a file name for the image.
        # So that we know which image we've downloaded,
        # it should match the fileName on the IA webpage.
        # The statement below will produce file names that look like
        # sessionlawsresol1955nort_0001.jpg
        fileName = id + '_' + fileNumber + '.jpg'
        
        # Next, we'll save the image and 
        # name it using the fileName we created above.
        image = i.save(fileName)
        
        # We'll tell the computer to print a message when
        # all of the above is done so that we know it has
        # successfully downloaded an image.
        print(fileName, " downloaded.")
        
        # This little line might be the most important:
        # We now need to change fileCount to make it 1 larger
        # than it was (from 0 to 1 and so on). This will help
        # the computer know to look for the next file in the list.
        fileCount += 1
    
    # We've set up this except statement to handle an error.
    # In this case, the error will likely be that there are no more
    # images to download, but the loop hasn't finished.
    # When the 'UnidentifiedImageError' occurs, the except
    # statement will tell the computer to print text letting
    # you know that there aren't any more images available on the page.
    except UnidentifiedImageError:
       print('No more files to download.')

If you haven't already, run the above script. For each file that is downloaded you'll see a message above that includes the file name and "downloaded", e.g. `sessionlawsresol1955nort_0000.jpg  downloaded.` Take a stretch break. The script will be finished after it downloads the image numbered "24" and hourglass has disappeared from your browser tab.

<div class="alert alert-block alert-warning">
    <p><strong>Why is downloading one image at a time recommended over downloading an entire .zip file?</strong></p>
    <p>Imagine you attempt Option 2 below and try to download one big file all at once. Then 5 minutes after you started the script, your internet goes down. If the file wasn't finished downloading, then you'll find that you don't have a file yet (or you'll have a partial, unusable file), and you'll need to start the script over again. Imagine having to do that restarting a lot.</p>
    <p>When we break down such big files into smaller individual image files, we reduce the risk of data loss--if our internet goes down, we'll likely have some files already downloaded. Once our internet comes back up, we can start the process from where the computer left off--we don't have to go back to the beginning. It may take us a little longer with this specific use case, but it's a best practice you should take with you into all your future work with files in Python.</p>
</div>

#### 5. Check your jpg_output folder. What do you find?
If you didn't change any of the code on the first run, you should see the first 25 images of our example item downloaded from the IA. Don't worry, you don't need to download any more--we've provided all the files you need to complete the steps in the next module.


#### 6. Choose your own adventure:

<div class="alert alert-block alert-success">
    <p><em>You're ready to move on to the next module, <a href="03-WhatIsOCR.ipynb" target="blank">What is OCR?</a>.</em></p> 

<p>OR, if you'd like to try downloading more images, go back to the code above and change the fileCount and number in the <code>while</code> statement to match the additional files you'd like to download.</p>

<p>OR read on to learn an alternative way to download files from the Internet Archive.</p>
    </div>

### Option 2: The Internet Archive's Python Library <a class="anchor" id="option-2"></a>

If you went through the steps in [Option 1](#option-1), you learned that a **library** in programming is a collection of functions often organized into packages and modules. Libraries that go beyond Python's standard library are created for many reasons, but most often they exist to do something that Python's creators didn't include in the standard library *or* they are created to work with a specific platform. 

We've already spent time looking at how to download individual files from the Internet Archive, and if you tried [Option 1](#option-1) you saw one way we can use Python to get many files far more quickly than if we were to do so manually (clicking each link to download 1 file at a time). The Internet Archive has, however, created its own [Python library](https://archive.org/services/docs/api/internetarchive/) to make programmatic access to its repository *even easier*. 

**There are a couple of caveats with attempting this option:**

- Only try this section if you have a stable internet connection. The .pdfs and .zip image folders can be very large (250-500MB) and make take 10-15 minutes or much longer to download. (Take a look at this [list](https://archive.org/download/lawsresolutionso1887nort) to see file sizes if you don't believe us.)


- All of the North Carolina State Law book image files uploaded to the Internet Archive are available via this Python Library *only* in a special image format, .jp2, that many computers cannot read. For this reason, we recommend following [Option 1](#option-1) to get the image files needed to complete these modules. 

That said, if you plan to continue working with the Internet Archive's repository, getting to know the IA Python Library could be useful. Try out the following to see how it works.

#### 1. First, we need to import a tool (module) from the Internet Archive's Python Library. Run the following line to get started:

In [None]:
# Find the Internet Archive's Python Library ("internetarchive") 
# and import (or "check out") the "download" function.
from internetarchive import download

#### 2. Next, we'll use the IA's [download function](https://archive.org/services/docs/api/internetarchive/quickstart.html#downloading) to find and download a specific file. 
To do this, we need a file's *identifier*, a string of letters and numbers *unique* to a specific file on the Internet Archive. As in [Option 1](#option-1), you can find the identifer in several places:

- In the file's URL, it will come between 2 forward slashes (/): **`sessionlawsresol1955nort`** in 


`https://archive.org/details/sessionlawsresol1955nort`


- On the file's webpage, scroll down to find the identifier in the file's *metadata*:

<img src="images/06-corpus-17.jpg" width="90%" style="margin:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="Screenshot showing where to find the identifier for a volume of North Carolina laws in the Internet Archive." title="Screenshot showing where to find the identifier for a volume of North Carolina laws in the Internet Archive." />

That identifier, `sessionlawsresol1955nort` goes into the the following code that will start the download:

In [None]:
# Find the specific file and download just the PDF format. 
# Use 'verbose=True' to let us know if download process is successful.
download('sessionlawsresol1955nort', verbose=True, glob_pattern='*pdf')

#### And now, we wait!

You'll see a * to the left of the code you just ran, 
<img src="images/06-corpus-07.jpeg" width="15%" style="padding-top:20px; box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" alt="Screenshot showing the asterisk next to a code block that is running in Jupyter Notebooks" />


and an hour glass in your browser tab. <img src="images/06-corpus-08.jpeg" style="box-shadow: 25px 25px 20px -30px rgba(0, 0, 0);" width="10%" alt="Screenshot showing the hour glass at the top of a Jupyter Notebooks browser tab" />


When that process is finished, the hour glass will disappear and the * will be replaced with a number. You will also see a confirmation that the file was downloaded:

<img src="images/06-corpus-09.jpg" width="90%" style="padding-top:20px; box-shadow: 25px 25px 20px -25px rgba(0, 0, 0);" alt="Screenshot showing that a download has completed." />

#### Wait, where is my download?

You'll find the new .pdf file in its own folder titled `sessionlawsresol1955nort` either in `jpg_output` if you worked through Option 1 above *or* in the main modules folder.

#### Want a different file type?

We just downloaded a .pdf -- but if you want a different file type, you can change the extension in `glob_pattern='*pdf'` in the code above and rerun it: simply replace `pdf` with another extension, such as `txt` or, if available, `jpg`.

#### Not sure which file types are available?

If you want to know which file types are available, you can use the following code to get a list of all files available. If you're using this code with a file other than our sample `sessionlawsresol1955nort` file, you may see an `Index Error` at the end of the code output. That's OK! It just means that the number of file types is fewer than `17`, which is the total number of files uploaded for `sessionlawsresol1955nort`. Try changing that number `17` in `range` below to see what happens.

In [None]:
# Find the Internet Archive's Python Library ("internetarchive") 
# and import ("check out") the "get_item" function.
from internetarchive import get_item

# Find the specific file.
item = get_item('sessionlawsresol1955nort')

# Set a counter that will help us run through the list of filenames.
number = 0

# Loop through the list of filenames 17 times.
for number in range(17):
    
    # Beginning from the top of the list, get one file name.
    filename = item.files[number]['name']
    
    # Increase the counter by one to help us get the next file name.
    number += 1
    
    # Before going back to the beginning of the loop, 
    # print the current filename.
    print(filename)

#### What if I want to download files from multiple items in the Internet Archive?
The above code just shows how to get files from one item in the IA at a time. If we want to get files from multiple items, we can use a **list** of identifiers. 

You can replace the identifiers in `' '` below to get other files. You can also change the file type in `glob_pattern`:

In [None]:
# Here is a list of identifiers for items we might want to download from IA.
file_list = ['sessionlawsresol1955nort','sessionlawsresol1959nort','sessionlawsresol1961nort']

# Begin a loop that will download each file from the list above.
for file in file_list:
    # Find the specific file from the list above 
    # and download just the PDF format. 
    # Use 'verbose=True' to let us know if download process is successful.
    download(file, verbose=True, glob_pattern='*pdf')

**And now we make a cup of tea and wait a little longer.** Maybe take a look at some of the [resources](#resources) below.

If you are working offline, you can find the files downloaded using the code above in each stored in their own folder along with this module or (if you worked through Option 1) in the `jpg_output` folder.

Remember, when the process finishes, you should no longer see an hourglass icon in your browser tab, and there should be a number to the left of the code block above. Open your file folder and see if the image files are there. Open an image file to see what it looks like.

<div class="alert alert-block alert-success">
    <strong>Next Steps:</strong> 
    <p>If you've successfully completed the above step, then you're ready to move on to the next module, <a href="03-WhatIsOCR.ipynb" target="blank">What is OCR?</a>, which digs further into how we can take the images we've downloaded and convert them into text files for search, analysis, and other computational activities.</p>
</div>

## Resources <a class="anchor" id="resources"></a>

### Getting Started with Python

- ["Beginner's Guide to Python"](https://wiki.python.org/moin/BeginnersGuide). Python.org.
- ["Python Crash Courses"](https://unc-libraries-data.github.io/Python/). UNC Libraries.

### Automated Downloading & Web Scraping with Python

A selection of titles to get you started:

- Broucke, Seppe vanden, and Bart Baesens. 2018. *Practical Web Scraping for Data Science: Best Practices and Examples with Python*. https://catalog.lib.unc.edu/catalog/UNCb9211730
- Hajba, Gábor László. 2018. *Website scraping with Python: using BeautifulSoup and Scrapy*. https://catalog.lib.unc.edu/catalog/UNCb9383712
- Mitchell, Ryan E. 2015. *Web scraping with Python: collecting data from the modern web*. https://catalog.lib.unc.edu/catalog/UNCb8344855

### Ethics & Legality of Automated Downloading & Web Scraping

We didn't just copy and paste here. These resources dedicate whole sections to legal and ethical concerns surrounding web scraping. These are not to be overlooked:

- Broucke, Seppe vanden, and Bart Baesens. 2018. *Practical Web Scraping for Data Science: Best Practices and Examples with Python*. https://catalog.lib.unc.edu/catalog/UNCb9211730
- Mitchell, Ryan E. 2015. *Web scraping with Python: collecting data from the modern web*. https://catalog.lib.unc.edu/catalog/UNCb8344855

And here are one data scientist's principles:

- Densmore, James. 2017. "Ethics in Web Scraping." *towards data science.* https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01

### Python & the Internet Archive

- "The Internet Archive Python Library." https://archive.org/services/docs/api/internetarchive/

**>> Next module: [What is OCR?](03-WhatIsOCR.ipynb) >>**

*This module is licensed under the [GNU General Public License v3.0](https://github.com/UNC-Libraries-data/OnTheBooks/blob/master/LICENSE). Individual images and data files associated with this module may be subject to a different license. If so, we indicate this in the module text.*