# HathiTrust Research Center (HTRC)

The [HathiTrust Digital Library](https://www.hathitrust.org/) contains over 14 million volumes scanned from academic libraries around the world (primarily in North America). The [HathiTrust Research Center](https://analytics.hathitrust.org/) allows researchers to access almost all of those texts in a few different modes for computational text analysis. 

This notebook will walk us through getting set-up to analyze [HTRC Extracted Features](https://wiki.htrc.illinois.edu/display/COM/Extracted+Features+Dataset) for volumes in HathiTrust in a Jupyter/Python environment. *Extracted Features* are currently (as of April 2017) the most robust way to access in-copyright works from the HT Library for computational analysis. 

For more information on HTRC: 
* [Library text mining guide page on HTRC](http://guides.lib.berkeley.edu/c.php?g=491766&p=3381443)
* [Programming Historian's Text Mining in Python through the HTRC Feature Reader](http://programminghistorian.org/lessons/text-mining-with-extracted-features)

## Installation

To start we'll need to install a few things:
* Install the *HTRC Feature Reader* to work with Extracted Features: 
```
conda install -c htrc htrc-feature-reader
``` 
or
```
pip install htrc-feature-reader
pip install matplotlib jupyter
```
* Install Rsync to download Extracted Features from HathiTrust:

  * For Linux:
```
yum -y install rsync
```
  * For Mac you need to use [Homebrew](https://brew.sh/) to install Rsync. To install Homebrew:
```
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
```
  * And then to install Rsync on Mac using Homebrew:
```
brew tap homebrew/dupes
brew install rsync
```

## Add volumes from HTRC

### Finding Volume IDs in HathiTrust

To build your own corpus, you will need to find the volume ID for each volume you'd like to include from the [HathiTrust Library](https://www.hathitrust.org/).

* Search for your book, and copy the URL from the *Limited (Search Only)* or *Full View* links under the work. <img src="files/judith-butler-ht.png">
* The final string of characters after the final / is your volume ID
* For example, mdp.39015070698322 is the volume ID for "https://hdl.handle.net/2027/mdp.39015070698322"

### Rsync the volumes

Now that you've identified the volumes you'd like to use, you can run Rsync to pull down their Extracted Features for use with the HTRC Feature Reader.

First, make your way to the directory where you plan to do your work.

#### Add a single volume
If you're planning to analyze only a few volumes you can use the following command, replacing {{volume_id}} with your own:
```
htid2rsync {{volume_id}} | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ local-folder/
```
#### Add a list of volumes
If you have a file of volume ids, one per line, use --from-file filename, or just -f filename, and point to a text file with one volume ID on each line.
```
htid2rsync --f volumeids.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ local-folder/
```

#### Congressional Record
I've added a text file, congressional_record_ids.txt, to this repo that includes the volume ID for every *Congressional Record* volume that HathiTrust could share with us. 
```
htid2rsync --f congressional_record_ids.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ local-folder/
```

#### Full 4TB library
It's also possible to work with the entire library (4TB, so beware):
```
rsync -rv data.analytics.hathitrust.org::features/ .
```

Or to use existing lists of public-domain [fiction](http://data.analytics.hathitrust.org/genre/fiction_paths.txt), [drama](http://data.analytics.hathitrust.org/genre/drama_paths.txt), and [poetry](http://data.analytics.hathitrust.org/genre/poetry_paths.txt) (Underwood 2014).

## Loading Extracted Features

In the example, below, we have five volume IDs from HathiTrust, which are listed in the file *vol_ids_5.txt.* You can explore those volumes along with me for the rest of the tutorial, or modify the command to include your own list of volume ids or a single volume id of your choosing. (If you choose your own volume/s here, you will also need to modify the filepaths in the next step to point to those files).


In [None]:
%%bash
htid2rsync --f vol_ids_5.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ local-folder/

### Working with Extracted Features
All of the examples of code below are taken directly, or adapted, from the [Programming Historian tutorial](http://programminghistorian.org/lessons/text-mining-with-extracted-features) or the [FeatureReader's Readme.md file](https://github.com/htrc/htrc-feature-reader).

You'll notice, from the output above, that the content for each volume is stored in a compressed JSON file, in a rather lengthy file directory. We can import and initialize FeatureReader with file paths pointing to those JSON files (using the full paths from the output above). If you chose to work with your own volumes in the previous step you can edit the cell below to add the paths from the output you see above.

In [None]:
from htrc_features import FeatureReader
import os
paths = [os.path.join('local-folder', 'pst/pairtree_root/00/00/13/40/84/35/000013408435/pst.000013408435.json.bz2'), 
         os.path.join('local-folder', 'ku01/pairtree_root/13/5/135/ku01.135.json.bz2'), 
         os.path.join('local-folder', 'mdp/pairtree_root/39/01/50/27/27/26/19/39015027272619/mdp.39015027272619.json.bz2'), 
         os.path.join('local-folder', 'uc1/pairtree_root/$c/22/51/89/$c225189/uc1.$c225189.json.bz2'), 
         os.path.join('local-folder', 'mdp/pairtree_root/39/01/50/60/56/70/24/39015060567024/mdp.39015060567024.json.bz2')]
fr = FeatureReader(paths)
for vol in fr.volumes():
    print(vol.id, vol.title)

Let's try to pull out some more metadata about these titles, using the [Volume object](http://htrc.github.io/htrc-feature-reader/htrc_features/feature_reader.m.html#htrc_features.feature_reader.Volume) in FeatureReader:

In [None]:
#show the HT URL, year, and page count for each volume
for vol in fr.volumes():
    print("URL: %s Year: %s Page count: %s " % (vol.handle_url, vol.year, vol.page_count))

In [None]:
#where were the volumes scanned
for vol in fr.volumes():
    print("Source institution: %s " % (vol.source_institution))

In [None]:
#let's focus on the first volume
vol = fr.first()
print(vol.title)
# and pull the tokens on each page
tokens = vol.tokens_per_page()

# Show just the first few rows, so we can look at what it looks like
tokens.head()

In [None]:
#we can easily plot the number of tokens across every page of the book
%matplotlib inline
tokens.plot()

Now let's look at some specific pages, using the [Page object in FeatureReader](http://htrc.github.io/htrc-feature-reader/htrc_features/feature_reader.m.html#htrc_features.feature_reader.Page).

In [None]:
i = 0
for page in vol:
    # Same as `for page in vol.pages()`
    i += 1
    if i >= 200:
        break
print(page)

In [None]:
print("The body has %s lines, %s empty lines, and %s sentences" % (page.line_count(),
                                                                   page.empty_line_count(),
                                                                   page.sentence_count()))


In [None]:
#look at the tokens on the page
print(page.tokenlist()[:])

What else can we do?