# HathiTrust Research Center (HTRC)

The [HathiTrust Digital Library](https://www.hathitrust.org/) contains over 14 million volumes scanned from academic libraries around the world (primarily in North America). The [HathiTrust Research Center](https://analytics.hathitrust.org/) allows researchers to access almost all of those texts in a few different modes for computational text analysis. 

This notebook will walk us through getting set-up to analyze [HTRC Extracted Features](https://wiki.htrc.illinois.edu/display/COM/Extracted+Features+Dataset) for volumes in HathiTrust in a Jupyter/Python environment. *Extracted Features* are currently (as of August 2017) the most robust way to access in-copyright works from the HT Library for computational analysis. 

For more information on HTRC: 
* [Library text mining guide page on HTRC](http://guides.lib.berkeley.edu/c.php?g=491766&p=3381443)
* [Programming Historian's Text Mining in Python through the HTRC Feature Reader](http://programminghistorian.org/lessons/text-mining-with-extracted-features)

## Installation

To start we'll need to install a few things:
* Install the *HTRC Feature Reader* to work with Extracted Features: 

In [None]:
!pip install htrc-feature-reader

## Adding volumes from HathiTrust

To build your own corpus, you first need to find the volumes you'd like to include in the [HathiTrust Library](https://www.hathitrust.org/). Alternately, you can access volumes from existing [public HT collections](https://babel.hathitrust.org/cgi/mb?colltype=featured), or use one of the sample datasets included below under the *Sample datasets* heading. To access extracted features from HathiTrust:

* Install and configure [the HT + HTRC mashup](https://data.analytics.hathitrust.org/features/) browser extension.
* Once the extension is running, go to the [HathiTrust Library](https://www.hathitrust.org/), and search for the titles you want to include.
* You can manually download extracted features one result at a time by simply choosing the *Download Extracted Features* link for any item in your search results. Save the .json.bz2 file or files and skip to the next section, *Working with Extracted Features* below to load them into your workspace.
* If you plan to work with a large number of texts, you might choose instead to create a collection in HathiTrust, and then download the Extracted Features for the entire collection at once. This requires a valid CalNet ID. 

### To create a collection:

* [Login to HathiTrust](https://www.hathitrust.org/shibboleth)
* Change the HathiTrust search tab to *Full-Text* or go to the [Advanced Full-Text search](https://babel.hathitrust.org/cgi/ls?a=page;page=advanced).
![image](files/ht-full-text.png)
* Check the boxes to the left of any search results you want to add to your collection (or select all), and use the *Select Collection* dropdown to *Add Selected* volumes to collections of your own design.
![image](files/judith-butler-ht.png)
![image](files/ht-add-selected.png)
* Choose *My Collections* from the top of the HathiTrust interface, choose your collection, and from the *Download Metadata* button/dropdown choose the TSV option.
![image](files/ht-json.png)
* Open the TSV file, and then delete all of the columns except for the first column, *htitem_id.* Delete the *htitem_id* header row as well and then save the file to your working directory.

## Loading Extracted Features

Go to the directory where you plan to do your work.

### Add a single volume
If you're planning to analyze only a few volumes you can use the following command, replacing {{volume_id}} with your own:
```
htid2rsync {{volume_id}} | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ local-folder/
```

### Add multiple volumes
If you have a file of volume ids in a .txt file, with one ID per line, use --from-file filename, or just -f filename, and point to a text file with one volume ID on each line.
```
htid2rsync --f volumeids.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ local-folder/
```

### Sample datasets

#### Complete Novels of Jane Austen (1 volume)
```
htid2rsync mdp.39015004788835 | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ local-folder/
```
#### Nigerian Authors (30 volumes)
authors-nigerian.txt includes volume IDs for 30 texts with the Library of Congress subject heading *Authors, Nigerian*. 
```
htid2rsync --f authors-nigerian.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ local-folder/
```

#### San Francisco (Calif.) - History (111 volumes)
sf-history.txt includes the volume ID for 111 texts with the Library of Congress subject heading *San Francisco (Calif.) - History*. 
```
htid2rsync --f sf-history.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ local-folder/
```

#### Congressional Record (1200 volumes)
congressional_record_ids.txt includes the volume ID for every *Congressional Record* volume that HathiTrust could share with us.
```
htid2rsync --f congressional_record_ids.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ local-folder/
```

#### Full 4TB library
It's also possible to work with the entire library (4TB, so beware):
```
rsync -rv data.analytics.hathitrust.org::features/ .
```

Or to use existing lists of public-domain [fiction](http://data.analytics.hathitrust.org/genre/fiction_paths.txt), [drama](http://data.analytics.hathitrust.org/genre/drama_paths.txt), and [poetry](http://data.analytics.hathitrust.org/genre/poetry_paths.txt) (Underwood 2014).

In the example, below, we have five volume IDs on San Francisco history from HathiTrust, which are listed in the file *vol_ids_5.txt.* You can modify the command to include your own list of volume ids or a single volume id of your choosing. (If you choose your own volume/s here, you will also need to modify the filepaths in the next step to point to those files).

In [None]:
!rm -rf local-folder/
download_output = !htid2rsync --f vol_ids_5.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ local-folder/
download_output

### Working with Extracted Features
All of the examples of code below are taken directly, or adapted, from the [Programming Historian tutorial](http://programminghistorian.org/lessons/text-mining-with-extracted-features) or the [FeatureReader's Readme.md file](https://github.com/htrc/htrc-feature-reader).

You'll notice, from the output above, that the content for each volume is stored in a compressed JSON file, in a rather lengthy file directory. We can import and initialize FeatureReader with file paths pointing to those JSON files (using the full paths from the output above). If you chose to work with your own volumes in the previous step you can edit the cell above and re-run the cells below.

First we'll get all the data filepaths from the output of our command above:

In [None]:
import os

suffix = '.json.bz2'
file_paths = ['local-folder/' + path for path in download_output if path.endswith(suffix)]
file_paths

Now we'll feed these paths into the `FeatureReader` method which will create a `FeatureReader` object:

In [None]:
from htrc_features import FeatureReader

fr = FeatureReader(file_paths)

We can now cycle through properties of the `FeatureReader`:

In [None]:
for vol in fr.volumes():
    print(vol.id, vol.title)

Let's try to pull out some more metadata about these titles, using the [Volume object](http://htrc.github.io/htrc-feature-reader/htrc_features/feature_reader.m.html#htrc_features.feature_reader.Volume) in `FeatureReader`. We'll get the HT URL, year, and page count for each volume.

In [None]:
for vol in fr.volumes():
    print("URL: %s Year: %s Page count: %s " % (vol.handle_url, vol.year, vol.page_count))

The `source_institution` tells us where the volumes were scanned:

In [None]:
for vol in fr.volumes():
    print("Source institution: %s " % (vol.source_institution))

Let's take a look at the first volume:

In [None]:
vol = fr.first()
vol.title

The `tokens_per_page` method will give us the words in the volume:

In [None]:
tokens = vol.tokens_per_page()
tokens.head()

The `metadata` property gives us all the metadata information about the volume:

In [None]:
vol.metadata

We can easily plot the number of tokens across every page of the book

In [None]:
%matplotlib inline
tokens.plot()

Now let's look at some specific pages, using the [Page object in FeatureReader](http://htrc.github.io/htrc-feature-reader/htrc_features/feature_reader.m.html#htrc_features.feature_reader.Page). We'll take the first 200 pages in this volume:

In [None]:
i = 0
for page in vol.pages():
    i += 1
    if i >= 200:
        break
page

In [None]:
print("The body has %s lines, %s empty lines, and %s sentences" % (page.line_count(),
                                                                   page.empty_line_count(),
                                                                   page.sentence_count()))


We can get a list of the tokens with the `tokenlist` method:

In [None]:
print(page.tokenlist()[:])