# Introduction

Hey there! This notebook will be presenting a small walkthrough in extracting the metadata of publications from the HathiTrust Digital Library as well as illustrating geographic data in Python. As an exercise, we will be examining and illustrating the publication locations of a collection of Dadaist literature extracted from HathiTrust.

[Dada](https://en.wikipedia.org/wiki/Dada), an art and literature movement which stemmed as a reaction to the physical and psychological trauma wrought by World War I, a conflict unmatched at the time in its scale, death toll, and devastation.

Below is a painting by [Max Ernst](https://en.wikipedia.org/wiki/Max_Ernst), a prominent German Dada artist.
<img src="ernst.jpg">

# Background

The [HathiTrust Digital Library](https://www.hathitrust.org/) contains over 14 million volumes scanned from academic libraries around the world (primarily in North America). The [HathiTrust Research Center](https://analytics.hathitrust.org/) allows researchers to access almost all of those texts in a few different modes for computational text analysis. 

For more information on HTRC: 
* [Library text mining guide page on HTRC](http://guides.lib.berkeley.edu/c.php?g=491766&p=3381443)
* [Programming Historian's Text Mining in Python through the HTRC Feature Reader](http://programminghistorian.org/lessons/text-mining-with-extracted-features)

# Installation

To start we'll need to install a few things:
* Install the *HTRC Feature Reader* to work with Extracted Features: 
```
conda install -c htrc htrc-feature-reader
``` 
or
```
pip install htrc-feature-reader
pip install matplotlib jupyter
```
* Install Rsync to download Extracted Features from HathiTrust:

  * For Linux:
```
yum -y install rsync
```
  * For Mac you need to use [Homebrew](https://brew.sh/) to install Rsync. To install Homebrew:
```
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
``` 
 and to install Rsync on Mac using Homebrew:
```
brew tap homebrew/dupes
brew install rsync
```
  * For the unfortunate Windows users, you will need to use [Cygwin](https://cygwin.com/) to install Rsync.


## Adding volumes from HathiTrust

To build your own corpus, you first need to find the volumes you'd like to include in the [HathiTrust Library](https://www.hathitrust.org/). Alternately, you can access volumes from existing [public HT collections](https://babel.hathitrust.org/cgi/mb?colltype=featured), or use one of the sample datasets included below under the *Sample datasets* heading. To access extracted features from HathiTrust:

* Install and configure [the HT + HTRC mashup](https://data.analytics.hathitrust.org/features/) browser extension.
* Once the extension is running, go to the [HathiTrust Library](https://www.hathitrust.org/), and search for the titles you want to include.
* You can manually download extracted features one result at a time by simply choosing the *Download Extracted Features* link for any item in your search results. Save the .json.bz2 file or files and skip to the next section, *Working with Extracted Features* below to load them into your workspace.
* If you plan to work with a large number of texts, you might choose instead to create a collection in HathiTrust, and then download the Extracted Features for the entire collection at once. This requires a valid CalNet ID. 

### To create a collection:

* [Login to HathiTrust](https://www.hathitrust.org/shibboleth)
* Change the HathiTrust search tab to *Full-Text* or go to the [Advanced Full-Text search](https://babel.hathitrust.org/cgi/ls?a=page;page=advanced).
* Check the boxes to the left of any search results you want to add to your collection (or select all), and use the *Select Collection* dropdown to *Add Selected* volumes to collections of your own design.
* Choose *My Collections* from the top of the HathiTrust interface, choose your collection, and from the *Download Metadata* button/dropdown choose the TSV option.
* Open the TSV file, and then delete all of the columns except for the first column, *htitem_id.* Delete the *htitem_id* header row as well and then save the file to your working directory.

## Loading Extracted Features

Go to the directory where you plan to do your work.

### Add a single volume
If you're planning to analyze only a few volumes you can use the following command, replacing {{volume_id}} with your own:
```
htid2rsync {{volume_id}} | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ local-folder/
```

### Add multiple volumes
If you have a file of volume ids in a .txt file, with one ID per line, use --from-file filename, or just -f filename, and point to a text file with one volume ID on each line.
```
htid2rsync --f volumeids.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ local-folder/
```

### Sample datasets

#### Complete Novels of Jane Austen (1 volume)
```
htid2rsync mdp.39015004788835 | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ local-folder/
```
#### Nigerian Authors (30 volumes)
authors-nigerian.txt includes volume IDs for 30 texts with the Library of Congress subject heading *Authors, Nigerian*. 
```
htid2rsync --f authors-nigerian.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ local-folder/
```

#### San Francisco (Calif.) - History (111 volumes)
sf-history.txt includes the volume ID for 111 texts with the Library of Congress subject heading *San Francisco (Calif.) - History*. 
```
htid2rsync --f sf-history.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ local-folder/
```

#### Congressional Record (1200 volumes)
congressional_record_ids.txt includes the volume ID for every *Congressional Record* volume that HathiTrust could share with us.
```
htid2rsync --f congressional_record_ids.txt | rsync -azv --files-from=- data.sharc.hathitrust.org::features/ local-folder/
```

#### Full 4TB library
It's also possible to work with the entire library (4TB, so beware):
```
rsync -rv data.analytics.hathitrust.org::features/ .
```

Or to use existing lists of public-domain [fiction](http://data.analytics.hathitrust.org/genre/fiction_paths.txt), [drama](http://data.analytics.hathitrust.org/genre/drama_paths.txt), and [poetry](http://data.analytics.hathitrust.org/genre/poetry_paths.txt) (Underwood 2014).

In the example, below, we have five volume IDs on San Francisco history from HathiTrust, which are listed in the file *vol_ids_5.txt.* You can modify the command to include your own list of volume ids or a single volume id of your choosing. (If you choose your own volume/s here, you will also need to modify the filepaths in the next step to point to those files).


# Extracting the Metadata

This cell extracts all the metadata for each volume ID in dada.txt

In [None]:
%%bash
rsync dada.txt| rsync -azv --files-from=- data.sharc.hathitrust.org::features/ local_folder/

If, like many Windows users, the command above did not run properly, run the next cell to load the extracted features into the feature reader.

In [36]:
from htrc_features import FeatureReader
import os
paths = [os.path.join('data', 'coo.31924015258274.json.bz2'), 
         os.path.join('data', "iau.31858037881137.json.bz2"),
         os.path.join('data', 'mdp.39015015261228.json.bz2'),
         os.path.join('data', 'mdp.39015005710549.json.bz2'),
         os.path.join('data', 'mdp.39015024530415.json.bz2'),
         os.path.join('data', 'mdp.39015029907717.json.bz2'),
         os.path.join('data', 'mdp.39015036938895.json.bz2'),
         os.path.join('data', 'mdp.39015040086376.json.bz2'),
         os.path.join('data', 'njp.32101058239425.json.bz2'),
         os.path.join('data', 'njp.32101071964223.json.bz2'),
         os.path.join('data', 'uc1.$b29521.json.bz2'),
         os.path.join('data', 'uc1.31175034880826.json.bz2'),
         os.path.join('data', 'uc1.$b466638.json.bz2'),
         os.path.join('data', 'uc1.b3053718.json.bz2'),
         os.path.join('data', 'uc1.b3132679.json.bz2'),
         os.path.join('data', 'uc2.ark+=13960=t1mg7h245.json.bz2'),
         os.path.join('data', 'wu.89094397676.json.bz2')]
         
fr = FeatureReader(paths)

fr.volumes() is a collection of Volume objects, each representing a unique book in our collection. Each [Volume object](http://htrc.github.io/htrc-feature-reader/htrc_features/feature_reader.m.html#htrc_features.feature_reader.Volume) has attributes we can access, such as title, author, and, importantly for our purposes, 

In [35]:
#This cell lets us see the titles in the collection
for volume in fr.volumes():
    print(volume.title)

Maintenant.
Die Kugel.
Western art and the new era : an introduction to modern art / by Katherine S. Dreier.
The art of thought, by Graham Wallas.
An anthology of modern French poetry, by Gustave L. Van Roosbroeck.
American poetry since 1900 / by Louis Untermeyer.
A.L.A. catalog, 1926; an annotated basic list of 10,000 books.
Dada.
Vierzehn Briefe Christi : ein Geburtstagsgeschenk für seine Abteilung Ernst Haeckel vom Besitzer des Kabarets zur Blauen Milchstrasse.
La dernière Bohème; Verlaine et son milieu. Fantaisie-préface de Rachilde.  4 hors-texte, dessins de: Lita Besnard, G. Braun, F.-A. Cazals, Marie Cazals, Fernand Fau, Florian-Parmentier, Gallien, J. Hilly, Ibels, Jarry, Moréas, Ernest Raynaud, Verlaine.
American criticism, 1926,
Der Zeltweg.




Dai shisō ensaikuropejia.
Abstracts of theses, science series ... submitted to the faculties of the graduate schools of the University of Chicago for the degree of doctor of philosophy, June 1922-June1923, with abstracts of some theses submitted at an earlier date.
Books in black or red.
100 Poy cartoons : reprinted from the London "Evening News" and "Daily Mail".
Abstracts of theses, science series ... submitted to the faculties of the graduate schools of the University of Chicago for the degree of doctor of philosophy, June 1922-June1923, with abstracts of some theses submitted at an earlier date.


In [38]:
#This cell prints out the publication locations of books in the collection
for volume in fr.volumes():
    print(volume.pub_place)

it 
lh 
nyu
nyu
nyu
|||
ilu
fr 
gw 
fr 
nyu
fr 




ja 
ilu
nyu
enk
ilu


For some cities, it is unclear what cities correspond to the abbreviations above. We need a way to map each abbreviation to a city name. 