# Introduction

Hi there! This notebook will be presenting a small walkthrough in extracting the metadata of publications from the HathiTrust Digital Library as well as illustrating geographic data in Python. As an exercise, we will be examining and illustrating the publication locations of a collection of Dadaist literature extracted from HathiTrust.

[Dada](https://en.wikipedia.org/wiki/Dada), an art and literature movement which stemmed as a reaction to the physical and psychological trauma wrought by World War I, a conflict unmatched at the time in its scale, death toll, and devastation.

Below is a painting by [Max Ernst](https://en.wikipedia.org/wiki/Max_Ernst), a prominent German Dada artist.
<img src="ernst.jpg">

# Background

The [HathiTrust Digital Library](https://www.hathitrust.org/) contains over 14 million volumes scanned from academic libraries around the world (primarily in North America). The [HathiTrust Research Center](https://analytics.hathitrust.org/) allows researchers to access almost all of those texts in a few different modes for computational text analysis. 

For more information on HTRC: 
* [Library text mining guide page on HTRC](http://guides.lib.berkeley.edu/c.php?g=491766&p=3381443)
* [Programming Historian's Text Mining in Python through the HTRC Feature Reader](http://programminghistorian.org/lessons/text-mining-with-extracted-features)

# Extracting the Metadata

This cell extracts all the metadata for each volume ID in dada.txt

In [24]:
import warnings
warnings.filterwarnings('ignore')

Credit to Alex Chan for the following two cells!

In [53]:
import json
import os

jsonFiles = [file for file in os.listdir('.') if file.find('json') != -1]

txts = []
for file in jsonFiles:
    with open(file) as f:
        data = json.load(f)
        
    texts = data['gathers']
    ids = [text['htitem_id'] for text in texts]
    
    filename = data['title'] + '.txt'
    txts.append(filename)
    
    #write each id into txt file
    with open(filename, 'w') as f:
        for textid in ids:
            f.write(textid + '\n')

print("JSON files created")

JSON files created


In [54]:
output = !htid2rsync --f "Dada Literature.txt"| rsync -azv --files-from=- data.sharc.hathitrust.org::features/ dada/

In [55]:
import os

paths = {}
suffix = ".json.bz2"
filePaths = [path for path in output if path.endswith(suffix)]
paths = [os.path.join("dada", path) for path in filePaths]

path_file = "paths.txt"
with open(path_file, "w") as f:
    for path in paths:
        p = str(path) + "\n"
        f.write(p)
    f.close()

In [59]:
from htrc_features import FeatureReader

with open(path_file, "r") as f:
    paths = [os.path.join("", line[:len(line)-1]) for line in f.readlines()]
    f.close()

dada = FeatureReader(paths)

dada.volumes() is a collection of Volume objects, each representing a unique work in our collection. Each [Volume object](http://htrc.github.io/htrc-feature-reader/htrc_features/feature_reader.m.html#htrc_features.feature_reader.Volume) has attributes we can access, such as title, author, and, importantly for our purposes, the location of publication.

In [None]:
volumes_collection = []

for volume in dada.volumes():
    volumes_collection.append(volume)

In [35]:
#This cell lets us see the titles in the collection
for volume in dada.volumes():
    print(volume.title)

Maintenant.
Die Kugel.
Western art and the new era : an introduction to modern art / by Katherine S. Dreier.
The art of thought, by Graham Wallas.
An anthology of modern French poetry, by Gustave L. Van Roosbroeck.
American poetry since 1900 / by Louis Untermeyer.
A.L.A. catalog, 1926; an annotated basic list of 10,000 books.
Dada.
Vierzehn Briefe Christi : ein Geburtstagsgeschenk für seine Abteilung Ernst Haeckel vom Besitzer des Kabarets zur Blauen Milchstrasse.
La dernière Bohème; Verlaine et son milieu. Fantaisie-préface de Rachilde.  4 hors-texte, dessins de: Lita Besnard, G. Braun, F.-A. Cazals, Marie Cazals, Fernand Fau, Florian-Parmentier, Gallien, J. Hilly, Ibels, Jarry, Moréas, Ernest Raynaud, Verlaine.
American criticism, 1926,
Der Zeltweg.




Dai shisō ensaikuropejia.
Abstracts of theses, science series ... submitted to the faculties of the graduate schools of the University of Chicago for the degree of doctor of philosophy, June 1922-June1923, with abstracts of some theses submitted at an earlier date.
Books in black or red.
100 Poy cartoons : reprinted from the London "Evening News" and "Daily Mail".
Abstracts of theses, science series ... submitted to the faculties of the graduate schools of the University of Chicago for the degree of doctor of philosophy, June 1922-June1923, with abstracts of some theses submitted at an earlier date.


In [38]:
#This cell prints out the publication locations of books in the collection
for volume in fr.volumes():
    print(volume.pub_place)

it 
lh 
nyu
nyu
nyu
|||
ilu
fr 
gw 
fr 
nyu
fr 




ja 
ilu
nyu
enk
ilu


For some cities, it is unclear what cities correspond to the abbreviations above. We need a way to map each abbreviation to a city name. 

In [2]:
import json
import time
import requests


def get_coordinates(address):
    url = 'http://maps.googleapis.com/maps/api/geocode/json?'
    
    p = {"address": address}

    res = requests.get(url, params=p)

    response = res.json()
    lat_long = response['results'][0]['geometry']['location']

    return lat_long

In [27]:
get_coordinates("UC Berkeley")

{'lat': 37.8718992, 'lng': -122.2585399}

In [26]:
with open('marc-codes.txt', 'r') as f:
    raw = f.read()
    
marc_codes = {line.split('\t')[0]:line.split('\t')[1] for line in raw.split('\n')}
marc_codes

{'-ac': 'Ashmore and Cartier Islands',
 '-ai': 'Anguilla',
 '-air': 'Armenian S.S.R.',
 '-ajr': 'Azerbaijan S.S.R.',
 '-bwr': 'Byelorussian S.S.R.',
 '-cn': 'Canada',
 '-cp': 'Canton and Enderbury Islands',
 '-cs': 'Czechoslovakia',
 '-cz': 'Canal Zone',
 '-err': 'Estonia',
 '-ge': 'Germany (East)',
 '-gn': 'Gilbert and Ellice Islands',
 '-gsr': 'Georgian S.S.R.',
 '-hk': 'Hong Kong',
 '-iu': 'Israel-Syria Demilitarized Zones',
 '-iw': 'Israel-Jordan Demilitarized Zones',
 '-jn': 'Jan Mayen',
 '-kgr': 'Kirghiz S.S.R.',
 '-kzr': 'Kazakh S.S.R.',
 '-lir': 'Lithuania',
 '-ln': 'Central and Southern Line Islands',
 '-lvr': 'Latvia',
 '-mh': 'Macao',
 '-mvr': 'Moldavian S.S.R.',
 '-na': 'Netherlands Antilles',
 '-nm': 'Northern Mariana Islands',
 '-pt': 'Portuguese Timor',
 '-rur': 'Russian S.F.S.R.',
 '-ry': 'Ryukyu Islands, Southern',
 '-sb': 'Svalbard',
 '-sk': 'Sikkim',
 '-sv': 'Swan Islands',
 '-tar': 'Tajik S.S.R.',
 '-tkr': 'Turkmen S.S.R.',
 '-tt': 'Trust Territory of the Pacific Is