# END Data / Jupyter + Python Workshop
Nabil Kashyap (nkashya1@swarthmore.edu) / 2017-06-13


![marc to marc](https://pbs.twimg.com/media/CwsAx9nXAAQ0khZ.jpg:large)

This workshop is a first pass at thinking through using Jupyter notebooks as a tool for exploring END MARCXML directly. With the use of hosted END data, the Python library pymarc and a set of helper functions, we should be able to get a hands-on sense for parsing MARC data and for how END records specifically are structured.

## STEPS

1. import pymarc library
2. import endmarcxml.py helper functions
3. explore basic Python loops and conditionals
4. extract subsets of data based on our sample criteria
5. further exploration

## Basic concepts

- loops
- conditionals
- data types

In [None]:
!python3 -m pip install pymarc --user ## install pymarc library
!wget https://raw.githubusercontent.com/swat-ds/endmarcxml/master/endmarcxml.py ## install endmarcxml.py helper functions

import os ## a little magic to get the locally install library to import
import sys
home = os.getenv('HOME')
sys.path.append(home + '/.local/lib/python3.5/site-packages')

from pymarc import marcxml ## import pymarc and endmarcxml
import endmarcxml as emx

## XML is nested

Painfully obvious, but necessary to start with. XML consists of nested hierarchies, i.e., there are always parent and child elements. In this case, there's the XML document within which we declare a MARC collection that is the parent of the children MARC records themselves.

In [None]:
collection = marcxml.parse_xml_to_array('https://raw.githubusercontent.com/earlynovels/end-dataset/041417-data/full-041417.xml',strict=True)

In [None]:
for record in collection:
    print(record.title())
    #     print(record['001'].value() + '\t' + record.title())

In [None]:
early_set = []

for record in collection:
    pub_date = emx.get_pub_date(record)
    if pub_date and pub_date <= 1789 and pub_date >= 1700:
        early_set.append(record)
        
print(len(early_set))

In [None]:
for record in early_set: print(record.title())

In [None]:
for record in early_set:
    prefix = emx.get_pymarc_field_value('001',record) + '\t' ## for convenience so we can keep track of these records
    if record['595']:
        fields = record.get_fields('595')
        for field in fields:
            print(prefix + field.value())

In [None]:
for record in early_set:
    prefix = emx.get_pymarc_field_value('001',record) + '\t' ## for convenience so we can keep track of these records
    subfields = emx.get_subfield_values('595','x',record)
    if subfields: print(prefix + str(subfields))

## Augmenting the data

One of the main benefits of working in this capacity is to draw in other data sources.

In [73]:
import re
import requests

In [82]:
for record in collection:

    names = emx.get_persons('printed for',record)
    for name in names:
        r = requests.get('http://viaf.org/viaf/AutoSuggest?query='+name)
        data = r.json()
        
        if data['result']:
            term = data['result'][0]['term']
#             print(term)
            dates = re.findall('\d+',term)
#             print(dates)
            if(dates): print(dates)

T Tegg
Thomas Tegg, 1776-1845
['1776', '1845']
J. Freeman  Young, 1820-1885
['1820', '1885']
John Morphew, 16..-17.., libraire
['16', '17']
Henry Colburn British publisher
Henry Colburn British publisher
Sherwood, Neely, and Jones
Wilkie and Robinson
Cadell and Davies
Longman, Hurst, Rees, and Orme
Lackington, Allen and Co. business; booksellers and publisher in London, 1793-1812
['1793', '1812']
Cuthell and Martin
Vernor, Hood and Sharpe
R. Faulder and Son
Harris, J. Rendel , 1852-1941
['1852', '1941']
Jeffrey, E. R
Booker, J.R. , 1942-1998
['1942', '1998']
Scholey, Robert, fl. 1806-1831
['1806', '1831']
Asperne, James 1757-1820
['1757', '1820']
Nunn, James, 17..-18
['17', '18']
Lea, R.M., 1943-
['1943']
Richardson, J.P
Johnson, J.J., 1924-2001
['1924', '2001']
Walker, J.&C
Bradley, A.C. , 1851-1935
['1851', '1935']
Bradley, A.C. , 1851-1935
['1851', '1935']
W Cooper
W Cooper
W Cooper
W Cooper
Colburn, H., 17..-18.., imprimeur-libraire
['17', '18']
Colburn, Henry, ?-1855
['1855']
T As

KeyboardInterrupt: 

## Making basic barcharts based on our subsets

Using the libraries numpy and pyplot, we can begin exploring our data visually.

In [None]:
import numpy as np
import matplotlib.pyplot as plotter

In [None]:

x = np.arange(5)
y = [5,10,15,35,3]
width = 1

barChart = plotter.bar(x,

                       y,

                       width,

                       color = ['red','blue'],

                       label = 'Returns')
plotter.xticks(x, ('2012', '2013', '2014', '2015', '2016'))

plotter.tight_layout()
plotter.show()

