# END Data / Jupyter + Python Workshop
Nabil Kashyap (nkashya1@swarthmore.edu) / 2017-06-13


![marc to marc](https://pbs.twimg.com/media/CwsAx9nXAAQ0khZ.jpg:large)

This workshop is a first pass at thinking through using Jupyter notebooks as a tool for exploring END MARCXML directly. With the use of hosted END data, the Python library pymarc and a set of helper functions, we should be able to get a hands-on sense for parsing MARC data and for how END records specifically are structured.

## STEPS

1. import pymarc library
2. import endmarcxml.py helper functions
3. explore basic Python loops and conditionals
4. extract subsets of data based on our sample criteria
5. further exploration

In [None]:
!python3 -m pip install pymarc --user ## install pymarc library
!wget https://raw.githubusercontent.com/swat-ds/endmarcxml/master/endmarcxml.py ## install endmarcxml.py helper functions

import os ## a little magic to get the locally install library to import
import sys
home = os.getenv('HOME')
sys.path.append(home + '/.local/lib/python3.5/site-packages')

from pymarc import marcxml ## import pymarc and endmarcxml
import endmarcxml as emx

## XML is nested

Painfully obvious, but necessary to start with. XML consists of nested hierarchies, i.e., there are always parent and child elements. In this case, there's the XML document within which we declare a MARC collection that is the parent of the children MARC records themselves.

In [None]:
collection = marcxml.parse_xml_to_array('https://raw.githubusercontent.com/earlynovels/end-dataset/041417-data/full-041417.xml',strict=True)

In [None]:
for record in collection:
    print(record.title())
    #     print(record['001'].value() + '\t' + record.title())

In [None]:
early_set = []

for record in collection:
    pub_date = emx.get_pub_date(record)
    if pub_date and pub_date <= 1789 and pub_date >= 1700:
        early_set.append(record)
        
print(len(early_set))

In [None]:
for record in early_set: print(record.title())

In [None]:
for record in early_set:
    prefix = emx.get_pymarc_field_value('001',record) + '\t' ## for convenience so we can keep track of these records
    if record['595']:
        fields = record.get_fields('595')
        for field in fields:
            print(prefix + field.value())

In [None]:
for record in early_set:
    prefix = emx.get_pymarc_field_value('001',record) + '\t' ## for convenience so we can keep track of these records
    subfields = emx.get_subfield_values('595','x',record)
    if subfields: print(prefix + str(subfields))

## Making basic barcharts based on our subsets

Using the libraries numpy and pyplot, we can begin exploring our data visually.

In [None]:
import numpy as np
import matplotlib.pyplot as plotter

In [None]:

x = np.arange(5)
y = [5,10,15,35,3]
width = 1

barChart = plotter.bar(x,

                       y,

                       width,

                       color = ['red','blue'],

                       label = 'Returns')
plotter.xticks(x, ('2012', '2013', '2014', '2015', '2016'))

plotter.tight_layout()
plotter.show()

