# Parsing MARC records with Pymarc #

In this exercise, we're going to use the Pymarc library to extract information from a file of library catalog records in MARC format. 

This code is meant to provide some practical techniques for dealing with MARC data. It's not necessarily the most elegant Python (experienced coders may well see better ways to accomplish these things), but each cell will introduce a different coding technique that is more broadly applicable (variables, control flow, conditionals, etc.). Likewise, professional catalogers may well see nuances of MARC that I'm passing over or missing.

All of the code examples here target Python 2.7. Python 2.7 isn't the most recent current of Python, and the code would need some updating for compatibility with Python 3. But you'll find lots of tutorials and discussion board posts targeted at Python 2.7, so this code may provide a reasonable starting point for further learning.

## Opening a file and reading it ##

This first example handles the basic steps we'll need to perform to deal with a file of MARC records: 
* Importing the necessary library for reading MARC records (provided by the Pymarc package); 
* Opening our file of MARC records;
* Passing the contents of the file to Pymarc's MARCReader function;
* Initiating an operation on each record in the file.

MARC is a long-lived format, and is actually quite ingenious--but it's also kind of abstruse and a bit cumbersome to work with in its raw form. Pymarc is a library that handles all the mucky business of reading MARC data for us and makes it available for use in our Python scripts. By "importing" Pymarc's MARCReader function, our script gets access to everything that Pymarc "knows" about reading MARC: Pymarc reads MARC so we don't have to. 

Refer to the "MARC Crash Course" handout for a quick overview of some of the MARC fields we're most likely to be interested in. For complete documentation, see https://www.loc.gov/marc/bibliographic/

(In the cells below, you'll see that lines with an octothorpe [aka "pound sign" or "hashtag"] at the beginning appear in light green italics. Those lines are comments and not code that gets executed. I've used comments to provide, well, commentary on the code. But we can also "comment out" a line of code to prevent it from running. In some of the examples below, I've commented out some commands so that they won't execute when you run the code in the cell the first time, then asked you to "uncomment" the command and re-run the code to see what changes. To "comment out" a line of code, simply add an octothorpe at the beginning of the line. To "uncomment" a line, just delete the octothorpe from the beginning of the line.)

In [1]:
# Import the necessary components from the Pymarc library.
from pymarc import MARCReader

# This is a pretty standard way of working with a file in Python. We open the file by passing the filename
# to the open command and indicate that we want to open the file in "readable" mode. While we're at it, we
# create a variable name ("infile") to use as an alias for referring to our file. 

with open('data/Bowyer_from_ESTC-sample.mrc', 'r') as infile :
    # Note the colon at the end of the "with" statement, as well as the indentation of the subsequent code. 
    # White space matters in Python. The indented code is executed "inside" the statement
    # above it. (Other languages would use, say, curly braces, to indicate this sort of control flow.)
    
    # Next, we pass the contents of the file we just opened to the MARCReader function of the Pymarc library.
    # This will allow us to use all of the methods that that library offers for reading MARC data.
    reader = MARCReader(infile) 
    # We're going to want to do something with each of the records in our file, so we'll begin a "for" loop.
    # Just as we did for our "with" statement above, the for statement ends with a colon and the subsequent
    # code is indented. The code inside this loop will be run for each item that the loop encounters. 
    # (Note that the variable name "record" is arbitrary--we could use "x" or "imelda" and the computer would 
    # understand just fine. But we need our code to be human-readable, too, so let's stick with variable names
    #  that make sense to us.)
    for record in reader :
        # Let's just keep things simple at this point and print each record to prove that we really do have a 
        # set of MARC records to work with.
        print(record)

=LDR  01493nam a2200229   4500
=001  N10049
=003  CU-RivES\
=005  20090206223027.0
=008  840710s1758\\\\enk||||\\\\\\\00|\||eng\c
=009  006000111\
=035  \\$a(Uk-ES)006000111 
=040  \\$aCU-RivES$cCU-RivES$dCU-RivES$dCStRLIN$dCU-RivES
=100  1\$aDavila, Arrigo Caterino,$d1576-1631.
=240  10$aIstoria delle guerre civilé di Francia.$lEnglish
=245  14$aThe history of the civil wars of France.$bIn which are related, the most remarkable transactions that happened during the reigns of Francis the Second, Charles the Ninth, Henry the Third, and, Henry the Fourth, surnamed the Great. A new translation from the Italian of Henrico Caterino Davila. By Ellis Farneworth, M. A. ... 
=260  \\$aLondon :$bprinted for D. Browne, without Temple-Bar A. Millar, in the Strand J. Whiston and B. White, in Fleet-Street R. and J. Dodsley, in Pall-Mall and W. Sandby, in Fleet-Street,$cMDCCLVIII. [1758] 
=300  \\$a2v. ;$c4⁰. 
=533  \\$aMicrofilm.$bWoodbridge, Conn.:$cPrimary Source Media,$d1999.$e1 reel ; 35 mm.$f(

## Accessing MARC fields with Pymarc ##

There's not much point in using Python just to print formatted MARC records to the screen. We're probably more interested in the contents of the fields and subfields than we are with the MARC tags and subfield codes. This is where Pymarc's ability to interpret MARC for us comes into play: MARCReader allows us to access the contents of fields and subfields using a pretty straightforward syntax.

(In the next few examples, we're going to stick to non-repeatable MARC fields. Repeatable fields need to be treated a little differently. We'll get there.)

The next cell works through a couple of ways of accessing MARC data. Run the code in the cell, then comment out the first print command, uncomment the next ones and re-run the code. Repeat that process for the final set of print commands.

In [4]:
from pymarc import MARCReader
with open('data/Bowyer_from_ESTC-sample.mrc','r') as infile :
    reader = MARCReader(infile)
    for record in reader :
        # We can access the content of fields like this:
        #print(record['245'])
        
        # More helpful, though, is our ability to access the content of subfields:
        #print(record['245']['a'])
        #print(record['245']['b'] + '\n')
        
        # Not all MARC fields have subfields. To access the contents of these controlfields, use ".data"
        print(record['001'].data)
        print(record['008'].data + '\n')

N10049
840710s1758    enk||||       00| ||eng c

N10081
840710s1728    enk||||       00| ||eng c

N101
820917s1733    enk||||       00| ||eng c

N10263
840716s1767    enk||||       00||||eng c

N10280
840716s1710    enk||||       00| ||eng c



### Variables ###
Reading and printing the contents of MARC fields is all well and good, but we probably want to be able to hold on to that information and do something with it. That's much easier to do when we declare variables and assign values to them. For most people, something like "main_title" is easier to understand and remember than "record['245']['a']". It's certainly easier and faster type.

In the cell below, I've created a variable to hold the value of MARC 245|a and then print that variable. After running the code, add your own variable to get the value of MARC 260|c, then uncomment the final print command and re-run the code in the cell.

In [5]:
from pymarc import MARCReader
with open('data/Bowyer_from_ESTC-sample.mrc','r') as infile :
    reader = MARCReader(infile)
    for record in reader :
        # Now we'll create some variables and assign them the values of selected fields from our MARC records.
        main_title = record['245']['a']
        
         # >> Create a new variable called "imprint_year" and assign it the value of MARC 260|c. Then uncomment
         # >> the print(imprint_year) command by deleting the octothorpe at the beginning of the line and run the
         # >> code in this cell.
        
        print(main_title)
        #print(imprint_year)

The history of the civil wars of France.
Observations on the small pox: or, An essay to discover a more effectual method of cure.
The beau and the academick.
A new grammar of the Latin tongue, comprising all necessary for grammar-schools.
The new pretenders to prophecy re-examined:


### Conditionals ###
While every MARC record has a title field (245), not all records have a main author field (100). Texts created by corporate authors, for example, have 110 fields, instead; works for which no author is known may not have any main author entry. If we were simply to create a variable for `author_name` and assign it the value of MARC 100|a, we'll run into an error when we come across a record that doesn't have a 100 field. 

In this next example, we'll check to see if the record *has* a 100 field before assigning the content of 100|a to a variable.

In [None]:
from pymarc import MARCReader
with open('data/Bowyer-from-ESTC-sample.mrc','r') as infile :
    reader = MARCReader(infile)
    for record in reader :
        main_title = record['245']['a']
        # Check to see if the record has a 100 field. Python's syntax can seem a bit terse in comparison to 
        # some other languages, but we could think of this statement as saying something like "If there is a
        # 100 field..." As with other statements, note the colon at the end.
        if record['100'] :
            # The indented code will only be executed if the condition in the if statement is met. If there  
            # isn't a 100 field in the record, this code will be skipped.
            author_name = record['100']['a']
            print(author_name)
            print(main_title)
        # Now we can say what to do if our "if" condition *isn't* met. (In this small MARC sample, all of the
        # records have 100 fields, so we won't end up reaching this part of the loop in this example.)
        else :
            print('No author')
            print(main_title)       

## Data types and structures ##

As with any language, Python can work with different types of data, including (among others): strings of characters; various kinds of numbers (integers, or whole numbers, and "floats"--numbers with decimals); "lists" (simple collections of data); and "dictionaries" (mapped collections of information with values corresponding to keys). For a discussion of Python's native data types, see: https://docs.python.org/3/library/stdtypes.html

What follows is by no means an exhaustive tour of different data types, but just an introductory overview of some basics that you can start working with pretty quickly. The fact that we're working with MARC data means we need first to deal with strings: because a MARC record is simply a plaintext file, everything in it is actually a string of text, no matter what it looks like.

### Manipulating strings ###
If you take a look at the output of the last example, you'll see that the authors' names all end with commas and the titles all end with periods. That kind of punctuation is included in the MARC specification, which ensures that records are formatted correctly when they are displayed by a library's ILS. Depending on what we were doing, we might want to get rid of those commas. 

Python gives us lots of ways to manipulate strings. This is far too big a topic to cover in much depth here (documentation for Python's native string methods is available here: https://docs.python.org/2/library/stdtypes.html#string-methods), but there are a few things we can do pretty quickly.

In the next few examples, we'll dispense with reading the MARC file and just work with an author's name we already know about from our sample.

#### Stripping #### 
There are several ways to clean up white space and text at the beginnings and ends of strings. The `strip` command cleans up unwanted characters from both the beginnings and ends of strings, while the `lstrip` and `rstrip` remove unwanted characters from the left and right ends of a string, respectively. By default, all of these commands remove white spaces, but we can indicate the characters we want to remove, instead.

In [6]:
## Stripping

## For the sake of convenience, we'll just set the variable of author_name to a string.
author_name = "Davila, Arrigo Caterino,"

## Let's get rid of that trailing comma by using "rstrip" to remove a defined character (the comma) from the right
## end of the string. 

author_name_stripped = author_name.rstrip(',')
print(author_name_stripped)

Davila, Arrigo Caterino


#### Partitioning ####
There are plenty of times when we might want to split a string of text whenever we encounter a certain separator character. The `partition` command is a little more esoteric, but can come in handy. Given a string and a separator character, `partition` returns three pieces of information: the portion of the string before the separator, the separator itself, and the portion of the string after the separator. These three pieces of information are stored as a list (a data type we'll look at more in a little bit), and can be retrieved by referring to their position in the list. In Python, as in many other languages, we begin counting at zero rather than one (which can be tricky to remember at first).

In [7]:
# Partitioning

# If we partition author_name_stripped on the comma, we'll be able to rearrange the author's name in firstname 
# middlename lastname order by putting the third part of our partition before the first part. Keep in mind that, in 
# Python as with many other languages, we begin counting at 0 rather than at 1.

name_parts = author_name_stripped.partition(', ')
print name_parts[2] + ' ' + name_parts[0]

Arrigo Caterino Davila


#### Substrings ####
Sometimes we want to extract just a portion of a longer string. There are several different ways to do this, but the simplest case would be one where we know that we're looking for a substring of a certain length occurring at a certain known position in a longer string.

Let's consider the MARC 008 control field, which is a fixed-length field that reports structured information about the record, with different character positions reserved for different pieces of information. Characters 7-10 (and, in some cases, 11-14) provide regularized publication dates for the title described by the record. Those dates can mean different things, depending on the record (the nature of the dates is reported in character 6), but for this example we can overlook those nuances and say that characters 7-10 represent the publication date.

In the example below, we get the contents of the MARC 008 field by using the .data syntax, and then specify the portion of that string we want: `[7:11]`. This seems strange. If we want characters 7-10, why 11? What's happening here is that our result begins the character at our starting index position and goes up to *but not including* the character at our ending position.

(It's worth pointing out here that a lot of this syntax will come up again when we begin working with lists. That's because, in a sense, a string is really just a list of characters.)

In [None]:
from pymarc import MARCReader
with open('data/Bowyer_from_ESTC-sample.mrc','r') as infile :
    reader = MARCReader(infile)
    for record in reader :
        pub_year = record['008'].data[7:11]
        print(pub_year)

Let's look at a couple of other substring examples. Most of these will seem a little arbitrary, but there may be times when we'd need them. Comment and uncomment the various print commands in the cell below to see what they do.

In [None]:
# For this example, we'll forego reading the MARC file and just work with some static strings.
main_title = 'Observations on the small pox: or, An essay to discover a more effectual method of cure.'
author_name = 'Holland, Richard,'

# Retrieve the first 20 characters of the title (this is the same as saying [0:21])
print(main_title[:21])

# Retrieve the last 20 characters of the title (start at the 20th character from the end and continue to 
# the end of the string).
#print(main_title[-21:])

# We can combine string operations to do more obviously sensible things, like getting the portion of the
# title up to the colon...
#print(main_title[0:main_title.find(':')])

# ... or the author's last name, by getting the portion up to the comma...
#print(author_name[0:author_name.find(',')])

# ... or the author's first name, by getting the portion after the first comma and white space, and up to the 
# last comma. (Though, really, partition(',') would be easier for this.):
#print(author_name[author_name.find(', ')+2:author_name.rfind(',')])

#### Converting strings to numbers ####
In an earlier example, we extracted four characters from the 008 control field to get a regularized publication year. But, crucially, while those are numerical characters, as far as the computer is concerned, they don't represent a number. Run the code in the cell below and study the error message that we get when we try to calculate how long ago a book was published by subtracting the publication date from the current year.

In [None]:
from pymarc import MARCReader
with open('data/Bowyer_from_ESTC-sample.mrc','r') as infile :
    reader = MARCReader(infile)
    for record in reader :
        pub_year = record['008'].data[7:11]
        print(pub_year)
        years_ago = 2017 - pub_year
        print(years_ago)       

It turns out, you can't subtract a string from an integer. (In this case, we're dealing with a unicode string, since our file of MARC records is encoded as UTF-8.) Despite what it looks like to us, the value of `pub_year`--"1758"--isn't something that our script can recognize as a number. We can confirm the problem by inspecting the datatype of `pub_year`.

In [8]:
from pymarc import MARCReader
with open('data/Bowyer_from_ESTC-sample.mrc','r') as infile :
    reader = MARCReader(infile)
    for record in reader :
        pub_year = record['008'].data[7:11]
        print(pub_year)
        print(type(pub_year))

1758
<type 'unicode'>
1728
<type 'unicode'>
1733
<type 'unicode'>
1767
<type 'unicode'>
1710
<type 'unicode'>


If we want to use the string of characters we extracted from the 008 control field as a year that we can use in calculations, we need to convert it from a unicode string to an integer. (Note, though that, in the last line, I've had to convert that integer *back to a string* in order to combine it with the words " years ago."!)

In [11]:
from pymarc import MARCReader
with open('data/Bowyer_from_ESTC-sample.mrc','r') as infile :
    reader = MARCReader(infile)
    for record in reader :
        pub_year = record['008'].data[7:11]
        print(pub_year)
        years_ago = 2017 - int(pub_year)
        print(str(years_ago) + ' years ago.')

1758
259 years ago.
1728
289 years ago.
1733
284 years ago.
1767
250 years ago.
1710
307 years ago.


#### Concatenating strings ####
There's probably lots more you could find yourself needing to do with strings, but my hope is that these examples will give you enough of a starting point to figure out the syntax explained in Python documentation or that you find online. That last example, though, does provide an occasion to talk about concatenating--i.e., chaining together--strings. In Python, this is done by simply using plus signs.

In [None]:
from pymarc import MARCReader
with open('data/Bowyer_from_ESTC-sample.mrc','r') as infile :
    reader = MARCReader(infile)
    for record in reader :
        author_name = record['100']['a'].rstrip(',')
        main_title = record['245']['a']
        pub_city = record['260']['a'].rstrip(' :')
        imprint = record['260']['b']
        pub_year = record['008'].data[7:11]
        print(author_name + '. ' + main_title + ' (' + pub_city  +': ' + imprint + ' ' + pub_year + '.)')

### Working with lists ###

We've already gotten a bit of a preview of working with lists from some of our work with substrings, but let's take a closer look. In the next example, we'll create a list to hold the names of the authors for our records and add the authors' names to that list as we iterate through the MARC records in the file.

In [None]:
from pymarc import MARCReader
with open('data/Bowyer_from_ESTC-sample.mrc','r') as infile :
    reader = MARCReader(infile)
    # Create a variable named "authors" that is an empty list. Note that this variable is created *outside* our for
    # loop...
    authors = []
    for record in reader :
        author_name = record['100']['a'].rstrip(',')
        # Now append each author_name to our list of authors.
        authors.append(author_name)
    # Note that we've left the "for" loop. In our earlier examples, the values of our variables were only available
    # inside the for loop--this time, we collected the values as we looped through the MARC records and stored
    # them in a list where we can get to them even again.
    print(authors)

We access the individual items in the list by using their index numbers (this should look familiar from our substring examples). Note that can also check to see how many items are in a list using `len()`. Comment and uncomment the print commands below to se how different list indices work

In [None]:
from pymarc import MARCReader
with open('data/Bowyer_from_ESTC-sample.mrc','r') as infile :
    reader = MARCReader(infile)
    authors = []
    for record in reader :
        author_name = record['100']['a'].rstrip(',')
        authors.append(author_name)
    print(authors)
    print(len(authors))
    #print(authors[1])
    #print(authors[0])
    #print(authors[-1])
    #print(authors[0:3])
    #print(authors[0:-2])
    #print(authors[:2])
    #print(authors[3:])   

We can iterate through the items in a list in the same way we've been iterating through the MARC records in our file.

In [None]:
from pymarc import MARCReader
with open('data/Bowyer_from_ESTC-sample.mrc','r') as infile :
    reader = MARCReader(infile)
    authors = []
    for record in reader :
        author_name = record['100']['a'].rstrip(',')
        authors.append(author_name)
    # Note that "author" is an arbitrary name. We could say "for i in authors" and it would work just as well
    for author in authors :
        print(author)

In our examples so far, we've been working with a very small sample set of just five MARC records, and those records all happen to be for titles written by different authors. If we were to work with a larger set of MARC records (as in this next example), we could expect to encounter multiple records with the same author. If we were to add all of the authors to our list as we've been doing so far, we'd end up with duplicates in our list. In some scenarios, that might not be a problem, but we could use an `if` statement to check and make sure that our `author_name` isn't already in the list before adding it. (While we're at it, let's add in a check to make sure that there's actualy a 100 field in the record before we try to get a nonexistent 100|a subfield.)

In [None]:
from pymarc import MARCReader
with open('data/Bowyer_from_ESTC-full.mrc','r') as infile :
    reader = MARCReader(infile)
    authors = []
    for record in reader :
        if record['100'] :
            author_name = record['100']['a'].rstrip(',')
            if author_name not in authors :
                authors.append(author_name)
    print(len(authors))
    for author in authors :
        print author
    # If we wanted to, we could put sort the list alphabetically:
    #authors.sort()
    #for author in authors :
        #print author

### Working with dictionaries ###

While lists are useful for some kinds of work, other situations call for a different kind of data structure. A dictionary holds data as a series of key/value pairs. Note that the values of a dictionary can be any combination of strings, numbers, lists, and even other dictionaries, so a dictionary can represent very complex information (and can get correspondingly complex to deal with in a hurry!). 

We could imagine holding data about a person in a dictionary like this, for example:

`{'first_name': 'John', 'last_name': 'Doe', 'age': 73, 'hobbies': ['birding','stamp collecting','freestyle polka'], 'family_members': {'father': 'James Doe', 'mother': 'Mary Doe', 'sister': 'Ann Doe'}, 'eye_color': 'brown'}`

In the next example, we'll create a dictionary rather than a list. We'll store authors' names as keys and keep a running count of the number of records by each author as the value. Because we don't know in advance what author names we might encounter, we'll use the `setdefault` command to add new keys to our dictionary if they aren't already present.

In [13]:
from pymarc import MARCReader
with open('data/Bowyer_from_ESTC-full.mrc','r') as infile :
    reader = MARCReader(infile)
    # Create the variable authors, this time as an empty dictionary, rather than an empty list. (Note the 
    # curly braces in place of the square brackets we used in our earlier example.)
    authors = {}
    for record in reader :
        if record['100'] :
            author_name = record['100']['a'].rstrip(',')
            # We use setdefault: if author_name is not already present as a key in our dictionary, we'll
            # add a new key,value pair with the author_name as the key and 0 as the value.
            authors.setdefault(author_name,0)
            # Now we increment the value associated with the key for this author_name. If that wasn't already
            # present, we just created it with value 0, so we'll be making the value equal one. If that key
            # *was* already present, the setdefault comand won't have done anything, so we'll be adding one
            # to the value that was there (e.g., increasing 1 to 2, 12 to 13, etc.)
            authors[author_name] += 1
    for author in authors.items() :
        print(author)

(u'Pagett, Thomas Catesby', 5)
(u'Nelson, Robert', 37)
(u'Gerard, Alexander', 1)
(u'Mauger, Claude.', 1)
(u'Gordon, Robert', 1)
(u'Whitefield, George', 24)
(u'Xenophon.', 3)
(u'Halket, Peter', 1)
(u'Wynne, William', 1)
(u'Underhill, Edward', 1)
(u'Foord, Ellen', 1)
(u'Drelincourt, Charles', 4)
(u'Kinnoull, George Hay', 2)
(u'Buckingham, John Sheffield', 2)
(u'Crowne', 1)
(u'Temple, William', 2)
(u'Fitz-Adam, Adam.', 1)
(u'Loughton, William.', 4)
(u'Awbrey, Timothy', 2)
(u'Craigengelt, Charles.', 1)
(u'White, John', 22)
(u'Mosheim, Johann Lorenz', 3)
(u'Homer.', 13)
(u'Florus, Lucius Annaeus.', 2)
(u'Solleysel, Jacques de', 1)
(u'Lobb, Theophilus', 2)
(u'Pococke, Richard', 2)
(u'Dunbar, Archibald.', 1)
(u'Langford', 2)
(u'Devonshire, William Cavendish', 1)
(u'Ribadeneyra, Pedro de', 1)
(u'Burke, William', 1)
(u'Burton, Thomas', 1)
(u'Ra\u0304zi\u0304, Abu\u0304 Bakr Muh\u0307ammad ibn Zakari\u0304ya', 1)
(u'Harris, Walter', 3)
(u'Morando, Bernardo', 2)
(u'Roper, William', 1)
(u'Gisbert,

Working with the contents of a dictionary is somewhat similar to working with a list, but with some twists because of the more complex key,value structure. Rather than accessing entries in the dictionary by index number, we use the key. 

In [None]:
from pymarc import MARCReader
with open('data/Bowyer_from_ESTC-full.mrc','r') as infile :
    reader = MARCReader(infile)
    authors = {}
    for record in reader :
        if record['100'] :
            author_name = record['100']['a'].rstrip(',')
            authors.setdefault(author_name,0)
            authors[author_name] += 1
    
    # Let's see how many distinct authors we found. Rather than using len() directly, as we could with a list, we
    # need to get the len() of the keys in the dictionary.
    print(len(authors.keys()))
    
    # We access a dictionary entry not with an index number, as we did with lists, but with the key.
    #print(authors['Pagett, Thomas Catesby'])

    # We can't iterate through a dictionary quite as directly as we iterate through a list.
    #for author, count in authors.iteritems() :
     #   print(author + ': ' + str(count))
        
    # Dictionaries are, by nature, unsorted, but we can sort them at the time we display them, if we really want to:
    #for author in sorted(authors) :
     #   print(author + ': ' + str(authors[author]))
    

## Conclusions ##

This notebook has touched on a lot of different techniques, but none of them in too much depth. It may, though, provide a starting point and reference for some of the things that can come up most commonly in working with bibliographic data in MARC format.

Another notebook in this directory (03-Living vs dead authors printed by Wm Bowyer (Case Study).ipynb) uses many of these techniques (and a few others) to take on a practical question from start to finish: of the titles printed by William Bowyer, how many were written by living authors and how many by dead ones? The next notebook in this directory (02-Parsing MARCXML with Python.ipynb) offers a quick look at a different flavor of MARC data--MARCXML, or MARC expressed in XML.