# Working with MARCXML in Python #

This notebook offers a very brief introduction to parsing XML using the BeautifulSoup package for Python. I don't know that BeautifulSoup is the most powerful tool for working with XML, but it does everything I've needed to do when trying to get information out of XML files. It's easy to use; the developer offers good documentation (https://www.crummy.com/software/BeautifulSoup/bs4/doc/); and you'll also find lots of tips and tutorials online (at Stackoverflow and elsewhere). BeautifulSoup is generally used for parsing HTML--which makes it an excellent choice for web scraping--but it can also work quite well with XML in concert with the lxml package if it is installed (as the code in this notebook assumes is the case).

As you'll see in the examples below, we're not quite using XPath to get to the parts of the document tree that interest us, but the thinking is pretty similar.

## Reading MARCXML ##

MARCXML is pretty much just MARC represented as XML. It's a nested hierarchy of tagged elements (so, lots of angle brackets), but you still have to know your MARC field codes to make any use of it. (For an amusing rant that's maybe not entirely fair to MARC, but pretty accurate on why MARCXML might not feel like "real" XML, see here: https://shelterit.blogspot.com/2008/09/marcxml-beast-of-burden.html.) 

### Sticking with what you already know: parsing MARCXML with Pymarc ###
The Pymarc library we used to work with MARC21 files can actually read MARCXML files and translate them into MARC21 on the fly. So, if you're just working with MARCXML files, it's a simple matter to read them with Pymarc's marcxml function and use the same methods you used in the other notebook.

In [None]:
# Note we're importing marcxml rather than MARCReader
from pymarc import marcxml
# Parse the MARCXML to arrays so that we can work with it using Pymarc's usual methods.
reader = marcxml.parse_xml_to_array('/media/sf_RBSDigitalApproaches/data/0611-Sess1-data/Bowyer_from_ESTC-sample-marcxml.xml', strict=True)
# Now you're on familiar territory from the notebook on working with MARC21.
for record in reader :
    print(record['001'].data)
    print(record['245']['a'] + ' ' + record['245']['b'])

### Introducing BeautifulSoup ###
While Pymarc is a great solution for working with MARCXML files, I wanted to introduce doing the same things with BeautifulSoup, because BeautifulSoup opens a way to working with XML more generally: you may run across data of many kinds distributed as XML, so it could be helpful to have a few tricks for getting information out.

This example is quite brief. It passes over questions about how MARC represents bibliographic data and doesn't comment too much on the Python code (both of which are covered in more detail in the notebook handling MARC21 files). I've focused the comments just on using BeautifulSoup to deal with XML in hopes that you can transfer what you've learned about MARC to this new expression of the same data structure.

In [None]:
# Import BeautifulSoup
from bs4 import BeautifulSoup
# Open the file
with open('/media/sf_RBSDigitalApproaches/data/0611-Sess1-data/Bowyer_from_ESTC-sample-marcxml.xml', 'r') as infile :
    # Hand the file off to BeautifulSoup. The 'xml' here tells BeautifulSoup that it should use the lxml package
    # as its parser. (Note that lxml needs to be installed for this to work.)
    soup = BeautifulSoup(infile, 'xml')
    # BeautifulSoup looks for all the "record" elements and holds them and their nested contents in the variable
    # "records." (Obviously, you need to know something about the structure of the XML files you're working with. 
    # You'd need to have a look at the file--preferably in an XML-savvy text editor--to know what the important 
    # tag names are.)
    records = soup('record')

    # Iterate through the "record" elements
    for record in records :
        # In a quasi-XPath-like way, we look for a "controlfield" element whose "tag" attribute has the value 001.
        # Work through the next few lines, commenting and uncommenting the successive commands to get a feel for
        # how we access the content of the elements.
        
        # Note how BeautifulSoup returns the result of its search of the document tree as a list--in this case, a list
        # with only one item.
        estc_num = record('controlfield', tag='001')
        
        # To get the actual field, we need to indicate its list index
        #estc_num = record('controlfield', tag='001')[0]
        
        # If we want to get rid of the tags and just get the content of the field, we need to use .string
        #estc_num = record('controlfield', tag='001')[0].string
        
        # Note the error we get, though, if we forget the list index and try to use .string without it:
        #estc_num = record('controlfield', tag='001').string
        print(estc_num)

As we saw when working with Pymarc, controlfields are different from other data fields in MARC. In the following examples, we'll start working with fields that have subfields.

In [None]:
from bs4 import BeautifulSoup
with open('/media/sf_RBSDigitalApproaches/data/0611-Sess1-data/Bowyer_from_ESTC-sample-marcxml.xml', 'r') as infile :
    soup = BeautifulSoup(infile, 'xml')
    records = soup('record')
    for record in records :
        # Let's make sure there's actually a 100 field before we try to do anything with it..
        if record('datafield', tag='100') :
            # Here, we're getting the 100 datafield and all its nested contents
            author = record('datafield', tag='100')[0]
            print(author)
            
            # Having gotten the 100 element, we could then access its child elements: we're using the author
            # variable we just defined, and then looking for the elements that it contains. (I'm stripping
            # closing punctuation while I'm at it.)
            author_name = author('subfield', code='a')[0].string.rstrip(',')
            if author('subfield', code='d') :
                author_dates = author('subfield', code='d')[0].string.rstrip('.')
            #print(author_name + ' (' + author_dates + ')' )

Note that, if we're interested in one particular nested element, we can go straight to it in one step--we don't have to capture the parent element and then work down to its children.

In [None]:
from bs4 import BeautifulSoup
with open('/media/sf_RBSDigitalApproaches/data/0611-Sess1-data/Bowyer_from_ESTC-sample-marcxml.xml', 'r') as infile :
    soup = BeautifulSoup(infile, 'xml')
    records = soup('record')
    for record in records :
        format = record('datafield', tag='300')[0]('subfield', code='c')[0].string.rstrip('. ')
        print(format)

So far, we've only looked at non-repeatable MARC fields. We can also work with repeatable fields, but we need to make a point of finding *all* of the instances of them. Then we just have to iterate through the list of results that BeautifulSoup provides.

In [None]:
from bs4 import BeautifulSoup
with open('/media/sf_RBSDigitalApproaches/data/0611-Sess1-data/Bowyer_from_ESTC-sample-marcxml.xml', 'r') as infile :
    soup = BeautifulSoup(infile, 'xml')
    records = soup('record')
    
    for record in records :
        estc_num = record('controlfield', tag='001')[0].string
        notes = record.find_all('datafield', tag='500')
        #print(notes)
        # I'm concatenating a bunch of stuff together here. Note that I'm using len() to figure out the length
        # of the list of notes (i.e., how many notes there are). That yields an integer, so I have to turn it into 
        # a string in order to combine it with my other bits of text.
        print('\t' + estc_num + ' has ' + str(len(notes)) + ' general notes.\n')
        
        # Now let's iterate through those notes, getting the content of the a subfield :
        for note in notes :
            print('\t\t* ' + note('subfield', code='a')[0].string)
        print('\n')          