This notebook articulates the process of generating a table of metadata about figures in the translation of the manuscript.

Firstly, import the `manuscript.py` file from the `manuscript-object` repo.

In [1]:
import sys, os
sys.path.insert(0, os.path.abspath('../..')) # Some magic to make sure we are in the context of the manuscript-object root directory.
import manuscript

We will also use some convenient constants from the `utils.py` file.

In [3]:
import utils

For parsing the XML of each entry, we will use the Python library BeautifulSoup.

[Read the documentation here.](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

In [4]:
from bs4 import BeautifulSoup, Tag

Our task is essentially this:
- for each entry in the manuscript, obtain a list of figures in that entry
- for each of those figures, generate a row of spreadsheet data containing the following information:
    - entry ID
    - entry title
    - figure ID
    - figure size
    - figure link
    - figure margin data (if applicable)
    - entry categories
    - the terms present in the entry for each semantic tag
- write all of these rows to a CSV

We will begin with the smallest conceptual unit--the individual figure--and work up from there.

This first function `extract`s information from a figure element.

The actual data type being passed to this function is a `Tag` object parsed by BeautifulSoup. This kind of object is more intelligent than a string of XML; it possess information about its own attributes and the structure of the XML document. We take advantage of this when searching for margin metadata, since figures in the margin may encode their position as an attribute of the `figure` tag or as an attribute of their enclosing `ab` tag.

For the rest of the metadata, it suffices to simply request the attribute from the element.

The `size` attribute is not included if the figure has the default size `medium`. We would like to see this explicitly, though, so we tell it to default to `medium` if no `size` attribute was found.

Finally we return this metadata as a Python dictionary keyed by what will eventually become the names of the columns of our spreadsheet.

_Note:_ while it is theoretically possible in XML that there could be an intermediate enclosing tag between the `ab` and the `figure` which would ruin our ability to request the margin attribute from the enclosing `ab` tag, in fact the manuscript is encoded such that this does not happen.

So, the following _never_ happens:
```xml
<ab margin='blah'>
<intermediate_parent_tag>
<figure></figure>
</intermediate_parent_tag>
</ab>
```
This means we can trust that the margin metadata is either in the attributes of the figure tag, the attributes of its parent, or it is not there at all.

See the end of this notebook for proof of this fact.

In [5]:
def extract(figure):
    size = figure.get('size') or 'medium'
    fig_id = figure.get('id')
    link = figure.get('link')
    
    # Use a short-circuiting 'or' to search for margin metadata first in the tag's attributes,
    # then in the attributes of the enclosing `ab` tag, and then finally return None if neither one yielded data.
    margin = figure.get('margin') or figure.parent.get('margin')
        
    return {'size':size,
            'fig_id':fig_id,
            'margin':margin,
            'link':link}

The next conceptual stage to consider is the level of the entry. Each entry may contain several figures.

The actual data type being given as input to be `transform`ed into several rows of data is an `Entry` object as defined in `entry.py`. This custom class possess information about its ID, title, categories, properties (i.e. semantic tags), and XML contents, among other things. This metadata combined with what we get from each individual figure tag accounts for each of the columns we wish to represent in our spreadsheet.

After we get the basic metadata information, we parse the XML content of the entry using BeautifulSoup. This gives us a "souped" XML element representing the root. We use the `find_all` method to recursively get every `figure` element in the entry as a list. These are exactly the objects we pass to `extract` in order to get their particular metadata.

We use a list comprehension to `extract` the metadata from all of the figures, giving us all of the rows we need from this entry.
Then it remains to add to each of these rows the metadata about the entry itself. We do this in a `for` loop that simply adds more key-value pairs to the dictionary. We have a nested `for` loop which adds all of the semantic tags and associated lists of terms as individual columns.

In [6]:
def transform(entry):
    entry_id = entry.identity
    categories = entry.categories
    title = entry.title
    
    soup = BeautifulSoup(entry.xml_string, 'lxml')
    figures = soup.find_all('figure')
    data = [extract(fig) for fig in figures]
    
    for row in data:
        row['title'] = title
        row['entry_id'] = entry_id
        row['categories'] = categories
        for prop, tag in utils.prop_dict.items():
            row[tag] = entry.properties[prop]
            
    return data

With the ability to now transform an entry object into several rows of figure metadata, it remains to apply this transformation to each fo the entries in the manuscript.

First, we must generate the manuscript object and all of the entry objects from the `ms-xml/tl` path.

In [8]:
%%capture
m = manuscript.Manuscript.from_dirs(utils.tl_path)

Then all we must do is iterate over the entries, transforming each one and extending our initially empty list of rows as we go.

In [14]:
rows = []
for entry in m.entries['tl'].values():
    rows.extend(transform(entry))

Since there are 162 figures in the manuscript, we should expect there to be 162 rows.

In [15]:
len(rows)

162

The most convenient way to write this data to a CSV file is using the DataFrame module from the pandas library.

In [11]:
from pandas import DataFrame

Here we define the order and names of our spreadsheet columns, which matches to the keys of our dictionaries.

Instead of painstakingly writing out each of the semantic tags, we cleverly use the `prop_dict` constant from `utils.py` and `*`-expand the abbreviated values into items in this list of column names.

In [16]:
columns=['entry_id', 'title', 'fig_id', 'link', 'size', 'margin', 'categories', *utils.prop_dict.values()]

The DataFrame module has a constructor method specifically designed to form a table from a list of dictionary records like we have. We provide it our data and our desired columns and it returns the table.

In [17]:
df = DataFrame.from_records(rows, columns=columns)

In [18]:
df

Unnamed: 0,entry_id,title,fig_id,link,size,margin,categories,al,bp,cn,...,tl,tmp,wp,de,el,it,la,fr,oc,po
0,3r3,Thick varnish for planks,fig_p003v_1,https://drive.google.com/open?id=0B9-oNrvWdlO5...,medium,,[varnish],[],[],"[sous, sous, sous]",...,"[vessel, vessel, vessel, oven, copper vessel, ...","[months, times past, hour]",[],[],[],[],[pix græca],[tou],[],[warp]
1,6v1,For cages,fig_p006v_1,https://drive.google.com/open?id=0B9-oNrvWdlO5...,medium,,[glass process],[],[hand],[],...,"[enamel cannules, cannules, cutting file, wood...",[],[],[],[],[],[],"[ga, cach]",[],[]
2,14r1,For walls of earth and rustic construction,fig_p014r_1,https://drive.google.com/open?id=0B9-oNrvWdlO5...,small,left-middle,[household and daily life],[Swallows],[foot],[],...,"[ditch-spade, measuring line, long <-ch-> pole...",[],[],[],[],[],[],"[arene, la, ch, m, a]",[],"[ditch-spade, tamp, tamps]"
3,16r1,Founding of soft iron,fig_p016r_1,https://drive.google.com/open?id=0B9-oNrvWdlO5...,large,left-middle,[metal process],[],[handfuls],[],...,"[small forges, iron pots, furnace, blast-pipe,...",[],[],[],[],[],[],"[d, pan, pan, pans, pan, pan, pan, pan, f, d, ...",[],[]
4,17r1,On the gunner,fig_p019r_1,https://drive.google.com/open?id=0B9-oNrvWdlO5...,x-small,,[arms and armor],"[horses, horses, horses, horses, horses, good ...",[thumb],"[lb, lb, lb, lb]",...,"[ladle, ladles, cauldrons, burin, burin, ladle...","[longer, day, by night]","[cannon, cannon-perrier, small, short cannons,...",[],[],[],[],"[gr, pans, pan, pan, pan, ce, pan, cano, metal...",[],[]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
157,167r1,Petards,fig_p168r_1,https://drive.google.com/open?id=0B9-oNrvWdlO5...,small,left-middle,[arms and armor],"[oxen, mules]","[thumbs breadth, fingers, saliva, hand]",[],...,"[crucible, forks, pegs, tools, thick canvas, t...",[],"[Petards, Petards, petard, petard, petard, pet...",[],[],[],[],"[metal, metal, ch, e, s, p, q, l, pulverin, s,...","[crucible, gimlet, gimlet, gimlets, gimlets]","[cake, presses, strap hinge]"
158,169v2,Reducing a round figure to a hollow form,fig_p169v_3,https://drive.google.com/open?id=0B9-oNrvWdlO5...,small,left-middle,[casting],[],[warm urine],[],...,"[clay slab, mold, mold, clay slab]",[],[],[],[],[],[],"[b, noyau, en noyau, en noyau, m, en, du]",[],[]
159,170r6,Cleaning closed molds,fig_p170r_1,https://drive.google.com/open?id=0B9-oNrvWdlO5...,x-small,left-bottom,[casting],[],[],[],...,"[chaple, molds, thin wire of latten, delicate ...",[],[],[],[],[],[],[chaple],[],[]
160,170r6,Cleaning closed molds,fig_p170r_2,https://drive.google.com/open?id=0B9-oNrvWdlO5...,x-small,,[casting],[],[],[],...,"[chaple, molds, thin wire of latten, delicate ...",[],[],[],[],[],[],[chaple],[],[]


Finally, this table can easily be written to a CSV.

In [19]:
df.to_csv('figures.csv', index=False)

The following query proves that there are no `ab` tags in the manuscript which have an intermediate tag between itself and a child `figure` tag.

In [20]:
%%capture
a = m.generate_all_folios(method='xml', version='tl') # Generate one giant XML document to search representing the entire manuscript in translation.

In [21]:
soup = BeautifulSoup(a, 'lxml') # Soupify the document.

In [22]:
all_ab = soup.find_all('ab') # Get a list of all the `ab` tags.

Technically, you should never write a list comprehension as horrible as the one you see below.
However, this is merely for querying purposes.
Essentially, it asks the following:

**"Give me every `ab` tag in the manuscript for which any of its descendants themselves have child tags which are `figure` tags."**

Or, put another way:

**"List those `ab` tags which have grandchildren (or great-grandchildren, or great-great-grandchildren...) that are `figure` tags."**

See? Not that bad at all.
It returns an empty list, meaning there are no such tags. Every figure tag inside an `ab` tag is a direct descendant with no intermediate parents.

In [23]:
[ab for ab in all_ab if any([len(list(c.children))>0 and any([d.name == 'figure' for d in c.children]) for c in ab.descendants if isinstance(c, Tag)])]

[]