Talk is to be timed for 22 minutes + 3 minutes for questions.

Include a pinch of mania. ;)

In [3]:
# useful for when kernel dies
import datreant.core as dtr
import seaborn as sns
import mdsynthesis as mds
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

In [4]:
from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
              'start_slideshow_at': 'selected',
              'width': 1024,
              'height': 768,
        
})

{'height': 768, 'start_slideshow_at': 'selected', 'width': 1024}

# datreant

### persistent, Pythonic trees for heterogeneous data

**David L. Dotson**, Arizona State University  
Sean L. Seyler, Arizona State University  
Max Linke, Max Planck Institute for Biophysics  
Richard Gowers, University of Manchester  
Oliver Beckstein, Arizona State University
<br><br>
#### SciPy 2016, Austin

### the problem

Scientific research often proceeds organically.

[Need an image of a directory tree, perhaps randomly generated]

Though portions are planned, the process is largely messy; this is especially true for simulation work.

### possible solutions?

* relational databases?
* document databases?
* HDFS?

Rarely are these a good fit for the data one needs to store:

- simulation parameters
- system description

- raw trajectory
- intermediate data

Also, existing tools often require customary file formats.

### Example: molecular dynamics (MD) simulations give positions of atoms with time

In [33]:
import nglview as nv

s = mds.Sim('dims/r_01')
prot = s.universe.select_atoms('protein')
wg = nv.show_mdanalysis(prot)
wg.camera = 'orthographic'
wg

### but an MD simulation isn't just the trajectory

<center><img src="figs/tree-off.png" alt="tree" height="754" width="600"></center>

It's also all files that encode *how* it was made, parameter choices, etc.

Can't easily cram all this information into a specialized database; likely less flexible than the filesystem anyway.

### why not use the filesystem itself?

Cons:
* littered with irrelevant files
* hierarchical, but perhaps inconsistently structured

Pros:
* already stores anything we need (by definition)
* existing tools work with existing formats

**`datreant`** is an attempt to take advantage of the universality of the filesystem while minimizing its inconveniences

## Treants: discoverable directories with metadata

A ``Treant`` is a directory with a special **state file**:

In [6]:
import datreant.core as dtr

t = dtr.Treant('maple')
t.draw()

maple/
 +-- Treant.86faba27-9eba-4e86-af0e-0cb9f3378cf1.json


The state file:
1. serves to mark the directory as a ``Treant``.
3. includes a unique identifier (uuid) for that ``Treant``.
2. stores metadata elements to distinguish the ``Treant``, such as *tags* and *categories*.

### so, Treants are walking, talking trees

<center><figure>
  <img src="figs/Treant.jpg" alt="A Treant" width="304" height="228", align="middle">
  <figcaption><center>A Treant with some data. <br><br><i>Dungeons and Dragons Monster Manual, Gary Gygax, 1977</i>.</center>
  </figcaption>
</figure></center>

**They can be uniquely identified even when moved, and they can speak through their metadata.**

Unlike regular directories, Treants are uniquely identified by uuid, not by path. They can be moved, and they can tell us more about themselves with descriptive metadata.

### a Treant can introspect and manipulate its tree

We can use a ``Treant`` to examine its existing directory structure:

In [12]:
napa = dtr.Treant('NapA_0/')
napa.draw(depth=1)

NapA_0/
 +-- dists/
 +-- WORK/
 +-- Treant.f91a79df-c223-4789-b222-5744cf7gjfdk.json
 +-- setup/
 +-- angles/


In [18]:
napa['setup'].draw(depth=1)

setup/
 +-- build/
 +-- emin/
 +-- preproduction/


And we can manipulate directories and files with `Tree` and `Leaf` objects, respectively.

In [21]:
tree = napa['setup']
tree

<Tree: 'NapA_0/setup/'>

In [20]:
leaf = napa['setup/emin/em.gro']
leaf

<Leaf: 'NapA_0/setup/emin/em.gro'>

In [131]:
print(leaf.read(size=160))

NapA inward-facing
132506
    1GLY      N    1   5.627   5.740   3.317
    1GLY     H1    2   5.613   5.806   3.237
    1GLY     H2    3   5.567   5.769   3.398


``Tree`` and ``Leaf`` objects are lightly-wrapped ``pathlib`` paths; they need not point to existing directories or files:

In [22]:
leaf.exists

True

They always reflect the filesystem **as it currently is**.

In [136]:
tree.glob('*/*.itp').relpaths

['NapA_0/setup/build/topol_Protein_chain_A_ion.itp',
 'NapA_0/setup/build/topol_Protein_chain_B.itp']

In [147]:
gros = tree.trees.loc['md.gro']
gros.relpaths

['NapA_0/setup/build/md.gro',
 'NapA_0/setup/emin/md.gro',
 'NapA_0/setup/preproduction/md.gro']

In [148]:
gros.exists

[False, False, True]

### Treants are sanity preserving

Using ``Treant``, ``Tree``, and ``Leaf`` objects, we can work with the filesystem Pythonically without giving much attention to *where* these objects live within that filesystem.

This is especially powerful when we have many directories/files we want to work with, possibly in many different places.

## Aggregating and splitting on Treant metadata

What makes a `Treant` distinct from a `Tree` is its **state file**. It stores metadata that can be used to filter and split `Treant` objects when treated in aggregate.

If we have many Treants, perhaps scattered around the filesystem:

In [154]:
arbor = dtr.Tree('arboretum')

for path in (arbor['maple/'], arbor['an/elm/'],
             arbor['the/oldest/oak/'], 
             arbor['the/oldest/tallest/sequoia/']):
    dtr.Treant(path)

We can gather Treants up from the filesystem with ``datreant.core.discover``:

In [155]:
b = dtr.discover('arboretum/')
b

<Bundle([<Treant: 'oak'>, <Treant: 'sequoia'>, <Treant: 'maple'>, <Treant: 'elm'>])>

A `Bundle` is an ordered set of ``Treant`` objects. It gives convenient mechanisms for working with Treants as a single logical unit.

In [156]:
b.relpaths

['arboretum/the/oldest/oak/',
 'arboretum/the/oldest/tallest/sequoia/',
 'arboretum/maple/',
 'arboretum/an/elm/']

In [157]:
b.names

['oak', 'sequoia', 'maple', 'elm']

A ``Bundle`` can subselect Treants in typical ways:

In [158]:
# integer indexing
b[1]

<Treant: 'sequoia'>

In [159]:
# slicing
b[1::2]

<Bundle([<Treant: 'sequoia'>, <Treant: 'elm'>])>

In [160]:
# fancy indexing
b[[3, 0, 1]]

<Bundle([<Treant: 'elm'>, <Treant: 'oak'>, <Treant: 'sequoia'>])>

In [161]:
# boolean indexing
b[[True, False, False, True]]

<Bundle([<Treant: 'oak'>, <Treant: 'elm'>])>

In [162]:
# indexing by name
b['oak']

<Bundle([<Treant: 'oak'>])>

### Treants can be filtered on their tags

Tags are individual strings that describe a `Treant`. Setting some tags for each of our Treants:

In [187]:
b['maple'].tags = ['syrup', 'furniture', 'plant']
b['sequoia'].tags = ['huge', 'plant']
b['oak'].tags = ['for building', 'plant', 'building']
b['elm'].tags = ['firewood', 'shady', 'paper', 
                 'plant', 'for building']

We can work with these tags in aggregate:

In [164]:
# will only show tags present in *all* members
b.tags

<AggTags(['plant'])>

In [177]:
# will show tags present among *any* member
b.tags.any

{'building',
 'firewood',
 'for building',
 'furniture',
 'huge',
 'paper',
 'plant',
 'shady',
 'syrup'}

And we can filter on them. For example, getting all Treants that are good for construction work:

In [178]:
# gives a boolean index for members with this tag
b.tags['building']

[True, False, False, False]

In [179]:
# we can use this to index the Bundle itself
b[b.tags['building']]

<Bundle([<Treant: 'oak'>])>

And since we tagged at least one ``Treant`` with "for building", we can do some fuzzy matching, too:

In [185]:
b.tags.filter([b.tags.fuzzy('building', scope='any')])

AttributeError: 'AggTags' object has no attribute 'filter'

or getting back Treants that are both good for construction *and* used for making furniture by giving tags as a list:

In [168]:
# a list of tags serves as an *intersection* query
b[b.tags[['building', 'furniture']]]

<Bundle([])>

which in this case none of them are.

Other tag expressions can be constructed using tuples (for *union* operations) and sets (for *negated intersections*), and nesting of any of these works as expected:

In [169]:
# we can get a *union* by using a tuple instead of a list
b[b.tags['building', 'furniture']]

<Bundle([<Treant: 'oak'>, <Treant: 'maple'>, <Treant: 'elm'>])>

In [170]:
# and we can get a *negated intersection* by using a set
b[b.tags[{'building', 'furniture'}]]

<Bundle([<Treant: 'oak'>, <Treant: 'sequoia'>, <Treant: 'maple'>, <Treant: 'elm'>])>

Using tag expressions, we can filter to Treants of interest from a ``Bundle`` counting many, perhaps hundreds, of Treants as members.

### Splitting Treants on categories

Categories are key-value pairs that provide another mechanism for distinguishing Treants. We can add categories to each Treant:

In [171]:
# add categories to individual members
b['oak'].categories = {'age': 'adult',
                       'type': 'deciduous',
                       'bark': 'mossy'}
b['elm'].categories = {'age': 'young',
                       'type': 'deciduous',
                       'bark': 'smooth'}
b['maple'].categories = {'age': 'young',
                         'type': 'deciduous',
                         'bark': 'mossy'}
b['sequoia'].categories = {'age': 'old',
                           'type': 'evergreen',
                           'bark': 'fibrous',
                           'home': 'california'}

# add value 'tree' to category 'plant' for all members
b.categories.add({'plant': 'tree'})

And we can access categories for individual Treants:

In [172]:
seq = b['sequoia'][0]
seq.categories

<Categories({'plant': 'tree', 'bark': 'fibrous', 'type': 'evergreen', 'age': 'old', 'home': 'california'})>

or the aggregated categories for all members in the `Bundle`:

In [173]:
b.categories

<AggCategories({'plant': ['tree', 'tree', 'tree', 'tree'], 'bark': ['mossy', 'fibrous', 'mossy', 'smooth'], 'type': ['deciduous', 'evergreen', 'deciduous', 'deciduous'], 'age': ['adult', 'old', 'young', 'young']})>

When many Treants possess the same category keys, we can take a "split-apply-combine" approach to working with them using `groupby`:

In [188]:
b.categories.groupby('bark')

{'fibrous': <Bundle([<Treant: 'sequoia'>])>,
 'mossy': <Bundle([<Treant: 'oak'>, <Treant: 'maple'>])>,
 'smooth': <Bundle([<Treant: 'elm'>])>}

We can also group on more than one key:

In [175]:
b.categories.groupby(['bark', 'home'])

{('fibrous', 'california'): <Bundle([<Treant: 'sequoia'>])>}

By leveraging the `groupby` method, we can extract Treants by category without having to explicitly access each member.

## Treants + the PyData stack

``datreant`` fundamentally serves as a Pythonic interface to the filesystem, bringing value to datasets and analysis results by making them easily accessible now and later.

As data structures and file formats change, ``datreant`` objects can always be used in the same way to supplement the way these tools are used.

### using distributed to get it done

In [1]:
from distributed import Executor, progress
ex = Executor()

In [7]:
b = mds.discover('sims/')
len(b)

49

In [None]:
def 

[subslide on using a workflow automation system like Fireworks,
showing how `datreant` makes this easier]

## Building domain-specific applications on datreant

[summary detail, probably a diagram, showing how Treants and Bundles
are able to work just fine with `Treant` subclasses]

## Leveraging molecular dynamics data with MDSynthesis

[steal example from paper, or perhaps make a simpler one? existing example isn't fast enough to do live (if that's desired)]

## A growing development community

## Where we are going

Things I didn't talk about much:

``datreant`` is a namespace package.

## Acknowledgments