Talk is to be timed for 22 minutes + 3 minutes for questions.

We should set up an env to run this all in Python 3.

In [8]:
from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
              'start_slideshow_at': 'selected',
              'width': 1024,
              'height': 768,
})

{u'height': 768, u'start_slideshow_at': 'selected', u'width': 1024}

# datreant

### persistent, Pythonic trees for heterogeneous data

**David L. Dotson**, Sean L. Seyler, Max Linke,  
Richard Gowers, Oliver Beckstein

## the problem

Scientific research often proceeds organically.

[Need an image of a directory tree, perhaps randomly generated]

Though portions are planned, the process is largely messy; this is especially true for simulation work.

## possible solutions?

* RDBMS?
* document databases?
* HDFS?

Rarely are these a good fit for the data one needs to store, including simulation parameters, system description, etc. Existing tools often require customary file formats.

## why not use the filesystem itself?

Cons:
* littered with irrelevant files
* hierarchical, but perhaps inconsistently strunctured

Pros:
* already stores anything we need (by definition)
* existing tools work with existing formats

**`datreant`** is an attempt to take advantage of the universality of the filesystem while minimizing its inconveniences

## Treants: discoverable directories with metadata

A ``Treant`` is a directory with a special **state file**:

In [12]:
import datreant.core as dtr

t = dtr.Treant('maple')
t.draw()

maple/
 +-- Treant.7da6e0a9-64db-431d-9141-e997945e05a6.json


The state file:
1. serves as a bookmark marking the directory as a ``Treant``.
2. stores metdata elements, such as *tags* and *categories*.

## introspecting and manipulating a Treant's tree

We can use a `Treant` to create directory structures:

In [15]:
t['a/place/for/data/'].makedirs()
t['a/place/for/text/'].makedirs()

t.draw()

maple/
 +-- Treant.7da6e0a9-64db-431d-9141-e997945e05a6.json
 +-- a/
     +-- place/
         +-- for/
             +-- data/
             +-- text/


And we can manipulate directories and files with `Tree` and `Leaf` objects, respectively.

For example, we could store a `pandas` DataFrame somewhere in the tree for reference later:

In [22]:
import pandas as pd
df = pd.DataFrame(pd.np.random.randn(3, 2),
                  columns=['A', 'B'])

In [23]:
data = t['a/place/for/data/']
data

<Tree: 'maple/a/place/for/data/'>

In [24]:
df.to_csv(data['random_dataframe.csv'].abspath)
data.draw()

data/
 +-- random_dataframe.csv


And we can introspect the file directly:

In [25]:
csv = data['random_dataframe.csv']
csv

<Leaf: 'maple/a/place/for/data/random_dataframe.csv'>

In [26]:
print(csv.read())

,A,B
0,-0.574553718574,-0.516982117727
1,-2.26093891758,0.58054828901
2,-0.0669276516294,-0.956296412749



## Aggregating and splitting on Treant metadata

What makes a `Treant` distinct from a `Tree` is its **state file**. This file stores metadata that can be used to filter and split `Treant` objects when treated in aggregate.

If we have many more Treants, perhaps scattered about the filesystem:

In [27]:
for path in ('an/elm/', 'the/oldest/oak', 
             'the/oldest/tallest/sequoia'):
    dtr.Treant(path)

We can gather them up with ``datreant.core.discover``:

In [29]:
b = dtr.discover('.')
b

<Bundle([<Treant: 'oak'>, <Treant: 'sequoia'>, <Treant: 'maple'>, <Treant: 'elm'>])>

A `Bundle` is an ordered set of ``Treant`` objects. It gives convenient mechanisms for working with Treants as a single logical unit.

In [30]:
b.relpaths

['the/oldest/oak/', 'the/oldest/tallest/sequoia/', 'maple/', 'an/elm/']

In [31]:
b.names

['oak', 'sequoia', 'maple', 'elm']

A ``Bundle`` can subselect Treants in typical ways:

In [32]:
# integer indexing
b[1]

<Treant: 'sequoia'>

In [33]:
# slicing
b[1::2]

<Bundle([<Treant: 'sequoia'>, <Treant: 'elm'>])>

In [34]:
# fancy indexing
b[[3, 0, 1]]

<Bundle([<Treant: 'elm'>, <Treant: 'oak'>, <Treant: 'sequoia'>])>

In [35]:
# boolean indexing
b[[True, False, False, True]]

<Bundle([<Treant: 'oak'>, <Treant: 'elm'>])>

In [36]:
# indexing by name
b['oak']

<Bundle([<Treant: 'oak'>])>

### Treants can be filtered on their tags

Tags are individual strings that describe a `Treant`. Setting some tags for each of our Treants:

In [39]:
b['maple'].tags = ['syrup', 'furniture', 'plant']
b['sequoia'].tags = ['huge', 'plant']
b['oak'].tags = ['for building', 'plant', 'building']
b['elm'].tags = ['firewood', 'shady', 'paper', 'plant', 'building']

We can work with these tags in aggregate:

In [40]:
# will only show tags present in *all* members
b.tags

<AggTags([u'plant'])>

In [41]:
# will show tags present among *any* member
b.tags.any

{u'building',
 u'firewood',
 u'for building',
 u'furniture',
 u'huge',
 u'paper',
 u'plant',
 u'shady',
 u'syrup'}

And we can filter on them. For example, getting all Treants that are good for construction work:

In [42]:
# gives a boolean index for members with this tag
b.tags['building']

[True, False, False, True]

In [44]:
# we can use this to index the Bundle itself
b[b.tags['building']]

<Bundle([<Treant: 'oak'>, <Treant: 'elm'>])>

or getting back Treants that are both good for construction *and* used for making furniture by giving tags as a list:

In [45]:
# a list of tags serves as an *intersection* query
b[b.tags[['building', 'furniture']]]

<Bundle([])>

which in this case none of them are.

Other tag expressions can be constructed using tuples (for *union* operations) and sets (for *negated intersections*), and nesting of any of these works as expected:

In [46]:
# we can get a *union* by using a tuple instead of a list
b[b.tags['building', 'furniture']]

<Bundle([<Treant: 'oak'>, <Treant: 'maple'>, <Treant: 'elm'>])>

In [47]:
# and we can get a *negated intersection* by using a set
b[b.tags[{'building', 'furniture'}]]

<Bundle([<Treant: 'oak'>, <Treant: 'sequoia'>, <Treant: 'maple'>, <Treant: 'elm'>])>

Using tag expressions, we can filter to Treants of interest from a ``Bundle`` counting many, perhaps hundreds, of Treants as members.

### Splitting Treants on categories

Categories are key-value pairs that provide another mechanism for distinguishing Treants. We can add categories to each Treant:

In [49]:
# add categories to individual members
b['oak'].categories = {'age': 'adult',
                       'type': 'deciduous',
                       'bark': 'mossy'}
b['elm'].categories = {'age': 'young',
                       'type': 'deciduous',
                       'bark': 'smooth'}
b['maple'].categories = {'age': 'young',
                         'type': 'deciduous',
                         'bark': 'mossy'}
b['sequoia'].categories = {'age': 'old',
                           'type': 'evergreen',
                           'bark': 'fibrous',
                           'home': 'california'}

# add value 'tree' to category 'plant' for all members
b.categories.add({'plant': 'tree'})

And we can access categories for individual Treants:

In [52]:
seq = b['sequoia'][0]
seq.categories

<Categories({u'home': u'california', u'plant': u'tree', u'type': u'evergreen', u'age': u'old', u'bark': u'fibrous'})>

or the aggregated categories for all members in the `Bundle`:

In [53]:
b.categories

<AggCategories({u'plant': [u'tree', u'tree', u'tree', u'tree'], u'type': [u'deciduous', u'evergreen', u'deciduous', u'deciduous'], u'age': [u'adult', u'old', u'young', u'young'], u'bark': [u'mossy', u'fibrous', u'mossy', u'smooth']})>

## A growing development community