Split off MDSynthesis core into another package? #38

dotsdl · 2015-07-16T05:13:46Z

Since the basic structure of MDSynthesis isn't entirely specific to molecular dynamics simulation, I think it might make sense to break the core out into a separate package. I'd been considering this for a while, but today I was contacted by someone with a use-case outside of MD, and I think it would be of benefit to MDSynthesis' core development to make it more general.

I've started a repository for this work here: https://github.com/dotsdl/datreant

Is anyone opposed to this? One thing to to note is that datreant is BSD 3-clause licensed, making its use more permissive than MDSynthesis, which is GPLv2 (same as MDAnalysis). I technically need permission from everyone that has contributed code so far to make this change.

P.S. Opinions on the name are welcome. I wanted something that included 'dat' to indicate 'data', and some kind of word to indicate 'trees' (as in directory trees). An old D&D woodland creature ('treant') came to mind. :D

The text was updated successfully, but these errors were encountered:

richardjgowers · 2015-07-16T06:30:09Z

Yeah seems sensible, permission given etc. The only potential annoyance is that MDS requires working on 2 repos, and you have to worry about multiple apps on datreant. But you've structured stuff quite strictly, and I think that will pay off now

Name is pretty cool though!

dotsdl · 2015-07-16T16:16:50Z

Yeah, I agree that development (including issues, wiki, docs, etc.) will now be split two ways, which is annoying. However, I hope this can be offset by the work of new contributors since it would make the core relevant to more people. Of course this also means more diverse pressures from other domain-specific packages that come out of it, but I think this should be a tide that lifts all ships (including MDS).

I think in terms of raw development work, it should be fairly easy (famous last words) to split out the core, since everything is pretty strictly partitioned. Any thoughts, @orbeckst and @cing?

cing · 2015-07-16T17:37:09Z

I'm into it. To be honest, I don't even use the MDAnalysis specific functionality (universes or selections). I have a big mish-mash of analysis scripts, some in MDTraj, VMD, and GMX tools, so universes and selections aren't required in those cases, and I just load the data into DataFrames and append new data with MDSynthesis. Then I group/tag/assemble the data from multiple Sims for plots, etc.

Like Selections and Universes, I wonder if you could store any type of domain-specific objects with the "Sims". I guess it would need to be pickleable/remakeable.

David, did you get any feedback on the project at SciPy when you presented this?

dotsdl · 2015-07-16T18:30:07Z

Like Selections and Universes, I wonder if you could store any type of domain-specific objects with the "Sims". I guess it would need to be pickleable/remakeable.

The only real MD-specific object is the Sim, and it's really just an MD-specific subclass of Container that includes interfaces (Aggregators) for handling MDAnalysis elements. I think it represents a model of how one could create useful domain-specific Containers that store the information needed for complex data structures that might not be themselves pickleable (or reasonable to pickle).

The most useful functionality is already included in Container, that being the data storage and metadata handling. And Group + Bundle give aggregation tools, which are also MD-agnostic.

At SciPy I got many interested folks asking about the problem it addresses. I think for many they were confused why I wasn't using fewer HDF5 files (such as one file per Sim, or even one file for everything), but explaining more of the context of MD (existing ecosystem of tools for existing trajectory formats, heterogeneity of the raw data, storage and compute on remote machines, organic approach to simulating with different parameters, etc.) seemed to help.

I think MDS approaches the problem of data management in a pragmatic-rather-than-pure way, similar to tools like pandas and others in the wild (such as blaze and dask). The main idea of MDS is that the filesystem is quite good for a lot of things, and doing things at the filesystem level carries many advantages over trying to shove everything into a single database. I think this approach is reasonable for many other areas in science. The individual that contacted me, for example, applies machine learning to neuroimaging. He needs to apply many different estimators to different datasets, and because this is time consuming, he needs a reasonable approach to storing and recalling the results.

cing · 2015-07-16T21:26:51Z

Right, I can see how you structured the code with this type of thing in mind (subclasses of Container). So it sounds like a logical move to me.

While we're talking about general things, I'll be honest, I feel a little uninformed on the real benefits of the filesystem approach for data management that MDS embodies. I think the docs for treant should really hammer this point home. Despite the reasons you describe about the "context of MD", I can't really figure out a real reason why the community loves dumping files of numbers on disk in so much, rather than into a database. The main reason I could imagine is that scientists are just lazy and used to working with flat files of numbers and HDF5 is a faster/smaller version of that. I imagine that most MD researchers don't work on a big enough scale to gain anything from the benefits that a database approach offers. It seems to me that you're adding a lot of functionality in MDS on top of HDF5s to make them database-like, like file locking, and isn't tagging/grouping just a simplified version of database querying? Don't get me wrong, I'm all about MDS as it just fits perfectly into my workflow, I just think we should carve out the niche a bit.

Hmm, could MDS work with a database "under-the-hood" rather than HDF5?

dotsdl · 2015-07-17T00:17:54Z

@cing

These are fair questions, and I think it's important to lay all this out to determine the best approach going forward. My thoughts on this are open for revision, but here's what they are at the moment. Sorry if it's a bit long.

I think the benefits of MDS' current design include:

It's daemonless. There is no server process required on the machine to read/write persistent objects. This is valuable when working with data on remote resources, where it might not be possible to set up one's own daemon to talk to.
Containers are portable. Because they store their state in HDF5 (which itself is portable across systems, handling endianess, etc.), and because containers store their data in the filesystem, they are easy to move around piecemeal. If you want to use a container on a remote system, but don't want to drag all its stored datasets with it, you can copy only what you need.

Contrast this with many database solutions, in which you either copy the whole database somehow, or slurp the pieces of data out that you want. Most database solutions can be rather slow to do this, to my knowledge.
Containers are independent. Although Groups are aware of their members, containers work independently from one another. If you want to use only basic Containers or Sims, that works just fine. If you want to use Groups, that works, too. If you want to use the Coordinator (not yet implemented; think a thin database that containers share their info with so they can be quickly queried from one place in the filesystem), then you can, but you don't have to. You don't have to buy the whole farm to ride the horse, in a sense.
Containers have a structure in the filesystem. This means that all the shell tools we know and love are available to work with their contents, which might include plaintext files, figures, topology files, trajectories, random pickles, ipython notebooks, html files, etc. Basically, containers are as versatile as the filesystem is, at least when it comes to storage.

There are, of course, disadvantages:

Containers could be anywhere in the filesystem. This is mostly a problem for Groups, which allow aggregation of other containers. If a member is moved, the Group has no way of knowing where it went; we've built machinery to help it find its members, but these will always be limited to filesystem search methods (some quite good, but still). If these objects lived in a single database, this wouldn't be an issue.
Queries on object metadata will be slower than a central database. We want Groups and Bundles (in-memory Groups, basically) to be able to run queries against their members' characteristics, returning subsets matching the query. Since these queries have to be applied against these objects and not against a single table somewhere, it will be relatively slow.

The Coordinator is an answer to this problem, albeit an imperfect one. The idea is that you can make a Coordinator, which is a small daemonless database (perhaps SQLite, but could be HDF5), and you can add containers for it to be aware of with something like:
```
import mdsynthesis as mds

co = mds.Coordinator('camelot')
co.add('moe', 'larry', 'curly')

# could also let the coordinator do a downward search and add all
# containers it finds
co.discover()
```
This awareness is bi-directional: a Coordinator is aware of its members, and its members are aware of the Coordinator, and where it lives. The Coordinator will store tables of member attributes for fast queries, and these tables will be updated by members as they themselves are updated. So whenever we have:
```
c = mds.Container('moe')
c.categories['bowlcut'] = True
```
the container updates both its state file and the Coordinator(s) it is affiliated with. This is in contrast to Groups, of which members are unaware. This is by design: the idea is that containers are likely to be members of many different Groups all over the filesystem; there would be comparatively fewer Coordinators in use, which have a performance hit to a container for each affiliation.

Obvious problem: there are probably a lot of ways for Coordinators and their members to get out-of-sync. A single database with everything inside avoids this entirely.
File locking is less efficient for multiple-read/write under load than a smart daemon process/scheduler. The assumption we make is that containers are primarily read, and only occasionally written to. This is assumed for their data and their metadata. They are not designed to scale well if the same parts are being written to and read at the same time by many processes.

Having containers exist as separate files (state files and data all separate) does mitigate this potential for gridlock, which is one reason we favor many files over few. But it's still something to be aware of.

As to your question about using a database "under-the-hood" rather than HDF5, I think that is possible. The package is written with a clear separation between the front-end containers and the backend objects that talk to the files, and in principle backend objects could be written to talk to a database instead. Backends would have to be written for storing datasets, too, which is a bit trickier. If there is interest in more backends, I'm happy to help flesh them out.

There is discussion about HDF5 vs. database solutions in various corners of the internet, but I found these pretty good for shedding some light on the relevant topics:

discussion on the pydata (pandas) mailing list: https://groups.google.com/forum/#!topic/pydata/S3kLxyrizkI
Anthony Scopatz' tutorial on HDF5: https://github.com/scopatz/hdf5-is-for-lovers/blob/master/hdf5-is-for-lovers.ipynb

Thoughts?

cing · 2015-07-17T01:39:55Z

Wow, +1! You just wrote the exact documentation page I've been looking for, well said.

dotsdl · 2015-07-17T03:15:28Z

Good! I'll make sure it makes its way into the docs. I agree that datreant's docs have to do a good job of addressing these pros and cons in order to be relevant to a lot of the community. Thanks for prodding me for it. :D

orbeckst · 2015-07-17T08:07:03Z

Random late comments:

Can you then think of containers in a much broader sense as really any kind of URI — I am thinking of a framework that allows one to organize all kinds of resources such as online databases, local files, etc.
You need a new name ;-) – such as
1. AnySynthesis
2. *Synthesis (for glob fans)
3. .*Synthesis (for regexers)
4. DotStarSynthesis (because all Unixers would hate you for 2 and 3; and it almost spells "Dotson" ;-) )

orbeckst · 2015-07-17T08:10:07Z

Ah, the whole thread was already tl;dr – didn't realize that you already picked a name. If you need my permission for splitting out code and relicensing then it is hereby given.

dotsdl · 2015-07-17T17:31:37Z

Awesome. In that case I'll start the process of breaking things out. I'm not going to try any fancy splitting with filter-branch since the classes we need to separate live among each other in the same files. I'm going to just do a straight clone of MDS and start trimming. :)

One last question; which name do you like better?

pytreant
datreant

Both are short, reasonably easy to say, and have personality. Not sure how in vogue or advantageous it is to have "py" in the name of a python package these days. These are the tough decisions. :)

cing · 2015-07-17T20:55:47Z

I take names very seriously! I was dead set on making a replica exchange server with the name Lakitu a couple years ago. That's the name of the bad guy in Super Mario that launches spikey turtles all over the place to kill Mario, and he lives on a cloud. The server would basically float around the supercomputer running on different nodes and launch jobs as needed to meet some sampling goal. Interest in the project kinda dissolved, but I still stand by the coolness of that name!

Anyway, I see that "treant" is taken on the cheese shop. The py prefix seems too oldschool to me, and snakes in trees is too biblical. I'm voting for datreant. I'm imaging someone illustrates a crazy digital tree monster as the logo. Another option of the blue would be, err... Cubby or DataCubby, in reference to these things pieces of furniture from school. It kind of fits the whole container idea, but it's certainly not as cool as treant.

dotsdl · 2015-07-20T06:04:13Z

Can you then think of containers in a much broader sense as really any kind of URI — I am thinking of a framework that allows one to organize all kinds of resources such as online databases, local files, etc.

@orbeckst I think this is possible, and I don't think it's a challenge for the front-end API. One could imagine having Containers in a PostgreSQL database somewhere and giving a full URI instead of a filesystem location with something like

import datreant as dtr
c = dtr.Container('postgresql://username:password@hostname:port')

Perhaps the way odo implements this would be of help if this is something we wish to pursue.

@cing I remember those things in kindergarten, but not sure how I feel about 'cubby'. I agree that 'datreant' works, and I like it over 'pytreant', too. :D

Seeing no other complaints, and now having permission to split the core out with a different license, I'm happy now to call this issue closed!

dotsdl · 2015-07-20T06:08:02Z

BTW: we now have a mailing list here: https://groups.google.com/forum/#!forum/mdsynthesis

I can say confidently from its past history that it is low traffic. :P

The developers of MDSynthesis (datreant/MDSynthesis#38) have decided to split out its core into a new package called "datreant". This makes the unique way datreant handles the problem of heterogeneous data storage and recall available to users in fields outside of molecular dynamics work, hopefully spurring development in new and wonderful ways. MDSynthesis will live on as a package built around datreant with MDAnalysis-specific subclasses of datreant Treants. One note of interest: to reduce blandness and avoid confusion with other common uses of the word, all instances of "Container" have been replaced with "Treant". For the uninitiated, a treant is a mythical walking, talking tree, originating the land of Dungeons and Dragons. The idea is that datreant's data model relies on directory trees, and leverages their usefulness over database-only solutions. If the use of 'Treant' and 'treant' in the documentation results in too much confusion, we will consider other names. The directory package of MDSynthesis has been reworked to accommodate this change. For one thing, it's been flattened. If this turns out to be confusing, we'll partition it up. One worry is that it removes the visual separation between the backend (File objects) and the frontend (Treants).

dotsdl · 2015-07-21T02:49:26Z

The deed is done: https://github.com/dotsdl/datreant

There is one issue (as far as I can tell) holding back support for Python 3.4 (datreant/datreant#10), but I think a workaround is possible.

dotsdl · 2015-07-21T02:50:47Z

The next task is removing the duplicate parts of MDSynthesis. This will happen before the 0.5.1 release.

dotsdl added the question label Jul 16, 2015

dotsdl closed this as completed Jul 20, 2015

dotsdl mentioned this issue Jul 21, 2015

Remove core elements of MDSynthesis that have moved to datreant #40

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split off MDSynthesis core into another package? #38

Split off MDSynthesis core into another package? #38

dotsdl commented Jul 16, 2015

richardjgowers commented Jul 16, 2015

dotsdl commented Jul 16, 2015

cing commented Jul 16, 2015

dotsdl commented Jul 16, 2015

cing commented Jul 16, 2015

dotsdl commented Jul 17, 2015

cing commented Jul 17, 2015

dotsdl commented Jul 17, 2015

orbeckst commented Jul 17, 2015

orbeckst commented Jul 17, 2015

dotsdl commented Jul 17, 2015

cing commented Jul 17, 2015

dotsdl commented Jul 20, 2015

dotsdl commented Jul 20, 2015

dotsdl commented Jul 21, 2015

dotsdl commented Jul 21, 2015

Split off MDSynthesis core into another package? #38

Split off MDSynthesis core into another package? #38

Comments

dotsdl commented Jul 16, 2015

richardjgowers commented Jul 16, 2015

dotsdl commented Jul 16, 2015

cing commented Jul 16, 2015

dotsdl commented Jul 16, 2015

cing commented Jul 16, 2015

dotsdl commented Jul 17, 2015

cing commented Jul 17, 2015

dotsdl commented Jul 17, 2015

orbeckst commented Jul 17, 2015

orbeckst commented Jul 17, 2015

dotsdl commented Jul 17, 2015

cing commented Jul 17, 2015

dotsdl commented Jul 20, 2015

dotsdl commented Jul 20, 2015

dotsdl commented Jul 21, 2015

dotsdl commented Jul 21, 2015