-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split off MDSynthesis core into another package? #38
Comments
Yeah seems sensible, permission given etc. The only potential annoyance is that MDS requires working on 2 repos, and you have to worry about multiple apps on datreant. But you've structured stuff quite strictly, and I think that will pay off now Name is pretty cool though! |
Yeah, I agree that development (including issues, wiki, docs, etc.) will now be split two ways, which is annoying. However, I hope this can be offset by the work of new contributors since it would make the core relevant to more people. Of course this also means more diverse pressures from other domain-specific packages that come out of it, but I think this should be a tide that lifts all ships (including MDS). I think in terms of raw development work, it should be fairly easy (famous last words) to split out the core, since everything is pretty strictly partitioned. Any thoughts, @orbeckst and @cing? |
I'm into it. To be honest, I don't even use the MDAnalysis specific functionality (universes or selections). I have a big mish-mash of analysis scripts, some in MDTraj, VMD, and GMX tools, so universes and selections aren't required in those cases, and I just load the data into DataFrames and append new data with MDSynthesis. Then I group/tag/assemble the data from multiple Sims for plots, etc. Like Selections and Universes, I wonder if you could store any type of domain-specific objects with the "Sims". I guess it would need to be pickleable/remakeable. David, did you get any feedback on the project at SciPy when you presented this? |
The only real MD-specific object is the The most useful functionality is already included in At SciPy I got many interested folks asking about the problem it addresses. I think for many they were confused why I wasn't using fewer HDF5 files (such as one file per I think MDS approaches the problem of data management in a pragmatic-rather-than-pure way, similar to tools like pandas and others in the wild (such as blaze and dask). The main idea of MDS is that the filesystem is quite good for a lot of things, and doing things at the filesystem level carries many advantages over trying to shove everything into a single database. I think this approach is reasonable for many other areas in science. The individual that contacted me, for example, applies machine learning to neuroimaging. He needs to apply many different estimators to different datasets, and because this is time consuming, he needs a reasonable approach to storing and recalling the results. |
Right, I can see how you structured the code with this type of thing in mind (subclasses of Container). So it sounds like a logical move to me. While we're talking about general things, I'll be honest, I feel a little uninformed on the real benefits of the filesystem approach for data management that MDS embodies. I think the docs for treant should really hammer this point home. Despite the reasons you describe about the "context of MD", I can't really figure out a real reason why the community loves dumping files of numbers on disk in so much, rather than into a database. The main reason I could imagine is that scientists are just lazy and used to working with flat files of numbers and HDF5 is a faster/smaller version of that. I imagine that most MD researchers don't work on a big enough scale to gain anything from the benefits that a database approach offers. It seems to me that you're adding a lot of functionality in MDS on top of HDF5s to make them database-like, like file locking, and isn't tagging/grouping just a simplified version of database querying? Don't get me wrong, I'm all about MDS as it just fits perfectly into my workflow, I just think we should carve out the niche a bit. Hmm, could MDS work with a database "under-the-hood" rather than HDF5? |
These are fair questions, and I think it's important to lay all this out to determine the best approach going forward. My thoughts on this are open for revision, but here's what they are at the moment. Sorry if it's a bit long. I think the benefits of MDS' current design include:
There are, of course, disadvantages:
As to your question about using a database "under-the-hood" rather than HDF5, I think that is possible. The package is written with a clear separation between the front-end containers and the backend objects that talk to the files, and in principle backend objects could be written to talk to a database instead. Backends would have to be written for storing datasets, too, which is a bit trickier. If there is interest in more backends, I'm happy to help flesh them out. There is discussion about HDF5 vs. database solutions in various corners of the internet, but I found these pretty good for shedding some light on the relevant topics:
Thoughts? |
Wow, +1! You just wrote the exact documentation page I've been looking for, well said. |
Good! I'll make sure it makes its way into the docs. I agree that datreant's docs have to do a good job of addressing these pros and cons in order to be relevant to a lot of the community. Thanks for prodding me for it. :D |
Random late comments:
|
Ah, the whole thread was already tl;dr – didn't realize that you already picked a name. If you need my permission for splitting out code and relicensing then it is hereby given. |
Awesome. In that case I'll start the process of breaking things out. I'm not going to try any fancy splitting with filter-branch since the classes we need to separate live among each other in the same files. I'm going to just do a straight clone of MDS and start trimming. :) One last question; which name do you like better?
Both are short, reasonably easy to say, and have personality. Not sure how in vogue or advantageous it is to have "py" in the name of a python package these days. These are the tough decisions. :) |
I take names very seriously! I was dead set on making a replica exchange server with the name Lakitu a couple years ago. That's the name of the bad guy in Super Mario that launches spikey turtles all over the place to kill Mario, and he lives on a cloud. The server would basically float around the supercomputer running on different nodes and launch jobs as needed to meet some sampling goal. Interest in the project kinda dissolved, but I still stand by the coolness of that name! Anyway, I see that "treant" is taken on the cheese shop. The py prefix seems too oldschool to me, and snakes in trees is too biblical. I'm voting for datreant. I'm imaging someone illustrates a crazy digital tree monster as the logo. Another option of the blue would be, err... Cubby or DataCubby, in reference to these things pieces of furniture from school. It kind of fits the whole container idea, but it's certainly not as cool as treant. |
@orbeckst I think this is possible, and I don't think it's a challenge for the front-end API. One could imagine having Containers in a PostgreSQL database somewhere and giving a full URI instead of a filesystem location with something like import datreant as dtr
c = dtr.Container('postgresql://username:password@hostname:port') Perhaps the way odo implements this would be of help if this is something we wish to pursue. @cing I remember those things in kindergarten, but not sure how I feel about 'cubby'. I agree that 'datreant' works, and I like it over 'pytreant', too. :D Seeing no other complaints, and now having permission to split the core out with a different license, I'm happy now to call this issue closed! |
BTW: we now have a mailing list here: https://groups.google.com/forum/#!forum/mdsynthesis I can say confidently from its past history that it is low traffic. :P |
The developers of MDSynthesis (datreant/MDSynthesis#38) have decided to split out its core into a new package called "datreant". This makes the unique way datreant handles the problem of heterogeneous data storage and recall available to users in fields outside of molecular dynamics work, hopefully spurring development in new and wonderful ways. MDSynthesis will live on as a package built around datreant with MDAnalysis-specific subclasses of datreant Treants. One note of interest: to reduce blandness and avoid confusion with other common uses of the word, all instances of "Container" have been replaced with "Treant". For the uninitiated, a treant is a mythical walking, talking tree, originating the land of Dungeons and Dragons. The idea is that datreant's data model relies on directory trees, and leverages their usefulness over database-only solutions. If the use of 'Treant' and 'treant' in the documentation results in too much confusion, we will consider other names. The directory package of MDSynthesis has been reworked to accommodate this change. For one thing, it's been flattened. If this turns out to be confusing, we'll partition it up. One worry is that it removes the visual separation between the backend (File objects) and the frontend (Treants).
The deed is done: https://github.com/dotsdl/datreant There is one issue (as far as I can tell) holding back support for Python 3.4 (datreant/datreant#10), but I think a workaround is possible. |
The next task is removing the duplicate parts of MDSynthesis. This will happen before the 0.5.1 release. |
Since the basic structure of MDSynthesis isn't entirely specific to molecular dynamics simulation, I think it might make sense to break the core out into a separate package. I'd been considering this for a while, but today I was contacted by someone with a use-case outside of MD, and I think it would be of benefit to MDSynthesis' core development to make it more general.
I've started a repository for this work here: https://github.com/dotsdl/datreant
Is anyone opposed to this? One thing to to note is that
datreant
is BSD 3-clause licensed, making its use more permissive than MDSynthesis, which is GPLv2 (same as MDAnalysis). I technically need permission from everyone that has contributed code so far to make this change.P.S. Opinions on the name are welcome. I wanted something that included 'dat' to indicate 'data', and some kind of word to indicate 'trees' (as in directory trees). An old D&D woodland creature ('treant') came to mind. :D
The text was updated successfully, but these errors were encountered: