-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New persistence package #755
Conversation
A few questions that I'll post here for now. I somehow forgot about versioning in last week's discussion so is versioning of files/directories supported in the new version? If so, how? This may not be a concern (and in fact may be a burden) for Matthias's work, but it was one of the drivers in the original implementation and a reason we chose a vcs to build on. Also, we have no compression now, correct? So a directory in which one file differs is a completely new directory in the store? Building on that, what happens when you get the same output on a different run? Is that output annotated with multiple executions it is tied to? Finally, are input files/dirs supported or only output/intermediate files/dirs? If so, how are input files identified in the store? |
Replying to comment 2 @dakoop:
The notion of version is a little weird, since there is no reason for the same workflow to generate a different file. This was previously tied to the uuid we assigned to a specific persistentfile module, now we can just set a tag (or any kind of key=value pair) on that module in the same way. We would still have different files generated for that tag, but they wouldn't be subsequent versions (although the system could still show a timestamp) but separate files which can exist independently.
This might be a problem. There is currently no compression: this allows for immediate access to the data, once the system gives you the filename, but might increase disk usage; Git has both de-duplication (at the file/object level) and delta compression (useful if files are small enough and similar). I'm adding [12].
This is a question I haven't answered yet. The current model assumes that you won't stumble on the exact same file again, or that if you do it's the same one (with the exact same metadata), see [10]. Deduplication is possible (i.e. store separate entries with hash(metadata + contents) but store the contents once).
We can have a configuration widget that allows the user to query, and once a file is selected, have a function be set for the hash (this is what currently happens). |
* New module: persistent_archive * Uses file_archive Vendor it in VisTrails? (recommend against submodule) * Only provides caching for the moment
Pushed new branch |
I now have:
Some names probably ought to be changed, and maybe some modules put into namespaces. I'll start working on the viewer part. |
[Original comment by troyer] Thank you for the progress report. From looking at the description here this looks reasonable, but I see three more sets of features that are needed:
|
Replying to comment 6 by troyer:
This is already implemented, I just didn't list them.
That sounds easy enough, I will add it today.
Do you mean, from the workflow? I intend this to be only doable from the UI, and to not provide modules for this. |
* 'most_recent' returns the result with the most recent timestamp (was: 'path'). * 'results' returns the list of all results (new) * 'count' returns the number of results (new)
add_*() now return an Entry (instead of the hash).
This means all QueryCondition and Metaclass subclasses are constants too. This allows to write condition as functions, and will be used by the UI.
Currently this is simply the one from file_archive.
This fixes the persistent_archive package for the new behavior introduced by install_package_requirements.
Tommy, note the dependence on file_archive for the binaries. This is just python code. |
file_archive can be installed through pip. I will test and make sure they are included in the binaries. |
Needs #1010 |
Looks good, I added file_archive to build machines and merged #1010. |
dcbd912 adds docstrings for some |
This is a minor issue, but the "_Path" modules have docstrings but corresponding "_File" and "*Dir" modules does not. |
Yes. There may be some cases where the subclass changes the way things are done, but hopefully the inheritance will prompt developers to make sure the documentation is updated. We should check our own code on this. |
Indeed I assumed it would get inherited. I can see arguments for inheriting (like here) and not inheriting (different modules have different behaviors, even if that difference can be minor). It not really any trouble for me to copy the docstring so I'll probably do that. Python doesn't inherit docstrings either, so it's probably fine to stick to this. |
Remember: docstrings are not inherited
After meeting with Matthias Troyer, we established the need for a new persistence package, that wouldn't be based on Git.
Meeting summary and design directions: http://vistrails.org/index.php/Archive
Project page on github: https://github.com/remram44/file_archive