Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New persistence package #755

Merged
merged 55 commits into from
Mar 18, 2015
Merged

New persistence package #755

merged 55 commits into from
Mar 18, 2015

Conversation

remram44
Copy link
Member

After meeting with Matthias Troyer, we established the need for a new persistence package, that wouldn't be based on Git.

Meeting summary and design directions: http://vistrails.org/index.php/Archive

Project page on github: https://github.com/remram44/file_archive

@dakoop
Copy link
Member

dakoop commented Oct 1, 2013

A few questions that I'll post here for now. I somehow forgot about versioning in last week's discussion so is versioning of files/directories supported in the new version? If so, how? This may not be a concern (and in fact may be a burden) for Matthias's work, but it was one of the drivers in the original implementation and a reason we chose a vcs to build on. Also, we have no compression now, correct? So a directory in which one file differs is a completely new directory in the store? Building on that, what happens when you get the same output on a different run? Is that output annotated with multiple executions it is tied to? Finally, are input files/dirs supported or only output/intermediate files/dirs? If so, how are input files identified in the store?

@remram44
Copy link
Member Author

remram44 commented Oct 1, 2013

Replying to comment 2 @dakoop:

I somehow forgot about versioning in last week's discussion so is versioning of files/directories supported in the new version?

The notion of version is a little weird, since there is no reason for the same workflow to generate a different file. This was previously tied to the uuid we assigned to a specific persistentfile module, now we can just set a tag (or any kind of key=value pair) on that module in the same way. We would still have different files generated for that tag, but they wouldn't be subsequent versions (although the system could still show a timestamp) but separate files which can exist independently.

Also, we have no compression now, correct? So a directory in which one file differs is a completely new directory in the store?

This might be a problem. There is currently no compression: this allows for immediate access to the data, once the system gives you the filename, but might increase disk usage; Git has both de-duplication (at the file/object level) and delta compression (useful if files are small enough and similar). I'm adding [12].

Building on that, what happens when you get the same output on a different run? Is that output annotated with multiple executions it is tied to?

This is a question I haven't answered yet. The current model assumes that you won't stumble on the exact same file again, or that if you do it's the same one (with the exact same metadata), see [10]. Deduplication is possible (i.e. store separate entries with hash(metadata + contents) but store the contents once).

Finally, are input files/dirs supported or only output/intermediate files/dirs? If so, how are input files identified in the store?

We can have a configuration widget that allows the user to query, and once a file is selected, have a function be set for the hash (this is what currently happens).

* New module: persistent_archive
* Uses file_archive
    Vendor it in VisTrails? (recommend against submodule)
* Only provides caching for the moment
@remram44
Copy link
Member Author

remram44 commented Oct 1, 2013

Pushed new branch persistent_archive.

@remram44
Copy link
Member Author

remram44 commented Oct 3, 2013

I now have:

  • !PersistedInputFile: gets a file from the store, from key=value pairs. If a local filename is provided and exists, it will be stored if different or nothing is stored (along with the key=value pairs).
  • !PersistedFile: replaces intermediate and output file, adds a file to the store along with metadata. If a file is in the store with the exact same module signature, it is returned without computing the upstream file.
  • I kept for now !CachedFile: lighter versions of !PersistedFile/PersistedDir which only keep the module signature, designed to be used for caching.
  • !QueriedInputFile: allows to use more complex queries to find a file (i.e. range queries). !PersistedInputFile is limited to equality tests since the metadata it receives must also be insertable.

Some names probably ought to be changed, and maybe some modules put into namespaces. I'll start working on the viewer part.

@VisTrailsBot
Copy link

[Original comment by troyer]

Thank you for the progress report. From looking at the description here this looks reasonable, but I see three more sets of features that are needed:

  • The same as above but for directories instead of files
  • Query modules that return a list of files or list of directories that match a query. I might then either use them with a Map module or just pass the list on to one of my modules that load data from a list of files.
  • The ability to remove the matching files (directories) from the file store

@remram44
Copy link
Member Author

remram44 commented Oct 4, 2013

Replying to comment 6 by troyer:

  • The same as above but for directories instead of files

This is already implemented, I just didn't list them.

  • Query modules that return a list of files or list of directories that match a query. I might then either use them with a Map module or just pass the list on to one of my modules that load data from a list of files.

That sounds easy enough, I will add it today.

  • The ability to remove the matching files (directories) from the file store

Do you mean, from the workflow? I intend this to be only doable from the UI, and to not provide modules for this.

* 'most_recent' returns the result with the most recent timestamp (was:
  'path').
* 'results' returns the list of all results (new)
* 'count' returns the number of results (new)
add_*() now return an Entry (instead of the hash).
This means all QueryCondition and Metaclass subclasses are constants
too. This allows to write condition as functions, and will be used by
the UI.
Currently this is simply the one from file_archive.
This fixes the persistent_archive package for the new behavior
introduced by install_package_requirements.
@dakoop
Copy link
Member

dakoop commented Mar 11, 2015

Tommy, note the dependence on file_archive for the binaries. This is just python code.

@rexissimus
Copy link
Member

file_archive can be installed through pip. I will test and make sure they are included in the binaries.

@remram44
Copy link
Member Author

Needs #1010

@rexissimus
Copy link
Member

Looks good, I added file_archive to build machines and merged #1010.
The subclassed Modules are missing docstrings, should we add that?

@remram44
Copy link
Member Author

dcbd912 adds docstrings for some QueryCondition subclasses, are these what you meant?

@rexissimus
Copy link
Member

This is a minor issue, but the "_Path" modules have docstrings but corresponding "_File" and "*Dir" modules does not.
The modules are very similar so maybe they could share docstrings.
Would it make sense if vistrails could take module docstrings from superclasses if it was missing?

@dakoop
Copy link
Member

dakoop commented Mar 17, 2015

Yes. There may be some cases where the subclass changes the way things are done, but hopefully the inheritance will prompt developers to make sure the documentation is updated. We should check our own code on this.

@remram44
Copy link
Member Author

Indeed I assumed it would get inherited.

I can see arguments for inheriting (like here) and not inheriting (different modules have different behaviors, even if that difference can be minor). It not really any trouble for me to copy the docstring so I'll probably do that.

Python doesn't inherit docstrings either, so it's probably fine to stick to this.

remram44 added a commit that referenced this pull request Mar 18, 2015
@remram44 remram44 merged commit 0500d30 into master Mar 18, 2015
@remram44 remram44 deleted the persistent_archive branch August 6, 2015 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants