Skip to content

Setting up the archiving program

gcorbin edited this page Oct 16, 2018 · 6 revisions

Archive Structure

The Archives are organized as follows: Each project should have its own archive. Each of those archives has a separate folder in the same directory as the main program. The folder name is the archive name. The archiving program expects a file named 'project.ini' in the archive directory, where all the project specific settings and paths are stored.

Let's say we have a project called 'some-project' that we want to archive. First, create a new folder 'some-project' in the program directory and in this folder create a 'project.ini' file. The directory should look like this:

ace.py               # The main executable
experimentarchiver/  # The experimentarchiver module
some-project/        # Archive for our project 
    project.ini      # Project specific settings go in here

Setting up the archive

Every project has a different structure. Because we probably do not want to change the whole structure of our projects just to be able to use this archiving program, the aim is to make it as generic as possible. Nevertheless, some basic assumptions on the project structure have been made.

An example project

But first, let's see how to set up the archive for a very simple example project.

This hypothetical example project is located in

/home/path/to/project/

Inside we see the following structure:

.git/                    # everything in here belongs to the project
build/                   # build system and executables
    build.sh             # builds the executable do-stuff.exe from src/, (could be a makefile or anything else)
    do-stuff.exe         # an executable, (there could be more)
out/                     # output of do-stuff goes in here
src/                     # source code
    do-stuff.cpp         # source code for the executable do-stuff.exe
param/                   # parameters to the executable (the program knows where to look for them)
    parameters.txt
    more-parameters.txt  

Setting up the project.ini file

The 'project.ini' file is interpreted with the configparser module.

Options

General options go in the section option:

[options]
do-git-checkout = true
do-build = true
build-command = ./build.sh

Of course the source code is versioned with git and we set 'do-git-checkout' to 'true' accordingly.

Because the project is written in C++, we need to compile the executable every time the code has changed and therefore set 'do-build' to 'true'. Our 'build system' is a simple shell script. The setting 'build-command=./build.sh' tells the archiving program to execute it every time an experiment is restored. Even for much more complex projects, it is assumed that everything can be built with a single command. This is not a real restriction, because especially for larger projects it is a good idea to have a functioning build system.

Note how we only specified the path of the build command relative to the build directory. The program will look for this command (and also execute it) in the directory specified by 'build-path'.

This brings us to the section 'paths' which holds all the structure information about the project.

Paths

We set up the 'paths' section in the 'project.ini' file to reflect the project structure.

[paths]
git-path=/home/path/to/project/
top-path=${git-path}
build-path=${top-path}build/
output-data-path=${top-path}out/

Note: All these paths should be given as absolute paths. The interpolation syntax of the configparser module can(and should) be used to make the project structure more apparent and avoid copy-paste errors.

  • 'git-path': The archiving program will perform all git operations(checking out a version, checking if the repo is clean) in this directory
  • 'top-path': This is set to the root of the project. In this example, it is identical to the 'git-path', but this need not be the case.
  • 'build-path': The build command is given relative to this directory and also executed from here. Also the main program is executed from here.
  • 'output-data-path': All program output lands in this directory.

We are still not quite ready to go. The archiving program needs to have a list of parameter files and a list of input data files.

Parameter files

Parameter files are typically simple text files or *.ini files that hold the program configuration. This could be physical constants, program options, paths to data files, ... While experimenting with our project, we probably change them a lot. Parameter files are small, i.e. on the order of a few KB. In the context of this archiving program, this means two things:

  1. Parameter files will be copied to and from the archive.
  2. Even if parameter files are under version control, they are not considered part of the source code and therefore are excluded when checking if the repository is clean. This prevents frequent changes to those files to clutter the commit history.

To tell the archiving program which files are parameters, we create a file 'parameter-list.txt' in our project and put it under version control.
Actually, this file can be named anything we want and be placed anywhere in our project. We simply need to put another line in the 'paths' section of our 'project.ini':

[paths]
# ...
parameter-list = ${top-path}parameter-list.txt

The archiving program will now read a list of parameter files from this file. The list is interpreted as a newline-separated list of relative paths to parameter files, relative to the top-directory. Therefore it is probably best to put this file in the top-path of the project.

In our example project, we have two parameter files: 'parameters.txt' and 'more-parameters.txt'. Thus the contents of 'parameter-list.txt' are:

param/parameters.txt
param/more-parameters.txt

Input data files

Input data files are potentially huge files on the order of GB that we do not want to be under version control. They are even too big to copy them into the archive every time an experiment is archived. Instead, it is our responsibility to backup these files appropriately, e.g on an external disk or a server. The archiving program will only store hashes for these files and when an experiment is restored, it will verify the identity of those files by comparing hashes.

Any hash algorithm from the hashlib module can be used. We want to use the 'sha256' algorithm and therfore add

hash-algorithm = sha256

to the 'options' section of 'project.ini'. If in the future a weakness of this algorithm is exposed, we can adapt and use a more secure one.

Note: Because the algorithm is stored together with the hashes, changing the algorithm will not break previous experiments.

One major assumption at this point is that all input data are located somewhere under one directory. Let's say that this directory is

/home/path/to/data/    
    large-file-a.dat
    large-file-b.dat

We add two lines to the 'paths' section of 'project.ini':

input-data-path = /home/path/to/data/
get-input-files = ${top-path}input-files.py

The first line simply defines the top level directory for all input data. The second line defines the location of a python script. It is our responsibility to write this script for our project. This may be a bit ugly and complicated compared to simply reading a list from a file, like it is done for parameters. But there is a good reason for it: Which data files the program needs can depend on the parameters. If there only was a simple list file, every time we wanted to change the input data files, we also needed to update the list file. This is cumbersome and error-prone.

If all we want is a simple list, then the script is as simple this:

def getFilesToHash():
    return ['large-file-a.dat', 'large-file-b.dat']

This can be modified to suit the project. The file must define a function 'getFilesToHash()' that returns a list of strings. Each string is interpreted as a relative path to a data file, relative to the 'input-data-path'.

Portability of experiments and data

What happens when we try to reproduce an experiment on another computer? The paths may all be different, so we need to change the paths in 'project.ini' accordingly. There is one problem though: How does the project executable know, where to look for the data files?

  • We cannot hard-code the data path into the source code, because restoring an experiment will check out the appropriate version of the source code, where the path was hard-coded to the original computer.
  • We cannot give the path as a recorded command line option to the project executable, because those options are also restored.

There are basically two solutions:

  1. The project is structured such that input data are always found in some constant path relative to the project. This works in principle, but only if input data can be copied around (i.e. if it is not too big)

  2. The project executable takes a named argument for the data path. For example, if this option is named --data-path we add the line

    append-arguments = --data-path=/home/path/to/data/

to the 'options' section. With the append-arguments option any arguments can be appended to the argument list before running the project executable. If we now want to reproduce an experiment on another computer, where data are located in '/a/different/path/to/data/' we only need to change the line above to

append-arguments = --data-path=/a/different/path/to/data/

Of course any arguments can be appended with this option. But please be careful: Specifying more options here than absolutely needed can harm reproducibility of experiments.

Last command

There is one last thing to do. We put the line

last-command=${build-path}last-command

in the 'paths' section of the 'project.ini'. This option tells the archiving program where to store the arguments and return status for the last run command. Again, the file can be located anywhere and named anything. Here we decided to put it into the build directory and name it 'last-command'.

The complete 'project.ini' file

[options]
do-git-checkout = true
do-build = true
build-command = ./build.sh
hash-algorithm = sha256

[paths]
git-path=/home/path/to/project/
top-path=${git-path}
build-path=${top-path}build/
output-data-path=${top-path}out/
parameter-list=${top-path}parameter-list.txt
input-data-path=/home/path/to/data/
get-input-files=${top-path}input-files.py
last-command=${build-path}last-command

Clone this wiki locally