# An introduction to Storage

### Introduction

All we need is contained in the `openpathsampling` package

In [1]:
import openpathsampling as paths

The storage itself is mainly a netCDF file and can also be used as such. Technically it is a subclass of `netCDF4.Dataset` and can use all of its functions in case we want to add additional tables to the file besides what we store using stores. You can of course also add new stores to the storage. Using `Storage()` will automatically create a set of needed storages when a new file is created. 

netCDF files are very generic while our Storage is more tuned to needs we have. It support etc native support for simtk.units, and can recursively store nested objects using JSON pickling. But we will get to that.

Open the output from the 'alanine.ipynb' notebook to have something to work with

In [2]:
storage = paths.storage.Storage('trajectory.nc')
storage

Storage @ 'trajectory.nc'

and have a look at what stores are available

In [3]:
print storage.list_stores()

['trajectory', 'snapshot', 'configuration', 'momentum', 'sample', 'sample_set', 'collectivevariable', 'pathmover', 'movedetails', 'shootingpoint', 'shootingpointselector', 'dynamicsengine', 'calculation', 'volume', 'ensemble', 'movepath']


and we can access all of these using

In [4]:
snapshot_store = storage.snapshots

### Stores are lists

In general it is useful to think about the storage as a set of lists. Each of these lists contain objects of the same type, e.g. `Sample`, `Trajectory`, `Ensemble`, `Volume`, ... The class instances used to access elements from the storage are called a store. Imagine you go into a store to *buy* and *sell* objects (luckily our stores are free). All the stores share the same storage space, which is a netCDF file on disc.

Still, a store is not really a list or subclassed from a list, but it almost acts like one.

In [5]:
print 'We have %d snapshots in our storage' % len(storage.snapshots)

We have 116 snapshots in our storage


### Loading objects

 In the same way we access lists we can also access these lists using slicing, and even lists of indices.

Load by slicing

In [6]:
print storage.samples[2:4]

[<Sample @ 0x10f9daf50>, <Sample @ 0x1126634d0>]


Load by name

In [7]:
print storage.ensembles['Interface 1'].name

Interface 1


Load by list of indices

In [8]:
print storage.ensembles[[0,1,'Interface 3']]

[<openpathsampling.ensemble.SequentialEnsemble object at 0x112663b90>, <openpathsampling.ensemble.SequentialEnsemble object at 0x110076090>, <openpathsampling.ensemble.SequentialEnsemble object at 0x112663350>]


### Indexing

Each loaded object is equipped with a `.idx` attribute which is a dictionary that contains the index for a specific storage. This is necessary since we can - in theory - store an object in several different stores at once and these might have different indices. Note that idx is NOT a function, but a dictionary, hence the square brackets.

In [9]:
print storage.samples[5].idx
print storage.samples[2].idx[storage]

{Storage @ 'trajectory.nc': 5}
2


### Saving

Saving is somehow special, since we try to deal exclusively with immutable objects. That means that once an object is saved, it cannot be changed. This is not completely true, since the netCDF file allow changing, but we try not to do it. The only exeption are collective variables, these can store their cached values and we want to store intermediate states so we add new values once we have computed these. This should be the only exception and we use the `.sync` command to update the status of a once saved collectivevariable

Saving is easy. Just use `.save` on the store 

In [10]:
# storage.samples.save(my_sample)

and it will add the object to the end of our store list or do nothing, if the object has already been stored. It is important to note, that each object knows, if it has been stored already. This allows to write nice recursive saving without worrying that we save the same object several times.

You can also store directly using the storage. Both is fine and the storage just delegates the task to the appropriate store.

In [11]:
# storage.save(my_sample)

I mentioned recursive saving. This does the following. Imagine a sample `snapshot` which itself has a `Configuration` and a `Momentum` object. If you store the snapshot it also store the content using the approriate stores. This can be arbitrarily complex. And most object can be either stored in a special way or get converted into a JSON string that we can turn into an object again. Python has something like this build it, which works similar, but we needed something that add the recursive storage connection and uses JSON. If you are curious, the json string can be accessed for some objects using `.json` but is only available for loaded or saved objects. It will not be computed unless it is used.

In [12]:
volume = storate.volumes[3]
volume.json

u'{"_cls": "UnionVolume", "_dict": {"volume1": {"_cls": "LambdaVolumePeriodic", "_dict": {"period_min": -3.14159, "lambda_min": -2.094393333333333, "collectivevariable": {"_idx": 0, "_cls": "CV_MD_Function"}, "lambda_max": -0.5235983333333332, "period_max": 3.14159}}, "volume2": {"_cls": "UnionVolume", "_dict": {"volume1": {"_cls": "LambdaVolumePeriodic", "_dict": {"period_min": -3.14159, "lambda_min": -2.094393333333333, "collectivevariable": {"_idx": 0, "_cls": "CV_MD_Function"}, "lambda_max": -0.5235983333333332, "period_max": 3.14159}}, "volume2": {"_cls": "LambdaVolumePeriodic", "_dict": {"period_min": -3.14159, "lambda_min": 1.7453277777777778, "collectivevariable": {"_idx": 0, "_cls": "CV_MD_Function"}, "lambda_max": -3.14159, "period_max": 3.14159}}}}}}'

### Iterators

A list is iterable and so is a store. Lets load all ensembles and list their names

In [13]:
print [ens.name for ens in storage.ensembles]

[u'Interface 0', u'Interface 1', u'Interface 2', u'Interface 3', u'Interface 4', u'Interface 5', u'Interface 6']


Maybe you have realized that some command run slower the first time. This is because we use caching and once an object is loaded it stays in memory and can be accessed much faster.

### Searching for objects

One way to find objects is to use their name, which I mentioned before, but in general there are no search functions, but we can use python notation in the usual way to load what we need. *List comprehensions* is the magic word.
Say, we want to get all snapshots that are reversed. We could just load all of these and filter them, but there is a more elegant way to do that, or let's say a more elegant way of writing it in python, because the underlying code does just that.

In [14]:
reversed_samples = [snapshot for snapshot in storage.snapshots if snapshot.reversed]
print 'We found %d reversed snapshots among %d total ones' % (len(reversed_samples), len(storage.snapshots))

We found 58 reversed snapshots among 116 total ones


Lets do something more useful: For TIS ensemble we want statistics on pathlengths associated with sampled trajectories `Sample` objects that are sampled for a specific ensemble. And we one want samples that have been generated in our production runs and are present in a `SampleSet`

> TODO: add a way to select only specific SampleSets

In [15]:
print storage.sample_sets[0]

<openpathsampling.samples.SampleSet object at 0x11211f710>


In [16]:
my_ensemble = storage.ensembles['Interface 2']
relevant_samples = [
    sample 
    for sample_set in storage.sample_set 
    for sample in sample_set 
    if sample.ensemble is my_ensemble
]
print len(relevant_samples)

14


and finally compute the average length

In [17]:
list_of_path_lengths = [
    len(sample.trajectory)
    for sample_set in storage.sample_set 
    for sample in sample_set 
    if sample.ensemble is my_ensemble
]
print list_of_path_lengths
if len(list_of_path_lengths) > 0:
    mean = float(sum(list_of_path_lengths))/len(list_of_path_lengths)
else:
    mean = 0.0 # actually, it is not defined, so we just set it to zero
print mean

[21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21]
21.0


Allright, we loaded from a bootstrapping sampling algorithm and the analysis is pointless, but still it is rather short considering what we just did.

### Generator expression

There is another very cool feature about python that is worth noting: generator expressions. Before we used list comprehensions to generate a list of all that we need, but what, if we don't want the whole list at once? Maybe that is impossible because of too much memory and also not desirable? We can do the same thing as above using a generator (although it would only be useful if we had to average over billions of samples). So assume the list of lengths is too large for memory. The summing does not mind to use little pieces so we construct a function that always gives us the next element. These functions are called iterators and to make these iteratore there is syntactic way to create them easily: Instead of square brackets in in list comprehensions use round brackets. So the above example would look like this

In [18]:
iterator_over_path_lengths = (
    len(sample.trajectory)
    for sample_set in storage.sample_set 
    for sample in sample_set 
    if sample.ensemble is my_ensemble
)
print iterator_over_path_lengths
total = float(sum(iterator_over_path_lengths))
print total

<generator object <genexpr> at 0x10fa7b1e0>
294.0


Note that we now have a generator and no computed values yet. If we iterator using our iterator called generator it will pass one value at a time and we can use it in sum as we did before. There are two important things to note. Once an iteratore is used, it is consumed and we cannot just be run again so we need to change the code again. I assume there are other ways to do that, too

In [19]:
iterator_over_path_lengths = (
    len(sample.trajectory)
    for sample_set in storage.sample_set 
    for sample in sample_set 
    if sample.ensemble is my_ensemble
)
total = 0
count = 0
for length in iterator_over_path_lengths:
    total += length
    count += 1
    
if count > 0:
    mean = float(total)/count
else:
    mean = 0.0 # actually, it is not defined, so we just set it to zero
print mean

21.0


Voilà, this time without computing all length before!

A last example that will be interesting is the statistics on acceptance. Each sample knows which mover was involved in its creation. This is stored in `.details.mover` in the `.details` attribute. Let us try to look at only forward moves

In [20]:
ff_movers = filter(lambda self : type(self) == paths.ForwardShootMover, storage.pathmovers)

In [21]:
ff_movers

[<openpathsampling.pathmover.ForwardShootMover at 0x10fa9ecd0>,
 <openpathsampling.pathmover.ForwardShootMover at 0x11210d810>,
 <openpathsampling.pathmover.ForwardShootMover at 0x11211f490>]

In [22]:
if len(ff_movers) > 2:
    mover = ff_movers[2]
    print "Use a '%s' for ensemble(s) '%s'" % ( mover.cls, ','.join(ens.name for ens in mover.ensembles) )

Use a 'ForwardShootMover' for ensemble(s) 'Interface 5'


I use a little trick here, notice that we use a list comprehension inside of a function call, this actually uses the generator expression and passes the resulting iterator to the `.join` function.

Now to get statistics on acceptances

In [23]:
if len(ff_movers) > 1:
    mover = ff_movers[1]
    print "Use a '%s' for ensemble(s) '%s'" % ( mover.cls, ','.join(ens.name for ens in mover.ensembles) )
    acceptances = [
        1.0 if sample.details.accepted else 0.0
        for sample in storage.sample 
        if sample.details.mover is mover
    ]
    print acceptances 
    print 'Acceptance is about %d %%' % int(100 * float(sum(acceptances))/len(acceptances))

Use a 'ForwardShootMover' for ensemble(s) 'Interface 1'
[1.0]
Acceptance is about 100 %


In [24]:
if len(ff_movers) > 3:
    mover = ff_movers[3]
    print "Use a '%s' for ensemble(s) '%s'" % ( mover.cls, ','.join(ens.name for ens in mover.ensembles) )
    acceptances = [
        1.0 if sample.details.accepted else 0.0
        for sample in storage.sample 
        if sample.details.mover is mover
    ]
    print acceptances 
    print 'Acceptance is about %d %%' % int(100 * float(sum(acceptances))/len(acceptances))

### Collective Variables

To be complete, you can update the storage to include the newest set of cached values. Just use

In [25]:
# my_collectivevariable.sync()

### Fun with movepaths

> This is experimental!

Okay, this is more some idea to explain the intricacies with movepaths and their differences to deal just with a single mover that produces samples. 

Let us start with a simple example a `OneWayShootingMover`. Nice and convenient. It shoots either forward or backward so its acceptance ratio is different from each sub mover. But we can still ask the question : "What is the acceptance of this joined mover?" Now it gets tricky. Imagine we made an error and set up the forward and backward mover for different ensembles then the question we can ask is : "What is the acceptance to either shoot forward in ensemble #1 or backward in ensemble #2 is shot?" 
Lets make it more complicated and do two shooting moves in a row. What is now our acceptance? Is it the question that either or the two acceptes a move or both? How to we count if two samples are generated? What does it mean when a forward sample in ensemble #1 AND a backward sample in ensemble #2 is generated?

The point I want to make is that while acceptance makes sense for a sample it does not necessarily in the replica context. To be clear: We ALWAYS generate a sample in a Monte Carlo move but it can be exactly the old one.
So what is the difference between accepting a `Shooting` and accepting a `PathMove`?

Acceptance does not mean we do not generate, it means the MC move does not move to a new point in our sample space and so convergence will not improve. I assume that the main point is to move around in the space to be samples. So, I could propose (or look in the literature) some things. Some observations that may or may not be true

1. When combining several moves into one the probability should stay the same. So we should normalize according to internal number of attempts. 

2. We should check this per ensemble. So how often did a submove change the samples in a particular ensemble.

3. In complicated moves we might have to differentiate between likely and unlikely combinations. In a PartialAcceptance move e.g. it is unlikely that all move pass but some will and so the number of attempts is not fixed. Assume we attempt 5 times the same forward mover and continue if one succeded. If we accept the first and reject the second, what is the acceptance probability? It is not 50%, it should be lower. But maybe that does not matter, since number 1 accepted is relative to all other moves and so we could say it has speed one change per step?

4. Can we treat moving back to an original one? Like doing 2 times the same RepEx move does this still count as a move?

Many question, which we can play around with.

### Some more examples with the storage

In [26]:
print storage.movepath[2]

Restrict to last sample : True : 1 samples
 +- PartialAcceptanceMove : True : 3 samples
 |   +- RandomChoice :
 |   |   +- SampleMove : BackwardShootMover : True : 1 samples [<Sample @ 0x10fa9d910>]
 |   +- SampleMove : EnsembleHopMover : True : 1 samples [<Sample @ 0x11211ff90>]
 |   +- SampleMove : ReplicaIDChangeMover : True : 1 samples [<Sample @ 0x112116f10>]


Which sample uses which mover

In [27]:
print [sample.details.mover.idx[storage] for sample in storage.sample]

[0, 3, 2, 1, 6, 5, 4, 9, 8, 7, 12, 11, 10, 15, 14, 13, 16, 17, 18, 17, 16, 17, 16, 17, 18, 17, 16, 17, 16, 17, 16, 17, 16, 17, 18, 17]


How often was a HopMove accepted independent of the Ensemble?

In [28]:
results = [
    1 if sample.details.accepted else 0
    for sample in storage.sample 
    if type(sample.details.mover) == paths.EnsembleHopMover 
]
print float(sum(results))/len(results)

0.333333333333
