# Step 1: Ingest some data in a datastore

In [1]:
import pandas as pd

from pond.storage.file_datastore import FileDatastore
from pond import Activity

# The activity object is usually the only object you need to care about.
# It is used to read and write artifacts to the datastore, and it takes care of all the versioning and lineage tracking for you.
activity = Activity(
    source='010_initial_data.ipynb',  # This "source" will be used as the lineage for the artifacts that you write in this session
    datastore='./catalog',    # We use a filesystem datastore in a local directory called `catalog` (the directory needs to exist already)
    location='experiment1',   # Within the datastore, you can specify a default location (optional). It can be used to organize different 
    author='pietro',          # The author is also used in the lineage metadata
)

# "Load" two data frames and store them in the datastore as two different artifacts

In [2]:
condition1 = pd.DataFrame(
    data=[[3], [2], [1], [5], [3]],
    index=pd.Index([1, 3, 5, 7, 9], name='time'),
    columns=['Results'],
)

condition2 = pd.DataFrame(
    data=[[-3], [-2], [-1], [-5], [-3]],
    index=pd.Index([0, 2, 4, 6, 8], name='timf'),
    columns=['Results'],
)


In [3]:
condition1

Unnamed: 0_level_0,Results
time,Unnamed: 1_level_1
1,3
3,2
5,1
7,5
9,3


In [4]:
condition2

Unnamed: 0_level_0,Results
timf,Unnamed: 1_level_1
0,-3
2,-2
4,-1
6,-5
8,-3


In [5]:
# Write the first dataframe with the name "condition1_results".
# The optional "metadata" argument can be used to add any user-defined metadata.
activity.write(condition1, name='condition1_results', metadata={'validated': True})
# Write the second dataframe with the name "condition2_results".
activity.write(condition2, name='condition2_results', metadata={'validated': False})

<pond.version.Version at 0x138edb310>

# Demonstrate write modes

`pond` supports several write modes, which control how the versioning of the artifacts work.

In [6]:
from pond.conventions import WriteMode

## `ERROR_IF_EXISTS`

`ERROR_IF_EXISTS` is the default write mode. It always create a new version of an artifact on write, which means that the history of your data is never lost by overwriting. If you explicitly specify a version number to write to, and it already exists, `pond` is going to throw an error.

In [7]:
# Write the first data frame again, in `ERROR_IF_EXISTS` mode (default)
# Since no version name is specified, it automatically create a new version of the artifact.
version = activity.write(condition1, name='condition1_results', metadata={'validated': True}, write_mode=WriteMode.ERROR_IF_EXISTS)

In [8]:
print(version.version_name)

v2


In [9]:
# If we try to write over an older versions, we get a loud complaint
version = activity.write(condition1, name='condition1_results', version_name='v1')

VersionAlreadyExists: Version already exists:  pond://catalog/experiment1/condition1_results/v1.

## `WRITE_ON_CHANGE`

With `WRITE_ON_CHANGE`, a new version is created only if the content of the data has changed since the last versino. This saves some space on disk, but be careful: the metadata is overwritten, so some information might be lost! For example, if you had a plot that was created with this version of the data, looking at its metadata might produce the wrong lineage if it's been overwritten later.

In [10]:
version = activity.write(condition1, name='condition1_results', metadata={'validated': True}, write_mode=WriteMode.WRITE_ON_CHANGE)

In [11]:
# The data did not change, so the version remains the same
print(version.version_name)

v2


In [12]:
# If we change the data and write again....
condition1.loc[1] = 7
version = activity.write(condition1, name='condition1_results', metadata={'validated': True}, write_mode=WriteMode.WRITE_ON_CHANGE)

In [13]:
# ... the version name is bumped
print(version.version_name)

v3


## `OVERWRITE`

The `OVERWRITE` mode just overwrites a previous version. Use this to your own risk! We do not recommend it, ever: it is going to mess up all of your reproducibility and lineage tracing efforts.

In [14]:
# Overwrite version 2
activity.write(condition1, name='condition1_results', metadata={'validated': True}, write_mode=WriteMode.OVERWRITE, version_name='v2')

<pond.version.Version at 0x139005390>

# What did we read / write?

If we want to know what artifacts we have read and written so far, we can interrogate the Activity object.

We expect to see several versions of the `condition1_results` artifact, since we wrote is several times when experimenting with write modes, and one version of `condition2_results`

In [15]:
activity.read_history

set()

In [16]:
activity.write_history

{'pond://catalog/experiment1/condition1_results/v1',
 'pond://catalog/experiment1/condition1_results/v2',
 'pond://catalog/experiment1/condition1_results/v3',
 'pond://catalog/experiment1/condition2_results/v1'}