# Data Storage

Pyatoa stores data using [PyASDF ASDFDataSets](https://seismicdata.github.io/pyasdf/asdf_data_set.html), which are seismological data structures built upon the HDF5 file format. 

Datasets are hierarchical (tree-like), portable, compressible, and self-describing or containing both data and metadata.  They are built around ObsPy objects, removing any need for conversions in the transition from data storage to data processing.

An `ASDFDataSet` can be passed directly to the `Manager` class. By default, gathered data and processed results will automatically be stored inside the dataset following a pre-defined naming convention. Naming schemes are set using parameters in the `Config` object. 

Below we show how data is saved throughout a workflow, and how it can be accessed using PyASDF and Pyatoa.

For a detailed tutorial on the `ASDFDataSet`, see: https://seismicdata.github.io/pyasdf/tutorial.html

In [1]:
import os
import obspy
from pyatoa import Config, Manager, logger
from pyasdf import ASDFDataSet

logger.setLevel("DEBUG")

# Load in the test data
inv = obspy.read_inventory("../tests/test_data/test_dataless_NZ_BFZ.xml")
cat = obspy.read_events("../tests/test_data/test_catalog_2018p130600.xml")
event = cat[0]
st_obs = obspy.read("../tests/test_data/test_obs_data_NZ_BFZ_2018p130600.ascii")
st_syn = obspy.read("../tests/test_data/test_syn_data_NZ_BFZ_2018p130600.ascii")

---
## Initializing 

First we must open a new `ASDFDataSet` file. We will fill it with data from the `Manager`.  
`ASDFDataSet`s can also be used as a context manager, using the `with` argument. This ensures the file is closed after use.

In [2]:
# Make sure we aren't trying to write to a file that exists
ds_fid = "../tests/test_data/docs_data/test_ASDFDataSet.h5"
os.remove(ds_fid)

ds = ASDFDataSet(ds_fid)
print(ds)

FileNotFoundError: [Errno 2] No such file or directory: '../tests/test_data/docs_data/test_ASDFDataSet.h5'

We can pass the `ASDFDataSet` ds directly to the initialization of the `Manager` class.  
The string representation of the `Manager` class shows us that the `ASDFDataSet` has been attached, by showing the name of the dataset.

> **__NOTE__:** In Pyatoa, by convention, each event gets its own `ASDFDataSet`; each `ASDFDataSet` should be named using a unique event identifier. This ensures that files are kept a reasonable size and avoids the need for more complicated internal naming schemes. 

In [3]:
mgmt = Manager(ds=ds, config=Config(), inv=inv, event=event, st_obs=st_obs, st_syn=st_syn)
print(mgmt)

NameError: name 'ds' is not defined

---
## Manually writing data

We can save the current Manager data using the `Manager.write()` function. 
The Pyatoa `Config` object can also be written to the `ASDFDataSet` using the `Config.write()` function.  

Once written, we see the `ASDFDataSet` has been populated with event and station metadata, waveform data, and Config information.

In [4]:
mgmt.write()
mgmt.config.write(write_to=ds)

NameError: name 'mgmt' is not defined

In [5]:
ds

NameError: name 'ds' is not defined

In [6]:
ds.events

NameError: name 'ds' is not defined

In [7]:
ds.waveforms.list()

NameError: name 'ds' is not defined

In [8]:
ds.auxiliary_data.Configs

NameError: name 'ds' is not defined

---
## Automatically written data

During a Pyatoa workflow, individual functions will automatically write their outputs into the given `ASDFDataSet`.  
Here the log statements show the `Manager.window()` and `Manager.measure()` functions saving their outputs into the data set.

In [9]:
mgmt.standardize().preprocess();

NameError: name 'mgmt' is not defined

In [10]:
mgmt.window();

NameError: name 'mgmt' is not defined

In [11]:
mgmt.measure();

NameError: name 'mgmt' is not defined

---
## Accessing saved data using PyASDF

All saved data can be accessed using `ASDFDataSet` attributes.  
For a more thorough explanation of accessing data with an `ASDFDataSet`, see: https://seismicdata.github.io/pyasdf/index.html

**Event metadata** is stored as an ObsPy `Catalog` object in the `ASDFDataSet.events` attribute.  

In [12]:
ds.events[0]

NameError: name 'ds' is not defined

---
**Waveforms** are stored as ObsPy `Stream` objects, and **station metadata** is stored as ObsPy `Inventory` objects.  
They are stored together in the `ASDFDataSet.waveforms` attribute.  

In [13]:
ds.waveforms.NZ_BFZ.StationXML

NameError: name 'ds' is not defined

In [14]:
ds.waveforms.NZ_BFZ.observed + ds.waveforms.NZ_BFZ.synthetic

NameError: name 'ds' is not defined

-----
**Misfit windows**, **Adjoint Sources**, and **Configuration parameters** are stored in the `ADSFDataSet.auxiliary_data` attribute.

In [15]:
ds.auxiliary_data

NameError: name 'ds' is not defined

If no `iteration` or `step_count` attributes are provided to the `Config` object, auxiliary data will be stored using the `default` tag.

In [16]:
ds.auxiliary_data.MisfitWindows

NameError: name 'ds' is not defined

In [17]:
ds.auxiliary_data.MisfitWindows['default']

NameError: name 'ds' is not defined

In [18]:
ds.auxiliary_data.MisfitWindows.default.NZ_BFZ_E_0

NameError: name 'ds' is not defined

In [19]:
ds.auxiliary_data.AdjointSources

NameError: name 'ds' is not defined

In [20]:
ds.auxiliary_data.AdjointSources.default

NameError: name 'ds' is not defined

In [21]:
ds.auxiliary_data.AdjointSources.default.NZ_BFZ_BXE

NameError: name 'ds' is not defined

---
## Re-loading data using the Manager

Data previously saved into an `ASDFDataSet` can be loaded back into a `Manager` class using the `Manager.load()` function. The `load()` function will search for matching metadata, waveforms and configuration parameters, based on the `path` argument provided.

In [22]:
mgmt = Manager(ds=ds)
mgmt.load(code="NZ.BFZ", path="default")

NameError: name 'ds' is not defined

Misfit windows and adjoint sources are not explicitely re-loaded. Windows can be loaded using optional arguments in the `Manager.window()` function.

---
## Saving data during an inversion

For each function evaluation, a new set of synthetic waveforms, misfit windows, adjoint sources and (potentially) configuration parameters, are defined. Therefore, unique tags are required to save and load this information in a reliable manner. 

Pyatoa tags using the `Config.iteration` and `Config.step_count` attributes to define unique tags during an inversion.

In [23]:
# Set the config iteration and step_count parameters
cfg = Config(iteration=1, step_count=0)

# Remove the previously created dataset
os.remove(ds_fid)
ds = ASDFDataSet(ds_fid)

cfg.write(write_to=ds)
mgmt = Manager(ds=ds, config=cfg, inv=inv, event=event, st_obs=st_obs, st_syn=st_syn)
mgmt.write()
mgmt.flow()

[2022-02-24 12:12:17] - pyatoa - DEBUG: Component list set to E/N/Z


FileNotFoundError: [Errno 2] No such file or directory: '../tests/test_data/docs_data/test_ASDFDataSet.h5'

The `ASDFDataSet` is now populated with appropriately tagged data, denoting which function evaluation it belongs to.

In [24]:
ds.waveforms.NZ_BFZ

NameError: name 'ds' is not defined

In [25]:
ds.waveforms.NZ_BFZ.synthetic_i01s00

NameError: name 'ds' is not defined

Auxiliary data will be tagged in a similar fashion, making it simple to re-access specific function evaluations.

In [26]:
ds.auxiliary_data.MisfitWindows

NameError: name 'ds' is not defined

In [27]:
ds.auxiliary_data.MisfitWindows.i01

NameError: name 'ds' is not defined

In [28]:
ds.auxiliary_data.MisfitWindows.i01.s00

NameError: name 'ds' is not defined

Using the `Manager.load()` function, we can specify the unique `path` to determine which function evaluation we want to retrieve data from.

In [29]:
mgmt = Manager(ds=ds)
mgmt.load("NZ.BFZ", path="i01/s00", synthetic_tag="synthetic_i01s00")
mgmt.standardize().preprocess()

NameError: name 'ds' is not defined

We can now load in previously retrieved windows from the dataset, using the `Manager.window()` function.  
Windows misfit criteria will be re-evaluated using the current set of data. We can turn off automatic window saving using the optional `save` argument.

In [30]:
mgmt.window(fix_windows=True, iteration=1, step_count=0, save=False)

NameError: name 'mgmt' is not defined

*easy peasy mate*