# Data gathering

Data gathering is an integral part of a seismic inversion workflow. Although data can directly be passed to the `Manager` class by the user, Pyatoa provides some mid-level classes that deal with automated data gathering, from local directory structures, `ASDFDataSet`s or with queries to the International Federation of Digital Seismographs web service using the ObsPy `Client` module.

---
## Event metadata

Event data can be gathered using the ObsPy FDSN web service client using the `Config.client` parameter.  

As an example we'll gather event information from the [M<sub>w</sub>7.8 Kaikoura Earthquake, New Zealand](http://ds.iris.edu/ds/nodes/dmc/tools/event/5197722).


In [None]:
from pyatoa import logger, Config, Manager
logger.setLevel("DEBUG")

In [None]:
cfg = Config(client="IRIS", event_id="5197722")
mgmt = Manager(config=cfg)
mgmt.gather(choice="event", try_fm=False)  # try_fm argument addressed next
mgmt.event

### GCMT focal mechanisms

Events gathered using the IRIS webservice may also query the [Harvard GCMT moment tensor catalog](https://www.globalcmt.org/CMTsearch.html) for matching focal mechanism information. If we set the `try_fm` argument to `True` in the `gather` function, it will search for matching moment tensor information using the `Event` origin time and magnitude.

In [None]:
mgmt = Manager(config=cfg)
mgmt.gather(choice="event", try_fm=True)

In [None]:
mgmt.event

In [None]:
mgmt.event.preferred_focal_mechanism().moment_tensor

In [None]:
mgmt.event.plot();

### New Zealand event metadata from GeoNet

Pyatoa was originally designed for the New Zealand tomography problem, so functions are available for querying the [GeoNet regional moment tensor catalog](https://github.com/GeoNet/data/blob/master/moment-tensor/GeoNet_CMT_solutions.csv
). 

> **__NOTE__:** GeoNet moment tensors are automatically converted to GCMT convention, i.e. from XYZ to RTP (m_rr, m_tt, m_pp...) coordinates and into units of N*m.

Let's try to grab to same [M<sub>w</sub>7.8 Kaikoura Earthquake](https://www.geonet.org.nz/earthquake/2016p858000) using its unique GeoNet identifier.

In [None]:
cfg = Config(client="GEONET", event_id="2016p858000")
mgmt = Manager(config=cfg)
mgmt.gather(choice="event", try_fm=True)

In [None]:
mgmt.event

In [None]:
mgmt.event.preferred_focal_mechanism().moment_tensor

In [None]:
mgmt.event.plot();

---
## Station metadata from local file system

Station metadata can be gathered from local file systems following SEED response file naming conventions. The paths to response files can be specified in the `Config.paths['responses']` list.

### Naming Convention

By default, responses are searched for using file name and directory structure templates that follow SEED formatting. This is defined as:

**Default Directory Template:** path/to/responses/SSS.NN  
**Default File ID Template:** RESP.NN.SSS.LL.CCC

* NN: The network code (e.g. NZ)  
* SSS: The station code (e.g. BFZ)  
* LL: The location code (e.g. 10)  
* CCC: The channel code (e.g. HHZ.D)  

An example directory for station NZ.BFZ: **path/to/response/BFZ.NZ/RESP.NZ.BFZ.10.HHZ**

!!! Include a link to the distribute_dataless script !!!

In [None]:
cfg = Config(paths={"responses": ["../tests/test_data/test_seed"]})
mgmt = Manager(config=cfg)
mgmt.gather(code="NZ.BFZ.??.HH?", choice=["inv"]);

---
## Observed waveforms from local file system

Observed waveforms can either be collected from a local file system or using the ObsPy webservice client. Waveform gathering is based on event origin time, therefore an `Event` object must be present for data gathering to work properly.

### Naming convention

By default, observed waveforms are searched for using file name and directory structure templates that follow SEED formatting. This is defined as:

**Default Directory Template:** path/to/observed/YYYY/NN/SSS/CCC/  
**Default File ID Template:** NN.SSS.LL.CCC.YYYY.DDD  

* YYYY: The year with the century (e.g., 1987)  
* NN: The network code (e.g. NZ)  
* SSS: The station code (e.g. BFZ)  
* LL: The location code (e.g. 10)  
* CCC: The channel code (e.g. HHZ.D)  
* DDD: The julian day of the year (January 1 is 001)

An example directory for station NZ.BFZ, for the day 2018-02-18: **path/to/observed/2018/NZ/BFZ/HHZ/NZ.BFZ.10.HHZ.D.2018.049**

User-defined paths can be provided to the `Config.paths` attribute, which takes the form of a dictionary of lists. Multiple paths can be passed to each list, and data gathering routines will search each path in order until relevant data is found. 

### Eketahuna example

Here we will gather event metadata for the [M6.2 Eketahuna earthquake, New Zealand](https://www.geonet.org.nz/earthquake/2014p051675), and use its origintime to gather observed waveforms from the test data directory, for which a matching directory structure has already been created. We add a dummy path to show that how multiple paths can be passed to the `paths` attribute. The logger output shows the location of the waveforms found, which matches the example path shown above.

In [None]:
cfg = Config(event_id="2018p130600", client="GEONET", paths={"waveforms": ["./dummy_path", "../tests/test_data/test_mseeds"]})
cfg.paths

In [None]:
mgmt = Manager(config=cfg)
mgmt.gather(code="NZ.BFZ.??.HH?", choice=["event", "st_obs"]);

---
## Waveforms and station metadata from FDSN

Observed waveforms and station metdata may also be fetched from FDSN webservices using the [ObsPy Client module](https://docs.obspy.org/packages/obspy.clients.fdsn.html). If paths are provided to the `Config` class, searches in local filesystems will occur first, but if not matching waveforms or metadata are found, then gathering will default to querying FDSN. Lets gather the same waveform data from the Eketahuna example.

In [None]:
cfg = Config(event_id="2018p130600", client="GEONET")
mgmt = Manager(config=cfg)
mgmt.gather(code="NZ.BFZ.??.HH?", choice=["event", "inv", "st_obs"]);

---
## Gathering synthetic waveforms

Pyatoa was designed around SPECFEM3D Cartesian, and so synthetic waveforms are expected in the ASCII outputs of SPECFEM. Synthetic waveforms can only be gathered from a local file system and are searched for using the `Config.paths['synthetics']` list.

Synthetic data will be read in as an ObsPy Stream object. Since SPECFEM ASCII files have no header information, an `Event` attribute is required to set the origin time of the synthetic data.
### Naming convention

The naming convention by default is set by ASCII output files of SPECFEM.

**Default File ID Template:** NN.SSS.CCC.EEEE
* NN: The network code (e.g. NZ)  
* SSS: The station code (e.g. BFZ)  
* CCC: The channel code, where the instrument code (second letter) is always 'X', to denote generated data, as per SEED convention (e.g. BXZ)  
* EEEE: The SEM extension which denotes the units of the synthetics. Usually something like 'semd', where 'd' stands for displacement. 

An example directory for station NZ.BFZ, for the day 2018-02-18: **path/to/synthetics/NZ.BFZ.BXZ.semd** 

> **__NOTE__:** An optional `syn_dir_template` can be passed to to the `gather` function to prepend additional paths, e.g. if many synthetics have been generated and grouped by event.

In [None]:
mgmt.config.paths["synthetics"].append("../tests/test_data/")
mgmt.gather(code="NZ.BFZ.??.BX?", choice=["st_syn"], syn_dir_template="synthetics")

---
## One-time mass data gathering

It may be useful to do a one-time mass data gathering prior to a seismic inversion, to assess for example how many stations a given event is recorded on, or to assess which stations show good data quality. Pyatoa provides a multithreaded data gathering scheme to set up the ASDFDataSets that will be used in a future seismic inversion.

We need a few prerequisite pieces of data: 
* Event origin time
* ASDFDataSet
* Station codes for desired data

In [None]:
from pyasdf import ASDFDataSet

event_id = "2016p858000"
ds = ASDFDataSet(f"../tests/test_data/{event_id}.h5")
cfg = Config(client="GEONET", event_id=event_id)

mgmt = Manager(config=cfg, ds=ds)
mgmt.gather(choice="event", try_fm=False)

Now we can gather data en masse using the desired stations codes. Wildcards are accepted and passed into the ObsPy webservice client query. Gathered data will be saved to the ASDFDataSet in the Pyatoa format, which can be used for subsequent inversion efforts. The multithreaded process will tell the user how many pieces of information were retrieved for each station, in this case 1 dataless file and 3 waveforms, 1 per component.

In [None]:
import warnings

station_codes = ["NZ.BFZ.??.HH?", "NZ.KNZ.??.HH?", "NZ.PUZ.??.HH?", "NZ.WEL.??.HH?"]

# We will ignore the UserWarning regarding ObsPy read versions
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    mgmt.gatherer.gather_obs_threaded(station_codes)

In [None]:
print(ds.waveforms.list())
ds.waveforms.NZ_BFZ

---
## From an ASDFDataSet

Once stored in an ASDFDataSet, data can be re-retrieved using the gather function. ASDFDataSet retrieval is prioritized above local file system recovery. See the 'Data Storage' section for some examples of reading/writing data from ASDFDataSets.