# IRIS Web Services Data Quality Metrics Exercise
*:auth: Nate Stevens (Pacific Northwest Seismic Network)*

In this notebook we'll query data quality metrics from the MUSTANG measurements webservice  
and the FDSNWS availability webservice provided by EarthScope/SAGE to get a sense of data availability and usefullness BEFORE downloading a ton of data!  

What is MUSTANG? - A continually growing data quality statistics dataset  
for every seismic station stored on the Data Management Center!  

What does MUSTANG stand for? - The **M**odular **U**tility for **STA**tistical k**N**owldege **G**athering system  

Where can I go to learn more about MUSTANG? 
https://service.iris.edu/mustang/  

Dependencies for this Notebook:   
 - `ObsPy`  
 - `Pyrocko` (and potentiall `PyQt5`)  
 - `ws_client` (`ws_client.py`)  


In [None]:
## IMPORT MODULES
import pandas as pd
from obspy import UTCDateTime
from obspy.clients.fdsn import Client
# Tools for data visualization
import matplotlib.pyplot as plt
from pyrocko import obspy_compat

# Custom-Built Clients for fetching data quality measurements from IRIS web services
from ws_client import MustangClient, AvailabilityClient



## Composing a MUSTANG query  
The `MustangClient` class' `get_metrics` follows the `key=value` syntax of the MUSTANG measurements service interface  
(https://service.iris.edu/mustang/measurements/1/)  
where multiple values can be provided as a comma-delimited string.

This version of the `MustangClient` can also parse lists of metric names (see below).  

The full list of MUSTANG metrics and detailed descriptions of their meaning can be found at the link above.  

The metrics we'll use in this exercise are:  

 - *`sample_min`*: the minimum sample value observed in a 24 hour period  

 - *`max_range`*: the maximum range between any two samples in a 5 minute window within a 24 hour period  

 - *`percent_availability`*: the percent of a 24 hour period for which there are data  

 - *`sample_unique`*: number of unique sample values reported in a 24 hour window  

 - *`num_gaps`*: number of data gaps encountered within a 24 hour window


The seismic station we're looking at is UW.MBW.01.EHZ, one of the longest running stations in the PNSN that  
monitored Mount Baker volcano until late 2023 when it was replaced with UW.MBW2.  

### UW.MBW was having some issues towards the end of its life, can you find spots where it looks like the data might not be as useful?

In [None]:
# Initialize the client
mclient = MustangClient()
# Compose a query for MUSTANG metrics for an analog seismometer near Mount Baker (Washington, USA)
metric = ['sample_min','max_range','percent_availability','sample_unique','num_gaps']
query = {'metric': metric,
            'net':'UW',
            'sta':'MBW',
            'loc':'*',
            'cha':'EHZ'}
# Run query
df_m = mclient.measurements_request(**query)

In [None]:
# What do we see?
display(df_m)

fig = plt.figure(figsize=(8,12))
gs = fig.add_gridspec(nrows=len(query['metric']), hspace=0)

for _e, _m in enumerate(query['metric']):
    ax = fig.add_subplot(gs[_e])
    ax.plot(df_m.index.get_level_values(0), df_m[_m].values, '.', label=_m)
    ax.set_ylabel(_m)
    ax.grid()


### Lots of gaps
Trying to bulk download gappy data from webservices can result in the entire request crashing.

If we can request with information on data availability (and which data seem to have meaning) then this job becomes easier.

Thankfully data availability is already documented by the NSF SAGE Facility FDSN Web Service!

# Running a FDSN Web Service query  

Use the custom-built `AvailabilityClient` class that follows the syntax of the  
related webservice: https://service.iris.edu/fdsnws/availability/1/  

For this example we'll keep looking at station UW.MBW.

In [None]:
# Initialize the client
aclient = AvailabilityClient()
# Run a data availability request for everything UW.MBW.*.EHZ has to offer
df_a = aclient.availability_request(sta='MBW',net='UW',cha='EHZ')

In [None]:
# Take a look at the 
display(df_a)

### What is going on with the sampling rates?

In [None]:
_series = pd.Series(df_a.SampleRate.values, index=df_a.Earliest.values)
ax = _series.plot()
ax.set_ylabel('Sampling Rate [sps]')

# Now lets' finally look at some data!

In [None]:
# IYKYK, or you're about to find out!
obspy_compat.plant()

In [None]:
# Get an obspy client for fetching waveforms
wclient = Client('IRIS')

In [None]:
# Subset available segments to a time window (currently uses pandas Timestamp objects)
_df_a = df_a[(df_a.Earliest >= pd.Timestamp('2023-02-19',tz='UTC')) & (df_a.Latest <= pd.Timestamp('2023-02-21',tz='UTC'))]
display(_df_a)

In [None]:
# Compose a bulk request
bulk = []
for _, row in _df_a.iterrows():
    # Switch pandas Timestamp objects back to UTCDateTime objects for requests
    req = (row.Network, row.Station, row.Location, row.Channel, UTCDateTime(row.Earliest.timestamp()), UTCDateTime(row.Latest.timestamp()))
    bulk.append(req)

In [None]:
# Run bulk request
st = wclient.get_waveforms_bulk(bulk)
st.plot()

## It looks like we have continuous data...
### Are they continuous?
### Where do they stop being useful?
### What does this mean for your workflow?

## Let's take an interactive look at our waveform data

In [None]:
# Let's take an interactive look at our waveforms
(exit_code, snuffler_pile) = st.snuffle()