# [WIP] Quality control - water level timeseries

Show how to quality control a water level timeseries using CoTeDe

WIP:
- Missing local noise test

## Objective:
Show how to use CoTeDe to quality control timeseries of water level records.

## Notes:
- This and other notebooks on quality control are available at https://cotede.castelao.net in /docs/notebooks/. There you can play the notebooks without installing anything in your machine.

In [None]:
import numpy as np
from bokeh.io import output_notebook, show
from bokeh.layouts import column, row
from bokeh.models import ColumnDataSource, CustomJS, Slider
from bokeh.plotting import figure
import pandas as pd
from scipy import stats

import cotede
from cotede import datasets, qctests

In [None]:
print("CoTeDe version: {}".format(cotede.__version__))

In [None]:
output_notebook()

## Data
We'll use a water level sample dataset from CoTeDe for this tutorial. This is the water level station: 8764227 LAWMA, Amerada Pass, LA, operated by NOAA / NOS / CO-OPS, and kindly provided by Lindsay Abrams. If curious about this dataset, check CoTeDe's documentation for more details and credits.

Fortunatelly, this data was already flagged by NOAA personel, so let's take that as our ground truth and use it as a reference to verify if we are doing a good job. But keep in mind that the idea is to pretend that we are analysing a raw dataset, i.e. we wouldn't know the answer a priori.

Let's load the data and check which variables are available.

In [None]:
data = cotede.datasets.load_water_level()

print("The variables are: ", sorted(data.keys()))
print("There is a total of {} observations.".format(len(data["epoch"])))

This data was previously quality controlled. Let's use that as our indexes of good and bad data to verify what we should be identifying.

In [None]:
idx_good = ~data["flagged"]
idx_bad = data["flagged"]

In [None]:
# A time series with the data
# x_axis_type='datetime'
p = figure(plot_width=750, plot_height=300, title="Water Level")
p.circle(data['epoch'][idx_good], data["water_level"][idx_good], size=8, line_color="orange", fill_color="orange", fill_alpha=0.5, legend_label="Good values")
p.triangle(data["epoch"][idx_bad], data["water_level"][idx_bad], size=12, line_color="red", fill_color="red", fill_alpha=0.8, legend_label="Bad values")

p.legend.location = "top_left"
show(p)

### Describing the Data
Based on the manual flagging, let's check the distribution of the good and bad data.

In [None]:
hist_good, edges_good = np.histogram(data["water_level"][idx_good], density=False, bins=50)
hist_bad, edges_bad = np.histogram(data["water_level"][idx_bad], density=False, bins=50)

p = figure(plot_width=750, plot_height=300, background_fill_color="#fafafa", title="Data distribution")
p.quad(top=hist_good, bottom=0, left=edges_good[:-1], right=edges_good[1:],
           fill_color="green", line_color="white", alpha=0.5, legend_label="Good data")
p.quad(top=hist_bad, bottom=0, left=edges_bad[:-1], right=edges_bad[1:],
           fill_color="red", line_color="white", alpha=0.5, legend_label="Bad data")
show(p)

A large fraction of the bad data is clearly distinct of the typical values of good measurements, with magnitudes higher than 14.
This maximum value is typically associated with the maximum value possible for a sensor or an unfeasible value used to assing a "missing value" also called "Fill Value", but I don't know if this is the case for this dataset.
It is common to any raw measurements to get measurements with unfeasible values and since this is probably the easiest error to identify we shall address it right the way.

Someone with experience with these sensors and this station should be able to suggest a limit for possible events.
This limit should be forgiven since we usually don't want to risk flagging good values as bad ones.
For this tutorial, let's guess that 12 is the limit and anything higher than 12 wouldn't be feasible in normal conditions for this station. If you're not happy with the idea of this arbitrary choice, check the notebook of Anomaly Detection with sea level for a probabilistic criterion.

This QC check based on feasible values is traditionally called "Global Range" check.

## Global Range: Check for Feasible Values
Let's assume that the sea level on this station can be as low as 6 and as high as 12, even considering extreme contidions like a storm event.
At this point we don't want to eliminate good data by mistake.

In [None]:
idx_valid = (data["water_level"] > 6) & (data["water_level"] < 12)

p = figure(plot_width=750, plot_height=300, title="Water Level")
p.circle(data['epoch'][idx_valid], data["water_level"][idx_valid], size=8, line_color="orange", fill_color="orange", fill_alpha=0.5, legend_label="Good values")
p.triangle(data["epoch"][~idx_valid], data["water_level"][~idx_valid], size=12, line_color="red", fill_color="red", fill_alpha=0.8, legend_label="Bad values")

p.legend.location = "top_left"
show(p)

Great, we already identified a significant number of bad measurements.
The global range test is a simple and light test, and there is no reason to always apply it in normal conditions, but this is usually not enough.
We will need to apply other tests to capture more bad measurements.

Several QC tests were already implemente in CoTeDe, so you don't need to code it again.
For instance, the global range test is available as `qctests.GlobalRange` and we can use it like

In [None]:
qc_global_range = qctests.GlobalRange(data, "water_level", cfg={"minval": 6, "maxval": 12})
qc_global_range.flags

The Global Range is a trivial to implement, but there are other checks that are more complex and CoTeDe provides a solution for that.
For instance, let's consider another traditional procedure, the Spike check.

## Spike
The spike check is a quite traditional one and is based on the principle of comparing one measurement with the tendency observed from the neighbor values.
We could implement it as follows:

In [None]:
def spike(x):
    """Spike check as defined by GTSPP
    """
    y = np.nan * x
    y[1:-1] = np.abs(x[1:-1] - (x[:-2] + x[2:]) / 2.0) - np.abs((x[2:] - x[:-2]) / 2.0)
    return y

This is already implemented in CoTeDe as `qctests.spike`, and we could use it like:

In [None]:
sea_level_spike = qctests.spike(data["water_level"])

print("The largest spike observed was: {:.3f}".format(np.nanmax(np.abs(sea_level_spike))))

The traditional approach to use the spike check is by comparing the "spikeness magnitude" with a threshold.
If larger than that limit it is considered bad.
Similar to the global range check, we could hence use the `spike()` and compare the output with acceptable limits.
This procedure is already available in CoTeDe as `qctests.Spike` and we can use it as follows,

In [None]:
y_spike = qctests.Spike(data, "water_level", cfg={"threshold": 2.0})
y_spike.flags

Like the Global Range, it provides the quality flags obtained from this procedure.
Note that the standard flagging follows the IOC recommendation (to customize the flags, check the manual), thus 1 means good data while 0 is no QC applied.
The spike check is based on the previous and following measurements, thus it can't evaluate the first of the last values, returning flag 0 for those two measurements.

Some procedures provide more than just the flags, but also include features derived from the original measurements.
For instance, if one was interested in the "spike intensity" of one measurement, that could be inspected as:

In [None]:
y_spike.features

## Multiple tests
QC checks are usually focused on specific characteristics of bad measurements, thus to cover a wider range of issues we typically combine a set of checks.
Let's apply the Gradient check

In [None]:
y_gradient = qctests.Gradient(data, "water_level", cfg={"threshold": 10})
y_gradient.flags

In [None]:
y_tukey53H = qctests.Gradient(data, "water_level", cfg={"threshold": 2.0})
y_tukey53H.flags

In [None]:
cfg = {
    "water_level": {
        "global_range": {"minval": 6, "maxval": 12},
        "gradient": {"threshold": 10.0},
        "spike": {"threshold": 2.0},
        "tukey53H": {"threshold": 1.5},
        "local_noise": {"threshold": 0.2},
        # "rate_of_change": {"threshold": 0.5}
    }
}

In [None]:
qc = cotede.TimeSeriesQC(data, cfg=cfg)

In [None]:
qc.flags.keys()

In [None]:
qc.flags["water_level"]

In [None]:
qc_good = qc.flags["water_level"]["overall"]

In [None]:
idx_valid = (qc_good <= 2)

p = figure(plot_width=750, plot_height=300, title="Water Level")
p.circle(data['epoch'][idx_valid], data["water_level"][idx_valid], size=8, line_color="orange", fill_color="orange", fill_alpha=0.5, legend_label="Good values")
p.triangle(data["epoch"][~idx_valid], data["water_level"][~idx_valid], size=12, line_color="red", fill_color="red", fill_alpha=0.8, legend_label="Bad values")

p.legend.location = "top_left"
show(p)