# [WIP] Quality Controlling a CTD profile
Quality control of a CTD profile (Temperature & Salinity)

## Objective:
Show how to quality control temperature and salinity from a CTD profile.


In [None]:
from bokeh.io import output_notebook, show
from bokeh.layouts import row
from bokeh.plotting import figure
import numpy as np

import cotede
from cotede import datasets, qctests

In [None]:
output_notebook()

## Data
We'll use a CTD profile in the Tropical Atlantic for this tutorial.
If curious about this dataset, check CoTeDe's documentation for more details.

Let's load the data and check which variables are available.

In [None]:
data = cotede.datasets.load_ctd()

print("The variables are: ", sorted(data.keys()))
print("There is a total of {} observed depths.".format(len(data["TEMP"])))

This CTD was equipped with backup sensors to provide more robustness.
Measurements from the secondary sensor are identified by a 2 in the end of the name. Let's focus here on the primary sensors.

To visualize this profile we will use Bokeh which allows some interactivity.

In [None]:
p1 = figure(plot_width=420, plot_height=600)
p1.circle(data['TEMP'], -data['PRES'], size=8, line_color="seagreen", fill_color="mediumseagreen", fill_alpha=0.3)
p1.xaxis.axis_label = "Temperature [C]"
p1.yaxis.axis_label = "Depth [m]"

p2 = figure(plot_width=420, plot_height=600)
p2.y_range = p1.y_range
p2.circle(data['PSAL'], -data['PRES'], size=8, line_color="seagreen", fill_color="mediumseagreen", fill_alpha=0.3)
p2.xaxis.axis_label = "Salinity"
p2.yaxis.axis_label = "Depth [m]"

p = row(p1, p2)
show(p)

Considering the unusual magnitudes and variability near the bottom, there are clearly bad measurements in this profile.
Let's start with the one of the most fundamental QC test and restrict to feasible values.

## Global Range: Check for Feasible Values
Let's use the thresholds recommended by the GTSPP:
 - Temperature between -2 and 40 $^\circ$C
 - Salinity between 0 and 41

In [None]:
# ToDo: Include a shaded area for unfeasible values

idx_valid = (data['TEMP'] > -2) & (data['TEMP'] < 40)

p1 = figure(plot_width=420, plot_height=600, title="Global Range Check (-2 <= T <= 40)")
p1.circle(data['TEMP'][idx_valid], -data['PRES'][idx_valid], size=8, line_color="seagreen", fill_color="mediumseagreen", fill_alpha=0.3, legend_label="Good values")
p1.triangle(data['TEMP'][~idx_valid], -data['PRES'][~idx_valid], size=8, line_color="red", fill_color="red", fill_alpha=0.3, legend_label="Bad values")
p1.xaxis.axis_label = "Temperature [C]"
p1.yaxis.axis_label = "Depth [m]"


idx_valid = (data['PSAL'] > 0) & (data['PSAL'] < 41)

p2 = figure(plot_width=420, plot_height=600, title="Global Range Check (0 <= S <= 41)")
p2.y_range = p1.y_range
p2.circle(data['PSAL'][idx_valid], -data['PRES'][idx_valid], size=8, line_color="seagreen", fill_color="mediumseagreen", fill_alpha=0.3, legend_label="Good values")
p2.triangle(data['PSAL'][~idx_valid], -data['PRES'][~idx_valid], size=8, line_color="red", fill_color="red", fill_alpha=0.3, legend_label="Bad values")
p2.xaxis.axis_label = "Pratical Salinity"
p2.yaxis.axis_label = "Depth [m]"

p = row(p1, p2)
show(p)

Great, we already identified a fair number of bad measurements.
The global range test is a simple and light test, and there is no reason to always apply it in normal conditions, but it is usually not enough.
We will need to apply more tests to capture the rest of the bad measurements.

Several QC tests were already implemented in CoTeDe, so you don't need to code it again.
For instance, the global range test is available as `qctests.GlobalRange` and we can use it like

In [None]:
y = qctests.GlobalRange(data, 'TEMP', cfg={"minval": -2, "maxval": 40})
y.flags

Let's use that to check what are the unfeasible values of temperature.

In [None]:
flag = y.flags["global_range"]
data["TEMP"][flag==4]

The Global Range is a trivial one to implement, but there are other checks that are more complex and CoTeDe provides a solution for that.
For instance, let's consider another traditional procedure, the Spike check.

## Spike
The spike check is a quite traditional one and is based on the principle of comparing one measurement with the tendency observed from the neighbor values.
We could implement it as follows:

In [None]:
def spike(x):
    """Spike check as defined by GTSPP
    
    Notes
    -----
    - Check CoTeDe's manual for more details.
    """
    y = np.nan * x
    y[1:-1] = np.abs(x[1:-1] - (x[:-2] + x[2:]) / 2.0) - np.abs((x[2:] - x[:-2]) / 2.0)
    return y

This is already implemented in CoTeDe as `qctests.spike`, and we could use it like:

In [None]:
temp_spike = qctests.spike(data["TEMP"])

print("The largest spike observed was: {:.3f}".format(np.nanmax(np.abs(temp_spike))))

The same could be done for salinity, such as: ``sal_spike = qctests.spike(data["PSAL"])``

The traditional approach to use the spike check is by comparing the "spikeness magnitude" with a threshold.
The measurement is considered bad (flag 4) if the spike was larger than that threshold.
Similar to the global range check, we could hence use the `spike()` and compare the output with acceptable limits.
This procedure is already available in CoTeDe as `qctests.Spike` and we can use it as follows,

In [None]:
y_spike = qctests.Spike(data, "TEMP", cfg={"threshold": 2.0})
y_spike.flags

Like the Global Range, it provides the quality flags obtained from this procedure.
Note that the standard flagging follows the IOC recommendation (to customize the flags, check the manual), thus 1 means good data while 0 is no QC applied.
The spike check is based on the previous and following measurements, thus it can't evaluate the first nor the last values, returning flag 0 for those two measurements.

Some procedures provide more than just the flags, but also include features derived from the original measurements.
For instance, if one was interested in the "spike intensity" of one measurement, that could be inspected as:

In [None]:
y_spike.features

The magnitudes of the tests are stored in features.
Let's check which features were saved for temperature,

In [None]:
print("Features for temperature: {}\n".format(pqc.features["TEMP"].keys()))

## More tests
QC checks are usually focused on specific characteristics of bad measurements, thus to cover a wider range of issues we typically combine a set of checks.
Let's apply the Gradient and the Tukey53H checks

In [None]:
y_gradient = qctests.Gradient(data, "TEMP", cfg={"threshold": 10})
y_gradient.flags

In [None]:
y_tukey53H = qctests.Gradient(data, "TEMP", cfg={"threshold": 2.0})
y_tukey53H.flags

These already implemented tests are useful, but it could be easier.
We usually don't apply one test at a time but a set of tests. We could do that byt defining a QC configuration like

In [None]:
cfg = {
    "TEMP": {
        "global_range": {"minval": -2, "maxval": 40},
        "gradient": {"threshold": 10.0},
        "spike": {"threshold": 2.0},
        "tukey53H": {"threshold": 1.5},
    }
}

In [None]:
qc = cotede.ProfileQC(data, cfg=cfg)

That's it, the temperature and salinity from the primary and secondary sensors were evaluated.

First the same variables in the input are available in the output object.

In [None]:
print("Variables available in data: {}\n".format(data.keys()))
print("Variables available in pqc: {}\n".format(pqc.keys()))

In [None]:
print("Flags available for temperature {}\n".format(pqc.flags["TEMP"].keys()))
print("Flags available for salinity {}\n".format(pqc.flags["PSAL"].keys()))

In [None]:
qc.flags.keys()

In [None]:
qc.flags["TEMP"]

In [None]:
flag = qc.flags["TEMP"]["overall"]
flag

In [None]:
cfg = {
    "TEMP": {
        "global_range": {"minval": -2, "maxval": 40},
        "gradient": {"threshold": 10.0},
        "spike": {"threshold": 2.0},
        "tukey53H": {"threshold": 1.5},
    },
    "PSAL": {
        "global_range": {"minval": 0, "maxval": 40},
        "gradient": {"threshold": 10.0},
        "spike": {"threshold": 2.0},
        "tukey53H": {"threshold": 1.5},
    }

}

In [None]:
qc = cotede.ProfileQC(data, cfg=cfg)

In [None]:
qc.flags.keys()

## Using CoTeDe QC framework
CoTeDe automates many procedures for QC. Let's start using the standard procedure.

That's it, the primary and secondary sensors were evaluated. First the same variables in the input are available in the output object.

In [None]:
print("Variables available in data: {}\n".format(data.keys()))
print("Variables available in pqc: {}\n".format(pqc.keys()))

In [None]:
print("Flags available for temperature {}\n".format(pqc.flags["TEMP"].keys()))
print("Flags available for salinity {}\n".format(pqc.flags["PSAL"].keys()))

The flags are on IOC standard, thus 1 means good while 4 means bad.
0 is used when the QC there was no QC. For instance, the spike test is defined so that it depends on the previous and following measurements, thus the first and last data point of the array will always have a spike flag equal to 0.

Let's check the salinity with feasible values:

In [None]:
# ToDo: Include a shaded area for unfeasible values

idx_valid = (qc.flags["TEMP"]["overall"] <= 2)

p1 = figure(plot_width=420, plot_height=600, title="Global Range Check (-2 <= T <= 40)")
p1.circle(data['TEMP'][idx_valid], -data['PRES'][idx_valid], size=8, line_color="seagreen", fill_color="mediumseagreen", fill_alpha=0.3, legend_label="Good values")
p1.triangle(data['TEMP'][~idx_valid], -data['PRES'][~idx_valid], size=8, line_color="red", fill_color="red", fill_alpha=0.3, legend_label="Bad values")
p1.xaxis.axis_label = "Temperature [C]"
p1.yaxis.axis_label = "Depth [m]"


idx_valid = (qc.flags["PSAL"]["overall"] <= 2)

p2 = figure(plot_width=420, plot_height=600, title="Global Range Check (0 <= S <= 41)")
p2.y_range = p1.y_range
p2.circle(data['PSAL'][idx_valid], -data['PRES'][idx_valid], size=8, line_color="seagreen", fill_color="mediumseagreen", fill_alpha=0.3, legend_label="Good values")
p2.triangle(data['PSAL'][~idx_valid], -data['PRES'][~idx_valid], size=8, line_color="red", fill_color="red", fill_alpha=0.3, legend_label="Bad values")
p2.xaxis.axis_label = "Pratical Salinity"
p2.yaxis.axis_label = "Depth [m]"

p = row(p1, p2)
show(p)

## More tests: GTSPP Spike and Gradient tests
OK, let's apply more tests beyond the global range.
Some common ones are the gradient and spike, and we could use CoTeDe to run that like

In [None]:
y_gradient = qctests.Gradient(data, 'TEMP', cfg={"threshold": 10})
y_gradient.flags

In [None]:
y_spike = qctests.Spike(data, 'TEMP', cfg={"threshold": 2.0})
y_spike.flags

## The Easiest Way: High level
Let's evaluate this profile using EuroGOOS standard tests.

In [None]:
pqced = cotede.ProfileQCed(data, cfg='eurogoos')

In [None]:
p = figure(plot_width=500, plot_height=600)
p.circle(pqced['TEMP'], -pqced['PRES'], size=8, line_color="green", fill_color="green", fill_alpha=0.3)
show(p)

## QC with more control: "medium" level

In [None]:
pqc = cotede.ProfileQC(data, cfg='eurogoos')

In [None]:
pqc.keys()

In [None]:
pqc.flags["TEMP"]

In [None]:
data.keys()

### Low level

In [None]:
from cotede import qctests
y = qctests.GlobalRange(data, 'TEMP', cfg={'minval': -4, "maxval": 45 })
y.flags

In [None]:
y = qctests.Tukey53H(data, 'TEMP', cfg={'threshold': 6, "l": 12})
y.features["tukey53H"]
p = figure(plot_width=500, plot_height=600)
p.circle(y.features["tukey53H"], -data['PRES'], size=8, line_color="green", fill_color="green", fill_alpha=0.3)
show(p)

In [None]:
cfg = {'TEMP': {'global_range': {'minval': -4, 'maxval': 45}}}

pqc = ProfileQC(data, cfg)

pqc.flags['TEMP']
pqc.flags['TEMP']['overall']

idx_good = pqc.flags['TEMP']['overall'] <= 2
idx_bad = pqc.flags['TEMP']['overall'] >= 3

p = figure(plot_width=500, plot_height=600)
p.circle(data['TEMP'][idx_good], -data['PRES'][idx_good], size=8, line_color="green", fill_color="green", fill_alpha=0.3)
p.triangle(data['TEMP'][idx_bad], -data['PRES'][idx_bad], size=8, line_color="red", fill_color="red", fill_alpha=0.3)
show(p)

In [None]:
cfg['TEMP']['spike'] = {'threshold': 6}

pqc = ProfileQC(data, cfg)

pqc.flags['TEMP']
pqc.flags['TEMP']['overall']

idx_good = pqc.flags['TEMP']['overall'] <= 2
idx_bad = pqc.flags['TEMP']['overall'] >= 3

p = figure(plot_width=500, plot_height=600)
p.circle(data['TEMP'][idx_good], -data['PRES'][idx_good], size=8, line_color="green", fill_color="green", fill_alpha=0.3)
p.triangle(data['TEMP'][idx_bad], -data['PRES'][idx_bad], size=8, line_color="red", fill_color="red", fill_alpha=0.3)
show(p)

In [None]:
cfg['TEMP']['woa_normbias'] = {'threshold': 6}


pqc = ProfileQC(data, cfg)

pqc.flags['TEMP']
pqc.flags['TEMP']['overall']

idx_good = pqc.flags['TEMP']['overall'] <= 2
idx_bad = pqc.flags['TEMP']['overall'] >= 3

p = figure(plot_width=500, plot_height=600)
p.circle(data['TEMP'][idx_good], -data['PRES'][idx_good], size=8, line_color="green", fill_color="green", fill_alpha=0.3)
p.triangle(data['TEMP'][idx_bad], -data['PRES'][idx_bad], size=8, line_color="red", fill_color="red", fill_alpha=0.3)
show(p)

In [None]:
cfg['TEMP']['spike_depthconditional'] = {"pressure_threshold": 500, "shallow_max": 6.0, "deep_max": 2.0}

pqc = ProfileQC(data, cfg)

pqc.flags['TEMP']
pqc.flags['TEMP']['overall']

idx_good = pqc.flags['TEMP']['overall'] <= 2
idx_bad = pqc.flags['TEMP']['overall'] >= 3

p = figure(plot_width=500, plot_height=600)
p.circle(data['TEMP'][idx_good], -data['PRES'][idx_good], size=8, line_color="green", fill_color="green", fill_alpha=0.3)
p.triangle(data['TEMP'][idx_bad], -data['PRES'][idx_bad], size=8, line_color="red", fill_color="red", fill_alpha=0.3)
show(p)

In [None]:
## The Easiest Way: High level
Let's evaluate this profile using EuroGOOS standard tests.

In [None]:
pqced = cotede.ProfileQCed(data, cfg='eurogoos')

In [None]:
p = figure(plot_width=500, plot_height=600)
p.circle(pqced['TEMP'], -pqced['PRES'], size=8, line_color="green", fill_color="green", fill_alpha=0.3)
show(p)

In [None]:
## QC with more control: "medium" level

In [None]:
pqc = cotede.ProfileQC(data, cfg='eurogoos')

In [None]:
pqc.keys()

In [None]:
pqc.flags["TEMP"]

In [None]:
data.keys()

In [None]:
### Low level

In [None]:
from cotede import qctests
y = qctests.GlobalRange(data, 'TEMP', cfg={'minval': -4, "maxval": 45 })
y.flags

In [None]:
y = qctests.Tukey53H(data, 'TEMP', cfg={'threshold': 6, "l": 12})
y.features["tukey53H"]
p = figure(plot_width=500, plot_height=600)
p.circle(y.features["tukey53H"], -data['PRES'], size=8, line_color="green", fill_color="green", fill_alpha=0.3)
show(p)