# Anomaly Detection concept with water level timeseries

Walk through some of the principles used on the Anomaly Detection approach to quality control oceanographic data.

## Objective:
The Anomaly Detection technique to quality control oceanographic data (Castelao, 2020 (in review)) is based on the principle of understanding the typicall behavior of the good data by considering not only the raw measurement but also its characteristics (also called features).
Here we will use a water level timeseries long enough so that we can have a fair estimate of the behavior of the water level in this station from the dataset itself.

The original paperonly illustrated temperature profiles, so here it is shown that the same Anomaly Detection principle is also valid for water level.

Note that although the code is explicitly included in this notebook so anyone can follow step by step the procedure, it is not required to fully understand the code. The text together with the figures should make sense by themselves.

This and other notebooks on quality control are available at https://cotede.castelao.net in /docs/notebooks/.
There you can run the notebooks without installing anything in your machine.

## Notes:
- This notebook is focused on the Anomaly Detection. If you're not familiar with CoTeDe, there is another notebook covering the basics on QC with sea level that might be worth checking before. Otherwise, this is probably fine if you're only interested in learning better the concept of Anomaly Detection for QC.
- This and other notebooks on quality control are available at https://cotede.castelao.net in /docs/notebooks/. There you can run the notebooks without installing anything in your machine.

In [None]:
import numpy as np
from bokeh.io import output_notebook, show
from bokeh.layouts import column, row
from bokeh.models import ColumnDataSource, CustomJS, Slider
from bokeh.plotting import figure
import pandas as pd
from scipy import stats

import cotede
from cotede import datasets, qctests

In [None]:
output_notebook()

In [None]:
print("CoTeDe version: {}".format(cotede.__version__))

## Data
We'll use a water level sample dataset from CoTeDe for this tutorial. This is the water level station: 8764227 LAWMA, Amerada Pass, LA, operated by NOAA / NOS / CO-OPS, and kindly provided by Lindsay Abrams. If curious about this dataset, check CoTeDe's documentation for more details and credits.

Fortunatelly, this data was already flagged by NOAA personel, so let's take advantage of that and use it as a reference to verify if we are doing a good job, but keep in mind that the idea is that when applying Anomaly Detection we would not expect to have the labels, i.e. we wouldn't know the answer a priori.

Let's load the data and check which variables are available.

In [None]:
data = cotede.datasets.load_water_level()

print("The variables are: ", sorted(data.keys()))
print("There is a total of {} observations.".format(len(data["epoch"])))

This data was previously quality controlled (variable flagged). Let's use that as our indexes of good and bad data to verify what we should be identifying.

In [None]:
idx_good = ~data["flagged"]
idx_bad = data["flagged"]

In [None]:
# A time series with the data
# x_axis_type='datetime'
p = figure(plot_width=750, plot_height=300, title="Water Level")
p.circle(data['epoch'][idx_good], data["water_level"][idx_good], size=8, line_color="orange", fill_color="orange", fill_alpha=0.5, legend_label="Good values")
p.triangle(data["epoch"][idx_bad], data["water_level"][idx_bad], size=12, line_color="red", fill_color="red", fill_alpha=0.8, legend_label="Bad values")
show(p)

### Describing the Data
Based on the manual flagging, let's check the distribution of the good and bad data.

The good data seems to be normally distributed.
Part of the bad values is quite distinct to the typicall good data, showing a cluster of clear outliers above 14.

In [None]:
hist_good, edges_good = np.histogram(data["water_level"][idx_good], density=False, bins=50)
hist_bad, edges_bad = np.histogram(data["water_level"][idx_bad], density=False, bins=50)

p = figure(plot_width=750, plot_height=300, background_fill_color="#fafafa", title="Data distribution")
p.quad(top=hist_good, bottom=0, left=edges_good[:-1], right=edges_good[1:],
           fill_color="green", line_color="white", alpha=0.5, legend_label="Good data")
p.quad(top=hist_bad, bottom=0, left=edges_bad[:-1], right=edges_bad[1:],
           fill_color="red", line_color="white", alpha=0.5, legend_label="Bad data")
show(p)

Let's estimate the mean and standard deviation.

In [None]:
mu_estimated, sigma_estimated = stats.norm.fit(data["water_level"])

print("Estimated mean: {:.3f}, and standard deviation: {:.3f}".format(mu_estimated, sigma_estimated))

In [None]:
x_ref = np.linspace(data["water_level"].min(), data["water_level"].max(), 1000)
pdf = stats.norm.pdf(x_ref, loc=mu_estimated, scale=sigma_estimated)

In [None]:
hist_good, edges_good = np.histogram(data["water_level"][idx_good], density=True, bins=50)
hist_bad, edges_bad = np.histogram(data["water_level"][idx_bad], density=True, bins=50)

p = figure(plot_width=750, plot_height=300, background_fill_color="#fafafa")
p.quad(top=hist_good, bottom=0, left=edges_good[:-1], right=edges_good[1:],
           fill_color="green", line_color="white", alpha=0.5, legend_label="Good data")
p.quad(top=hist_bad, bottom=0, left=edges_bad[:-1], right=edges_bad[1:],
           fill_color="red", line_color="white", alpha=0.5, legend_label="Bad data")
p.line(x_ref, pdf, line_color="orange", line_width=6, alpha=0.7, legend_label="PDF fit")

show(p)

Our estimated PDF doesn't look great, and that is due to the outliers. We better use a robust estimator.

### Robust estimate of the mean and standard deviation
For a normally distributed dataset, the median would be equal to the mean, with the difference that the median is more robust to outliers.
Like the mean, the standard deviation is also sensitive to outliers and an alternative to use the pseudo-standard deviation, which is also equal to the standard deviation for a normaly distributed dataset.

In [None]:
mu_robust = np.percentile(data["water_level"], 50)
sigma_robust = (np.percentile(data["water_level"], 75) - np.percentile(data["water_level"], 25)) / 1.349

print("Estimated robust mean: {:.3f}, and robust standard deviation: {:.3f}".format(mu_robust, sigma_robust))

In [None]:
x_ref = np.linspace(data["water_level"].min(), data["water_level"].max(), 1000)
pdf = stats.norm.pdf(x_ref, loc=mu_robust, scale=sigma_robust)

In [None]:
hist_good, edges_good = np.histogram(data["water_level"][idx_good], density=True, bins=50)
hist_bad, edges_bad = np.histogram(data["water_level"][idx_bad], density=True, bins=50)

p = figure(plot_width=750, plot_height=300, background_fill_color="#fafafa", title="Probability Density Function")
p.quad(top=hist_good, bottom=0, left=edges_good[:-1], right=edges_good[1:],
           fill_color="green", line_color="white", alpha=0.5, legend_label="Good data")
p.quad(top=hist_bad, bottom=0, left=edges_bad[:-1], right=edges_bad[1:],
           fill_color="red", line_color="white", alpha=0.5, legend_label="Bad data")
p.line(x_ref, pdf, line_color="orange", line_width=6, alpha=0.7, legend_label="PDF fit")
# p.line(x_ref, sf, line_color="blue", line_width=4, alpha=0.7, legend_label="SF")

show(p)

Using the median and the pseudo-standard deviation works great if the ammount of outliers is not too large. Otherwise, well, those wouldn't be outliers anyways, right?

## Survival Function
Once we can estimate the PDF we can also obtain the Survival Function

In [None]:
cdf = stats.norm.cdf(x_ref, loc=mu_robust, scale=sigma_robust)
sf = stats.norm.sf(x_ref, loc=mu_robust, scale=sigma_robust)

In [None]:
# hist, edges = np.histogram(data["water_level"], density=True, bins=50)

p = figure(plot_width=750, plot_height=300, background_fill_color="#fafafa", title="Survival Function")
# p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
#            fill_color="dodgerblue", line_color="white", alpha=0.5)
p.quad(top=hist_good, bottom=0, left=edges_good[:-1], right=edges_good[1:],
           fill_color="green", line_color="white", alpha=0.5, legend_label="Good data")
p.quad(top=hist_bad, bottom=0, left=edges_bad[:-1], right=edges_bad[1:],
           fill_color="red", line_color="white", alpha=0.5, legend_label="Bad data")
p.line(x_ref, cdf, line_color="lightseagreen", line_width=4, alpha=0.7, legend_label="CDF")
p.line(x_ref, sf, line_color="orange", line_width=4, alpha=0.7, legend_label="SF")

show(p)

For a given value $x_i$, the CDF($x_i$) gives the probability of sampling from this dataset a value equal or smaller than $x_i$, while SF($x_i$) gives the probability of sampling a value equal or greater than $x_i$.
For instance, values higer than 8 are only 5% of the dataset (you can check it by zoom in the orange line).
Therefore, the SF can be used as a guide on how rare is a measurement in the upper bound perspective.

### Cumulative Density Function versus Survival Function
Let's play with the good data to better understand the CDF and the SF. You don't need to fully understand the code in the next box, but you can jump straight to plot below and play with the slider for different water levels.

The top plot is a normalized histogram of the observed water level flagged as good, and the line is our estimated PDF. The second plot shows the area on the histogram that is greater or smaller than the water level selected with the slider. The bottom plot shows the CDF and SF, which are equivalent to the areas in the middle plot, therefore, higer the water level choice, less orange in the histogram (middle), and smaller is the value of the SF (bottom).

For the Anomaly Detection implemented to QC data, we use the SF as an index to quantify how rare is an observation in respect to the upper bound. If we were insterested in how rare is a small value (i.e. the lower bound) we should be using the CDF. 

In [None]:
x = data["water_level"][idx_good]
hist, edges = np.histogram(x, density=True, bins=50)


x_ref = np.linspace(data["water_level"].min(), 8.3, 250)

pdf = stats.norm.pdf(x_ref, loc=mu_robust, scale=sigma_robust)
cdf = stats.norm.cdf(x_ref, loc=mu_robust, scale=sigma_robust)
sf = stats.norm.sf(x_ref, loc=mu_robust, scale=sigma_robust)

slider = Slider(title="water level", value=np.median(x), start=x.min(), end=x.max(), step=0.02, orientation="horizontal")


tmp = dict(
    x_ref=x_ref.copy(),
    pdf_ref=pdf.copy(),
    cdf_ref=cdf.copy(),
    sf_ref=sf.copy(),
    cdf=cdf.copy(),
    sf=sf.copy()
)
tmp["cdf"][x_ref > slider.value] = np.nan
tmp["sf"][x_ref < slider.value] = np.nan

dist_source = ColumnDataSource(data=tmp)


tmp = dict(
    hist=hist,
    left=edges[:-1],
    right=edges[1:],
    ch=hist.copy(),
    cl=edges[:-1].copy(),
    cr=edges[1:].copy(),
    sh=hist.copy(),
    sl=edges[:-1].copy(),
    sr=edges[1:].copy()
)

idx = edges[1:] < slider.value
tmp["sh"][idx] = np.nan
tmp["sl"][idx] = np.nan
tmp["sr"][idx] = np.nan
idx = edges[:-1] > slider.value
tmp["ch"][idx] = np.nan
tmp["cl"][idx] = np.nan
tmp["cr"][idx] = np.nan


source = ColumnDataSource(data=tmp)
# source = ColumnDataSource(data=tmp)
callback = CustomJS(args=dict(source=source, dist_source=dist_source), code="""
    var data = source.data;
    var f = cb_obj.value;
    var hist = data['hist'];
    var left = data['left'];
    var right = data['right'];
    var ch = data['ch'];
    var cl = data['cl'];
    var cr = data['cr'];
    var sh = data['sh'];
    var sl = data['sl'];
    var sr = data['sr'];
    for (var i = 0; i < hist.length; i++) {
        if (left[i] > f) {
            ch[i] = "NaN";
            cl[i] = "NaN";
            cr[i] = "NaN";
        } else {
            ch[i] = hist[i];
            cl[i] = left[i];
            cr[i] = right[i];
        }
        if (right[i] < f) {
            sh[i] = "NaN";
            sl[i] = "NaN";
            sr[i] = "NaN";
        } else {
            sh[i] = hist[i];
            sl[i] = left[i];
            sr[i] = right[i];
        }
    }
    var ddata = dist_source.data;
    var x_ref = ddata['x_ref'];
    var cdf_ref = ddata['cdf_ref'];
    var cdf = ddata['cdf'];
    var sf_ref = ddata['sf_ref'];
    var sf = ddata['sf'];
    for (var i = 0; i < x_ref.length; i++) {
        if (x_ref[i] > f) {
            cdf[i] = "NaN"
            sf[i] = sf_ref[i]
        } else {
            cdf[i] = cdf_ref[i]
            sf[i] = "NaN"
        }
    }
    dist_source.change.emit();
    source.change.emit();
""")


slider.js_on_change('value', callback)

p_top = figure(plot_width=750, plot_height=300, background_fill_color="#fafafa", title="Probability Density Function")
p_top.quad(top="hist", bottom=0, left="left", right="right", source=source,
           fill_color="green", line_color="white", alpha=0.5)
p_top.line(x_ref, pdf, line_color="crimson", line_width=6, alpha=0.7, legend_label="PDF fit")


p1 = figure(plot_width=750, plot_height=300, background_fill_color="#fafafa", title="Observations in respect to the threshold")
p1.quad(top="ch", bottom=0, left="cl", right="cr", source=source,
           fill_color="lightseagreen", line_color="white", alpha=0.5, legend_label="lower than")
p1.quad(top="sh", bottom=0, left="sl", right="sr", source=source,
           fill_color="orange", line_color="white", alpha=0.5, legend_label="greater than")
p1.line(x_ref, pdf, line_color="crimson", line_width=6, alpha=0.7, legend_label="PDF fit")

p2 = figure(plot_width=750, plot_height=300, background_fill_color="#fafafa", title="Cummulative Density Function & Survival Function")
p2.x_range = p1.x_range
p2.line("x_ref", "cdf_ref", source=dist_source, line_color="lightgray", line_width=1, alpha=0.7, legend_label="CDF")
p2.line("x_ref", "cdf", source=dist_source, line_color="lightseagreen", line_width=4, alpha=0.7, legend_label="CDF")
p2.line("x_ref", "sf_ref", source=dist_source, line_color="lightgray", line_width=1, alpha=0.7, legend_label="CDF")
p2.line("x_ref", "sf", source=dist_source, line_color="orange", line_width=4, alpha=0.7, legend_label="SF")

p = column(p_top,slider, p1, p2)
show(p)

How rare is to observe in this station a water level higher than 8? You can play with the plot above to find that value, or check the answer below. And higher than 10?

In [None]:
x1 = 8.1
x2 = 10
print("While the SF({})={:.3}, the SF({})={:.3e}".format(
    x1,
    stats.norm.sf(x1, loc=mu_robust, scale=sigma_robust),
    x2,
    stats.norm.sf(x2, loc=mu_robust, scale=sigma_robust)
    )
)

It is clear that a sea level 10 is less common than 8.1, but how much less? The survival function is a way to scale that, by defining how frequently it was observed values equal or higher than the one in question.
For instance, approximately 2.5% of the observations were equal or higher than 8.1.

## Adding a new perspective: Spikiness using Tukey53H
Looking back on the distribution plots it was clear that there are bad measurements within the scale of valid measurements, i.e. between 7 and 8.
Just looking at the magnitude it is not possible to identify those, so we shall add more tests.
Let's start with Tukey53H (if curious, check CoTeDe's manual (https://cotede.castelao.net) about this procedure).

In [None]:
y_tukey53H = qctests.tukey53H(data["water_level"])

Let's take a look on the same timeseries but projected as a Tukey53H.


In [None]:
# A time series with the data
p1 = figure(plot_width=750, plot_height=300, title="Water Level")
p1.circle(data['epoch'][idx_good], data["water_level"][idx_good], size=8, line_color="orange", fill_color="orange", fill_alpha=0.5, legend_label="Good values")
p1.triangle(data["epoch"][idx_bad], data["water_level"][idx_bad], size=12, line_color="red", fill_color="red", fill_alpha=0.8, legend_label="Bad values")

p2 = figure(plot_width=750, plot_height=300, title="Tukey53H of the water level")
p2.x_range = p1.x_range
p2.circle(data['epoch'][idx_good], y_tukey53H[idx_good], size=8, line_color="orange", fill_color="orange", fill_alpha=0.5, legend_label="Good values")
p2.triangle(data["epoch"][idx_bad], y_tukey53H[idx_bad], size=12, line_color="red", fill_color="red", fill_alpha=0.8, legend_label="Bad values")
p = column(p1, p2)
show(p)

What is the distribution of the tukey53H feature?
In contrast to the raw data, the tukey53H of the good data is quite different than the bad data.
For that reason, let's use two plots, with different scales

In [None]:
hist_good

In [None]:
idx = np.isfinite(y_tukey53H)

hist_good, edges_good = np.histogram(y_tukey53H[idx & idx_good], density=False, bins=50)

p1 = figure(plot_width=750, plot_height=300, background_fill_color="#fafafa", title="Tukey53H of the good data")
p1.quad(top=hist_good, bottom=0, left=edges_good[:-1], right=edges_good[1:],
           fill_color="green", line_color="white", alpha=0.5)

hist_bad, edges_bad = np.histogram(y_tukey53H[idx & idx_bad], density=False, bins=50)

p2 = figure(plot_width=750, plot_height=300, background_fill_color="#fafafa", title="Tukey53H of the bad data")
p2.quad(top=hist_bad, bottom=0, left=edges_bad[:-1], right=edges_bad[1:],
           fill_color="red", line_color="white", alpha=0.5)

p = column(p1, p2)
show(p)

Let's use a robust estimate of the mean and standard deviation

In [None]:
mu_tukey53H = np.percentile(y_tukey53H[idx], 50)
sigma_tukey53H = (np.percentile(y_tukey53H[idx], 75) - np.percentile(y_tukey53H[idx], 25)) / 1.349

print("Estimated robust mean: {:.3e}, and robust standard deviation: {:.3e}".format(mu_tukey53H, sigma_tukey53H))

How does that compares with a non robust estimate?

In [None]:
mu_tukey53H, sigma_tukey53H = stats.norm.fit(y_tukey53H[idx])

print("Estimated mean: {:.3f}, and standard deviation: {:.3f}".format(mu_tukey53H, sigma_tukey53H))

In [None]:
print("While the SF(0.01)={:.3}, the SF(0.5)={:.3e}".format(
    stats.norm.sf(0.01, loc=mu_tukey53H, scale=sigma_tukey53H),
    stats.norm.sf(0.5, loc=mu_tukey53H, scale=sigma_tukey53H)
    )
)

## Combining tests into a multidimensional criterion
How could we aggregate information from multiple tests into a combined criterion? Extreme cases are easy to identify, such the water elevation above 14, but as we get closer to expected values it is harder to decide based in a single criteria without compromising by also flagging good data as bad.
One alternative is to combine multiple perspectives which alone are not clear but combined could make a clear case.

Here we used two features to evaluate each measurement, the water elevation itself and the Tukey53H of the water elevation.
Those have different scales (compare the histograms shown before), so those can't be combined without some sort of scaling.
A difference in 0.1 in the water level is not the same effect of a difference in 0.1 in Tukey53H.
One way to normalize each feature is by using the survival function.

In [None]:
x_i = 8.1
pxi = stats.norm.sf(x_i, loc=mu_robust, scale=sigma_robust)

y_i = 0.1
pyi = stats.norm.sf(y_i, loc=mu_tukey53H, scale=sigma_tukey53H)

print("""The probability of observing a measurement higher than {} is {:.3e}.\n""".format(x_i, pxi))
print("""And the probability of observing a tukey53H value higher than {} is {:.3e}.\n""".format(y_i, pyi))

print("If we assume that these are independent processes, i.e. a spike is independent of the actual water level thus it can happend at any point, the probability of a tukey53H larger than {} while resulting in a value higher than {} is {:.3e}""".format(y_i, x_i, pxi * pyi))

## Focus on the uncommon
Instead of assuming that all features have a normal distribution, and trying to fit the whole dataset, let's focus on the extremme cases.
A good observing system has typically much less than 1% of its measurements invalid, so let's fit our PDF using only the top 5% values.

Let's reconsider the feature Tukey53H, and assume it is symmetric, i.e. doesn't matter if the spike is up or down, which is not always true. For instance spikes on Chlorophyll measurements are not symmetric, thus positive spikes are different then negative ones.
Since we are looking at the tail, instead of a Gaussian, let's use an Exponential Weibull distribution which will give us more degrees of freedom.

In [None]:
y_tukey53H = np.absolute(qctests.tukey53H(data["water_level"]))

In [None]:
tukey53H_top95 = np.percentile(y_tukey53H[np.isfinite(y_tukey53H)], 95)

print("The 95 percentile of the valid absolute Tukey53H is {:.4e}".format(tukey53H_top95))

Let's take only the top 5% of Tukey53H values and fit an exponential weibull distribution. 

In [None]:
y = y_tukey53H[y_tukey53H > tukey53H_top95]

from scipy.stats import exponweib
param_tukey53H = exponweib.fit(y)
param_tukey53H

In [None]:
y_ref = np.linspace(y.min(), y.max(), 1000)
sf = exponweib.sf(y_ref, *param_tukey53H[:-2], loc=param_tukey53H[-2], scale=param_tukey53H[-1])

# An index where Tukey53H is valid. The tails can't be calculated (check CoTeDe's manual)
idx = np.isfinite(y_tukey53H)

hist_good, edges_good = np.histogram(y_tukey53H[idx & idx_good], density=False, bins=50)

p1 = figure(plot_width=750, plot_height=250, background_fill_color="#fafafa")
p1.quad(top=hist_good, bottom=0, left=edges_good[:-1], right=edges_good[1:],
           fill_color="green", line_color="white", alpha=0.5)

hist_bad, edges_bad = np.histogram(y_tukey53H[idx & idx_bad], density=False, bins=50)

p2 = figure(plot_width=750, plot_height=250, background_fill_color="#fafafa")
p2.quad(top=hist_bad, bottom=0, left=edges_bad[:-1], right=edges_bad[1:],
           fill_color="red", line_color="white", alpha=0.5)

p3 = figure(plot_width=750, plot_height=250, background_fill_color="#fafafa", title="Survival Function")
p3.line(y_ref, sf, line_color="orange", line_width=4, alpha=0.7)

p = column(p1, p2, p3)
show(p)

Let's do the same for the water level itself.

In [None]:
top95 = np.percentile(data["water_level"], 95)
print("The 95 percentile of the valid water level is {:.5f}".format(top95))

x = data["water_level"][data["water_level"] > top95]

param = exponweib.fit(x)

x1 = 8.1
x2 = 10
print("While the SF({})={:.3}, the SF({})={:.3}".format(
    x1,
    exponweib.sf(x1, *param[:-2], loc=param[-2], scale=param[-1]),
    x2,
    exponweib.sf(x2, *param[:-2], loc=param[-2], scale=param[-1])
    )
)

Let's combine both probabilites

In [None]:
xx = np.arange(7, 14, 0.05)
yy = np.arange(0, 1, 0.005)
X, Y = np.meshgrid(xx, yy)

P = exponweib.sf(X, *param[:-2], loc=param[-2], scale=param[-1]) * exponweib.sf(Y, *param_tukey53H[:-2], loc=param_tukey53H[-2], scale=param_tukey53H[-1])

import matplotlib.pyplot as plt
plt.contour(X, Y, np.log(P))
plt.colorbar()
plt.show()

In [None]:
idx = np.isfinite(y_tukey53H)


p = figure(plot_width=450, plot_height=450, background_fill_color="#fafafa", title="A 2D space")
p.circle(data["water_level"][idx &idx_good], y_tukey53H[idx & idx_good], size=8, line_color="orange", fill_color="orange", fill_alpha=0.4, legend_label="Good values")
p.triangle(data["water_level"][idx & idx_bad], y_tukey53H[idx & idx_bad], size=12, line_color="red", fill_color="red", fill_alpha=0.4, legend_label="Bad values")

show(p)

In [None]:
p_valid = exponweib.sf(data["water_level"], *param[:-2], loc=param[-2], scale=param[-1]) * exponweib.sf(y_tukey53H, *param_tukey53H[:-2], loc=param_tukey53H[-2], scale=param_tukey53H[-1])


In [None]:
from bokeh.models import ColumnDataSource, CustomJS, Slider

threshold = Slider(title="threshold", value=-5.0, start=-15.0, end=0.0, step=0.5, orientation="horizontal")

tmp = dict(
    epoch=data["epoch"],
    wl=data["water_level"],
    wl_good=data["water_level"].copy(),
    wl_bad=data["water_level"].copy(),
    p=np.log(p_valid),
)
idx = np.log(p_valid) < threshold.value
tmp["wl_good"][idx] = np.nan
tmp["wl_bad"][~idx] = np.nan

source = ColumnDataSource(data=tmp)

callback = CustomJS(args=dict(source=source), code="""
    var data = source.data;
    var f = cb_obj.value;
    var p = data['p'];
    var wl = data['wl'];
    var wl_good = data['wl_good'];
    var wl_bad = data['wl_bad'];
    for (var i = 0; i < wl.length; i++) {
        if (p[i] < f) {
            wl_good[i] = "NaN"
            wl_bad[i] = wl[i]
        } else {
            wl_good[i] = wl[i]
            wl_bad[i] = "NaN"
        }
    }
    source.change.emit();
""")


threshold.js_on_change('value', callback)

In [None]:
np.nanmin(np.log(p_valid))

In [None]:
# A time series with the data
p1 = figure(plot_width=750, plot_height=300)
p1.circle("epoch", "wl_good", source=source, size=8, line_color="orange", fill_color="orange", fill_alpha=0.5, legend_label="Good values")
p1.triangle("epoch", "wl_bad", source=source, size=12, line_color="red", fill_color="red", fill_alpha=0.8, legend_label="Bad values")

p = column(threshold, p1)
show(p)

In [None]:
cdf_tukey53H = stats.norm.cdf(y_ref, loc=mu_tukey53H, scale=sigma_tukey53H)
sf_tukey53H = stats.norm.sf(y_ref, loc=mu_tukey53H, scale=sigma_tukey53H)

In [None]:
y_ref = np.linspace(np.nanmin(y_tukey53H), np.nanmax(y_tukey53H), 1000)

cdf_tukey53H = stats.norm.cdf(y_ref, loc=mu_tukey53H, scale=sigma_tukey53H)
sf_tukey53H = stats.norm.sf(y_ref, loc=mu_tukey53H, scale=sigma_tukey53H)

In [None]:
idx = np.isfinite(y_tukey53H)

hist_good, edges_good = np.histogram(y_tukey53H[idx & idx_good], density=True, bins=50)

p1 = figure(plot_width=750, plot_height=300, background_fill_color="#fafafa")
p1.quad(top=hist_good, bottom=0, left=edges_good[:-1], right=edges_good[1:],
           fill_color="green", line_color="white", alpha=0.5)
#p1.line(y_ref, cdf_tukey53H, line_color="lightseagreen", line_width=4, alpha=0.7, legend_label="CDF")
#p1.line(y_ref, sf_tukey53H, line_color="orange", line_width=4, alpha=0.7, legend_label="SF")

hist_bad, edges_bad = np.histogram(y_tukey53H[idx & idx_bad], density=True, bins=50)

p2 = figure(plot_width=750, plot_height=300, background_fill_color="#fafafa")
p2.quad(top=hist_bad, bottom=0, left=edges_bad[:-1], right=edges_bad[1:],
           fill_color="red", line_color="white", alpha=0.5)

p = column(p1, p2)
show(p)


In [None]:
mu_tukey53H, sigma_tukey53H = stats.norm.fit(y_tukey53H[np.isfinite(y_tukey53H)])

print("Estimated mean: {:.3f}, and standard deviation: {:.3f}".format(mu_tukey53H, sigma_tukey53H))

In [None]:
x_ref = y_tukey53H[np.isfinite(y_tukey53H)]
sf_tukey53H = stats.norm.sf(x_ref, loc=mu_tukey53H, scale=sigma_tukey53H)

hist, edges = np.histogram(y_tukey53H[np.isfinite(y_tukey53H)], density=True, bins=50)

p = figure(plot_width=750, plot_height=300, background_fill_color="#fafafa")
p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
           fill_color="blue", line_color="white", alpha=0.5)
p.line(x_ref, sf_tukey53H, line_color="orange", line_width=4, alpha=0.7, legend_label="SF")

show(p)

In [None]:
data["water_level"][idx_bad]
np.nonzero(np.isnan(data["water_level"]))

In [None]:
def plot_hist(x):
    """Plot an histogram
    
    Create an histogram from the output of numpy.hist().
    We will create several histograms in this notebook so let's save this as a function to
    reuse this code.
    """
    x = x[~np.isnan(x)]
    hist, edges = np.histogram(x, density=True, bins=50)
    
    #title = 'test'
    # p = figure(title=title, tools='', background_fill_color="#fafafa")
    p = figure(plot_width=750, plot_height=300,
        tools='', background_fill_color="#fafafa")
    p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
           fill_color="navy", line_color="white", alpha=0.5)
    # p.line(x, pdf, line_color="#ff8888", line_width=4, alpha=0.7, legend_label="PDF")
    # p.line(x, cdf, line_color="orange", line_width=2, alpha=0.7, legend_label="CDF")

    p.y_range.start = 0
    p.legend.location = "center_right"
    # p.legend.background_fill_color = "#fefefe"
    p.xaxis.axis_label = 'x'
    p.yaxis.axis_label = 'Pr(x)'
    p.grid.grid_line_color="white"
    return p
