Skip to content

Commit

Permalink
Flexible WOA comparison (#57)
Browse files Browse the repository at this point in the history
* WIP: different date sources

* feat: WOA comparison features are regular arrays

Conform with new standard for CoTeDe of all features as regular arrays.

Note, for number of samples, which is an integer type, I'm using -1 to
keep as an integer. I might regret on this decision.

* refactor: woa_normbias() in charge of WOA comparison

woa_normbias() was several years old, and originally would do the whole
WOA QC comparison. The standard QC classes were developed and adopted
later. This commit was a step towards bringing the WOA comparison to the
current standard, where woa_normbias() takes care of defining the
features, i.e. obtaining the WOA refernce values as well as the bias and
normbias.

* feat: extract_time(), search for time information

Different datasets with different structure might also follow different
vocabularies, thus extract_time search on different places for the time
to associate with the measurements.

* fix: typo

* style: PEP8

* fix: moving test_flex_time into tests

* feat: Extract DOY from several possible inputs

utils.day_of_year() is a WIP but it is fine for now. It can obtain the
DOY as integer from several possibilities of inputs. Once I normalize
other parts of the code I will come back here to finish and simplify
this function.

* feat: Extract depth from several possible standards

* refactor: WOA_NormBias uses woa_normbias to create features

* feat: More validation tests for woa_normbias

* fix: Updating required version of OceansDB

There was a bug on OceansDB for invalid locations such as on land.

* refactor: Fails if can't obtain geolocation

Before the approach was to create an empty features dictionary, but now
that I'm using woa_normbias it breaks if can't obtain lat lon.

* feat: More logging information

* refactor: minimum samples for WOA comparison

Repositioning minimum samples control in WOA_NormBias class

* refactor: flag 0 if woa_normbias fails

* docs: typo

* docs: Comment on explicit target variables

* fix: Wrong path to support module - data

* fix: Require generic Anom. Detection to skip test

This new approach on WOA comparison is incompatible with old Anomaly
Detection approach. The new structure, in a parallel branch will be
merged soon.

* Bump version: 0.23.2 → 0.23.3

* doc: Clarifying warning message
  • Loading branch information
castelao committed Jan 2, 2021
1 parent 4dde93f commit c7315ba
Show file tree
Hide file tree
Showing 9 changed files with 435 additions and 188 deletions.
2 changes: 1 addition & 1 deletion cotede/__init__.py
Expand Up @@ -2,7 +2,7 @@

__author__ = 'Guilherme Castelao'
__email__ = 'guilherme@castelao.net'
__version__ = '0.23.2'
__version__ = '0.23.3'

from cotede import qc
from cotede.qc import ProfileQC, ProfileQCed
305 changes: 123 additions & 182 deletions cotede/qctests/woa_normbias.py
Expand Up @@ -26,244 +26,185 @@
from oceansdb import WOA

from .qctests import QCCheckVar
from ..utils import extract_coordinates, extract_time, day_of_year, extract_depth

module_logger = logging.getLogger(__name__)


def woa_normbias(data, v, cfg):
def woa_normbias(data, varname, attrs=None, use_standard_error=False):
"""
FIXME: Move this procedure into a class to conform with the new system
and include a limit in minimum ammount of samples to trust it. For
example, consider as masked all climatologic values estimated from
less than 5 samples.
Notes
-----
- Include arguments to overwrite target variable (timename=None, latname=None, lonname=None)
"""
try:
doy = day_of_year(extract_time(data, attrs))
except LookupError as err:
module_logger.error("Missing time")
raise

# 3 is the possible minimum to estimate the std, but I shold use higher.
min_samples = 3
woa = None
try:
# Note that QCCheck fallback to self.data.attrs if attrs not given
lat, lon = extract_coordinates(data, attrs)
except LookupError:
module_logger.error("Missing geolocation (lat/lon)")
raise

kwargs = {"lat": lat, "lon": lon}

if (np.size(lat) > 1) | (np.size(lon) > 1):
dLmax = max(np.max(lat) - np.min(lat), np.max(lon) - np.min(lon))
if dLmax >= 0.01:
mode = "track"
# kwargs["alongtrack_axis"] = ['lat', 'lon']
else:
mode = "profile"
kwargs = {
"lat": np.mean(lat),
"lon": np.mean(lon),
}
module_logger.warning("Multiple lat/lon positions but too close to each other so it will be considered a single position for the WOA comparison. lat: {}, lon: {}".format(kwargs["lat"], kwargs["lon"]))
else:
mode = "profile"

depth = extract_depth(data)

db = WOA()
if v not in db.keys():
vtype = v[:-1]
# This must go away. This was a trick to handle Seabird CTDs, but
# now that seabird is a different package it should be handled there.
if isinstance(varname, str) and (varname[-1] == "2"):
vtype = varname[:-1]
else:
vtype = v

# Temporary solution while I'm not ready to handle tracks.
if (
("LATITUDE" in data)
and ("LONGITUDE" in data)
and ("LATITUDE" not in data.attributes)
and ("LONGITUDE" not in data.attributes)
):
if "datetime" in data.keys():
d = data["datetime"]
elif "datetime" in data.attributes:
d0 = data.attributes["datetime"]
if "timeS" in data.keys():
d = [d0 + timedelta(seconds=s) for s in data["timeS"]]
else:
d = ([data.attributes["datetime"]] * len(data["LATITUDE"]),)

# woa = woa_track_from_file(
# d,
# data['LATITUDE'],
# data['LONGITUDE'],
# cfg['file'],
# varnames=cfg['vars'])

module_logger.error("Sorry, I'm temporary not ready to handle tracks.")
# woa = db[vtype].get_track(var=['mean', 'standard_deviation'],
# doy=d,
# depth=[0],
# lat=data['LATITUDE'],
# lon=data['LONGITUDE'])

elif (
("LATITUDE" in data.attributes.keys())
and ("LONGITUDE" in data.attributes.keys())
and ("PRES" in data.keys())
):

woa = db[vtype].track(
var=["mean", "standard_deviation", "number_of_observations"],
doy=int(data.attributes["datetime"].strftime("%j")),
depth=data["PRES"],
lat=data.attributes["LATITUDE"],
lon=data.attributes["LONGITUDE"],
)
vtype = varname

woa_vars = [
"mean",
"standard_deviation",
"standard_error",
"number_of_observations",
]

# Eventually the case of some invalid depth levels will be handled by
# OceansDB and the following steps will be simplified.
valid_depth = depth
if (np.size(depth) > 0):
idx = ~ma.getmaskarray(depth) & (np.array(depth) >= 0) & np.isfinite(depth)
if not idx.any():
module_logger.error("Invalid depth(s) for WOA comparison: {}".format(depth))
raise IndexError
elif not idx.all():
valid_depth = depth[idx]
if mode == "track":
woa = db[vtype].track(var=woa_vars, doy=doy, depth=valid_depth, **kwargs)
else:
woa = db[vtype].extract(var=woa_vars, doy=doy, depth=valid_depth, **kwargs)

flag = np.zeros(data[v].shape, dtype="i1")
features = {}
if not np.all(depth == valid_depth):
for v in woa.keys():
tmp = ma.masked_all(depth.shape, dtype=woa[v].dtype)
tmp[idx] = woa[v]
woa[v] = tmp

try:
woa_bias = data[v] - woa["mean"]
woa_normbias = woa_bias / woa["standard_deviation"]
features = {
"woa_mean": woa["mean"],
"woa_std": woa["standard_deviation"],
"woa_nsamples": woa["number_of_observations"],
"woa_se": woa["standard_error"],
}

ind = np.nonzero(
(woa["number_of_observations"] >= min_samples)
& (np.absolute(woa_normbias) <= cfg["sigma_threshold"])
)
flag[ind] = 1 # cfg['flag_good']
ind = np.nonzero(
(woa["number_of_observations"] >= min_samples)
& (np.absolute(woa_normbias) > cfg["sigma_threshold"])
)
flag[ind] = 3 # cfg['flag_bad']
features["woa_bias"] = data[varname] - features["woa_mean"]

# Flag as 9 any masked input value
flag[ma.getmaskarray(data[v])] = 9
for v in features:
idx = ma.getmaskarray(features[v])
if idx.any():
if v == "woa_nsamples":
missing_value = -1
else:
missing_value = np.nan
features[v][idx] = missing_value
features[v] = np.array(features[v])

# if use_standard_error = True, the comparison with the climatology
# considers the standard error, i.e. the bias will be only the
# ammount above the standard error range.
if use_standard_error is True:
standard_error = features["woa_std"]
idx = features["woa_nsamples"] > 0
standard_error[~idx] = np.nan
standard_error[idx] /= features["woa_nsamples"][idx] ** 0.5

idx = np.absolute(features["woa_bias"]) <= standard_error
features["woa_bias"][idx] = 0
idx = np.absolute(features["woa_bias"]) > standard_error
features["woa_bias"][idx] -= (
np.sign(features["woa_bias"][idx]) * standard_error[idx]
)

features = {
"woa_bias": woa_bias,
"woa_normbias": woa_normbias,
"woa_std": woa["standard_deviation"],
"woa_nsamples": woa["number_of_observations"],
"woa_mean": woa["mean"],
}
features["woa_normbias"] = features["woa_bias"] / features["woa_std"]

finally:
# self.logger.warnning("%s - WOA is not available at this site" %
# self.name)
return flag, features
return features


class WOA_NormBias(QCCheckVar):
"""Compares measurements with WOA climatology
Notes
-----
* Although using standard error is a good idea, the default is to not use
* Although using standard error is a good idea, the default is to don't use
standard error to estimate the bias to follow the traaditional approach.
This can have a signifcant impact in the deep oceans and regions lacking
extensive sampling.
FIXME: Move this procedure into a class to conform with the new system
and include a limit in minimum ammount of samples to trust it. For
example, consider as masked all climatologic values estimated from
less than 5 samples.
"""

flag_bad = 3
use_standard_error = False
# 3 is the possible minimum to estimate the std, but I shold use higher.
min_samples = 3

def __init__(self, data, varname, cfg=None, autoflag=True):
try:
self.use_standard_error = cfg["use_standard_error"]
except (KeyError, TypeError):
module_logger.debug("use_standard_error undefined. Using default value")
try:
self.min_samples = cfg["min_samples"]
except (KeyError, TypeError):
module_logger.debug("min_samples undefined. Using default value")
super().__init__(data, varname, cfg, autoflag)

def set_features(self):
try:
doy = int(self.data.attrs["date"].strftime("%j"))
except:
doy = int(self.data.attrs["datetime"].strftime("%j"))

if ("LATITUDE" in self.data.attrs.keys()) and (
"LONGITUDE" in self.data.attrs.keys()
):
mode = "profile"
kwargs = {
"lat": self.data.attrs["LATITUDE"],
"lon": self.data.attrs["LONGITUDE"],
}

if ("LATITUDE" in self.data.keys()) and ("LONGITUDE" in self.data.keys()):
mode = "track"
dLmax = max(
self.data["LATITUDE"].max() - self.data["LATITUDE"].min(),
self.data["LONGITUDE"].max() - self.data["LONGITUDE"].min(),
)
# Only use each measurement coordinate if it is spread.
if dLmax >= 0.01:
kwargs = {"lat": self.data["LATITUDE"], "lon": self.data["LONGITUDE"]}

if "DEPTH" in self.data.keys():
depth = self.data["DEPTH"]
elif "PRES" in self.data.keys():
depth = self.data["PRES"]

db = WOA()
# This must go away. This was a trick to handle Seabird CTDs, but
# now that seabird is a different package it should be handled there.
if isinstance(self.varname, str) and (self.varname[-1] == "2"):
vtype = self.varname[:-1]
else:
vtype = self.varname

woa_vars = [
"mean",
"standard_deviation",
"standard_error",
"number_of_observations",
]

idx = ~ma.getmaskarray(depth) & np.array(depth >= 0)
if idx.any():
if mode == "track":
woa = db[vtype].track(
var=woa_vars, doy=doy, depth=np.atleast_1d(depth[idx]), **kwargs
)
else:
woa = db[vtype].extract(
var=woa_vars, doy=doy, depth=np.atleast_1d(depth[idx]), **kwargs
)
else:
woa = {v: ma.masked_all(1) for v in woa_vars}

if idx.all() is not True:
for v in woa.keys():
tmp = ma.masked_all(depth.shape, dtype=woa[v].dtype)
tmp[idx] = woa[v]
woa[v] = tmp

self.features = {
"woa_mean": woa["mean"],
"woa_std": woa["standard_deviation"],
"woa_nsamples": woa["number_of_observations"],
"woa_se": woa["standard_error"],
}

self.features["woa_bias"] = self.data[self.varname] - self.features["woa_mean"]

# if use_standard_error = True, the comparison with the climatology
# considers the standard error, i.e. the bias will be only the
# ammount above the standard error range.
if self.use_standard_error is True:
standard_error = (
self.features["woa_std"] / self.features["woa_nsamples"] ** 0.5
)
idx = np.absolute(self.features["woa_bias"]) <= standard_error
self.features["woa_bias"][idx] = 0
idx = np.absolute(self.features["woa_bias"]) > standard_error
self.features["woa_bias"][idx] -= (
np.sign(self.features["woa_bias"][idx]) * standard_error[idx]
)

self.features["woa_normbias"] = (
self.features["woa_bias"] / self.features["woa_std"]
)

self.features = woa_normbias(self.data, self.varname, self.attrs)
except LookupError:
self.features = {}

def test(self):

# 3 is the possible minimum to estimate the std, but I shold use higher.
try:
min_samples = self.cfg["min_samples"]
except KeyError:
min_samples = 3

self.flags = {}

threshold = self.cfg["threshold"]
assert (np.size(threshold) == 1) and (threshold is not None)

flag = np.zeros(self.data[self.varname].shape, dtype="i1")

if "woa_normbias" not in self.features:
self.flags["woa_normbias"] = flag
return

normbias_abs = np.absolute(self.features["woa_normbias"])
ind = np.nonzero(
(self.features["woa_nsamples"] >= min_samples)
(self.features["woa_nsamples"] >= self.min_samples)
& np.array(normbias_abs <= threshold)
)
flag[ind] = self.flag_good
ind = np.nonzero(
(self.features["woa_nsamples"] >= min_samples)
(self.features["woa_nsamples"] >= self.min_samples)
& np.array(normbias_abs > threshold)
)
flag[ind] = self.flag_bad
Expand Down

0 comments on commit c7315ba

Please sign in to comment.