# Postprocessing: data quality check using the analysis
In this notebook we explain how the analysis itself can be used to get an indication of the data quality.

In [None]:
using DIVAnd
using Makie, CairoMakie, GeoMakie
using Dates
using Statistics
using ColorSchemes
include("../config.jl")

## Data reading
Salinity observations are read from a netCDF file based on the [World Ocean Database](https://www.nodc.noaa.gov/OC5/WOD/pr_wod.html).     
The reading is done using `loadobs` function.

In [None]:
varname = "Salinity"
filename = salinityprovencalfile
download_check(salinityprovencalfile, salinityprovencalfileURL)

In [None]:
obsval, obslon, obslat, obsdepth, obstime, obsid = loadobs(Float64, filename, "Salinity");

### Topography and grid definition

See the [topography notebook](../2-Preprocessing/06-topography.ipynb) for more details.     
Here the code is just replicated to get topography defining the mask.

In [None]:
dx = dy = 0.05
lonr = 2.5:dx:12.0
latr = 41.9:dy:44.6

mask, (pm, pn), (xi, yi) = DIVAnd_rectdom(lonr, latr)

bathname = gebco04file
download_check(gebco04file, gebco04fileURL)

bx, by, b = load_bath(bathname, true, lonr, latr)

mask = falses(size(b, 1), size(b, 2))
for j = 1:size(b, 2)
    for i = 1:size(b, 1)
        mask[i, j] = b[i, j] >= 1.0
    end
end

### Data selection for example

Cross validation, error calculations etc. assume independant data. Hence do not take high-resolution vertical profiles with all data but restrict yourself to specific small depth range. Here August data at surface:

In [None]:
sel = (obsdepth .< 1) .& (Dates.month.(obstime) .== 8)

obsval = obsval[sel]
obslon = obslon[sel]
obslat = obslat[sel]
obsdepth = obsdepth[sel]
obstime = obstime[sel]
obsids = obsid[sel];
@show (size(obsval))
checkobs((obslon, obslat, obsdepth, obstime), obsval, obsid)

### Analysis

Analysis parameters have been calibrated in the other notebook example.     
Analysis `fi` using mean data as background.      
Structure `s` is stored for later use.

In [None]:
len = 0.5
epsilon2 = 5.0
fi, s = DIVAndrun(
    mask,
    (pm, pn),
    (xi, yi),
    (obslon, obslat),
    obsval .- mean(obsval),
    len,
    epsilon2,
);

Plot of the gridded field and the observations.

In [None]:
f = Figure()
ax = GeoAxis(
    f[1, 1],
    title = "Interpolated salinity field",
    dest = "+proj=merc",
    yticks = 40:1:45.0,
)
hm = heatmap!(
    ax,
    lonr,
    latr,
    fi .+ mean(obsval),
    colormap = ColorSchemes.haline,
    colorrange = [37.0, 38.5],
    interpolate = false,
)
Colorbar(f[2, 1], hm, vertical = false, label = "S", labelrotation = 0)
contourf!(ax, bx, by, b, levels = [-1e5, 0], colormap = Reverse(:binary))
scatter!(
    ax,
    obslon,
    obslat,
    color = obsval,
    markersize = 7,
    colormap = ColorSchemes.haline,
    colorrange = [37.0, 38.5],
)
rowgap!(f.layout, -80)
f

Plot of the data residuals

In [None]:
dataresiduals = DIVAnd_residualobs(s, fi)
rscale = sqrt(var(obsval))

In [None]:
f = Figure()
ax = GeoAxis(f[1, 1], title = "Residuals", dest = "+proj=merc", yticks = 40:1:45.0)
scat = scatter!(
    ax,
    obslon,
    obslat,
    color = dataresiduals,
    markersize = 7,
    colormap = ColorSchemes.balance,
    colorrange = [-rscale, rscale],
)
Colorbar(f[2, 1], scat, vertical = false)
contourf!(ax, bx, by, b, levels = [-1e5, 0, 1.0], colormap = Reverse(:binary))
rowgap!(f.layout, -80)
f

In [None]:
f = Figure()
ax = Axis(
    f[1, 1],
    title = "Residuals as function of value",
    xlabel = "Salinity",
    ylabel = "Residuals",
)
plot!(ax, obsval, dataresiduals)
f

# Data quality check

As for cross validation, in theory take out data and measure the difference between these data points not used and the analysis. Can be done without actually taking out the points. Three methods are implemented.
## Define method used
The different methods are described as the output of 
```julia
?DIVAnd_qc
```
Here we use the method 1: standard cross validation.    
The suspect data points are considered as those with a QC value higher than 10.

In [None]:
qcval = DIVAnd_qc(fi, s, 1)

# Find suspect points
sp = findall(x -> x .> 10, qcval)

# Or sort the indicator
suspectindexes = sortperm(qcval, rev = true);

The suspect points are characterised by a lower salinity and are mostly located in the [Rhône estuary](https://en.wikipedia.org/wiki/Rh%C3%B4ne).

In [None]:
f = Figure()
ax = Axis(
    f[1, 1],
    title = "Residuals as function of value,\n colored by the QC value",
    xlabel = "Salinity",
    ylabel = "Residuals",
)
scat = scatter!(ax, obsval, dataresiduals, color = qcval, colorrange = [1, 10])
Colorbar(f[2, 1], scat, vertical = false)
f

In [None]:
f = Figure()
ax = GeoAxis(f[1, 1], title = "Positions of suspect points", dest = "+proj=merc")
hm = heatmap!(
    ax,
    lonr,
    latr,
    fi .+ mean(obsval),
    colormap = ColorSchemes.haline,
    colorrange = [37.0, 38.5],
    interpolate = false,
)
Colorbar(f[2, 1], hm, vertical = false, label = "S", labelrotation = 0)
contourf!(ax, bx, by, b, levels = [-1e5, 0, 1.0], colormap = Reverse(:binary))
plot!(ax, obslon[sp], obslat[sp], color = :red, markersize = 7)
rowgap!(f.layout, -80)
f

### More information

In [None]:
?DIVAnd_qc

## Exercise

* Redo for different data.
* Possibly force the cross-validation method (use ?DIVAnd_cv)
* Once opimized, try to redo optimization with starting point being the first estimate+
* Change level for qc parameter used for flagging
* Create non-uniform weights for data using the quality check parameter. Then redo the analysis and possibly the whole chain from analysis, calibration

<div class="alert alert-block alert-warning">
⚠️ Data quality check using analysis-data misfits is usefull but should not replace <b>preprocessing</b> quality checks and proper quality flag exploitation.
</div>