# Automatic data downloading

In this example we will show how one can download data from a Jupyter notebook.

The [PhysOcean](https://github.com/gher-uliege/PhysOcean.jl) module provides ways to automatically download data from 
- the [World Ocean Database](https://www.nodc.noaa.gov/OC5/WOD/pr_wod.html) and
- the [CMEMS](http://marine.copernicus.eu/) In-Situ TAC.

This module can be installed by: 

```julia
using Pkg
Pkg.add("PhysOcean")

```
or in the package REPL
```julia
(@v1.11) pkg> add PhysOcean
```
We make sure that the latest version is installed.

In [None]:
using Pkg
Pkg.add(PackageSpec(name="PhysOcean", rev="master"))

In [None]:
using CairoMakie
using GeoMakie
using PhysOcean           # Download data from the World Ocean Database and Copernicus
using DIVAnd              # DIVAnd 
using Dates
using Statistics
include("../config.jl")

## Settings
Define the time and geospatial bounding box for downloading the data

In [None]:
# resolution (the resolution is only used for DIVAnd analyses)
dx = dy = 0.25   # medium size test 

# vectors defining the longitude and latitudes grids
# Here longitude and latitude correspond to the Mediterranean Sea
lonr = -7:dx:37
latr = 30:dy:46

# time range of the in-situ data
timerange = [Date(2016,1,1),Date(2016,12,31)]

In [None]:
# Name of the variable
varname = "Salinity"

Please use your own email address (!) 😉     
It is only use to get notified by mail once the dataset is ready.

In [None]:
if isfile("email.txt")
    email = strip(read("email.txt",String));
    @info("Getting email address from email.txt");
else
    @warn("Create a file 'email.txt' if you want to query data to the World Ocean Database")
end

# Email for downloading the data
# Indicate here your email address
email = "ctroupin@uliege.be"

## Download the data

* World Ocean Database: example for bulk access data by simulating a web-user.
* Downloading can take several tens of minutes.
* SeaDataNet will provide a dedicated machine-to-machine interface during the SeaDataCloud project

In [None]:
?WorldOceanDatabase.download

Define the directory where the results will be saved. This directory must exists and must be empty.   
The command `mkpath` will create this path (including parent path).

In [None]:
basedir = joinpath(datadir, "WOD-temporary-dir2")
isdir(basedir) ? rm(basedir,recursive=true) : mkpath(basedir)
WorldOceanDatabase.download(lonr, latr, timerange, varname, email, basedir);

Downloading from World Ocean Database can take some time, the result from the previous request can be download directly:

In [None]:
download_check(WODdatafile, WODdatafileURL)
extractcommand = `tar -C $(datadir) -xzf $(WODdatafile)`
run(extractcommand);

## Load data
Load the data into memory and perform, if needed, an additional subsetting

In [None]:
# load all data under basedir as a double-precision floating point variable
obsval, obslon, obslat, obsdepth, obstime, obsid = 
WorldOceanDatabase.load(Float64, basedir, "Salinity");
@info("Number of data points: $(length(obsval))")

Check some observation IDs

In [None]:
@show obsid[1];
@show obsid[2];

With `checkobs` we get an overview of the extremal values of each dimension and variable.

In [None]:
checkobs((obslon, obslat, obsdepth, obstime), obsval, obsid)

## Additional sub-setting 
Based on time and depth for plotting.     
For instance the month can be extracted from the `Date` using:

In [None]:
Dates.month.(obstime)

In [None]:
# depth range levels
depthr = [0., 20.]

# month range (January to March)
timer = [1, 3]

# additional sub-setting and discard wrong negative salinities
sel = ((obsval .> 0 )
       .& (minimum(depthr) .<= obsdepth .<= maximum(depthr))
       .& (minimum(timer) .<= Dates.month.(obstime) .<= maximum(timer)));

@show typeof(sel);
@show size(sel);

The new variables (ending by `sel`) are a sub-selection based on the previous criteria.

In [None]:
valsel = obsval[sel]
lonsel = obslon[sel]
latsel = obslat[sel]
depthsel = obsdepth[sel]
timesel = obstime[sel]
idssel = obsid[sel];

Let's perform again the check.

In [None]:
checkobs((lonsel,latsel,depthsel,timesel),valsel,idssel)

Number of selected data points

In [None]:
length(valsel)

## Bathymetry download 
For plotting purpose. See [06-topography](06-topography.ipynb) for details.

In [None]:
bathname = gebco16file
download_check(gebco16file, gebco16fileURL)
bathisglobal = true

# Extract the bathymetry for plotting
bx, by, b = DIVAnd.extract_bath(bathname, bathisglobal, lonr, latr);

Create a simple plot to show the domain.

In [None]:
plot_bathy(bx, by, b, xticks=-7.:4.:37, yticks=30.:3.:48.)

## Data plotting
The bathymetry is used to display a land-sea mask using the `contourf` function with 2 levels.      
The data are shown as colored circles using `scatter`.

In [None]:
f = GeoMakie.Figure()
ax = GeoAxis(f[1,1], title="Observations")
GeoMakie.contourf!(ax, bx, by, b, levels = [-1e5,0,1e5], colormap=Reverse("binary"))
GeoMakie.scatter!(ax, obslon[sel],obslat[sel]; color = obsval[sel])
f

# Check for duplicates

There are two ways to call the function `checkduplicates`:

In [None]:
?DIVAnd.Quadtrees.checkduplicates

We load a small ODV file containing data in the same domain to test the duplicate detection.     
We use the function `ODVspreadsheet.load` available within `DIVAnd.jl`.

In [None]:
download_check(smallODVsamplefile, smallODVsamplefileURL)

In [None]:
obsval_ODV,obslon_ODV,obslat_ODV,obsdepth_ODV,obstime_ODV,obsid_ODV = ODVspreadsheet.load(Float64,[smallODVfile],
                           ["Water body salinity"]; nametype = :localname );

In [None]:
length(obsval_ODV)

In [None]:
checkobs((obslon_ODV,obslat_ODV,obsdepth_ODV,obstime_ODV),obsval_ODV,obsid_ODV)

In [None]:
f = GeoMakie.Figure()
ax = GeoAxis(f[1,1], title="Observations from ODV file")
GeoMakie.contourf!(ax, bx, by, b, levels = [-1e5,0,1e5], colormap=Reverse("binary"))
GeoMakie.scatter!(ax, obslon_ODV, obslat_ODV; color = obsval_ODV)
f

Look for duplicates
* within 0.01 degree (about 1km)
* within 0.01 m depth
* within 1 minute.      

Difference in value is 0.01 psu.

In [None]:
dupl = DIVAnd.Quadtrees.checkduplicates((obslon_ODV,obslat_ODV,obsdepth_ODV,obstime_ODV),
    obsval_ODV,(obslon,obslat,obsdepth,obstime),
    obsval,(0.01,0.01,0.01,1/(24*60)),0.01);

In [None]:
size(obsval) == size(dupl)

* `dupl` is an array of the same length as `obsval`
* If the i-th element of `dupl` is an empty list, then the i-th element in `obsval` is probably not a duplicate
* Otherwise, the i-th element in `obsval` is probably a duplicate of the element `val_ODV` with the indices `dupl[i]`.

In [None]:
dupl[1]

To get a list of possible duplicates, we check for the elements of `dupl` that are not empty.

In [None]:
index = findall(.!isempty.(dupl))

Number of duplicate candidates

In [None]:
length(index)

Check the first reported duplicate

In [None]:
if length(index) > 0
    index_WOD = index[1]
else
    @info("No duplicate detected")
end

Show its coordinates and value from the ODV file:

In [None]:
obslon[index_WOD],obslat[index_WOD],obsdepth[index_WOD],obstime[index_WOD],obsval[index_WOD]

They are quite close to the data point with the index:

In [None]:
dupl[index_WOD]

In [None]:
index_ODV = dupl[index_WOD][1]

In [None]:
obslon_ODV[index_ODV],obslat_ODV[index_ODV],
obsdepth_ODV[index_ODV],obstime_ODV[index_ODV],
obsval_ODV[index_ODV]

Indeed, it is quite likely that they are duplicates.

Combine the dataset and retain only new points from WOD

In [None]:
newpoints = findall(isempty.(dupl));
@show length(newpoints)

In [None]:
obslon_combined   = [obslon_ODV;   obslon[newpoints]];
obslat_combined   = [obslat_ODV;   obslat[newpoints]];
obsdepth_combined = [obsdepth_ODV; obsdepth[newpoints]];
obstime_combined  = [obstime_ODV;  obstime[newpoints]];
obsval_combined   = [obsval_ODV;   obsval[newpoints]];
obsids_combined   = [obsid_ODV;   obsid[newpoints]];

## CMEMS data download
The function works in a similar way.

In [None]:
?CMEMS.download

## Exercice
1. Download data from CMEMS in the same domain and for the same time period.
2. Plot the data location on a map along with the WOD observations.
3. Check for the duplicates between the two datasets.