# How to read data files?
This notebook describes or points to modules for reading data in different file formats and from different sources.       
The $^\star$ symbol denotes functions or tools available in `DIVAnd.jl`.

| Format        | Tool           | 
| ------------- |:-------------:| 
| Delimiter-separated values | [readdlm](https://docs.julialang.org/en/stable/stdlib/io-network/#Base.DataFmt.readdlm-Tuple{Any,Char,Type,Char}), [CSV](https://juliadata.github.io/CSV.jl/stable/)
| NetCDF        | [NCDatasets.jl](https://github.com/Alexander-Barth/NCDatasets.jl) | 
| ODV  $^\star$ | [ODVspreadsheet.jl](https://github.com/gher-uliege/DIVAnd.jl/blob/master/src/ODVspreadsheet.jl) |
| ODV netCDF $^\star$   | [NCODV.jl](https://github.com/gher-uliege/DIVAnd.jl/blob/master/src/NCODV.jl) | 
| GEBCO bathymetry $^\star$ | [load_bath.jl](https://github.com/gher-uliege/DIVAnd.jl/blob/master/src/load_bath.jl)|
| Big files $^\star$    | [loadbigfile](https://github.com/gher-uliege/DIVAnd.jl/blob/master/src/load_obs.jl) |
| NetCDF WOD $^\star$   | [loadobs](https://github.com/gher-uliege/DIVAnd.jl/blob/master/src/load_obs.jl)
| Mat files     | [MAT.jl](https://github.com/JuliaIO/MAT.jl)| 
| GRIB          | [GRIB.jl](https://github.com/weech/GRIB.jl)|
| GeoJSON       | [GeoJSON.jl](https://github.com/JuliaGeo/GeoJSON.jl)|
| GeoTIFF       | [GeoArrays.jl](https://github.com/evetion/GeoArrays.jl)|

In [23]:
using DIVAnd
using DelimitedFiles
using CSV
include("../config.jl")

"https://github.com/weech/GRIB.jl/raw/master/test/samples/regular_latlon_surface.grib2"

## Delimiter-separated values files 
This include the comma-separated values (CSV), the tab-separated values, among others.    
We show an example with the NAO indices that we obtain from the [Climate Data Guide](https://climatedataguide.ucar.edu/) website.

In [2]:
download_check(naodatafile, naodatafileURL)

"../data/nao_station_annual.txt"

If we use the function without option, the number of columns is deduced from the header, which lead to empty data columns:

In [4]:
dataNAO = DelimitedFiles.readdlm(naodatafile, )

155×5 Matrix{Any}:
     "Hurrell"    "Station-Based"  "Annual"  "NAO"  "Index"
 1865           -0.66              ""        ""     ""
 1866           -0.2               ""        ""     ""
 1867           -3.04              ""        ""     ""
 1868            4.14              ""        ""     ""
 1869            0.42              ""        ""     ""
 1870           -2.77              ""        ""     ""
 1871           -0.85              ""        ""     ""
 1872           -0.83              ""        ""     ""
 1873            0.17              ""        ""     ""
 1874            2.32              ""        ""     ""
 1875           -2.1               ""        ""     ""
 1876           -1.85              ""        ""     ""
    ⋮                                               
 2007            1.35              ""        ""     ""
 2008            1.72              ""        ""     ""
 2009            0.72              ""        ""     ""
 2010           -5.96              ""      

So we indicate that the first line is the header using the option *skipstart*:

In [5]:
dataNAO = DelimitedFiles.readdlm(naodatafile, skipstart=1);
dataNAO

154×2 Matrix{Float64}:
 1865.0  -0.66
 1866.0  -0.2
 1867.0  -3.04
 1868.0   4.14
 1869.0   0.42
 1870.0  -2.77
 1871.0  -0.85
 1872.0  -0.83
 1873.0   0.17
 1874.0   2.32
 1875.0  -2.1
 1876.0  -1.85
 1877.0  -0.24
    ⋮    
 2007.0   1.35
 2008.0   1.72
 2009.0   0.72
 2010.0  -5.96
 2011.0   2.95
 2012.0  -0.25
 2013.0   0.9
 2014.0   2.97
 2015.0   4.09
 2016.0   1.7
 2017.0   1.14
 2018.0   2.83

**Note:** if you have a process files in which the decimal separators is comma instead of dots, specific options are available in the module [`CSV`](https://juliadata.github.io/CSV.jl/stable/).

## NetCDF

The 2 main modules available for the reading and writing if netCDF files are:
1. [NetCDF.jl](https://github.com/JuliaGeo/NetCDF.jl)
2. [NCDatasets.jl](https://github.com/Alexander-Barth/NCDatasets.jl)

For this workshop we will mainly use `NCDatasets.jl`, described in this [notebook](../1-Intro/03-netCDF.ipynb).

### Bathymetry
The General Bathymetric Chart of the Oceans [GEBCO](https://www.gebco.net/) (in netCDF) is directly read with `DIVAnd` using the function `load_bath`.  

First make sure we have a bathymetry file.

In [6]:
bathname = gebco16file
download_check(gebco16file, gebco16fileURL)

[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mFile already downloaded


Then we have to define the grid on which we need the bathymetry and apply the function.

In [7]:
lonr = -10:0.5:36.
latr = 37:0.5:48
bx,by,b = load_bath(bathname,true,lonr,latr);

`bx` and `by` are the same as lonr and latr.    
`b` contains the bathymetry values.

A complete example is provided in the notebook [06-topography.ipynb](./06-topography.ipynb). 

## ODV spreadsheet
ODV spreadsheets constitute one of the standard formats defined in [SeaDataNet](https://www.seadatanet.org/).        
In `DIVAnd`, we provide:
* [ODVspreadsheet.jl](https://github.com/gher-uliege/DIVAnd.jl/blob/master/src/ODVspreadsheet.jl) designed to read such format and
* [NCODV.jl](https://github.com/gher-uliege/DIVAnd.jl/blob/master/src/NCODV.jl) to read the ODV netCDF files.

An example is provided in this the notebook [09-ODV-data-import.ipynb](./09-ODV-data-import.ipynb).

## Big files
The so-called big files are intermediate files using by DIVA and DIVAnd. The format is rather simple: a tab-separated file containing the following variables:
1. longitude,
2. latitude,
3. field value (e.g., temperature, salinity, chlorophyll concentration, ...), 
4. depth,
5. time,
6. measurement identifier.

In the module [load_obs.jl](https://github.com/gher-uliege/DIVAnd.jl/blob/master/src/load_obs.jl), the function `loadbigfile` allows the reading of such file format.    
In the next cell we download a *big file* containing salinity measurements (also used in other examples) and read it using `loadbigfile`.

In [8]:
fname = salinitybigfile
download_check(salinitybigfile, salinitybigfileURL)


obsval,obslon,obslat,obsdepth,obstime,obsid = loadbigfile(fname);
@show(length(obsval));

[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mFile already downloaded
[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mLoading data from 'big file' ../data/Salinity.bigfile


length(obsval) = 139230


## Mat files
We use the same .mat file as in [04-OI-variational-analysis-introduction](../1-Intro/04-OI-variational-analysis-introduction.ipynb).

In [11]:
using MAT

In [16]:
download_check(danfile, danfileURL)
mf = matopen(danfile);

[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mFile already downloaded


We can get a list of the variables stored in the file:

In [17]:
varnames = names(mf)

KeySet for a Dict{String, Int64} with 3 entries. Keys:
  "f"
  "Fe"
  "F"

and to load one of them, use

In [18]:
var1 = read(mf, "f");
@show sizeof(var1);

sizeof(var1) = 20000


When we're done, don't forget to close the file (especially if we process a large amount of files).

In [19]:
close(mf)

## GRIB files

In [21]:
Pkg.add("GRIB")
using GRIB

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.11/Manifest.toml`


In [24]:
download_check(gribfile, gribfileURL)

[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mFile already downloaded


The module `GRIB.jl` works only on Linux and Mac.

In [25]:
if !Sys.iswindows()
    GribFile(gribfile) do f
       # Get the first message from f
       msg = Message(f)
       lons, lats, values = data(msg)
       @info(length(lons))
    end
end

[36m[1m[ [22m[39m[36m[1mInfo: [22m[39m496


## GeoJSON
The sample file has been generated and downloaded from https://geojson.io.

In [32]:
using GeoJSON

In [33]:
download_check(geojsonfile, geojsonfileURL)

jsonbytes = read(geojsonfile);
fc = GeoJSON.read(jsonbytes)
@show typeof(fc)

typeof(fc) = GeoJSON.FeatureCollection{2, Float32}

[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mFile already downloaded





GeoJSON.FeatureCollection{2, Float32}

The geoJSON file contains 2 features, each of them consisting of a 2D Polygon, from which we can extract the coordinates.

In [43]:
polygon1 = fc[1]
polygon1.geometry[1]

11-element Vector{Tuple{Float32, Float32}}:
 (27.363369, 41.221752)
 (30.815773, 40.19586)
 (34.206352, 40.975384)
 (41.713684, 39.8079)
 (43.453205, 41.83671)
 (39.93689, 44.866375)
 (40.599613, 48.618225)
 (34.362835, 47.074516)
 (30.482286, 48.15362)
 (25.860664, 45.09467)
 (27.363369, 41.221752)

In [44]:
include("../config.jl")

"https://dox.uliege.be/index.php/s/tz9lCANaNIj3iG2/download"

## GeoTIFF
The image was extracted from https://worldview.earthdata.nasa.gov/

In [56]:
using GeoArrays

In [49]:
download_check(geotifffile, geotifffileURL)

geoarray = GeoArrays.read(geotifffile)
coordinates = collect(GeoArrays.coords(geoarray))
lats = [cc[2] for cc in reverse(coordinates[1,:])]
lons = [cc[1] for cc in coordinates[:,1]]
img = reverse(geoarray.A[:,:,1]', dims=1);

[36m[1m[ [22m[39m[36m[1mInfo: [22m[39mFile already downloaded
