# Generate your input dataset for OSNet

This notebook will guide you step by step to generate the input dataset for your study area. 

To adapt the notebook to your study area you need to configure the "configuration.yaml" file. 
In this document, you can configure the paths to the data as well as the output paths in which the data generated by the notebook will be saved. 

This notebook is already connected to all other notebooks to produce your dataset. You don't need to modify anything outside the configuration file

## Pre-requisite

Before you start processing data for your OSnet study, this is what you will need:

### Software
The OSnet machinery is coded in Python 3.8 and mainly consists of Jupyter notebooks file to run. So, you should make sure that you will have the appropriate Python environment. 


### Data
If you want to be able to study any region, you should download all the global dataset listed below. This is not mandatory, but it will ensure that you have all the necessary dataset on the appropriate grid and format for the OSnet procedure to work smoothly.
If you can't download all these files, please contact Ifremer so that we find a solution to assemble your training set.

- **ETOPO Bathymetrie**:  
    We use the cell-registered bedrock netcdf file ``ETOPO1_Bed_c_gmt4.grd`` (about 890Mb)  
    *source*: https://www.ngdc.noaa.gov/mgg/global
- **DUACS Altimetry**:  
    We use global daily altimetric files (10.016 files, 7.3Mb each, about 9.5Gb in total), stored on disk in the form:
    ``./<YYYY>/<DAYOFYEAR>/dt_global_allsat_phy_l4_<YYYYMMDD>_20190101.nc``  
    *source*: https://resources.marine.copernicus.eu/product-detail/SEALEVEL_GLO_PHY_L4_MY_008_047/INFORMATION
- **CNES-CLS Mean Dynamic Topography**:  
    We use the CNES-CLS18 MDT file (95Mb)  
    *source*: https://resources.marine.copernicus.eu/product-detail/SEALEVEL_GLO_PHY_MDT_008_063/INFORMATION
- **ESA CCI and C3S Satellite Sea Surface Temperature**:  
    We use global daily SST files  (14.444 files, 16Mb each, about 13.45Gb in total), stored on disk in the form:
    ``./<YYYY>/<DAYOFYEAR>/<YYYYMMDD>120000-ESACCI-L4_GHRSST-SSTdepth-OSTIA-GLOB_CDR2.1-v02.0-fv01.0.nc``  
    *source*: https://resources.marine.copernicus.eu/product-detail/SST_GLO_SST_L4_REP_OBSERVATIONS_010_024/INFORMATION

## Presentation of configuration.yaml

The configuration file is composed of 9 sections: study boxes and Oceans boxes, Years, SLA, BATHYMETRY, SST, PI, CORA, MDT and global configuration. 

We will configure all of this sections step by step.

### Study box and time period 


The first step is to configure two boxes:

1. The "study" box is the actual region you want to study
    
1. the "ocean" box defines a larger box than the *study* box. The goal is to be able to adjust the *study* box without having to regenerate all the input data (SLA, SST and Bathymetry).

Here is an illustration to better understand the difference :

![image](img/global_box_osnet.PNG)

We further define the study period with the YEARS section. 
The two attributes ``YEARS_1`` is the start date of the dataset and ``YEARS_2`` the end date. 

### OSnet input datasets

We are now going to start generating files necessary for the neural networks 

For that we will follow the order of the sections of the file "configuration.yaml"

#### Bathymetrie

We have 3 elements to configure, let's go back to ``configuration.yaml``. The first one, ``BATHYMETRIE_path`` is the path to the product (https://www.ngdc.noaa.gov/mgg/global/). The second one is the folder where the files generated by the ``01_Bathymetrie.ipynb`` notebook will be saved. The last one is the name you want to give to the file. Don't forget the .nc at the end.

We are now ready to generate the first file! Just launch the cell below. 

Check that the file has been created in the path you have configured.

#### SLA

SLA also requires the configuration of 3 elements in the configuration file. The first data to configure is the path to the product (https://resources.marine.copernicus.eu/product-detail/SEALEVEL_GLO_PHY_L4_MY_008_047/INFORMATION). 
The other two attributes are used to configure where to save the generated files. 

Processing SLA requires the creation of many files. Be sure to configure the path before launching the notebook. 

We are now ready to generate SLA files ! Just launch the cell below. 

PS: If you have "PerformanceWarning:" don't take it into account, it's just a warning. 

The notebook displays a map that allows you to view the SLA over a year. You can notice that the SLA is generated from the ocean box. 

#### SST

Let's go back to the configuration file to modify the attribute values according to your situation. 

Once again we have to configure 3 elements. The first one is the path to the product (https://resources.marine.copernicus.eu/product-detail/SST_GLO_SST_L4_REP_OBSERVATIONS_010_024/INFORMATION). As before, the other two attributes are used for the output path. 

Processing the SST requires the creation of many files. Be sure to configure the path before launching the notebook. 

We are now ready to generate sst files ! Just launch the cell below. 

Just like the SLA notebook you will get a visualization of the extracted region from the ocean box. 

### Preprocessing of T/S profiles 

To do this, you must configure the attributes of the CORA section with 3 attributes following the same logic as the other sections. The last element is the name of the file that will be generated.

You will also have to configure the MDT section which contains only one attribute for the path to the product https://cds.climate.copernicus.eu/cdsapp#!/dataset/satellite-sea-level-global?tab=overview

To finish the configuration you must choose the name of the file that will represent the dataset. For that go to the general section and modify the attribute: ``OUTPUT_FILE_NAME``
Don't forget the .nc at the end.

Now you can use the cell below to generate the dataset file.

***

**Warning**

During the generation of the representative file you may get NaNs. If this is the case you will get two graphs like the one below to tell you the distribution of the NaNs and where they are.

![image](img/Nan.png)

The NaNs depend on the area you are in, so you must resolve them before continuing. If the dataset contains NaN no prediction will be reliable. 

***

###  View a part of your dataset

We put at your disposal a notebook to visualize in map form the 4 main parameters of your dataset: Bathymetry, SLA, SST and MDT.

To get the visualization run the cell below. 

Congratulations, your DataSet is ready!