<style type="text/css">.tg  {border-collapse:collapse;border-spacing:0;}.tg td{border-color:rgb(16, 137, 182);border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;word-break:normal;}.tg th{border-color:rgb(16, 137, 182);border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}.tg .tg-73oq{border-color:rgb(10, 89, 162);text-align:left;vertical-align:middle}.tg .tg-42lt{border-color:rgb(10, 89, 162);text-align:center;vertical-align:middle}.tg .tg-5qt9{font-size:small;text-align:left;vertical-align:middle}</style><table class="tg"><thead>  <tr>    <th class="tg-73oq"><img src="https://raw.githubusercontent.com/euroargodev/argopy/master/docs/_static/argopy_logo_long.png" alt="Argopy logo" width="120" height="60"></th>    <th class='tg-42lt'><h1>GDAC File system</h1></th>  </tr></thead><tbody>  <tr>    <td class="tg-5qt9" colspan="2"><span style="font-weight:bold">Author :</span> <a href="//annuaire.ifremer.fr/cv/17182" target="_blank" rel="noopener noreferrer">G. Maze</a></td>  </tr>  <tr>    <td class="tg-5qt9" colspan="2">🏷️ This notebook is compatible with Argopy versions &gt;= <a href="https://argopy.readthedocs.io/en/v1.3.1" target="_blank" rel="noopener noreferrer">v1.3.1</a></td>  </tr>  <tr>    <td class="tg-5qt9" colspan="2">© <a href="https://github.com/euroargodev/argopy-training/blob/main/LICENSE" target="_blank" rel="noopener noreferrer">European Union Public Licence (EUPL) v1.2</a>, see at the bottom of this notebook for more.</td>  </tr></tbody></table>
**Description:**

This notebook describes how to use the generic GDAC file system provided by Argopy: [gdacfs (↗)](https://argopy.readthedocs.io/en/v1.3.1/generated/argopy.gdacfs.html)

This is an illustration of the [documentation section on gdacfs (↗)](https://argopy.readthedocs.io/en/v1.3.1/advanced-tools/stores/gdac_filesystem.html).


**Table of Contents**
  - [Why a GDAC file system in Argopy ?](#why-a-gdac-file-system-in-argopy-?)
  - [Creating a GDAC file system](#creating-a-gdac-file-system)
    - [🔍 Pro tip](#🔍-pro-tip)
  - [Usage: low-level methods](#usage:-low-level-methods)
    - [🛟 NOTE](#🛟-note)
    - [🔍 Pro tip](#🔍-pro-tip)
    - [Opening files](#opening-files)
  - [Usage: higher-level methods](#usage:-higher-level-methods)
    - [✏️ EXERCISE](#✏️-exercise)
    - [🔍 Pro tip](#🔍-pro-tip)
    - [✏️ EXERCISE](#✏️-exercise)
***

Let's start with the filesystem import:

In [None]:
# Shut down some warning messages for clarity
import warnings
warnings.filterwarnings("ignore")

# Import the Argopy features we want to work with in this notebook:
from argopy.stores import gdacfs

And to prevent cell output to be too large, we won't display xarray object attributes:

In [None]:
import xarray as xr
xr.set_options(display_expand_attrs = False)

### Why a GDAC file system in Argopy ?

**The problem**  
In an Argo dataset analysis workflow, opening/loading/reading a file from a given GDAC will require to handle the file system protocol of this GDAC location, or `host`. Reaching the http GDAC will require to handle HTTP requests, or the ftp GDAC will require some FTP connection, etc ... This could be a complex logic and is associated with a low-level expertise that may be cubersome to acquire, since it is probably out of the scope of your Argo data analysis.

Typically, to open a local file in Python, we use the `open` function like this:
```python
with open('some_file.txt', 'r') as f:
    f.read_lines()
```    

but what if the file is located on a http server ?

Similarly, to `ls` or `glob` files in a given location, the Python implementation will depend on the file location, local or remote.

Even at some higher level, typically with xarray, reading a netcdf file is not trivial. For instance, this command:
```python
xr.open_dataset('https://data-argo.ifremer.fr/dac/coriolis/6903091/6903091_prof.nc')
```
will raise an issue.

**Bricks for a solution**  
From the very beginning, **Argopy** has implemented a *separation of concerns* approach to file access. Backed by the very powerfull [fsspec (↗)](https://filesystem-spec.readthedocs.io/en/latest) library, **Argopy** has a multiple-protocol support for methods repeatedly used in an Argo procedure (e.g. `open_dataset`, `read_csv`, `open_json`, etc ...).

On the other hand, the Argo dataset, which is a collection of files, is organised similarly whatever the GDAC host considered, and independently of any protocol (file for your own local copy, the http or ftp from French of USA servers, and more recently the s3 server).

This means, for instance, that all GDAC hosts have a `dac` folder, and that netcdf files paths have similar patterns like `dac/<DAC>/<WMO>/<WMO>_prof.nc`. This is the reason why Argo indexes contain *relative* paths toward files.

**Implementation**  
To take advantage of the smart GDAC architecture and to address the multiple protocols handling burden, **Argopy** provides a low-level GDAC store, that is a prefixed directory file system powered by **Argopy** internal file systems available for any of the GDAC protocols (file, http, ftp, s3).

Therefore, you can separate file access logic from host protocol handling in your workflow, **making your procedure agnostic to the GDAC host**.
This means that you can develop localy and move to production by simply changing one option in your procedure, to point to the appropriate GDAC host.


We call this feature the [gdacfs (↗)](https://argopy.readthedocs.io/en/v1.3.1/generated/argopy.gdacfs.html) store.

This class will return one of the **Argopy** file systems (stores.filestore, stores.httpstore, stores.ftpstore or stores.s3store) with a prefixed directory, so that you don’t have to include the GDAC root path to access files.

### Creating a GDAC file system

To create a [gdacfs (↗)](https://argopy.readthedocs.io/en/v1.3.1/generated/argopy.gdacfs.html) instance, you just need to indicate which GDAC host to use. 
By default, it will point toward the Ifremer https server:

In [None]:
fs = gdacfs("https://data-argo.ifremer.fr")
fs

where you can see that we get a store implementation for the http protocol.

#### 🔍 Pro tip

The list of all shortcuts from GDAC hosts to servers is accessible with the `shortcut2gdac` utility:

In [None]:
from argopy.utils import shortcut2gdac
shortcut2gdac()

<br>

Therefore it is as easy to point a [gdacfs (↗)](https://argopy.readthedocs.io/en/v1.3.1/generated/argopy.gdacfs.html) toward any GDAC host:

In [None]:
fs = gdacfs("ftp") # points toward ftp://ftp.ifremer.fr/ifremer/argo
fs

### Usage: low-level methods

All store implementations will provide a unified pythonic interface to local or remote file systems.

Low-level methods available are: ``open``, ``ls``, ``exists``, ``info`` and ``glob``.

So, for instance, one can glob some mono-profile files like this:

In [None]:
fs.glob("dac/coriolis/6903091/profiles/BR*_00*D.nc")

<br>

and this `glob` will work, whatever the location of the files, which is determined by the `host` used to instantiate the [gdacfs (↗)](https://argopy.readthedocs.io/en/v1.3.1/generated/argopy.gdacfs.html):

In [None]:
gdacfs('s3').glob("dac/coriolis/6903091/profiles/BR*_00*D.nc")

#### 🛟 NOTE

The `ls` output content will depend on the protocol:

In [None]:
gdacfs('s3').ls("dac/coriolis/6903091/profiles/BR6903091_001D.nc")

In [None]:
gdacfs('http').ls("dac/coriolis/6903091/profiles/BR6903091_001D.nc")

<br>

The ``info`` method is more reliable:

In [None]:
gdacfs('ftp').info("dac/coriolis/6903091/profiles/BR6903091_001D.nc")

In [None]:
gdacfs('http').info("dac/coriolis/6903091/profiles/BR6903091_001D.nc")

#### 🔍 Pro tip

You can get the absolute path of a given file with the ``full_path`` method:

In [None]:
fs.full_path("dac/coriolis/6903091/profiles/BR6903091_001D.nc")

<br> 

and if you get lost with the protocol your file system relies on, you can check it with the ``protocol`` attribute

In [None]:
fs.protocol

#### Opening files

If you're into very low-level approach to reading a file content, you can use the ``open`` method:

In [None]:
with gdacfs('http').open("dac/coriolis/6903091/profiles/BR6903091_001D.nc", 'rb') as f:
    print(f.full_name)
    print(f.read(10))  # Read the first 10 bytes

In [None]:
with gdacfs('ftp').open("dac/coriolis/6903091/profiles/BR6903091_001D.nc", 'rb') as f:
    print(f.full_name)
    print(f.read(10))  # Read the first 10 bytes

<br>

Obviously, what is interesting here, is that the `f.read` logic of the procedure does not depends on the GDAC file location 🎉

### Usage: higher-level methods

If you are mostly interested in loading netcdf, csv or json files, **Argopy** has higher-level methods than the above.

For instance, to open a netcdf file is as direct as:             

In [None]:
fs = gdacfs('http')
ds = fs.open_dataset('dac/coriolis/6903091/profiles/BR6903091_001D.nc', engine='argo')
ds

🛟 **NOTE**

Above we used the `engine='argo'` argument to ensure that all data arrays are casted with appropriate data types, which is not the case otherwise.

<br>

Again, this method is totally agnostic to the file location, it will work similarly with the ftp server for instance:

In [None]:
fs = gdacfs('ftp')
ds = fs.open_dataset('dac/coriolis/6903091/profiles/BR6903091_001D.nc', engine='argo')
ds

#### ✏️ EXERCISE

Use the `read_csv` method of a [gdacfs (↗)](https://argopy.readthedocs.io/en/v1.3.1/generated/argopy.gdacfs.html) instance to load one of the Argo index file.

💡 Code hint:
```python
csvfile = 'argo_bio-traj_index.txt'
fs = gdacfs()
fs.read_csv(csvfile, header=0, ...)  # kwargs are passed to pandas.read_csv
```

In [None]:
# Your code

#### 🔍 Pro tip

The [gdacfs (↗)](https://argopy.readthedocs.io/en/v1.3.1/generated/argopy.gdacfs.html) class has also multiple files open methods like ``open_mfdataset`` and ``open_mfjson``.

Each of these methods can process more than one file sequentially or in parallel, and apply some processing to each individual files.

⚠️ By default, ``open_mfdataset`` will try to concatenate all datasets, to disable this behavior, use the `concat=False` argument.

Here is an example.

We first create a [gdacfs (↗)](https://argopy.readthedocs.io/en/v1.3.1/generated/argopy.gdacfs.html) instance and then make a list of some netcdf files:

In [None]:
fs = gdacfs('http')
ncfiles = fs.glob('dac/coriolis/6903091/profiles/BR6903091_00*D.nc')
ncfiles

<br>

Then we open all these netcdf files, with appropriate data type casting with the `engine='argo'` option:

In [None]:
ds_list = fs.open_mfdataset(ncfiles, concat=False, open_dataset_opts={'engine': 'argo'}, progress=True)
len(ds_list), ds_list[0]

#### ✏️ EXERCISE

Use the `preprocess` argument of `open_mfdataset` to open the netcdf files above and return the maximum pressure for each file for the primary profile.

💡 Code hint:  
`preprocess` is a function that will receive one netcdf file as a an xarray dataset.

In [None]:
# Your code


## 🏁 End of the notebook

***
#### 👀 Useful argopy commands
```python
argopy.reset_options()
argopy.show_options()
argopy.status()
argopy.clear_cache()
argopy.show_versions()
```
#### ⚖️ License Information
This Jupyter Notebook is licensed under the **European Union Public Licence (EUPL) v1.2**.

| Permissions      | Limitations     | Conditions                     |
|------------------|-----------------|--------------------------------|
| ✔ Commercial use | ❌ Liability     | ⓘ License and copyright notice |
| ✔ Modification   | ❌ Trademark use | ⓘ Disclose source              |
| ✔ Distribution   | ❌ Warranty      | ⓘ State changes                |
| ✔ Patent use     |                  | ⓘ Network use is distribution  |
| ✔ Private use    |                  | ⓘ Same license                 |

For more details, visit: [EUPL v1.2 Full Text (↗)](https://github.com/euroargodev/argopy-training/blob/main/LICENSE).

#### 🤝 Sponsor
![logo (↗)](https://raw.githubusercontent.com/euroargodev/argopy-training/refs/heads/main/for_nb_producers/disclaimer_argopy_EAONE.png)
***
