<img src="https://raw.githubusercontent.com/euroargodev/argopy/master/docs/_static/argopy_logo_long.png" alt="argopy logo" width="200"/>

# Training Camp - Sept 22<sup>th</sup> 2025

***

## Notebook Title : How to handle large data selection

**Author contact : [G. Maze](https://annuaire.ifremer.fr/cv/17182)**

**Description:**

Often, one large data selection will fail with default options and arguments. This notebook provides tip and tricks to allow a large data selection to go through. In particular with:
- caching, i.e. save your request locally so that you don’t have to fetch it again,
- parallelisation, i.e. fetch chunks of independent data simultaneously (e.g. with a Dask cluster for instance).

All information about Argo data fetching performances can be found in the [dedicated section of the documentation](https://argopy.readthedocs.io/en/v1.3.0/advanced-tools/performances/index.html).

🏷️ This notebook was developed with [Argopy version *1.3.0*](https://argopy.readthedocs.io/en/v1.3.0)

©  [European Union Public Licence (EUPL) v1.2](https://github.com/euroargodev/argopy-training/blob/main/LICENSE), see at the bottom of this notebook for more.

**Table of Contents**
- [Use data caching](#use-data-caching)
    - [🔍 Pro tip](#🔍-pro-tip)
- [Use parallelisation](#use-parallelisation)
    - [🔍 Pro tip](#🔍-pro-tip)
    - [✏️ EXERCICE](#✏️-exercice)
    - [🛟 Note](#🛟-note)
- [🏁 End of the notebook](#🏁-end-of-the-notebook)
    - [👀 Useful argopy commands](#👀-useful-argopy-commands)
    - [⚖️ License Information](#⚖️-license-information)
    - [🤝 Sponsor](#🤝-sponsor)
***

Let's start with the usual import:

In [None]:
from argopy import DataFetcher

And to prevent cell output to be too large, we won't display xarray object attributes:

In [None]:
import xarray as xr
xr.set_options(display_expand_attrs = False)

## Use data caching

If you want to avoid retrieving the same data several times during a working session, especially if you fetch a large amount of data, you may want to temporarily save data in a cache file.

All details are given in [this section of the documentation](https://argopy.readthedocs.io/en/v1.3.0/advanced-tools/performances/caching.html).

You can cache fetched data with the [DataFetcher](https://argopy.readthedocs.io/en/v1.3.0/generated/argopy.fetchers.ArgoDataFetcher.html#argopy.fetchers.ArgoDataFetcher) option `cache`.

In [None]:
box = [15, 23, 35, 39, 0, 500, '2024-01', '2025-01']
f = DataFetcher(cache=True).region(box)
f

<br> 

At this point, data are not loaded yet. So let's trigger download:

In [None]:
%%time
ds = f.to_xarray()
ds.argo

<br>

This took some time (about 40 seconds) because it was the first download.

If we trigger again data download, because of the `cache` option, this should go faster:

In [None]:
%%time
ds = f.to_xarray()
ds.argo

#### 🔍 Pro tip

Cache data expire after 24h00. To clean up this [DataFetcher](https://argopy.readthedocs.io/en/v1.3.0/generated/argopy.fetchers.ArgoDataFetcher.html#argopy.fetchers.ArgoDataFetcher) cache, you can use:

In [None]:
f.clear_cache()

And if you want to clear all Argopy cache data:

In [None]:
import argopy
argopy.clear_cache()

## Use parallelisation

You can try to let argopy chunks your request into smaller pieces and have them fetched in parallel for you. This is done with the data fetcher argument, or global option, `parallel`.

#### 🔍 Pro tip

- Parallelization can be tuned using arguments `chunks` and `chunks_maxsize`.
- Use the argument `progress` to monitor how your data fetching is going.

<br>

To get started, let's make sure we have a data request that can't get through without optimisation, i.e. that a default fetcher will fail:

In [None]:
argopy.set_options(erddap='https://erddap-test1.ifremer.fr/erddap')

box = [15, 23, 35, 39, 0, 500, '2015-01', '2025-01']
f = DataFetcher().region(box)
try:
    ds = f.to_xarray()
    print(ds.argo)
except:
    print('❌ This fails')

<br>

We can now try parallelisation:

In [None]:
f = DataFetcher(parallel=True, progress=True).region(box)
f

<br>

And see that data fetching is now possible (this can take up to 3mins):

In [None]:
%%time
ds = f.to_xarray()
ds.argo

#### ✏️ EXERCICE

Reduce the size of the box for demonstration purposes, and increase the number of chunks to check how performances are modified.

💡 Code hint:
```python
f = DataFetcher(parallel=True, progress=True,
                chunks_maxsize={'dpt': 100})
```

In [None]:
# Your code here

#### 🛟 Note

Parallelisation may requires some tuning because of the balance to find between the chunking overhead, the size/number of chunks and the response time of the Argo GDAC server.

We noticed some limitations to how we can improve data fetching of Argo data. They are explained in [this section of the documentation](https://argopy.readthedocs.io/en/v1.3.0/advanced-tools/performances/index.html#limitations).

## 🏁 End of the notebook

***
#### 👀 Useful argopy commands
```python
argopy.reset_options()
argopy.show_options()
argopy.status()
argopy.clear_cache()
argopy.show_versions()
```
#### ⚖️ License Information
This Jupyter Notebook is licensed under the **European Union Public Licence (EUPL) v1.2**.

| Permissions      | Limitations     | Conditions                     |
|------------------|-----------------|--------------------------------|
| ✔ Commercial use | ❌ Liability     | ⓘ License and copyright notice |
| ✔ Modification   | ❌ Trademark use | ⓘ Disclose source              |
| ✔ Distribution   | ❌ Warranty      | ⓘ State changes                |
| ✔ Patent use     |                  | ⓘ Network use is distribution  |
| ✔ Private use    |                  | ⓘ Same license                 |

For more details, visit: [EUPL v1.2 Full Text](https://github.com/euroargodev/argopy-training/blob/main/LICENSE).

#### 🤝 Sponsor
![logo](https://raw.githubusercontent.com/euroargodev/argopy-training/refs/heads/main/for_nb_producers/template_argopy_training_EAONE.png)
***
