In [None]:
# HIDDEN CELL
import sys, os

import numpy as np

# Importing argopy in dev mode:
on_rtd = os.environ.get('READTHEDOCS', None) == 'True'
if not on_rtd:
    sys.path.insert(0, "/Users/gmaze/git/github/euroargodev/argopy")
    import git
    import argopy
    from argopy.options import OPTIONS
    print("argopy:", argopy.__version__, 
          "\nsrc:", argopy.__file__, 
          "\nbranch:", git.Repo(search_parent_directories=True).active_branch.name, 
          "\noptions:", OPTIONS)
else:
    sys.path.insert(0, os.path.abspath('..'))

import xarray as xr
# xr.set_options(display_style="html");
xr.set_options(display_style="text");

In [None]:
import argopy
from argopy import DataFetcher as ArgoDataFetcher

# Performance

To improve ``argopy`` data fetching performances (in terms of time of retrieval), 2 solutions are available:
    
- Cache fetched data, i.e. save your request locally so that you don't have to fetch it again,
- Fetch data by chunks in parallel, i.e. fetch peace of independant data simultaneously.

These solutions are explained below.

Note that another solution from standard big data strategies would be to fetch data lazily. But since (i) *argopy* post-processes raw Argo data on the client side and (ii) none of the data sources are cloud/lazy compatible, this solution is not possible (yet).

## Cache

### Caching data

If you want to avoid retrieving the same data several times during a working session, or if you fetched a large amount of data, you may want to temporarily save data in a cache file.

You can cache fetched data with the fetchers option ``cache``.

**Argopy** cached data are persistent, meaning that they are stored locally on files and will survive execution of your script with a new session. 
**Cached data have an expiration time of one day**, since this is the update frequency of most data sources. This will ensure you always have the last version of Argo data.

All data and meta-data (index) fetchers have a caching system.

The argopy default cache folder is under your home directory at ``~/.cache/argopy``. 

But you can specify the path you want to use in several ways:

- with **argopy** global options:

```python
argopy.set_options(cachedir='mycache_folder')
```

- in a temporary context:

```python
with argopy.set_options(cachedir='mycache_folder'):
    ds = ArgoDataFetcher(cache=True).profile(6902746, 34).to_xarray()
```

- when instantiating the data fetcher:

```python
ds = ArgoDataFetcher(cache=True, cachedir='mycache_folder').profile(6902746, 34).to_xarray()
```

### Clearing the cache

If you want to manually clear your cache folder, and/or make sure your data are newly fetched, you can do it at the fetcher level with the ``clear_cache`` method.

Start to fetch data and store them in cache:

```python
fetcher = ArgoDataFetcher(cache=True, cachedir='mycache_folder').profile(6902746, 34)
fetcher.to_xarray();
```

Fetched data are in the local cache folder:
```python
os.listdir('mycache_folder')
```
```bash
['cache', 
 'c5c820b6aff7b2ef86ef00626782587a95d37edc54120a63ee4699be2b0c6b7c']
```

where we see one hash entries the newly fetched data and the cache registry file ``cache``.

We can then fetch something else using the same cache folder:

```python
fetcher2 = ArgoDataFetcher(cache=True, cachedir='mycache_folder').profile(1901393, 1)
fetcher2.to_xarray();
```

All fetched data are cached:

```python
os.listdir('mycache_folder')
```
```bash
['cache',
 'c5c820b6aff7b2ef86ef00626782587a95d37edc54120a63ee4699be2b0c6b7c',
 '58072df8477157c194449a2e6dff8d69ca3c8fded01eebdd8a5fc446f2f7f9a7']
```

Note the new hash file with the ``fetcher2`` data.

It is important to note that we can safely clear the cache from the first ``fetcher`` data, it won't remove the ``fetcher2`` data:

```python
fetcher.clear_cache()
os.listdir('mycache_folder')
```
```bash
['cache', 
 '58072df8477157c194449a2e6dff8d69ca3c8fded01eebdd8a5fc446f2f7f9a7']
```

By using the fetcher level clear cache, you make sure that only data fetched with it are removed, while other fetched data (with other fetchers for instance) will stay in place.

If you want to clear the entire cache folder, whatever the fetcher used, do it at the package level with:

```python
argopy.clear_cache()
```

So, if we now check the cache folder, it's been deleted:

```python
os.listdir('mycache_folder')
```
```bash
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-13-6726e674f21f> in <module>
----> 1 os.listdir('mycache_folder')

FileNotFoundError: [Errno 2] No such file or directory: 'mycache_folder'
```

## Parallel data fetching

Sometimes you may find that your request takes a long time to fetch, or simply does not even succeed. This is probably because you're trying to fetch a large amount of data.

In this case, you can try to let argopy chunks your request into smaller pieces and have them fetched in parallel for you. This is done with the argument ``parallel`` of the data fetcher and can be tuned using options ``chunks`` and ``chunksize``.

This goes by default like this:

In [None]:
# Define a box to load (large enough to trigger chunking):
box = [-60, -30, 40.0, 60.0, 0.0, 100.0, "2007-01-01", "2007-04-01"]

# Instantiate a parallel fetcher:
loader_par = ArgoDataFetcher(src='erddap', parallel=True).region(box)

you can also use the option ``progress`` to display a progress bar during fetching:

In [None]:
loader_par = ArgoDataFetcher(src='erddap', parallel=True, progress=True).region(box)
loader_par

Then, you can fetch data as usual:

In [None]:
%%time
ds = loader_par.to_xarray()

### Number of chunks

To check how many chunks your request has been split into, you can look at the ``uri`` property of the fetcher, it gives you the list of paths toward data:

In [None]:
# Display only the relevant part of each URLs of URI:
for uri in loader_par.uri:
    print("http: ... ", "&".join(uri.split("&")[1:-2])) 

To control chunking, you can use the **``chunks``** option that specifies the number of chunks in each of the *direction*:

- ``lon``, ``lat``, ``dpt`` and ``time`` for a **region** fetching,
- ``wmo`` for a **float** and **profile** fetching.

In [None]:
# Create a large box:
box = [-60, 0, 0.0, 60.0, 0.0, 500.0, "2007", "2010"]

# Init a parallel fetcher:
loader_par = ArgoDataFetcher(src='erddap', 
                             parallel=True, 
                             chunks={'lon': 5}).region(box)
# Check number of chunks:
len(loader_par.uri)

This creates 195 chunks, and 5 along the longitudinale direction, as requested. 

When the ``chunks`` option is not specified for a given *direction*, it relies on auto-chunking using pre-defined chunk maximum sizes (see below). 
In the case above, this explains why we have 195 and not only 5 chunks.

To chunk the request along a single direction, set explicitely all the others direction to ``1``:

In [None]:
# Init a parallel fetcher:
loader_par = ArgoDataFetcher(src='erddap', 
                             parallel=True, 
                             chunks={'lon': 5, 'lat':1, 'dpt':1, 'time':1}).region(box)

# Check number of chunks:
len(loader_par.uri)

We now have 5 chunks along longitude, check out the URLs parameter in the list of URIs:

In [None]:
for uri in loader_par.uri:
    print("&".join(uri.split("&")[1:-2])) # Display only the relevant URL part

### Size of chunks

The default chunk size for each access point dimensions are:

| Access point dimension | Maximum chunk size |
|------------------------|:------------------:|
| region / **lon**       |       20 deg       |
| region / **lat**       |       20 deg       |
| region / **dpt**       |      500 m or db   |
| region / **time**      |       90 days      |
| float / **wmo**        |          5         |
| profile / **wmo**      |          5         |

These default values are used to chunk data where the ``chunks`` parameter key is set to ``auto``.

But you can modify the maximum chunk size allowed in each of the possible directions. This is done with the option **``chunks_maxsize``**.

For instance if you want to make sure that your chunks are not larger then 100 meters (db) in depth (pressure), you can use:

In [None]:
# Create a large box:
box = [-60, -10, 40.0, 60.0, 0.0, 500.0, "2007", "2010"]

# Init a parallel fetcher:
loader_par = ArgoDataFetcher(src='erddap', 
                             parallel=True, 
                             chunks_maxsize={'dpt': 100}).region(box)
# Check number of chunks:
len(loader_par.uri)

Since this creates a large number of chunks, let's do this again and combine with the option ``chunks`` to see easily what's going on:

In [None]:
# Init a parallel fetcher with chunking along the vertical axis alone:
loader_par = ArgoDataFetcher(src='erddap', 
                             parallel=True, 
                             chunks_maxsize={'dpt': 100},
                             chunks={'lon':1, 'lat':1, 'dpt':'auto', 'time':1}).region(box)

for uri in loader_par.uri:
    print("http: ... ", "&".join(uri.split("&")[1:-2])) # Display only the relevant URL part

You can see, that the ``pres`` argument of this erddap list of URLs define layers not thicker than the requested 100db.

With the ``profile`` and ``float`` access points, you can use the ``wmo`` keyword to control the number of WMOs in each chunks.

In [None]:
WMO_list = [6902766, 6902772, 6902914, 6902746, 6902916, 6902915, 6902757, 6902771]

# Init a parallel fetcher with chunking along the list of WMOs:
loader_par = ArgoDataFetcher(src='erddap', 
                             parallel=True, 
                             chunks_maxsize={'wmo': 3}).float(WMO_list)

for uri in loader_par.uri:
    print("http: ... ", "&".join(uri.split("&")[1:-2])) # Display only the relevant URL part

You see here, that this request for 8 floats is split in chunks with no more that 3 floats each.

### Parallelization methods

They are 3 methods available to set-up your data fetching requests in parallel:
    
1. [Multi-threading](https://en.wikipedia.org/wiki/Multithreading_(computer_architecture))
1. [Multi-processing](https://en.wikipedia.org/wiki/Multiprocessing)
1. a [Dask distributed client](https://distributed.dask.org/en/latest/client.html)

The first two options use a pool of [threads](https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ThreadPoolExecutor) or [processes](https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ProcessPoolExecutor) managed with the [concurrent futures module](https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures). The last option is a much higher level method to leverage the computing power of a Dask distributed framework ([see here for examples on how to create a dask cluster](https://nbviewer.jupyter.org/github/obidam/ds2-2020/blob/master/practice/environment/01-Launch_Dask_Cluster.ipynb)). 

The parallelization method is set with the ``parallel_method`` option of the fetcher, which can take as values the strings ``thread`` or ``process`` or a *client* object.

In [None]:
%%time
box = [-60, 0, 
       20.0, 60.0 + np.random.randint(0,100,1)[0]/1000, 
       0.0, 500.0, 
       "2007", "2010"]
box = [-60, -30, 40.0, 60.0, 0.0, 100.0, "2007-01-01", "2007-04-01"]

print("BOX=", box)
ds = ArgoDataFetcher(src='argovis', parallel=True, progress=True, parallel_method='thread').region(box).to_xarray()

### Comparison of performances

Note that to compare performances with or without the parallel option, we need to make sure that data are not cached on the server side.
To do this, we use a very small random perturbation on the box definition, here on the maximum latitude.

In [None]:
def this_box():
    return [-60, 0, 
           20.0, 60.0 + np.random.randint(0,100,1)[0]/1000, 
           0.0, 500.0, 
           "2007", "2010"]

In [None]:
%%time
ds = ArgoDataFetcher(src='argovis', parallel=False).region(this_box()).to_xarray()

In [None]:
%%time
ds = ArgoDataFetcher(src='argovis', parallel=True).region(this_box()).to_xarray()

This simple comparison shows that parallel request is significantly faster than the standard one.