# Read CAFE dataset directly from AWS datastore

The CAFE dataset produced by CSIRO is stored as on Open Dataset on AWS. It is possible to directly read the data from the AWS store from Python. Here is an example on how to do so.

The data is organised by realm: atmos_hybrid, atmos_isobaric, ice, land, ocean, ocean_bgc, ocean_force, ocean_scalar

The list of variables in each realm is given in the [CAFE documentation](https://data.csiro.au/dap/ws/v2/collections/49803/support/4029)

To name of the AWS filesystem is S3. Some Python packages have been developed to access this filesystem. Here we are going to use s3fs.

In [1]:
import s3fs
import xarray as xr
import climtas.nci
import kerchunk.hdf
import kerchunk.combine
import dask
import ujson
from glob import glob

## Start with an example

First, you will need to open an anonymous connection to the filesystem. The `anon` option stands for anonymous.

In [2]:
fs = s3fs.S3FileSystem(anon=True)

Let's find the path to one of the CAFE files using [the AWS Explorer](https://cafe60-reanalysis-dataset-aws-open-data.s3.amazonaws.com/index.html) and see how to open it. You can see the path in the top banner of the Explorer.

Xarray can open the files for you but you shouldn't give the file path as usual but a file object as is returned by the `open()` operation.

In [3]:
file_path = "cafe60-reanalysis-dataset-aws-open-data/atmos_isobaric/temp.atmos_isobaric.daily.CAFE60.19600101-19600131.nc"
file_obj = fs.open(file_path)   # We use fs.open() as the file is on S3
ds = xr.open_dataset(file_obj)
ds

You can then use the data as usual but the operation will be longer than usual:

In [4]:
ds["temp"].mean(dim="time")

## Generalisation
From the path to the CAFE file used previously, we see that all the files following this naming pattern:

`"cafe60-reanalysis-dataset-aws-open-data/<realm>/<varname>.<realm>.<temporal resolution>.CAFE60.<starttime>-<endtime>.nc"`

With this information, we can write a function that will return the paths to all the files for a given variable, realm and temporal resolution.

In [5]:
def find_cafe_files(realm, varname, time_res):
    """ Return a list of all the files for a given variable and a temporal resolution in a realm
    realm: str, one of the realms for the CAFE variables
    varname: str, name of one of the CAFE variables
    time_res: 'daily'|'month', temporal resolution for the variable"""
    
    root = "cafe60-reanalysis-dataset-aws-open-data"
    path = f"{root}/{realm}"
    
    my_list = fs.glob(f"{path}/{varname}.{realm}*")
    
    # Stop and return an error message if no files found.
    assert len(my_list)!=0, "There is no such variable in that realm or for that temporal resolution. Please check the CAFE documentation."
    return my_list    

In [6]:
temp_files = find_cafe_files("atmos_isobaric","temp","daily")

Let's try to read in 100 and 250 files and see how long it is. This will allow us to assess if this simple approach can work in this case

In [None]:
%%time
# Open all the files and then read them with open_mfdataset
temp_files_ob = [ fs.open(tt) for tt in temp_files ]
ds = xr.open_mfdataset(temp_files_ob[0:100],
                       combine='nested',
                       concat_dim='time', 
                       join='override',
                       coords='minimal',
                       compat='override',
                       chunks={"time":1}, 
#                        parallel=True
                      )

In [None]:
%%time
# Open all the files and then read them with open_mfdataset
temp_files_ob = [ fs.open(tt) for tt in temp_files ]
ds = xr.open_mfdataset(temp_files_ob[0:250],
                       combine='nested',
                       concat_dim='time', 
                       join='override',
                       coords='minimal',
                       compat='override',
                       chunks={"time":1}, 
#                        parallel=True
                      )


We see the time to read 250 files is much more than 2.5 times the time it takes to read 100 files. This tells us we can't use this simple way to read in the data to read in the whole timeseries of 726 files.

## Let's try to optimise this
Cloud is much better at reading a format called zarr instead of netcdf. Luckily, it's possible to fake a Zarr file from a netcdf file. Let's see if it speeds things up. We can use what was done here: https://gist.github.com/rsignell-usgs/ef435a53ac530a2843ce7e1d59f96e22

For this, we need to save some data about the chunking in the netcdf files in JSON files.

In [7]:
client = climtas.nci.Client()
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: /node/ood-vn9/39373/proxy/8787/status,

0,1
Dashboard: /node/ood-vn9/39373/proxy/8787/status,Workers: 8
Total threads: 8,Total memory: 22.46 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:39885,Workers: 8
Dashboard: /node/ood-vn9/39373/proxy/8787/status,Total threads: 8
Started: Just now,Total memory: 22.46 GiB

0,1
Comm: tcp://127.0.0.1:38873,Total threads: 1
Dashboard: /node/ood-vn9/39373/proxy/36539/status,Memory: 2.81 GiB
Nanny: tcp://127.0.0.1:43071,
Local directory: /local/w35/ccc561/tmp/dask-worker-space/worker-94kq267v,Local directory: /local/w35/ccc561/tmp/dask-worker-space/worker-94kq267v

0,1
Comm: tcp://127.0.0.1:42849,Total threads: 1
Dashboard: /node/ood-vn9/39373/proxy/36301/status,Memory: 2.81 GiB
Nanny: tcp://127.0.0.1:32915,
Local directory: /local/w35/ccc561/tmp/dask-worker-space/worker-hyunmlo_,Local directory: /local/w35/ccc561/tmp/dask-worker-space/worker-hyunmlo_

0,1
Comm: tcp://127.0.0.1:37031,Total threads: 1
Dashboard: /node/ood-vn9/39373/proxy/36069/status,Memory: 2.81 GiB
Nanny: tcp://127.0.0.1:33071,
Local directory: /local/w35/ccc561/tmp/dask-worker-space/worker-7mepsonn,Local directory: /local/w35/ccc561/tmp/dask-worker-space/worker-7mepsonn

0,1
Comm: tcp://127.0.0.1:36105,Total threads: 1
Dashboard: /node/ood-vn9/39373/proxy/44779/status,Memory: 2.81 GiB
Nanny: tcp://127.0.0.1:40469,
Local directory: /local/w35/ccc561/tmp/dask-worker-space/worker-axla5c2a,Local directory: /local/w35/ccc561/tmp/dask-worker-space/worker-axla5c2a

0,1
Comm: tcp://127.0.0.1:44531,Total threads: 1
Dashboard: /node/ood-vn9/39373/proxy/35897/status,Memory: 2.81 GiB
Nanny: tcp://127.0.0.1:46811,
Local directory: /local/w35/ccc561/tmp/dask-worker-space/worker-eofiugs7,Local directory: /local/w35/ccc561/tmp/dask-worker-space/worker-eofiugs7

0,1
Comm: tcp://127.0.0.1:37401,Total threads: 1
Dashboard: /node/ood-vn9/39373/proxy/38933/status,Memory: 2.81 GiB
Nanny: tcp://127.0.0.1:40261,
Local directory: /local/w35/ccc561/tmp/dask-worker-space/worker-n_94cyka,Local directory: /local/w35/ccc561/tmp/dask-worker-space/worker-n_94cyka

0,1
Comm: tcp://127.0.0.1:40447,Total threads: 1
Dashboard: /node/ood-vn9/39373/proxy/42171/status,Memory: 2.81 GiB
Nanny: tcp://127.0.0.1:42405,
Local directory: /local/w35/ccc561/tmp/dask-worker-space/worker-kq_bvquj,Local directory: /local/w35/ccc561/tmp/dask-worker-space/worker-kq_bvquj

0,1
Comm: tcp://127.0.0.1:35993,Total threads: 1
Dashboard: /node/ood-vn9/39373/proxy/45281/status,Memory: 2.81 GiB
Nanny: tcp://127.0.0.1:40817,
Local directory: /local/w35/ccc561/tmp/dask-worker-space/worker-9lsegioa,Local directory: /local/w35/ccc561/tmp/dask-worker-space/worker-9lsegioa


In [8]:
so = dict(mode='rb', anon=True, default_fill_cache=False, default_cache_type='first')

def gen_json(u, json_dir):
    with fs.open(u, **so) as infile:
        h5chunks = kerchunk.hdf.SingleHdf5ToZarr(infile, u, inline_threshold=300)
        p = u.split('/')
        fname = p[4]
        outf = f'{json_dir}/{fname}.json'
        with open(outf, 'wb') as f:
            f.write(ujson.dumps(h5chunks.translate()).encode());

In [9]:
# Let's define the URLs and a path to save the JSON files
urls = ["s3://" + f for f in temp_files]
json_path = f"/g/data/w35/ccc561/CAFE60/json/"

In [10]:
%%time
dask.compute(*[dask.delayed(gen_json)(u, json_path) for u in urls], retries=10);

CPU times: user 1min 37s, sys: 14.4 s, total: 1min 52s
Wall time: 7min 56s


(None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,

In [11]:
json_list = sorted(glob(f"{json_path}/temp*.json"))

In [14]:
mzz = kerchunk.combine.MultiZarrToZarr(json_list[0:50], 
    remote_protocol='s3',
    remote_options={'anon' : 'True'},   #JSON files  
    xarray_open_kwargs={
        'decode_cf' : False,
        'mask_and_scale' : False,
        'decode_times' : False,
        'use_cftime' : False,
        'drop_variables': ['time_bounds'],
        'decode_coords' : False
    },
    xarray_concat_args={
        "data_vars": "minimal",
        "coords": "minimal",
        "compat": "override",
        "join": "override",
        "combine_attrs": "override",
        "dim": "time"
    }
)


In [15]:
mzz.translate(f"{json_path}/temp.atmos_isobaric.daily.CAFE60.json")

IndexError: list index out of range

In [None]:
temp_files[0]

In [None]:
len(json_list)