<a href="https://colab.research.google.com/github/Vizzuality/copernicus-climate-data/blob/master/prepare_hourly_pet_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prepare data for the copernicus-climate project

https://github.com/Vizzuality/copernicus-climate-data

`Edward P. Morris (vizzuality.)`

## Description
[Describe what the notebook does.] 

## TODO
+ check dtypes and convert to simpler types; float64 to float32 reduces array size by 2. see https://xarray.pydata.org/en/stable/io.html#writing-encoded-data

```
MIT License

Copyright (c) 2020 Vizzuality

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
```

# Setup

Instructions for setting up the computing environment.

In [0]:
%%bash
# Remove sample_data
rm -r sample_data

## Linux dependencies

Instructions for adding linux (including node, ect.) system packages.

``` 
!apt install -q -y <package-name>
!npm install -g <package-name>
```

In [0]:
# Packages for projections and geospatial processing
!apt install -q -y libspatialindex-dev libproj-dev proj-data proj-bin libgeos-dev

Reading package lists...
Building dependency tree...
Reading state information...
proj-data is already the newest version (4.9.3-2).
proj-data set to manually installed.
The following additional packages will be installed:
  libspatialindex-c4v5 libspatialindex4v5
Suggested packages:
  libgdal-doc
The following NEW packages will be installed:
  libgeos-dev libproj-dev libspatialindex-c4v5 libspatialindex-dev
  libspatialindex4v5 proj-bin
0 upgraded, 6 newly installed, 0 to remove and 25 not upgraded.
Need to get 860 kB of archives.
After this operation, 5,014 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libgeos-dev amd64 3.6.2-1build2 [73.1 kB]
Get:2 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libspatialindex4v5 amd64 1.8.5-5 [219 kB]
Get:3 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libspatialindex-c4v5 amd64 1.8.5-5 [51.7 kB]
Get:4 http://archive.ubuntu.com/ubuntu bionic/universe amd64 libproj-dev amd64 4

## Python packages

Consider using package versions to ensure nothing changes.

`!pip install -q <package-name>`

In [0]:
# connect to Google cloud storage
!pip install -q gcsfs 

In [0]:
# geospatial tools
!pip install -q country-converter rtree geopandas shapely fiona

[K     |████████████████████████████████| 51kB 1.7MB/s 
[K     |████████████████████████████████| 71kB 3.9MB/s 
[K     |████████████████████████████████| 931kB 20.5MB/s 
[K     |████████████████████████████████| 14.7MB 311kB/s 
[K     |████████████████████████████████| 10.4MB 52.3MB/s 
[?25h  Building wheel for country-converter (setup.py) ... [?25l[?25hdone
  Building wheel for rtree (setup.py) ... [?25l[?25hdone


In [0]:
# netcdf, xarray, xclim, and Zarr tools
!pip install -q cftime netcdf4 nc-time-axis zarr xarray xclim rioxarray regionmask sparse

[K     |████████████████████████████████| 327kB 2.7MB/s 
[K     |████████████████████████████████| 4.1MB 6.6MB/s 
[K     |████████████████████████████████| 3.3MB 39.2MB/s 
[K     |████████████████████████████████| 112kB 56.6MB/s 
[K     |████████████████████████████████| 3.7MB 60.3MB/s 
[K     |████████████████████████████████| 122kB 56.3MB/s 
[K     |████████████████████████████████| 71kB 9.8MB/s 
[K     |████████████████████████████████| 3.8MB 59.1MB/s 
[K     |████████████████████████████████| 194kB 57.3MB/s 
[K     |████████████████████████████████| 174kB 40.2MB/s 
[K     |████████████████████████████████| 18.1MB 240kB/s 
[K     |████████████████████████████████| 8.9MB 57.1MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
[K     |████████████████████████████████| 634kB 53.1MB/s 
[K     |████████████████████████████████| 225kB 64.5MB/s 
[K     |█

In [0]:
# Show python package versions
!pip list

Package                  Version        
------------------------ ---------------
absl-py                  0.9.0          
affine                   2.3.0          
alabaster                0.7.12         
albumentations           0.1.12         
altair                   4.1.0          
asciitree                0.3.3          
asgiref                  3.2.7          
astor                    0.8.1          
astropy                  4.0.1.post1    
astunparse               1.6.3          
atari-py                 0.2.6          
atomicwrites             1.3.0          
attrs                    19.3.0         
audioread                2.1.8          
autograd                 1.3            
Babel                    2.8.0          
backcall                 0.1.0          
beautifulsoup4           4.6.3          
bleach                   3.1.4          
blis                     0.4.1          
bokeh                    1.4.0          
boltons                  20.1.0         
boto            

## Authorisation

Setting up connections and authorisation to cloud services.

### Google Cloud

This can be done in the URL or via adding service account credentials.

If you do not share the notebook, you can mount your Drive and and transfer credentials to disk. Note if the notebook is shared you always need to authenticate via URL.  

In [0]:
# Set Google Cloud information
gc_project = "skydipper-196010"
gc_creds = "skydipper-196010-f842645fd0f3.json"
gc_user = "edward-morris@skydipper-196010.iam.gserviceaccount.com"
gcs_prefix = "gs://copernicus-climate"

In [0]:
# For auth WITHOUT service account
# https://cloud.google.com/resource-manager/docs/creating-managing-projects
#from google.colab import auth
#auth.authenticate_user()
#!gcloud config set project {project_id}

In [0]:
# If the notebook is shared
#from google.colab import drive
#drive.mount('/content/drive')

In [0]:
# If Drive is mounted, copy GC credentials to home (place in your GDrive, and connect Drive)
!cp "/content/drive/My Drive/{gc_creds}" "/root/.{gc_creds}"

In [0]:
# Auth WITH service account
!gcloud auth activate-service-account {gc_user} --key-file=/root/.{gc_creds} --project={gc_project}


Activated service account credentials for: [edward-morris@skydipper-196010.iam.gserviceaccount.com]


To take a quick anonymous survey, run:
  $ gcloud survey



In [0]:
# Test GC auth
!gsutil ls {gcs_prefix}

gs://copernicus-climate/heatwaves_historical_Basque.zip
gs://copernicus-climate/heatwaves_longterm_Basque.zip
gs://copernicus-climate/spain.zarr.zip
gs://copernicus-climate/coldsnaps/
gs://copernicus-climate/data_for_PET/
gs://copernicus-climate/dataset/
gs://copernicus-climate/european-nuts-lau-geometries.zarr/
gs://copernicus-climate/heatwaves/
gs://copernicus-climate/pet/
gs://copernicus-climate/spain.zarr/
gs://copernicus-climate/tasmax/
gs://copernicus-climate/tasmin/
gs://copernicus-climate/to_delete/


# Utils

Generic helper functions used in the subsequent processing. For easy navigation each function seperated into a section with the function name.

## copy_gcs

In [0]:
import os
import subprocess

def copy_gcs(source_list, dest_list, opts=""):
  """
  Use gsutil to copy each corresponding item in source_list
  to dest_list.

  Example:
  copy_gcs(["gs://my-bucket/data-file.csv"], ["."])

  """
  for s, d  in zip(source_list, dest_list):
    cmd = f"gsutil -m cp -r {opts} {s} {d}"
    print(f"Processing: {cmd}")
    r = subprocess.call(cmd, shell=True)
    if r == 0:
        print("Task created")
    else:
        print("Task failed")
  print("Finished copy")

## list_paths

In [0]:
import glob
import subprocess

def list_paths(uri_prefix, dir_path, file_pattern="*", gsutil=True, return_dir_path=True):
        ''' Creates a list of full paths 
    
        Uses glob regex rules allowing flexible patterns
    
        Parameters
        ----------
        uri_prefix : str
            The (GCS) uri prefix.
        dir_path : str
            Directory path, can use regex.
        file_pattern : str
            File pattern for glob searching.
        gsutil : bool
            Use gsutil, default is True.
        return_dir_path : bool
            Return directory path relative to uri_prefix, default is True.        
    
        Returns
        -------
        List of path strings.
        
        Examples
        --------
        # Requires authentication
        #list_paths("gs://skydipper-water-quality", "cloud-masks/*", "*.tif", True, False)
        '''
        p = f"{uri_prefix}/{dir_path}/{file_pattern}"
        print(f"Searching {p}")
        if not gsutil:
          out = glob.glob(p)
        if gsutil:
          cmd = f"gsutil ls {p}"
          out = subprocess.check_output(cmd, shell=True).decode('utf8').split('\n')
          out.pop(-1)
        if return_dir_path:
          out = [f.split(uri_prefix)[1] for f in out]  
        print(f"Found {len(out)} path(s)")
        return out

#path_list = list_paths("gs://skydipper-water-quality", "cloud-masks/*", "*.tif", True, True)
#print(path_list[0])



## mkdirs

In [0]:
from pathlib import Path

def mkdirs(dirs_list, exist_ok=True):
  """ Create nested directories
  """
  for p in dirs_list:
    Path(p).mkdir(parents=True, exist_ok=exist_ok)


## unzip_to_dir

In [0]:
from zipfile import ZipFile 
  
def unzip_to_dir(source_list, dest_list, dry_run=False, view_contents=False):
  for s,d in zip(source_list, dest_list):
    # opening the zip file in READ mode 
    with ZipFile(s, 'r') as archive: 
      if dry_run:
        print(f"Dry run Extracting {s} to {d}")
        if view_contents:
          # printing all the contents of the zip file 
          archive.printdir()
      else:
        # extracting all the files 
        print(f"Extracting {s} to {d}") 
        archive.extractall(path=d) 
      print('Done!') 


## extract_string

In [0]:
import os
def extract_string(file_path, split, index):
  """
  Get string by splitting path
  
  @arg file_path The file path string to split
  @arg split Caracter to split path on
  @index Index integer to retain

  @return A string
  """ 
  return os.path.splitext(file_path)[0].split(split)[index]

## set_acl_to_public

In [0]:
import subprocess

# Set to asset permissions to public for https read
def set_acl_to_public(gs_path):
  """ 
  Set all Google Storage assets to puplic read access.

  Requires GS authentication

  Parameters
  ----------
  gs_path str
    The google storage path, note the "-r" option is used, setting the acl of all assets below this path
  """
  cmd = f"gsutil -m acl -r ch -u AllUsers:R {gs_path}"
  print(cmd)
  r = subprocess.call(cmd, shell=True)
  if r is 0:
    print("Set acl(s) sucsessful")
  else:
    print("Set acl(s) failed")  

#set_acl_to_public("gs://skydipper-water-quality/cloud-masks")

## unchunk_dataset

In [0]:
# unchunk coords
# from xcube
import json
import os.path
from typing import List, Sequence

import numpy as np
import xarray as xr
import zarr


def unchunk_dataset(dataset_path: str, var_names: Sequence[str] = None, coords_only: bool = False):
    """
    Unchunk dataset variables in-place.
    :param dataset_path: Path to ZARR dataset directory.
    :param var_names: Optional list of variable names.
    :param coords_only: Un-chunk coordinate variables only.
    """

    is_zarr = os.path.isfile(os.path.join(dataset_path, '.zgroup'))
    if not is_zarr:
        raise ValueError(f'{dataset_path!r} is not a valid Zarr directory')

    with xr.open_zarr(dataset_path) as dataset:
        if var_names is None:
            if coords_only:
                var_names = list(dataset.coords)
            else:
                var_names = list(dataset.variables)
        else:
            for var_name in var_names:
                if coords_only:
                    if var_name not in dataset.coords:
                        raise ValueError(f'variable {var_name!r} is not a coordinate variable in {dataset_path!r}')
                else:
                    if var_name not in dataset.variables:
                        raise ValueError(f'variable {var_name!r} is not a variable in {dataset_path!r}')

    _unchunk_vars(dataset_path, var_names)


def _unchunk_vars(dataset_path: str, var_names: List[str]):
    for var_name in var_names:
        var_path = os.path.join(dataset_path, var_name)

        # Optimization: if "shape" and "chunks" are equal in ${var}/.zarray, we are done
        var_array_info_path = os.path.join(var_path, '.zarray')
        with open(var_array_info_path, 'r') as fp:
            var_array_info = json.load(fp)
            if var_array_info.get('shape') == var_array_info.get('chunks'):
                continue

        # Open array and remove chunks from the data
        var_array = zarr.convenience.open_array(var_path, 'r+')
        if var_array.shape != var_array.chunks:
            # TODO (forman): Fully loading data is inefficient and dangerous for large arrays.
            #                Instead save unchunked to temp and replace existing chunked array dir with temp.
            # Fully load data and attrs so we no longer depend on files
            data = np.array(var_array)
            attributes = var_array.attrs.asdict()
            # Save array data
            zarr.convenience.save_array(var_path, data, chunks=False, fill_value=var_array.fill_value)
            # zarr.convenience.save_array() does not seem save user attributes (file ".zattrs" not written),
            # therefore we must modify attrs explicitly:
            var_array = zarr.convenience.open_array(var_path, 'r+')
            var_array.attrs.update(attributes)

## write_to_remote_zarr

In [0]:
import gcsfs
import zarr
import xarray as xr

def write_to_remote_zarr(
    ds,
    group,
    root,
    unchunk_coords = True,
    project_id = gc_project,
    token=f"/root/.{gc_creds}",
    show_tree = True
    ):
  
  # Connect to GS
  gc = gcsfs.GCSFileSystem(project=project_id, token=token)
  store = gc.get_mapper(root, check=False, create=True)
  
  # consolidate metadata at root
  zarr.consolidate_metadata(store)
  
  # Write to zarr group
  ds.to_zarr(store=store, group=group, mode="w", consolidated=True)
  
  # consolidate metadata at root
  zarr.consolidate_metadata(store)
  c = gc.exists(f"{root}/.zmetadata")
  print(f"{root} is consoldiated? {c}")
  # unchunk coordinates
  # TODO: optimise this for remote ZARR
  #if unchunk_coords:
  #  unchunk_dataset(store, coords_only = True)
  if show_tree:
    with zarr.open_consolidated(store, mode='r') as z:
      print(z.tree())




## rmv_remote_zarr

In [0]:
import gcsfs
import zarr

def rmv_remote_zarr(
    group,
    root,
    project_id = gc_project,
    token=f"/root/.{gc_creds}",
     show_tree=False):
  
  # Connect to GS
  gc = gcsfs.GCSFileSystem(project=project_id, token=token)
  store = gc.get_mapper(root, check=False, create=True)
  # Remove zarr group
  print(f"Removing {root}/{group}")
  zarr.storage.rmdir(store, path=group)
  # consolidate metadata at root
  zarr.consolidate_metadata(store)
  with zarr.open_consolidated(store, mode='r') as z:
    print(z.tree())

In [0]:
%%time
#rmv_remote_zarr_group('pet-minmax-monthly-era5', root = "copernicus-climate/spain.zarr")

CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 6.68 µs


## get_size_remote_zarr

In [0]:
import os
import subprocess

def get_size_remote_zarr(
    group,
    root):
  
  # Get size using gsutil
  if group:
    p = f"gs://{root}/{group}"
  else:
    p = f"gs://{root}"  
  cmd = f"gsutil -m du -sh {p}"
  print(f"Processing: {cmd}")
  r = subprocess.getoutput(cmd)
  print(r)

## get_cached_remote_zarr

In [0]:
import gcsfs
import zarr
import xarray as xr



def get_cached_remote_zarr(
    group,
    root,
    project_id = gc_project,
    token=f"/root/.{gc_creds}"
    ):
  
  # Connect to GS
  gc = gcsfs.GCSFileSystem(project=project_id, token=token)
  store = gc.get_mapper(root, check=False, create=True)
  # Check zarr is consolidated
  consolidated = gc.exists(f'{root}/.zmetadata')
  # Cache the zarr store
  #store = zarr.ZipStore(store, mode='r')
  cache = zarr.LRUStoreCache(store, max_size=None)
  # Return cached zarr group
  return xr.open_zarr(cache, group=group, consolidated=consolidated)

## cold_spell_frequency

In [0]:
# Coldsnap events
from xclim.indices import run_length as rl

def cold_spell_frequency(da_tasmin, da_tasmin_reference, quantile=0.05, windows=[2,4,5], names=['warnings', 'alerts', 'alarms'], freq='MS'):
  # Create quantile for reference array
  da_quantile = da_tasmin_reference.quantile(quantile, dim=["time"])
  syear = da_tasmin_reference.time.dt.strftime('%Y').values[0]
  eyear = da_tasmin_reference.time.dt.strftime('%Y').values[-1]
  # Create bool array
  ba = da_tasmin < da_quantile
  # Resample to freq
  group = ba.resample(time='MS')
  # Calculate sum of events per frequency per window
  da_list = [group.map(rl.windowed_run_events, window=window, dim="time") for window in windows]
  da_dict = dict(zip(names, da_list))
  # Rename and add metadata
  if freq == 'MS': freq = 'Monthly'
  if freq == 'YS': freq = 'Yearly'
  for name, window in zip(names, windows):
    attrs = {
        'units':"",
        'standard_name': "cold_snap_events",
        'long_name': f"Number of cold snap events (Tmin < Tmin_q{quantile} for >= {window} days)",
        'description': f"{freq} number of cold snap events. "
        "An event occurs when the minimum daily "
        "temperature is lower than a specific threshold per cell : "
        f"(Tmin < Tmin_q{quantile}) "
        f"over a minimum number of days ({window}), where "
        f"Tmin_q{quantile} is calculated for the reference time-interval "
        f"{syear}--{eyear}.",
    }
    da_dict[name].attrs = attrs
    da_dict[name].name = f"cold_snap_{name}"
  # Combine into a dataset
  return xr.merge(list(da_dict.values())).drop('quantile')


## heat_wave_frequency

In [0]:
# Heatwave events
from xclim.indices import run_length as rl

def heat_wave_frequency(da_tasmax,
                        da_tasmin,
                        ds_reference,
                        quantiles={'tasmin':0.90, 'tasmax':0.95},
                        windows=[2,4,5],
                        names=['warnings', 'alerts', 'alarms'],
                        freq='MS'):
  # Create quantiles for reference array
  da_tasmax_quantile = ds_reference.tasmax.quantile(quantiles['tasmax'], dim=["time"])
  da_tasmin_quantile = ds_reference.tasmin.quantile(quantiles['tasmin'], dim=["time"])
  syear = ds_reference.time.dt.strftime('%Y').values[0]
  eyear = ds_reference.time.dt.strftime('%Y').values[-1]
  # Create bool array
  ba = (da_tasmin > da_tasmin_quantile) & (da_tasmax > da_tasmax_quantile)
  # Resample to freq
  group = ba.resample(time='MS')
  # Calculate sum of events per frequency per window
  da_list = [group.map(rl.windowed_run_events, window=window, dim="time") for window in windows]
  da_dict = dict(zip(names, da_list))
  # Rename and add metadata
  if freq == 'MS': freq = 'Monthly'
  if freq == 'YS': freq = 'Yearly'
  for name, window in zip(names, windows):
    attrs = {
        'units':"",
        'standard_name': "heat_wave_events",
        'long_name': f"Number of heat wave events (Tmin > Tmin_q{quantiles['tasmin']} & Tmax > Tmax_q{quantiles['tasmax']} for >= {window} days)",
        'description': f"{freq} number of heat wave events. "
        "An event occurs when the minimum and maximum daily "
        "temperatures are higher than specific thresholds per cell : "
        f"(Tmin > Tmin_q{quantiles['tasmin']}) & (Tmax > Tmax_q{quantiles['tasmax']}) "
        f"over a minimum number of days ({window}), where "
        f"Tmin_q{quantiles['tasmin']} and Tmax_q{quantiles['tasmax']} are calculated for the reference time-interval "
        f"{syear}--{eyear}.",
    }
    da_dict[name].attrs = attrs
    da_dict[name].name = f"heatwave_{name}"
  # Combine into a dataset
  return xr.merge(list(da_dict.values()))#.drop('quantile') 



# Processing

Data processing organised into sections.

## Get datasets

In [0]:
# Creat directory structure
ds_dir = "dataset"
mkdirs([ds_dir])
# Add pet to process hourly pet
# "pet"
data_dirs = ["pet", "data_for_PET"]

In [0]:
# Get datasets
dest_list = [f"{ds_dir}" for p in data_dirs]
source_list = [f"{gcs_prefix}/{p}" for p in data_dirs]
copy_gcs(source_list, dest_list)

Processing: gsutil -m cp -r  gs://copernicus-climate/pet dataset
Task created
Processing: gsutil -m cp -r  gs://copernicus-climate/data_for_PET dataset
Task created
Finished copy


In [0]:
import glob
# Unzip datasets
source_paths = [glob.glob(f"/content/{ds_dir}/{d}/*.zip") for d in data_dirs]
#print(source_paths)
dest_paths = [[f"/content/{ds_dir}/{d}" for s in sp] for d, sp in zip(data_dirs, source_paths)]
#print(dest_paths)
[unzip_to_dir(sp, dp, dry_run=False) for sp, dp in zip(source_paths, dest_paths)]

Extracting /content/dataset/pet/pet_girl_walking_2015-2019.zip to /content/dataset/pet
Done!
Extracting /content/dataset/pet/pet_woman_lying_2005-2019.zip to /content/dataset/pet
Done!
Extracting /content/dataset/pet/pet_old_woman_walking_2015-2019.zip to /content/dataset/pet
Done!
Extracting /content/dataset/pet/pet_woman_walking_summer_2015-2019.zip to /content/dataset/pet
Done!
Extracting /content/dataset/pet/pet_woman_sport_2015-2019.zip to /content/dataset/pet
Done!
Extracting /content/dataset/pet/pet_woman_walking_winter_2015-2019.zip to /content/dataset/pet
Done!
Extracting /content/dataset/pet/pet_woman_walking_2005-2019.zip to /content/dataset/pet
Done!
Extracting /content/dataset/data_for_PET/t2m_ERA5.zip to /content/dataset/data_for_PET
Done!
Extracting /content/dataset/data_for_PET/mrt_ERA5-HEAT.zip to /content/dataset/data_for_PET
Done!
Extracting /content/dataset/data_for_PET/RH_ERA5.zip to /content/dataset/data_for_PET
Done!
Extracting /content/dataset/data_for_PET/sp_ER

[None, None]

## Prepare Hourly PET

In [0]:
# translate terms
ages = ['girl', 'woman', 'old']
ages2 = ['child', 'adult', 'senior']
age_dict = dict(zip(ages,ages2))

# Combine datasets
import xarray as xr
import pandas as pd
import datetime

def prep_dataset(fp):
  ds = xr.open_dataset(fp)
  d = ds.attrs['description'].split(",")
  mal = float(d[1].split(' Metabolic activity level = ')[1])
  cl = float(d[2].split(' Clothing level = ')[1])
  age = d[0].lower().split('pet ')[1].split(' ')[0]
  ds['gender'] = ['female']
  ds['age_cat'] = [age_dict[age]]
  ds['met'] = [mal]
  ds['clo'] = [cl]
  ds = ds.set_coords(['time', 'longitude', 'latitude', 'gender', 'age_cat', 'met', 'clo'])
  return ds.chunk({'age_cat':1, 'met':1, 'clo':1})

def prep_hourly_pet(ds_dir):
  ds_list = list()
  fps = list_paths(ds_dir, "pet", file_pattern="*.nc", gsutil=False, return_dir_path=False)
  for fp in fps:
    ds_list.append(prep_dataset(fp))
  ds_pet = xr.combine_by_coords(ds_list)
  ds_pet = ds_pet.set_index(lat='latitude', lon='longitude', append=True)    

  # Update metadata
  ds_pet.attrs = {
    'description' : 'Physiologically Equivalent Temperature (PET) calculated for a range of genders, age categories, clothing insulations, and metabolic rates.',
    'history' : '',
    'source' : 'ERA5 available from the Copernicus Climate Data Service'
  }
  ds_pet.pet.time.attrs = {
    'standard_name': 'time',
    'long_name': 'Time',
    'bounds': 'time_bnds',
    'axis': 'T',
    'stored_direction': 'increasing',
    'type': 'double',
    #'units': 'days since 1900-01-01'  
  }
  ds_pet.pet.lat.attrs = {
      'axis' : 'Y',
      'long_name' : 'latitude',
      'standard_name' : 'latitude',
      'stored_direction' : 'increasing',
      'type' : 'double',
      'units' : 'degrees_east',
      'valid_max' : 360.0,
      'valid_min' : -180.0
      }
  ds_pet.pet.lon.attrs = {
      'axis' : 'X',
      'long_name' : 'longitude',
      'standard_name' : 'longitude',
      'stored_direction' : 'decreasing',
      'type' : 'double',
      'units' : 'degrees_north',
      'valid_max' : 90,
      'valid_min' : -90
      }
  ds_pet.age_cat.attrs = {
    'long_name' : 'Age category',
    'standard_name' : 'age_category'
  }
  ds_pet.met.attrs = {
    'long_name' : 'Metabolic rate',
    'standard_name' : 'metabolic_rate',
    'description' : 'The rate of transformation of chemical energy into heat and mechanical work by metabolic activities of an individual, per unit of skin surface area.',
    'units' : 'W/m2'
  }
  ds_pet.clo.attrs = {
    'long_name' : 'Clothing insulation',
    'standard_name' : 'clothing_insulation',
    'description' : 'Resistance to sensible heat transfer provided by a clothing ensemble, expressed in units of clo, where clo = 0.155 (m2·ºC)/W.',
    'units' : '1'
  }
  ds_pet.gender.attrs = {
    'long_name' : 'Gender',
    'standard_name' : 'gender'
  }
  ds_pet.pet.attrs = {
   'long_name' : 'Physiologically Equivalent Temperature (PET)',
   'standard_name' : 'physiologically_equivalent_temperature',
   'units' : '1' 
  }
  return ds_pet

ds_pet = prep_hourly_pet(ds_dir)

Searching dataset/pet/*.nc
Found 1246 path(s)


In [0]:
ds_pet = ds_pet.chunk({'time':24*2, 'age_cat':-1, 'met':-1, 'clo':-1, 'lat':-1, 'lon':-1})
ds_pet

Unnamed: 0,Array,Chunk
Bytes,31.69 GB,11.70 MB
Shape,"(3, 3, 3, 130008, 37, 61)","(3, 3, 3, 48, 37, 61)"
Count,37736 Tasks,2709 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 31.69 GB 11.70 MB Shape (3, 3, 3, 130008, 37, 61) (3, 3, 3, 48, 37, 61) Count 37736 Tasks 2709 Chunks Type float32 numpy.ndarray",3  3  3  61  37  130008,

Unnamed: 0,Array,Chunk
Bytes,31.69 GB,11.70 MB
Shape,"(3, 3, 3, 130008, 37, 61)","(3, 3, 3, 48, 37, 61)"
Count,37736 Tasks,2709 Chunks
Type,float32,numpy.ndarray


### Write to ZARR

In [0]:
# Use this to rmv any failed runs
#%%time
#rmv_remote_zarr('pet-hourly-era5', root = "copernicus-climate/spain.zarr")
!gsutil -m rm -r gs://copernicus-climate/spain.zarr/pet-hourly-era5/*

Removing gs://copernicus-climate/spain.zarr/pet-hourly-era5/.zgroup#1588547032623067...
/ [1/1 objects] 100% Done                                                       
Operation completed over 1 objects.                                              


In [0]:
%%time
# Create remote ZARR
write_to_remote_zarr(ds_pet, 'pet-hourly-era5', root = "copernicus-climate/spain.zarr")

copernicus-climate/spain.zarr is consoldiated? True
/
 ├── future-longterm-yr10
 │   ├── coldsnap_alarms (8, 5, 2, 91, 151) float64
 │   ├── coldsnap_alerts (8, 5, 2, 91, 151) float64
 │   ├── experiment (2,) object
 │   ├── heatwave_alarms (8, 5, 2, 91, 151) float64
 │   ├── heatwave_alerts (8, 5, 2, 91, 151) float64
 │   ├── lat (91,) float64
 │   ├── lon (151,) float64
 │   ├── model (5,) object
 │   ├── tasmax (8, 5, 2, 91, 151) float32
 │   ├── tasmin (8, 5, 2, 91, 151) float32
 │   └── time (8,) int64
 ├── future-seasonal-monthly
 │   ├── coldsnap_alarms (6, 6, 91, 151) float64
 │   ├── coldsnap_alerts (6, 6, 91, 151) float64
 │   ├── heatwave_alarms (6, 6, 91, 151) float64
 │   ├── heatwave_alerts (6, 6, 91, 151) float64
 │   ├── lat (91,) float64
 │   ├── lon (151,) float64
 │   ├── model (6,) object
 │   ├── tasmax (6, 6, 91, 151) float32
 │   ├── tasmin (6, 6, 91, 151) float32
 │   └── time (6,) int64
 ├── historical-monthly
 │   ├── cold_snap_alarms (468, 127, 211) float64
 

In [0]:
#%%time
get_size_remote_zarr('pet-hourly-era5', root = "copernicus-climate/spain.zarr")

NameError: ignored