## Cleaning up Atlas data - ETHZ ClimWIP weights
**Function**      : Preprocess netCDF files and restructure the dataset<br>
**Description**   : In this notebook serves to clean up Atlas data which is given in netcdf format and aggregate the data into a single file.<br>
**Return Values   : .nc files**<br>
**Note**          : All the data is saved to netCDF4 format. Note that data from different models may vary concerning the resolution and coordinates.<br>

In [1]:
import os
from pathlib import Path
import xarray as xr

### Path
Specify the path to the dataset and the place to save the outputs. <br>

In [2]:
# please specify data path
datapath = Path("./AtlasData/raw/weights")

# please specify output path
output_path = Path("./AtlasData/preprocess/weights")
os.makedirs(output_path, exist_ok=True)

In [3]:
output_file_name = {
    "prefix": "atlas",
    "activity": "EUCP",  # project name e.g. EUCP
    "institution_id": "ETHZ",  # ETHZ
    "source": "CMIP6",  # e.g. CMIP6 or CMIP5
    "method": "ClimWIP",  # e.g. ClimWIP
    "cmor_var": "tas",  # e.g. tas or pr
    "suffix": "weights",
}

Make some metadata. Here, we follow CF-conventions as much as possible.

In [4]:
attrs_vars = {
    "tas": {
        "description": "Change in Air Temperature",
        "standard_name": "Change in Air Temperature",
        "long_name": "Change in Near-Surface Air Temperature",
        "units": "K",
        "cell_methods": "time: mean changes over 20 years 2041-2060 vs 1995-2014",
    },
    "pr": {
        "description": "Relative precipitation",
        "standard_name": "Relative precipitation",
        "long_name": "Relative precipitation",
        "units": "%",
        "cell_methods": "time: mean changes over 20 years 2041-2060 vs 1995-2014",
    },
    "weight": {
        "description": ("ClimWIP has a single weighting for a model for the whole domain."
                        "ClimWIP weights depend (to a degree) on the composition of the ensemble as the include an independence criterion."),
        "standard_name": "weight",
        "long_name": "weight",
        "units": "1",
    },
    "latitude": {"units": "degrees_north", "long_name": "latitude", "axis": "Y"},
    "longitude": {"units": "degrees_east", "long_name": "longitude", "axis": "X"},
    "time": {
        "climatology": "climatology_bounds",
        "long_name": "time",
        "axis": "T",
        "climatology_bounds": ["2050-6-1", "2050-9-1", "2050-12-1", "2051-3-1"],
        "description": "mean changes over 20 years 2041-2060 vs 1995-2014. The mid point 2050 is chosen as the representative time.",
    },
    "model": {
        "units": "1",
        "long_name": "model",
        "axis": "Z",
        "description":"the model dimension uses the convention: <model name>_<number of ensemble members OR ensemble member ID if only one>_<model generation>"
    },
}

### Load and process raw data
Load data, clean it and save it. One file per each variable contains weights, data and model names.

In [5]:
institution_id = output_file_name["institution_id"]
method = output_file_name["method"]

TIMES = {
    "JJA": "2050-7-16",
    "DJF": "2051-1-16",
}  # "0000-4-16", "0000-7-16", "0000-10-16", "0000-1-16" MAM JJA SON DJF

In [10]:
for variable in ["tas", "pr"]:
    seasons = []
    file_names = []
    for season in TIMES:
        file_name = f"eur_{variable}_41-60_{season.lower()}_cmip6.nc"
        file_names.append(file_name)
        ds = xr.open_dataset(datapath / f"{institution_id}_{method}/{file_name}")
        
        # calculate relative precipitation
        if variable == "pr":
            ds_rel = ds.assign(pr = (ds["pr_mean"]/ ds["pr_clim_mean"]) * 100)
            
            # drop clim_mean, rename variables correctly, add time dimension
            renamed_var = {
                "weights_mean": "weight",
                "lon": "longitude",
                "lat": "latitude",
            }
            new_ds = ds_rel.drop(["pr_clim_mean", "pr_mean"]).rename(renamed_var).expand_dims({"time":[TIMES[season]]})

        else:
            # drop clim_mean, rename variables correctly, add time dimension
            renamed_var = {
                "weights_mean": "weight",
                f"tas_mean": "tas",
                "lon": "longitude",
                "lat": "latitude",
            }
            new_ds = ds.drop(f"tas_clim_mean").rename(renamed_var).expand_dims({"time":[TIMES[season]]})
        
        # Fix attributes of each variable
        for key in new_ds.keys():
            new_ds[key].attrs = attrs_vars[key]

        # a list of two seasons data   
        seasons.append(new_ds)
    
    # merge two seasons data
    data_variable = xr.concat(seasons, dim="time")

    # Fix attributes of dataset
    attrs_ds = {
    "description": f"Contains modified {institution_id} {method} data used for Atlas in EUCP project.",
    "history": (
        f"original {institution_id} {method} data files: "
        f"{file_names}"),
    }
    data_variable.attrs = attrs_ds
    
    # save the data
    output_file_name["cmor_var"] = variable
    file_name = f"{'_'.join(output_file_name.values())}.nc"
    data_variable.to_netcdf(output_path / file_name)
    print(f"One dataset is saved to {file_name}")

One dataset is saved to atlas_EUCP_ETHZ_CMIP6_ClimWIP_tas_weights.nc
One dataset is saved to atlas_EUCP_ETHZ_CMIP6_ClimWIP_pr_weights.nc


### Check input and output

In [11]:
# load one of the input for tas
ds = xr.open_dataset(datapath / f"{institution_id}_{method}"/ "eur_tas_41-60_djf_cmip6.nc")
ds

In [12]:
# load output for tas
ds = xr.open_dataset(output_path / "atlas_EUCP_ETHZ_CMIP6_ClimWIP_tas_weights.nc")
ds

In [13]:
# load one of the input for pr
ds = xr.open_dataset(datapath / f"{institution_id}_{method}"/ "eur_pr_41-60_djf_cmip6.nc")
ds

In [14]:
# load output for pr
ds = xr.open_dataset(output_path / "atlas_EUCP_ETHZ_CMIP6_ClimWIP_pr_weights.nc")
ds