## Cleaning up Atlas data - ICTP REA weights
**Function**      : Preprocess netCDF files and restructure the dataset<br>
**Description**   : In this notebook serves to clean up Atlas data which is given in netcdf format and aggregate the data into a single file.<br>
**Return Values   : .nc files**<br>
**Note**          : All the data is saved to netCDF4 format. Note that data from different models may vary concerning the resolution and coordinates.<br>

In [1]:
import os
from pathlib import Path
import xarray as xr

Specify the path to the dataset and the place to save the outputs. <br>

In [2]:
# please specify data path
datapath = Path("./AtlasData/raw/weights")

# please specify output path
output_path = Path("./AtlasData/preprocess/weights")
os.makedirs(output_path, exist_ok=True)

Components used to create the output file names. Here, only `institution_id` and `cmor_var` is based on on CMIP DRS conventions.

In [3]:
output_file_name = {
    "prefix": "atlas",
    "activity": "EUCP",  # project name e.g. EUCP
    "institution_id": "ICTP",  # ICTP
    "source": "CMIP6",  # e.g. CMIP6 or CMIP5
    "method": "REA",  # e.g. REA
    "cmor_var": "tas",  # e.g. tas or pr
    "suffix": "weights",
}

Make some metadata. Here, we follow CF-conventions as much as possible.

In [4]:
attrs_vars = {
    "tas": {
        "description": "Change in Air Temperature",
        "standard_name": "Change in Air Temperature",
        "long_name": "Change in Near-Surface Air Temperature",
        "units": "K",
        "cell_methods": "time: mean changes over 20 years 2041-2060 vs 1995-2014",
    },
    "pr": {
        "description": "Relative precipitation",
        "standard_name": "Relative precipitation",
        "long_name": "Relative precipitation",
        "units": "%",
        "cell_methods": "time: mean changes over 20 years 2041-2060 vs 1995-2014",
    },
    "weight": {
        "description": ("The weights depend on the chosen ensemble,"
                        "the convergence criterion considers the degree to which a model-simulated change is an outlier compared to the other models."
                        "So each weight can be associated to each model always considering the model as part of an ensemble."),
        "standard_name": "weight",
        "long_name": "weight",
        "units": "1",
    },
    "latitude": {"units": "degrees_north", "long_name": "latitude", "axis": "Y"},
    "longitude": {"units": "degrees_east", "long_name": "longitude", "axis": "X"},
    "time": {
        "climatology": "climatology_bounds",
        "long_name": "time",
        "axis": "T",
        "climatology_bounds": ["2050-6-1", "2050-9-1", "2050-12-1", "2051-3-1"],
        "description": "mean changes over 20 years 2041-2060 vs 1995-2014. The mid point 2050 is chosen as the representative time.",
    },
    "model": {"units": "1", "long_name": "model", "axis": "Z"},
}


Load data, clean it and save it. One file per each variable contains weights, data and model names.

In [5]:
institution_id = output_file_name["institution_id"]
method = output_file_name["method"]

model_names = [
    'AWI-CM-1-1-MR', 'BCC-CSM2-MR', 'CanESM5', 'CESM2',
    'CESM2-WACCM', 'CNRM-CM6-1', 'CNRM-ESM2-1', 'EC-Earth3',
    'EC-Earth3-Veg', 'GFDL-CM4', 'IPSL-CM6A-LR',
    'MIROC6', 'MRI-ESM2-0', 'NESM3', 'UKESM1-0-LL',
] 

TIMES = {
    "JJA": "2050-7-16",
    "DJF": "2051-1-16",
}  # "0000-4-16", "0000-7-16", "0000-10-16", "0000-1-16" MAM JJA SON DJF

In [6]:
for variable in ["tas", "pr"]:
    seasons = []
    file_names = []
    for season in TIMES:
        file_name = f"cat_both_{variable}_rcp85_{season}_box.nc"
        ds = xr.open_dataset(datapath / f"{institution_id}_{method}/{file_name}")
        file_names.append(file_name)

        # drop time_bnds, rename variables correctly, add time dimension
        renamed_var = {
            "erre": "weight",
            "time": "model",
            "lon": "longitude",
            "lat": "latitude",
        }
        ds_with_new_dims = ds.drop("time_bnds").rename(renamed_var).expand_dims({"time":[TIMES[season]]})

        # use the models names for model dimension
        new_ds = ds_with_new_dims.assign({"model": model_names})

        # Fix attributes of each variable
        for key in new_ds.keys():
            new_ds[key].attrs = attrs_vars[key]

        # a list of two seasons data   
        seasons.append(new_ds)
    
    # merge two seasons data
    data_variable = xr.concat(seasons, dim="time")

    # Fix attributes of dataset
    attrs_ds = {
    "description": f"Contains modified {institution_id} {method} data used for Atlas in EUCP project.",
    "history": (
        f"original {institution_id} {method} data files: "
        f"{file_names}"),
    }
    data_variable.attrs = attrs_ds
    
    # save the data
    output_file_name["cmor_var"] = variable
    file_name = f"{'_'.join(output_file_name.values())}.nc"
    print(f"one dataset is saved to {file_name}")
    data_variable.to_netcdf(output_path / file_name)

one dataset is saved to atlas_EUCP_ICTP_CMIP6_REA_tas_weights.nc
one dataset is saved to atlas_EUCP_ICTP_CMIP6_REA_pr_weights.nc


### Check input and output

In [7]:
# load one of the input for tas
ds = xr.open_dataset(datapath / f"{institution_id}_{method}"/ "cat_both_tas_rcp85_DJF_box.nc")
ds

In [8]:
# load output for tas
ds = xr.open_dataset(output_path / "atlas_EUCP_ICTP_CMIP6_REA_tas_weights.nc")
ds

In [9]:
# load one of the input for pr
ds = xr.open_dataset(datapath / f"{institution_id}_{method}"/ "cat_both_pr_rcp85_DJF_box.nc")
ds

In [10]:
# load output for pr
ds = xr.open_dataset(output_path / "atlas_EUCP_ICTP_CMIP6_REA_pr_weights.nc")
ds