### Writing data to an existing zarr store (INCA)

Writing data to an exisiting store should be done automatically with [argo worklfows](ArgoWorkflows.ipynb), this notebook describes how it is done manually.

#### Necessary imports

In [None]:
import xarray as xr
import numpy as np
import zarr
import pandas as pd
import os

This function will help in a later step

In [None]:
def get_idx(array1, array2):
    min = np.where(array1==array2[0])[0][0]
    max = np.where(array1==array2[-1])[0][0]+1
    return min, max

#### Writing data

Define the path to your store

In [None]:
store_path = "INCA.zarr"

In a [previous step](INCA_download.ipynb) the necessary data is downloaded from the [geosphere data hub](https://data.hub.geosphere.at/dataset/inca-v1-1h-1km) and saved to the folder INCA_data. The writing is done individually for each available parameter in the dataset.

In [None]:
param ="T2M"

# Get the paths to each file
folder_path = f'INCA_data/{param}'
filepaths = []
for filename in os.listdir(folder_path):
    file_path = os.path.join(folder_path, filename)
    if os.path.isfile(file_path):
        filepaths.append(file_path)

The writing is done by looping through the files

In [None]:
# Open the dataarray and get metadata parameters for writing. 
store = zarr.storage.LocalStore(store_path)
group = zarr.group(store=store)

dtype = group[param].dtype
fill_value = group[param].attrs.get('_FillValue')
freq = group.attrs.get('freq')

# Get the x and y extent of the group
x_extent = group["x"][:]
y_extent = group["y"][:]

# Get the origin time value from the zarr store
origin = xr.open_zarr(store_path).time[0].values

# Loop through the files
for i, file in enumerate(filepaths):
    
    # Reading the data
    data = xr.open_dataset(file, chunks={}, mask_and_scale=False) # Set mask_and_scale to Fales to get the raw data values
    data = data.load()

    # If the loaded data has a smaller spatial exent than the whole extent of the zarr store, you need to know the indexes where to write the data.
    x_min, x_max = get_idx(x_extent, data["x"].values)
    y_min, y_max = get_idx(y_extent, data["y"].values)

    # To write in the correct time coordinates you need the starting and end time of your data
    time_min, time_max = data.time.values[0].astype("datetime64[h]"), data.time.values[-1].astype("datetime64[h]")+1
    
    # And  the corresponding indexes in the zarr store originating from the origin
    time_delta_min, time_delta_max = (time_min - origin).astype("int64"), (time_max - origin).astype("int64")

    ### If you have a missing timestep in your data this snippet will add the timestep filled with FillValue ###
    
    # First create a range where all timesteps are present
    full_range = pd.date_range(time_min, time_max, freq=freq).values.astype("datetime64[ns]")
    
    # Check if the data has no missing timesteps
    for value in data.time.values:
        if value in set(full_range):
            continue
        else:
            # If there are missing timesteps, a dataset is created with all timesteps and only FillValues
            print(f"{file} Data incomplete")
            empty_array = np.full((full_range.shape[0], data["x"].values.shape[0], data["y"].values.shape[0]),
                                fill_value=fill_value, dtype=dtype)

            template = xr.Dataset({f"{param}": (("time", "x", "y"), empty_array)},
                                  coords={
                                    "time": full_range,
                                    "x": data["x"].values,
                                    "y": data["y"].values
                                  }
                                  )

            # The data is then combined to form a dataset with all timesteps present
            data = data.combine_first(template)
            print(f"{file} Data gaps filled with no data values")
            break
    ### end of snippet ###

    # Write the data in the zarr data array, with the correct indices in spatial  and temporal domain
    group[param][time_delta_min:time_delta_max, y_min:y_max, x_min:x_max] = data[param].values

    print(f"{file} written to zarr store. {i}/{len(filepaths)} complete💌")

### Inspecting the zarr store

After writing to the store you should check if the data in the store matches the original data.

In [None]:
original_data = xr.open_dataset(filepaths[0]).load()

In [None]:
time_min = original_data.time.values[0]
time_max = original_data.time.values[-1]
zarr_data = xr.open_zarr(store_path)[param].sel(time=slice(time_min, time_max)).load()