Skip to content

Commit

Permalink
Adding caravan forcing (#407)
Browse files Browse the repository at this point in the history
* adding caravan forcing class.

* forgot to pass basin_id correctly

* wrong path

* specifiy output path for extraction

* wrong var_name path

* replacing geopandas and wget

* adding lat/lon to dimensions based on shape

* issue with to_array(), should be to_numpy()

* wrong dataset

* wrong dataset ref

* move basin properties to coordinates

* Visual tweaks based on sonarcloud feedback

* change intersection to union

* back to intersection, change order

* add rename era5 dict for now

* also rename variables tuple

* add example

* running pre-commit & refactoring accoring to suggestions

* refactor name, calling `generate` throws exception

* Fix CI problems

* Debug failing test

* Couldn't get temp file to work ('only' 30mb)

* change config, remove old test file

* re-add files

* should be `types-requests`

* swap requests with urllib3

* pre-commit changes

* recheck tests

* readd failing line

* Check if both pass without failing test

* Adding test & test data

* Run pre-commit, fix formatting

* still failing black-jupyter linter

* Fix import sorting, formatting

* Remove "end-of-file" hook: unconfigurable to ignore shapefiles

* Pin sonar workflow instead of using master

* mocking get_dataset correctly

* Re adding failing test test_util.test_merge_esmvaltool_datasets

* adjust & run pre-commit

* fixing indent in .pre-commit

* test new generate object

* change path in test

* add link to example notebook

* fix failing pre-commit tests

* issues with tests/setting variables correctly aftere refactor

* re-add failing test

* ensure history is also copied

* final tweaks & changelog

* Update CHANGELOG.md

Co-authored-by: Bart Schilperoort <b.schilperoort@gmail.com>

* catch missing basin_id correctly

Co-authored-by: Bart Schilperoort <b.schilperoort@gmail.com>

* make basin_ids more findable

---------

Co-authored-by: daafip <daafip@gmail.com>
Co-authored-by: Bart Schilperoort <b.schilperoort@esciencecenter.nl>
Co-authored-by: Bart Schilperoort <b.schilperoort@gmail.com>
  • Loading branch information
4 people committed May 2, 2024
1 parent d4fd99e commit 12d3672
Show file tree
Hide file tree
Showing 19 changed files with 404 additions and 1 deletion.
2 changes: 1 addition & 1 deletion .github/workflows/sonar.yml
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ jobs:
- name: Correct coverage paths
run: sed -i "s+$PWD/++g" coverage.xml
- name: SonarCloud Scan
uses: sonarsource/sonarcloud-github-action@master
uses: sonarsource/sonarcloud-github-action@v2.1.1
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}
4 changes: 4 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,10 @@ repos:
- id: check-merge-conflict
- id: debug-statements
- id: end-of-file-fixer
exclude:
(?x)^(
tests/src/base/forcing_files/camels_03439000.*
)$
- id: trailing-whitespace
args: [--markdown-linebreak-ext=md]
- repo: https://github.com/adrienverge/yamllint
Expand Down
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ Formatted as described on [https://keepachangelog.com](https://keepachangelog.co
- Added the `input_var_names` property to the eWaterCycle model class, to accompany the existing `output_var_names` property ([#403](https://github.com/eWaterCycle/ewatercycle/pull/403)).
- Added the `var_units` method to the eWaterCycle model class, to mirror PyMT ([#403](https://github.com/eWaterCycle/ewatercycle/pull/403)).
- Added a note on the time units that eWaterCycle expects models to provide to the documentation ([#403](https://github.com/eWaterCycle/ewatercycle/pull/403)).
- Added LumpedCaravan forcing, which allows for the retrieval of single basins from the Caravan dataset, rather than downloading the whole dataset ([#407](https://github.com/eWaterCycle/ewatercycle/pull/407)).

### Changed
- If a model plugin raises an exception during the loading of the model entry points, a more clear exception is raised which guides the users on how to solve the error ([#404](https://github.com/eWaterCycle/ewatercycle/pull/404)).
Expand Down
1 change: 1 addition & 0 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,7 @@ dev =
twine
types-python-dateutil
types-PyYAML
types-requests

[options.package_data]
ewatercycle =
Expand Down
307 changes: 307 additions & 0 deletions src/ewatercycle/_forcings/caravan.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,307 @@
import shutil
import zipfile
from pathlib import Path
from typing import Type

import fiona
import numpy as np
import pandas as pd
import urllib3
import xarray as xr
from cartopy.io import shapereader

from ewatercycle.base.forcing import DefaultForcing
from ewatercycle.esmvaltool.schema import Dataset
from ewatercycle.util import get_time

COMMON_URL = "ca13056c-c347-4a27-b320-930c2a4dd207"
OPENDAP_URL = f"https://opendap.4tu.nl/thredds/dodsC/data2/djht/{COMMON_URL}/1/"
SHAPEFILE_URL = (
f"https://data.4tu.nl/file/{COMMON_URL}/bbe94526-cf1a-4b96-8155-244f20094719"
)

PROPERTY_VARS = [
"timezone",
"name",
"country",
"lat",
"lon",
"area",
"p_mean",
"pet_mean",
"aridity",
"frac_snow",
"moisture_index",
"seasonality",
"high_prec_freq",
"high_prec_dur",
"low_prec_freq",
"low_prec_dur",
]

RENAME_ERA5 = {
"total_precipitation_sum": "pr",
"potential_evaporation_sum": "evspsblpot",
"temperature_2m_mean": "tas",
"temperature_2m_min": "tasmin",
"temperature_2m_max": "tasmax",
"streamflow": "Q",
}


class CaravanForcing(DefaultForcing):
"""Retrieves specified part of the caravan dataset from the OpenDAP server.
Examples:
The caravan dataset is an already prepared set by Frederik Kratzert,
(see https://doi.org/10.1038/s41597-023-01975-w).
This retrieves it from the OpenDAP server of 4TU,
(see https://doi.org/10.4121/bf0eaf7c-f2fa-46f6-b8cd-77ad939dd350.v4).
This can be done by specifying the
.. code-block:: python
from pathlib import Path
from rich import print
from ewatercycle.forcing import sources
path = Path.cwd()
forcing_path = path / "Forcing"
experiment_start_date = "1997-08-01T00:00:00Z"
experiment_end_date = "2005-09-01T00:00:00Z"
HRU_id = 1022500
camels_forcing = sources['LumpedCaravanForcing'].retrieve(
start_time = experiment_start_date,
end_time = experiment_end_date,
directory = forcing_path / "Camels",
basin_id = f"camels_0{HRU_id}"
)
which gives somthing like:
.. code-block:: python
LumpedCaravanForcing(
start_time='1997-08-01T00:00:00Z',
end_time='2005-09-01T00:00:00Z',
directory=PosixPath('/home/davidhaasnoot/eWaterCycle-WSL-WIP/Forcing/Camels'),
shape=PosixPath('/home/davidhaasnoot/eWaterCycle-WSL-WIP/Forcing/Camels/shapefiles/camels_01022500.shp'),
filenames={
'tasmax':
'camels_01022500_1997-08-01T00:00:00Z_2005-09-01T00:00:00Z_tasmax.nc',
'tasmin':
'camels_01022500_1997-08-01T00:00:00Z_2005-09-01T00:00:00Z_tasmin.nc',
'evspsblpot':
'camels_01022500_1997-08-01T00:00:00Z_2005-09-01T00:00:00Z_evspsblpot.nc',
'pr': 'camels_01022500_1997-08-01T00:00:00Z_2005-09-01T00:00:00Z_pr.nc',
'tas': 'camels_01022500_1997-08-01T00:00:00Z_2005-09-01T00:00:00Z_tas.nc',
'Q': 'camels_01022500_1997-08-01T00:00:00Z_2005-09-01T00:00:00Z_Q.nc'
}
)
More in depth notebook van be found here:
https://gist.github.com/Daafip/ac1b030eb5563a76f4d02175f2716fd7
"""

@classmethod
def generate( # type: ignore[override]
cls: Type["CaravanForcing"],
start_time: str,
end_time: str,
directory: str,
variables: tuple[str, ...] = (),
shape: str | Path | None = None,
dataset: str | Dataset | dict = "unused",
**kwargs,
) -> "CaravanForcing":
"""Retrieve caravan for a model.
Args:
start_time: Start time of forcing in UTC and ISO format string e.g.
'YYYY-MM-DDTHH:MM:SSZ'.
end_time: nd time of forcing in UTC and ISO format string e.g.
'YYYY-MM-DDTHH:MM:SSZ'.
directory: Directory in which forcing should be written.
If not given will create timestamped directory.
variables: Variables which are needed for model,
if not specified will default to all.
shape: (Optional) Path to a shape file.
If none is specified, will be downloaded automatically.
dataset: Unused
**kwargs:
basin_id: str containing the wanted basin_id. Data sets can be explored
using `CaravanForcing.get_dataset` or `CaravanForcing.get_basin_id`
More explanation in the example notebook mentioned above.
"""
if "basin_id" not in kwargs:
msg = "You have to specify a basin ID to be able to generate forcing from Caravan."
raise InputError(msg)
basin_id = kwargs["basin_id"]

dataset = basin_id.split("_")[0]
ds = get_dataset(dataset)
ds_basin = ds.sel(basin_id=basin_id.encode())
ds_basin_time = crop_ds(ds_basin, start_time, end_time)

if shape is None:
shape = get_shapefiles(Path(directory), basin_id)

if variables == ():
variables = ds_basin_time.data_vars.keys()

# only return the properties which are also in property vars
properties = set(variables).intersection(PROPERTY_VARS)
non_property_vars = set(variables) - properties
variable_names = non_property_vars.intersection(
RENAME_ERA5.keys()
) # only take the vars also in Rename dict

for prop in properties:
ds_basin_time.coords.update({prop: ds_basin_time[prop].to_numpy()})

ds_basin_time = ds_basin_time.rename(RENAME_ERA5)
variables = tuple([RENAME_ERA5[var] for var in variable_names])

for temp in ["tas", "tasmin", "tasmax"]:
ds_basin_time[temp].attrs.update({"height": "2m"})

start_time_name = start_time[:10]
end_time_name = end_time[:10]

history_attrs = ds_basin_time.attrs
for var in variables:
ds_basin_time[var].attrs.update(history_attrs)
ds_basin_time[var].to_netcdf(
Path(directory)
/ f"{basin_id}_{start_time_name}_{end_time_name}_{var}.nc"
)

forcing = cls(
directory=Path(directory),
start_time=start_time,
end_time=end_time,
shape=Path(shape),
filenames={
var: f"{basin_id}_{start_time_name}_{end_time_name}_{var}.nc"
for var in variables
},
)
forcing.save()
return forcing


def get_dataset(dataset) -> xr.Dataset:
"""Opens specified dataset from data.4tu.nl OPeNDAP server.
Args:
dataset (str): name of dataset, choose from:
'camels',
'camelsaus',
'camelsbr',
'camelscl',
'camelsgb',
'hysets',
'lamah'
"""
return xr.open_dataset(f"{OPENDAP_URL}{dataset}.nc")


def get_basin_id(dataset) -> list[str]:
"""Gets a list of all the basin ids in provided dataset
Args:
dataset (str): name of dataset, choose from:
'camels',
'camelsaus',
'camelsbr',
'camelscl',
'camelsgb',
'hysets',
'lamah'
Note:
a zip with shapefiles is available at
https://doi.org/10.4121/ca13056c-c347-4a27-b320-930c2a4dd207.v1 which also
allows exploration of the dataset.
"""
return [val.decode() for val in get_dataset(dataset).basin_id.values]



def get_shapefiles(directory: Path, basin_id: str) -> Path:
"""Retrieve shapefiles from data 4TU.nl ."""
zip_path = directory / "shapefiles.zip"
output_path = directory / "shapefiles"
shape_path = directory / f"{basin_id}.shp"

if not shape_path.is_file():
combined_shapefile_path = output_path / "combined.shp"
if not combined_shapefile_path.is_file():
timeout = urllib3.Timeout(connect=10.0, read=300)
http = urllib3.PoolManager(timeout=timeout)
with http.request(
"GET", SHAPEFILE_URL, preload_content=False
) as r, zip_path.open("wb") as out_file:
shutil.copyfileobj(r, out_file)

with zipfile.ZipFile(zip_path) as myzip:
myzip.extractall(path=directory)

extract_basin_shapefile(basin_id, combined_shapefile_path, shape_path)

return shape_path


def extract_basin_shapefile(
basin_id: str,
combined_shapefile_path: Path,
shape_path: Path,
) -> None:
"""Extract single polygon from multipolygon shapefile."""
shape_obj = shapereader.Reader(combined_shapefile_path)
list_records = []
for record in shape_obj.records():
list_records.append(record.attributes["gauge_id"])

df = pd.DataFrame(
data=list_records, index=range(len(list_records)), columns=["basin_id"]
)
basin_index = df[df["basin_id"] == basin_id].index.array[0]

with fiona.open(combined_shapefile_path) as src:
dst_schema = src.schema # Copy the source schema
# Create a sink for processed features with the same format and
# coordinate reference system as the source.
with fiona.open(
shape_path,
mode="w",
layer=basin_id,
crs=src.crs,
driver="ESRI Shapefile",
schema=dst_schema,
) as dst:
for i, feat in enumerate(src):
# kind of clunky but it works: select filtered polygon
if i == basin_index:
geom = feat.geometry
assert geom.type == "Polygon"

# Add the signed area of the polygon and a timestamp
# to the feature properties map.
props = fiona.Properties.from_dict(
**feat.properties,
)

dst.write(fiona.Feature(geometry=geom, properties=props))


def crop_ds(ds: xr.Dataset, start_time: str, end_time: str) -> xr.Dataset:
"""Crops dataset based on time."""
get_time(start_time), get_time(end_time) # if utc, remove Z to parse to np.dt64
start, end = np.datetime64(start_time[:-1]), np.datetime64(end_time[:-1])
return ds.isel(
time=(ds["time"].to_numpy() >= start) & (ds["time"].to_numpy() <= end)
)
2 changes: 2 additions & 0 deletions src/ewatercycle/forcing.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
from typing import Any, Type

from ewatercycle import shared
from ewatercycle._forcings.caravan import CaravanForcing
from ewatercycle._forcings.makkink import (
DistributedMakkinkForcing,
LumpedMakkinkForcing,
Expand Down Expand Up @@ -65,6 +66,7 @@ def __getitem__(self, key) -> Type[DefaultForcing]:
"LumpedUserForcing": LumpedUserForcing,
"DistributedMakkinkForcing": DistributedMakkinkForcing,
"LumpedMakkinkForcing": LumpedMakkinkForcing,
"CaravanForcing": CaravanForcing,
}
_forcings.update(
{
Expand Down
9 changes: 9 additions & 0 deletions tests/src/base/forcing_files/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# License
The shapefile and NetCDF file hosted here are to test downloading data from the Caravan dataset hosted by [4TU.nl](https://doi.org/10.4121/ca13056c-c347-4a27-b320-930c2a4dd207.v1) on their [OpenDAP server](https://opendap.4tu.nl/thredds/catalog/data2/djht/ca13056c-c347-4a27-b320-930c2a4dd207/1/catalog.html).
The data only includes a year of forcing for one catchment.

For own use, please download from the original source and cite correctly. The Caravan dataset itself is also a combination of data from seperate sources.

The Carvan dataset is originanly obtained from https://zenodo.org/records/7944025 and is explained in a paper by Kratzert, F. :'Caravan - A global community dataset for large-sample hydrology' found here: https://doi-org.tudelft.idm.oclc.org/10.1038/s41597-023-01975-w

Distributed under Creative Commons Attribution 4.0 International.
1 change: 1 addition & 0 deletions tests/src/base/forcing_files/camels_03439000.cpg
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
ISO-8859-1
Binary file added tests/src/base/forcing_files/camels_03439000.dbf
Binary file not shown.
1 change: 1 addition & 0 deletions tests/src/base/forcing_files/camels_03439000.prj
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
GEOGCS["GCS_WGS_1984",DATUM["D_WGS_1984",SPHEROID["WGS_1984",6378137.0,298.257223563]],PRIMEM["Greenwich",0.0],UNIT["Degree",0.0174532925199433]]
Binary file added tests/src/base/forcing_files/camels_03439000.shp
Binary file not shown.
Binary file added tests/src/base/forcing_files/camels_03439000.shx
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
ISO-8859-1
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
GEOGCS["GCS_WGS_1984",DATUM["D_WGS_1984",SPHEROID["WGS_1984",6378137.0,298.257223563]],PRIMEM["Greenwich",0.0],UNIT["Degree",0.0174532925199433]]
Binary file not shown.
Binary file not shown.
Loading

0 comments on commit 12d3672

Please sign in to comment.