diff --git a/README.md b/README.md index a7e9a68..f421e85 100644 --- a/README.md +++ b/README.md @@ -1,22 +1,24 @@ # asdi-examples + Jupyter Notebook examples related to the datasets in the [Amazon Sustainability Data Initiative (ASDI)](https://sustainability.aboutamazon.com/environment/the-cloud/amazon-sustainability-data-initiative) program. In particular, these notebooks demonstrate how to query and access data from a [Spatial Temporal Asset Catalog (STAC)] of the open datasets that have been cataloged. -The notebooks are designed to work with [AWS SageMaker Studio Lab](https://studiolab.sagemaker.aws/) which you can use for free[^1], but should also work with other Jupyter Notebook environments. For some datasets an AWS account for requester pays charges may be required. See *Advanced Usage* below for more details. +The notebooks are designed to work with [AWS SageMaker Studio Lab](https://studiolab.sagemaker.aws/) which you can use for free[^1], but should also work with other Jupyter Notebook environments. For some datasets an AWS account for requester pays charges may be required. See *Advanced Usage* below for more details. [^1]: Free allocation is time limited. *Advanced Usage* As a reminder these notebooks are for demonstration purposes using an experimental API. For large amounts of data access and analysis we recommend that you use Sagemaker Studio or deploy JupyterHub to an EC2 instance. In both cases you can optimize data access and minimize costs by running your instances in the same region as the data you intend to access. - ## Datasets * Copernicus Digital Elevation Model (DEM) as Cloud Optimized GeoTIFF, [Glo-30 (30m) AWS Open Data](https://registry.opendata.aws/copernicus-dem/) Open In SageMaker Studio Lab * Sentinel 1 GRD as Cloud Optimized GeoTIFF, [Sentinel-1 GRD AWS Open Data](https://registry.opendata.aws/sentinel-1/) Open In SageMaker Studio Lab +* NOAA High Resolution Rapid Refresh 48-hour forecast visualization, [NOAA HRRR AWS Open Data](https://registry.opendata.aws/noaa-hrrr-pds/) Open In SageMaker Studio Lab + * NOAA Sea Surface Temperature - Optimum Interpolation via Zarr accessible Kerchunk index of NetCDF, [NOAA OISST AWS Open Data](https://registry.opendata.aws/noaa-cdr-oceanic/) Open In SageMaker Studio Lab * Amazonia-1 WFI as Cloud Optimized GeoTIFF, [Amazonia EO satellite on AWS](https://registry.opendata.aws/amazonia/) Open In SageMaker Studio Lab diff --git a/noaa-hrrr/noaa-hrrr.ipynb b/noaa-hrrr/noaa-hrrr.ipynb new file mode 100644 index 0000000..113c985 --- /dev/null +++ b/noaa-hrrr/noaa-hrrr.ipynb @@ -0,0 +1,583 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "6d416be0-b98e-4525-bce2-86aeab6c2180", + "metadata": {}, + "source": [ + "# NOAA High Resolution Rapid Refresh (HRRR)" + ] + }, + { + "cell_type": "markdown", + "id": "ed56d0d2-1359-4dfb-ad95-34dd8800c2e3", + "metadata": {}, + "source": [ + "## Using the Notebook\n", + "First we install any necessary packages. Please note that the following block of code only needs to be run if needed within a new workspace and that you may need to restart the kernel to use updated packages. The code block only needs to be run once for each AWS Sagemaker Studio profile." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d939b1f5-a18b-4eba-b39b-866da39e707f", + "metadata": {}, + "outputs": [], + "source": [ + "%pip -q install dask folium geopandas imageio ipywidgets matplotlib mapclassify pystac-client rioxarray tqdm" + ] + }, + { + "cell_type": "markdown", + "id": "917d80ca-baa2-4f9a-a846-30bd8ba9d89b", + "metadata": {}, + "source": [ + "## HRRR Background (exceprt from the [NOAA HRRR stactools package docs](https://github.com/stactools-packages/noaa-hrrr))\n", + "\n", + "The [NOAA HRRR dataset](https://www.nco.ncep.noaa.gov/pmb/products/hrrr/#CO)\n", + "is a continuously updated atmospheric forecast data product.\n", + "\n", + "### Data structure\n", + "\n", + "- There are two regions: CONUS and Alaska\n", + "- Every hour, new hourly forecasts are generated for many atmospheric attributes\n", + " for each region\n", + " - All hours (`00-23`) get an 18-hour forecast in the `conus` region\n", + " - Forecasts are generated every three hours (`00`, `03`, `06`, etc) in the\n", + " `alaska` region\n", + " - On hours `00`, `06`, `12`, `18` a 48-hour forecast is generated\n", + " - One of the products (`subh`) gets 15 minute forecasts (four per hour per\n", + " attribute), but the sub-hourly forecasts are stored as layers within a\n", + " single GRIB2 file for the forecast hour rather than in separate files.\n", + "- The forecasts are broken up into 4 products (`sfc`, `prs`, `nat`, `subh`), \n", + "- Each GRIB2 file has hundreds to thousands of variables\n", + "- Each .grib2 file is accompanied by a .grib2.idx which has variable-level\n", + " metadata including the starting byte for the data in that variable (useful\n", + " for making range requests instead of reading the entire file) and some\n", + " other descriptive metadata\n", + "\n", + "### Summary of Considerations for Organizing STAC Metadata\n", + "\n", + "After extensive discussions, we decided to organize the STAC metadata with\n", + "the following structure:\n", + "\n", + "1. **Collections**: Separate collections for each region-product combination\n", + " - regions: `conus` and `alaska`\n", + " - products: `sfc`, `prs`, `nat`, and `subh`\n", + "\n", + "2. **Items**: Each GRIB file in the archive is represented as an item with two assets:\n", + " - `\"grib\"`: Contains the actual data.\n", + " - `\"index\"`: The .grib2.idx sidecar file.\n", + "\n", + " Each GRIB file contains the forecasts for all of a product's variables for a\n", + " particular forecast hour from a reference time, so you need to combine data\n", + " from multiple items to construct a time series for a forecast. It uses the\n", + "\n", + " [`forecast` extension](https://github.com/stac-extensions/forecast) to\n", + " organize the forecast attributes, notably the `forecast:horizon` and\n", + " `forecast:reference_time` properties.\n", + "\n", + "4. **`grib:layers`**: Within each `\"grib\"` asset, a `grib:layers` property details\n", + " each layer's information, including description, units, and byte ranges.\n", + " This enables applications to access specific parts of the GRIB2 files without\n", + " downloading the entire file.\n", + "\n", + " - We intend to propose a `GRIB` STAC extension with the `grib:layers` property\n", + " for storing byte-ranges after testing this specification out on other GRIB2\n", + " datasets.\n", + " - The layer-level metadata is worth storing in STAC because you can construct\n", + " URIs for specific layers that GDAL can read using either `/vsisubfile` or\n", + " `vrt://`:\n", + " - `/vsisubfile/{start_byte}_{byte_size},/vsicurl/{grib_href}`\n", + " - `vrt:///vsicurl/{grib_href}?bands={grib_message}`, where `grib_message` is\n", + " the index of the layer within the GRIB2 file.\n", + " - under the hood, GDAL's `vrt` driver is reading the sidecar .grib2.idx file\n", + " and translating it into a `/vsisubfile` URI.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "45712eab-1215-4d4b-8a93-5aaed738fab8", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "from datetime import datetime, timedelta\n", + "\n", + "import geopandas as gpd\n", + "import pandas as pd\n", + "import pystac_client\n", + "import rioxarray\n", + "import xarray as xr\n", + "from pystac import Asset, Collection, ItemCollection\n", + "from tqdm.notebook import tqdm, trange\n", + "\n", + "# disable some pesky GDAL warnings/messages\n", + "os.environ[\"CPL_LOG\"] = \"OFF\"" + ] + }, + { + "cell_type": "markdown", + "id": "a09c0124-5177-480e-a9d9-7d62b0ed9de9", + "metadata": {}, + "source": [ + "### Querying the AWS STAC API with Pystac Client\n", + "Using the [Pystac Client](https://pystac-client.readthedocs.io/en/stable/) library we can query the AWS STAC API to access NOAA HRRR STAC items.\n", + "\n", + "The data are organized in eight separate collections. One collection per **product** (`sfc`, `prs`, `nat`, `subh`) and **region** (conus or alaska) combination.\n", + "\n", + "For this demonstration, we will look at variables in the `\"noaa-hrrr-src-conus\"` collection." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "22d1eab2-d595-46ed-9ff5-4d9f189acecc", + "metadata": {}, + "outputs": [], + "source": [ + "API_ROOT_URL = \"https://dev.asdi-catalog.org/\"\n", + "COLLECTION_ID = \"noaa-hrrr-sfc-conus\"\n", + "\n", + "client = pystac_client.Client.open(API_ROOT_URL)\n", + "\n", + "collection = client.get_collection(COLLECTION_ID)\n", + "collection" + ] + }, + { + "cell_type": "markdown", + "id": "fcbad6b7-d0af-404f-b0b9-b1c86fb930b0", + "metadata": {}, + "source": [ + "### Forecast queries\n", + "A new forecast is generated every hour and the the STAC item metadata is uploaded to the STAC as the data land in the AWS bucket (`s3://noaa-hrrr-bdp-pds/`).\n", + "There is usually a ~one-hour lag between forecast model run and the data getting uploaded to S3.\n", + "\n", + "An 18-hour forecast is generated every hour, but every 6 hours (00, 06, 12, 18) a 48-hour forecast is generated.\n", + "To get the most recent available 48-hour forecast, we can apply a filter to our STAC search that limits the results where the forecast horizon is 48 hours (`\"forecast:horizon\" == \"PT48H\")).\n", + "\n", + "Once we have found the item for the most recent 48-hour forecast, we can load the full 48-hour forecast by running a search that limits the results to items where the forecast reference time matches the reference time from the item retrieved in the first query." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bef83324-8df4-494d-b041-be8e6ad9128a", + "metadata": {}, + "outputs": [], + "source": [ + "# perform a search to get the full 48 hour forecast (plus the zero-hour forecast) for a point in time\n", + "now = datetime.now().replace(minute=0, second=0)\n", + "\n", + "search = client.search(\n", + " collections=COLLECTION_ID,\n", + " sortby=\"datetime\",\n", + " filter={\n", + " \"op\": \"and\",\n", + " \"args\": [\n", + " {\n", + " \"op\": \"eq\",\n", + " \"args\": [{\"property\": \"forecast:horizon\"}, \"PT48H\"]\n", + " },\n", + " {\n", + " \"op\": \"gte\",\n", + " \"args\": [\n", + " {\"property\": \"forecast:reference_time\"},\n", + " (now - timedelta(hours=6)).strftime(\"%Y-%m-%dT%H:%M:%SZ\")\n", + " ],\n", + " },\n", + " ],\n", + " }\n", + ")\n", + "\n", + "# get the forecast that was generated most recently\n", + "most_recent_48_hour_forecast_item = sorted(\n", + " search.item_collection(),\n", + " key=lambda item: datetime.strptime(\n", + " item.properties[\"forecast:reference_time\"], \"%Y-%m-%dT%H:%M:%SZ\"\n", + " ),\n", + " reverse=True\n", + ")[0]\n", + "\n", + "# now search for all items with that reference time\n", + "search = client.search(\n", + " collections=\"noaa-hrrr-sfc-conus\",\n", + " filter={\n", + " \"op\": \"eq\",\n", + " \"args\": [\n", + " {\"property\": \"forecast:reference_time\"},\n", + " most_recent_48_hour_forecast_item.properties[\"forecast:reference_time\"]\n", + " ]\n", + " },\n", + ")\n", + "\n", + "item_collection = search.item_collection()\n", + "print(f\"retrieved {len(item_collection)} items\")" + ] + }, + { + "cell_type": "markdown", + "id": "7e75fa66-57ca-4cbe-9d28-c276b00bd383", + "metadata": {}, + "source": [ + "The HRRR predictions are distributed in the GRIB2 format. GRIB2 files have many different layers encoded as \"messages\" that are strung together. Fortunately each GRIB file is accompanied by a .grib.idx index file that describes the byte ranges for each of the messages/layers. GDAL can perform targeted reads of the GRIB2 files if we use the `vrt://` syntax with a band index." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6ae305b7-9fc8-4928-af28-2429a5ff5724", + "metadata": {}, + "outputs": [], + "source": [ + "# identify the grib:layers key for the wind speed variable\n", + "{key: info for key, info in collection.ext.item_assets[\"grib\"].properties[\"grib:layers\"].items() if \"WIND\" in key}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2dc8828b-56ab-4382-8ad1-7f1d2d133bfa", + "metadata": {}, + "outputs": [], + "source": [ + "# function for generating a uri that GDAL can open using the /vsisubfile driver\n", + "def format_vrt_path(grib_href: str, grib_message: int) -> str:\n", + " return f\"vrt:///vsicurl/{grib_href}?bands={grib_message}\"\n", + "\n", + "# function to load a particular layer from all items into xarray and stack along the time dimension\n", + "def xarray_from_item_collection(item_collection: ItemCollection, grib_layer_key: str):\n", + " item_layers = {}\n", + " for item in item_collection:\n", + " layer_info = item.assets[\"grib\"].extra_fields[\"grib:layers\"].get(grib_layer_key)\n", + " if layer_info:\n", + " vrt_path = format_vrt_path(item.assets[\"grib\"].href, layer_info[\"grib_message\"])\n", + " arr = (\n", + " rioxarray.open_rasterio(\n", + " vrt_path,\n", + " chunks={\n", + " \"band\": 1,\n", + " \"x\": -1, # keep all x/y coordinates in the same chunk because GRIB files\n", + " \"y\": -1, # do not have any spatial partitioning\n", + " },\n", + " use_idx=\"NO\",\n", + " )\n", + " .squeeze()\n", + " .assign_coords(\n", + " time=item.properties[\"datetime\"],\n", + " reference_datetime=item.properties[\"forecast:reference_time\"],\n", + " ).expand_dims(\"time\")\n", + " )\n", + " item_layers[item.id] = arr\n", + "\n", + " return xr.concat(item_layers.values(), dim=\"time\").sortby(\"time\").chunk(time=1)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "04ea7159-73a7-4e52-b41c-70f509d65617", + "metadata": {}, + "outputs": [], + "source": [ + "wind_ts = xarray_from_item_collection(item_collection, \"WIND__10_m_above_ground__periodic_max\")\n", + "wind_ts" + ] + }, + { + "cell_type": "markdown", + "id": "e9e15e78-875e-46a7-a17d-c55a943b90ad", + "metadata": {}, + "source": [ + "## visualizing 48-hour forecast data\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "812b7b29-7c1a-45c9-a61f-ce4de6442607", + "metadata": {}, + "outputs": [], + "source": [ + "# code for rendering gifs from xarray data\n", + "import matplotlib.pyplot as plt\n", + "import imageio.v2 as imageio\n", + "import io\n", + "from IPython.display import Image\n", + "\n", + "\n", + "# Function to plot a single frame and return as a BytesIO object\n", + "def plot_frame(data, time_idx, title, vmin, vmax, cmap, units_label):\n", + " fig, ax = plt.subplots(figsize=(7,5))\n", + " im = ax.imshow(data.isel(time=time_idx), cmap=cmap, vmin=vmin, vmax=vmax)\n", + " ax.set_axis_off()\n", + "\n", + " # Adjust layout to reduce whitespace\n", + " plt.subplots_adjust(left=0.1, right=0.9, top=0.85, bottom=0.25)\n", + " \n", + " # Main title at the top\n", + " fig.suptitle(title, fontsize=12, y=0.92)\n", + "\n", + " # reference time coordinate at the bottom left\n", + " fig.text(0.1, 0.05, f\"reference time: {data.reference_datetime.values}\", ha=\"left\", fontsize=10)\n", + " \n", + " # forecast time coordinate at the bottom right\n", + " fig.text(0.95, 0.05, f\"forecast time: {data.time.values[time_idx]}\", ha=\"right\", fontsize=10)\n", + " \n", + " # Horizontal colorbar below the image\n", + " cbar_ax = fig.add_axes([0.15, 0.15, 0.7, 0.03])\n", + " cbar = plt.colorbar(im, cax=cbar_ax, orientation=\"horizontal\")\n", + " cbar.set_label(units_label, fontsize=10, labelpad=3)\n", + " cbar.ax.xaxis.set_label_position(\"top\")\n", + " \n", + " # Progress bar at the very bottom\n", + " total_frames = data.sizes[\"time\"]\n", + " progress = (time_idx + 1) / total_frames\n", + " progress_bar = plt.Rectangle((0.1, 0.02), progress * 0.8, 0.02, transform=fig.transFigure, color=\"blue\")\n", + " fig.patches.append(progress_bar)\n", + " \n", + " buf = io.BytesIO()\n", + " plt.savefig(buf, format=\"png\", bbox_inches=\"tight\", pad_inches=0.1)\n", + " plt.close(fig)\n", + " buf.seek(0)\n", + " return buf\n", + "\n", + "def render_gif(data: xr.DataArray, title: str, cmap: str, units_label: str, vmin: float = None, vmax: float = None):\n", + " if vmin == None:\n", + " vmin = data.quantile(0.01).item()\n", + " if vmax == None:\n", + " vmax = data.quantile(0.99).item()\n", + "\n", + " # Create frames in memory\n", + " frames = []\n", + " for t in trange(data.sizes[\"time\"]):\n", + " frame = plot_frame(data, t, title, vmin, vmax, cmap, units_label)\n", + " frames.append(imageio.imread(frame))\n", + " \n", + " # Create a GIF in memory\n", + " gif_bytes = io.BytesIO()\n", + " imageio.mimsave(gif_bytes, frames, format=\"GIF\", duration=0.5, loop = 0)\n", + " gif_bytes.seek(0)\n", + " \n", + " return gif_bytes.read()" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "b2be2cfe-be8c-4297-ab9c-b222ada96be4", + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "dd7562998b104cf885d59614a96706eb", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + " 0%| | 0/49 [00:00" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "Image(\n", + " render_gif(\n", + " data=wind_ts,\n", + " title=\"NOAA HRRR: wind speed (10 m above ground) forecast\",\n", + " cmap=\"viridis\",\n", + " units_label=\"wind speed m/s\",\n", + " vmin=0,\n", + " vmax=15\n", + " )\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "1dd3ca20-9a55-41d1-8417-a023379a61cb", + "metadata": {}, + "source": [ + "## More visualization examples\n", + "To identify a particular variable of interest you can browse the item-assets metadata to find the specific `grib:layers` key that you need." + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "66bd335c-0ac2-4908-8441-1e1e2c6e757c", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'DSWRF__surface__point_in_time': {'unit': 'W/m^2',\n", + " 'level': 'surface',\n", + " 'variable': 'DSWRF',\n", + " 'description': 'Downward Short-Wave Radiation Flux',\n", + " 'forecast_layer_type': 'point_in_time'}}" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "{key: info for key, info in collection.ext.item_assets[\"grib\"].properties[\"grib:layers\"].items() if \"DSWRF\" in key}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e4f25c68-adbe-4205-9353-5a27b4b7a368", + "metadata": {}, + "outputs": [], + "source": [ + "solar_ts = xarray_from_item_collection(item_collection, \"DSWRF__surface__point_in_time\")\n", + "Image(\n", + " render_gif(\n", + " data=solar_ts,\n", + " title=\"NOAA HRRR: downward short-wave radiation flux\",\n", + " cmap=\"rainbow\",\n", + " units_label=\"W/m^2\",\n", + " vmin=0,\n", + " vmax=1000\n", + " )\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "8d37121e-436c-424f-86a2-4f11a526b0e0", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'PRATE__surface__point_in_time': {'unit': 'kg/m^2/s',\n", + " 'level': 'surface',\n", + " 'variable': 'PRATE',\n", + " 'description': 'Precipitation Rate',\n", + " 'forecast_layer_type': 'point_in_time'}}" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "{key: info for key, info in collection.ext.item_assets[\"grib\"].properties[\"grib:layers\"].items() if \"PRATE\" in key}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "29a1f138-84fc-4f4a-96ee-aa7b4638014d", + "metadata": {}, + "outputs": [], + "source": [ + "precip_rate_ts = xarray_from_item_collection(item_collection, \"PRATE__surface__point_in_time\")\n", + "Image(\n", + " render_gif(\n", + " data=precip_rate_ts,\n", + " title=\"NOAA HRRR: precipitation rate\",\n", + " cmap=\"PuBu\",\n", + " units_label=\"kg / m^2 / s\",\n", + " vmin=0,\n", + " vmax=0.001\n", + " )\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "55ca95c9-233e-4056-8b7a-a7aaa2271de4", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'APCP__surface__periodic_acc': {'unit': 'kg/m^2',\n", + " 'level': 'surface',\n", + " 'variable': 'APCP',\n", + " 'description': 'Total Precipitation',\n", + " 'forecast_layer_type': 'periodic_summary'},\n", + " 'APCP__surface__cumulative_acc': {'unit': 'kg/m^2',\n", + " 'level': 'surface',\n", + " 'variable': 'APCP',\n", + " 'description': 'Total Precipitation',\n", + " 'forecast_layer_type': 'cumulative_summary'}}" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "{key: info for key, info in collection.ext.item_assets[\"grib\"].properties[\"grib:layers\"].items() if \"APCP\" in key}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1dbb86c5-4f7e-4bf5-8ffd-8039435b80c2", + "metadata": {}, + "outputs": [], + "source": [ + "precip_acc_ts = xarray_from_item_collection(item_collection, \"APCP__surface__cumulative_acc\")\n", + "Image(\n", + " render_gif(\n", + " data=precip_acc_ts,\n", + " title=\"NOAA HRRR: precipitation accumulation\",\n", + " cmap=\"Blues\",\n", + " units_label=\"kg / m^2\",\n", + " vmin=0,\n", + " vmax=60\n", + " )\n", + ")" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}