opendata-coawst

Scripts and code related to the USGS COAWST US East and Gulf Coast forecast model archive dataset on AWS Open Data Program. The model archive can be explored using the COAWST_explore.ipynb notebook.

Rendered version of the COAWST_explore notebook

Launch in SageMaker Studio Lab

If you have an AWS SageMaker Studio Lab account, you can open in Studio Lab using the button below, then when prompted, choose to download the whole repo and to build the conda environment. If you don't have an account, you can sign up for free (no AWS account required).

Run with Coiled

You can run this notebook on a Jupyterlab instance on AWS us-west-2 using Coiled. For example:

You can setup software environments in Coiled:

coiled env create --name pangeo-notebook --workspace esip-lab --conda coiled_pangeo_notebook_env.yml
coiled env create --name esip-pangeo-arm --workspace esip-lab --conda pangeo_env.yml --architecture aarch64

then if you create an API token, users can then run using:

conda create -n coiled -c conda-forge coiled
conda activate coiled
export DASK_COILED__TOKEN=18b82ee62e6c4c8dbfb81d8646xxxxxxxxxxxxxx

coiled notebook start --region us-west-2 --vm-type m5.xlarge --software pangeo-notebook --name unconf --workspace esip-lab
< open a terminal on jupyterlab>
git clone https://github.com/fs-jbzambon/opendata-coawst.git
< run COAWST_explore notebook!>

Data Processing Steps

Rechunking the NetCDF files

The official USGS Data Publication for these files lists the Globus Endpoint from which the original NetCDF files can be obtained. These NetCDF files have 12 or 13 hourly time steps, and were rechunked to be more performant on the cloud and better support a variety of use cases.

The scripts here were run on an HPC system with the files residing at /proj/usgs-share/Projects/COAWST, following the same directory structure as the Globus endpoint (for example, /proj/usgs-share/Projects/COAWST/2009/coawst_us_20090821_01.nc)

The first script, coawst2zarr.py, runs in parallel using Dask using all CPUs available on the node:

Creates an empty Zarr dataset exactly one week long (168 hourly time steps)
Finds NetCDF files with data within this time period
Writes each time step of data found into the proper location in the Zarr dataset
After all the data has been written, uses rechunker to create a new zarr dataset with chunk sizes nt = 168, ny = 168, nx = 224.

The second script, zarr2nc.py, runs serially using only one CPU:

Converts a zarr dataset into a NetCDF4 file, specifying compression settings.

We process all the weeks of the archive using a SLURM job array. This allows weeks to be processed in parallel subject to the availabilty of nodes:

run_zarr2coawst.sh is submitted, which creates all the rechunked week-long zarr datasets
run_zarr2nc.sh is submitted, which converts the week-long rechunked zarr datasets into week-long NetCDF files

Creating references for the rechunked NetCDF files

The Jupyter notebook coawst_open_data_create_refs.ipynb reads the remote NetCDF files on the AWS Open Data bucket and creates a references file using Kerchunk. This references file can then be used to open the entire collection of NetCDF files as Xarray DataSet using the Zarr library.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.gitignore		.gitignore
COAWST_explore.ipynb		COAWST_explore.ipynb
README.md		README.md
coawst.yaml		coawst.yaml
coawst2zarr.py		coawst2zarr.py
coawst_open_data_create_refs.ipynb		coawst_open_data_create_refs.ipynb
coiled_pangeo_notebook_env.yml		coiled_pangeo_notebook_env.yml
environment.yml		environment.yml
pangeo_coiled_env.yml		pangeo_coiled_env.yml
run_aws_coawst_copy.sh		run_aws_coawst_copy.sh
run_coawst2zarr.sh		run_coawst2zarr.sh
run_zarr2nc.sh		run_zarr2nc.sh
zarr2nc.py		zarr2nc.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

opendata-coawst

Rendered version of the COAWST_explore notebook

Launch in SageMaker Studio Lab

Run with Coiled

Data Processing Steps

Rechunking the NetCDF files

Creating references for the rechunked NetCDF files

About

Releases

Packages

Languages

fs-jbzambon/opendata-coawst

Folders and files

Latest commit

History

Repository files navigation

opendata-coawst

Rendered version of the COAWST_explore notebook

Launch in SageMaker Studio Lab

Run with Coiled

Data Processing Steps

Rechunking the NetCDF files

Creating references for the rechunked NetCDF files

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages