Skip to content

🌎 A curated list of Earth system datasets for the machine learning and weather, climate, ice, ocean, etc. community.

License

Notifications You must be signed in to change notification settings

blutjens/awesome-earth-system-ml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome-earth-system-ml Awesome

Awesome-earth-system-ml is a curated list of datasets from dynamic Earth system models for the climate-interested machine learning community. The list targets data from climate, weather, atmosphere, ocean, flood, cryosphere, or other models and sciences.

Getting started with data science in Earth system modeling is challenging. The lack of accessible datasets and plethora of evaluation options is one reason why. So, this list of datasets and benchmarks intends to get you started with building machine learning models for analysing dynamical Earth systems.

This is list represents an inclusive community. We would very much appreciate if you add your favorite datasets via a pull request or (emailing lutjens at mit [dot] edu).

LONDON/ENGLAND – FEBRUARY 22 2020: Black Extinction Rebellion protester holding a 'THERE IS NO PLANET B' sign at the February 2020 March in collaboration with Parents 4 Future by JessicaGirvan on Shutterstock

Photo of climate activists holding a THERE IS NO PLANET B sign by Jessica Girvan on Shutterstock

Content

Air Quality

  • [TOAR: Tropospheric Ozone Assessment Report] todo

Atmosphere

  • SEVIR : A Storm Event Imagery Dataset for Deep Learning Applications in Radar and Satellite Meteorology (Veillette et al., 21)
    ML-ready dataset for forecasting (nowcasting) storm events. US dataset of 10,000 weather events that each consist of 384 km x 384 km image sequences spanning 4 hours of time. Contains 5-bands: 3x GOES-16 advanced baseline imager, NEXRAD vertically integrated liquid, and GOES-16 Geostationary Lightning Mapper. Used in 1.

  • CUMULO : A Dataset for Learning Cloud Classes (Zantedeschi et al., 19)
    ML-ready dataset for classification of clouds. Global dataset at 1km spatial and daily resolution for 2008, 2009 and 2016. Includes 300K annotated images with multispectral image (MODIS), radar (CloudSat), and lidar (CLDCLASS and CALIOP). Used in 1.

  • RainNet: A Large-Scale Imagery Dataset and Benchmark for Spatial Precipitation Downscaling (X. Chen and K. Feng et al., 22)
    ML-ready dataset for superresolution of precipitation. US East dataset at 12 and 4km spatial and hourly resolution for 17 years creating >60K snapshot images at 208x333 and 624x999 resolution totaling 360GB. Includes StageIV and NLDAS data assimilation projects from gauges, radar, and satellite. Contains precipitation (mm/hr) as in- and output variable. Evaluates domain-informed reconstruction (MPPE, HRRE, CPMSE, AMMD, RMSE) and dynamic metrics (HRTS and CMD). Used in 1.

  • Fast and accurate learned multiresolution dynamical downscaling for precipitation (J. Wang et al., 20)
    ML-ready dataset for superresolution of precipitation. US dataset at 50 and 12km spatial and 3-hourly resolution for 2005 creating ~3K snapshot images at 128x64 and 512x256. Includes WRF RCM simulation from NCEP-R2 climate model (RainNet in comparison contains observational data). Contains output variables (high-res. precipitation) and input variables (low-res. precipitation, vertically integrated water vapor, sea level pressure, 2m air temperature, and high-res. topography). Evaluates MSE, Jensen-Shannon distance of probability density functions, and extreme precipitation occurences on global and local scale. Used in 1.

  • [SP-CAM]

  • [NCAR CAM]

Climate

  • ClimateBench (Watson-Parris et al., 22)
    ML-ready dataset for forecasting the climate response to aerosols. Global dataset at 2° spatial and yearly resolution creating images of size 96x144 videos, totaling approx 2GB storage. Includes carbon dioxide, methane, sulfur dioxide, and soot forcings and temperature, diurnal temperature range and precipitation predictors. Includes CMIP6's AerChemMIP, NorESM2, ScenarioMIP, and DAMIP data. Evaluates RMSE. Used in 1, 2, 3, and 4.

  • ClimART: A Benchmark Dataset for Emulating Atmospheric Radiative Transfer in Weather and Climate Models (Cachay and Ramesh et al., 22)
    ML-ready dataset for forecasting atmospheric radiative transfer parametrizations. Contains 10M samples from present, pre-industrial, and future climate conditions, based on the Canadian Earth System Model. Used in 1.

  • PaleoJump: A database for abrupt transitions in past climates (Bagniewski et al., 2022)
    Raw dataset for forecasting climatic shocks and analyzing paleoclimate. Global dataset from 123 sites with point data from 4M years ago until present. Includes PANGAEA and NCEI/NOAA datasets. Contains 49 marine-sedient cores, 29 speleothems, 18 lake sediment cores, 16 terrestrial records, and 11 ice cores. Used in 1.

  • CMIP6 (WCRP, 2019) \ Raw comprehensive dataset of 100+ climate models under various emission scenarios.

Climate Risk

  • todo

Cryosphere

  • [MAR]
    todo

Flooding

  • [NEMO: Digital Twin Earth]
    todo

Land surface, forest, and biodiversity

  • EarthNet2021x: Forecasting High-Resolution Earth Multispectral Imagery (Requena-Mesa et al., 2021) ML-ready dataset for video prediction of land surface dynamics. Germany-centric dataset at 20m spatial and 5-day resolution creating 32K samples images of size 128x128x30 steps, totaling 218GB storage. Contains multispectral satellite imagery, cloud masks, elevation, land cover, rainfall, pressures, and temperatures. Includes Sentinel-2, EU-DEM, E-OBS data. Evaluates MAD, OLS, EMD, and SSIM. Used in 1.

  • [CESM CLM]
    Raw dataset that is a go-to model for global land surface dynamics

  • see Awesome-awesome for more forest data.

Ocean

  • todo

Renewables wind and solar

  • WiSoSuper: Benchmarking Super-Resolution Methods on Wind and Solar Data (Kurinchi-Vendhan et al.,2022)
    ML-ready dataset for superresolution of wind and solar data. US dataset at 10 and 2km spatial and 4-hourly (wind) and 20 and 4km spatial and hourly resolution (solar) from 2007 to 2018. Includes NREL WIND and NSRDB solar data. Contains output variables (high-res. westward wind velocity, southward wind velocity, direct normal irradiance, diffused horizontal irradiance) and input variables (low-res. bilinearly interpolated version of HR variables). Evaluates RMSE, kinetic energy spectrum, and solar semivariogram. Used in 1, 2.

Scientific machine learning and numerical methods

  • PDEBench: An Extensive Benchmark for Scientific Machine Learning (Takamoto et al., 2022)
    ML-ready dataset for forecasting various PDEs from hydromechanics. Includes 6 basic and 3 advanced problems. The basic PDEs are 1D advection, Burgers, Diffusion-Reaction, Diffusion-Sorption equations and 2D Diffusion-Reaction and Darcy Flow. The advanced PDEs are incompressible Navier-Stokes equations (NSE) and compressible NSE, and shallow-water equations. Evaluates RMSE, normalized RMSE, RMSE on boundary, RMSE of conserved value, RMSEs in low-, mid- or high-pass Fourier space. Used in 1.

Weather

  • RainBench: Towards Data-Driven Global Precipitation Forecasting from Satellite Imagery (Schroeder de Witt et al., 2021)
    ML-ready dataset for forecasting precipitation. Global dataset at 1.4° and 5.625° spatial and hourly resolution creating images of size 32x64 with 3 vertical grid points from 2016-2019. Includes ERA5 reanalysis, SimSat simulated satellite data, and IMERG glocal precipitation estimates. Contains output variables (ERA5 total precipitation, IMERG precipitation), dynamic input variables (geopotential, temperature, humidity, cloud liquid water content, cloud ice water content, surface pressure, 2-meter temperature, and cloud-brightness temperatures), and static input variables (lat, lon, land-sea mask, orography, soil type). Evaluate RMSE. Used in 1.

  • WeatherBench: A benchmark dataset for data-driven weather forecasting (Rasp et al., 2020)
    ML-ready dataset for forecasting weather. Global dataset at 1.4° and 5.625° spatial and hourly resolution creating images of size 128x256 - 32x64 with 13 vertical grid points, totaling 191GB storage. Includes geopotential, temperature, humidity, wind, potential vorticity, solar radiation, and others. Includes ERA5 and CMIP-MPI-ESM-HR data. Evaluate RMSE. Used in 1, 2, 3, and 4.

  • ERA5 (ECMWF, 2020)
    Raw hourly reanalysis estimate of atmospheric, land and oceanic variables. Global, 30km grid, with 137 vertical nodes in the atmosphere, including uncertainties, 1959-present. Used in FourCastNet and Keysler et al., 22.

Wildfire

Awesome-awesome

Acknowledgements

  • This list has only been possible to assemble through the extensive input by Duncan Watson Parris, Paula Harder, and Fabrizio Falasca.

About

🌎 A curated list of Earth system datasets for the machine learning and weather, climate, ice, ocean, etc. community.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published