Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate SST from 1940-Present #36

Open
alxmrs opened this issue Mar 7, 2024 · 4 comments
Open

Calculate SST from 1940-Present #36

alxmrs opened this issue Mar 7, 2024 · 4 comments

Comments

@alxmrs
Copy link
Owner

alxmrs commented Mar 7, 2024

https://www.threads.net/@earthlyeducation/post/C4KP3-wxv83

I notice this only goes back until 1980. How bad is the global average sea surface temperature this year compared to years past? I think we could answer this with a SQL query.

@alxmrs
Copy link
Owner Author

alxmrs commented Mar 12, 2024

The query will be something like:


SELECT
  DATE("time") as date,
  AVG("sea_surface_temperature") as daily_avg_sst
FROM 
  "era5" 
GROUP BY
  DATE("time")

@alxmrs
Copy link
Owner Author

alxmrs commented Mar 26, 2024

Lessons for how to scale this up from this article:

https://blog.coiled.io/blog/coiled-xarray.html

  • Use memory optimized VMs with large amounts of RAM (e.g. 64).
  • Larger chunk sizes (800MiBs)
  • Coiled Run to launch the job from a remote VM (presumably, to reduce idle VM time given a large task graph).

Other principles:

  • Have a good understanding of how to minimize the Dask task graph. The size of the graph will determine the performance of the job. The size of the graph will take CPU time due to the scheduler overhead.
  • Prefer a task graph that minimizes IO overhead. So, try to load bigger batches of data to RAM for the core computation.

@alxmrs
Copy link
Owner Author

alxmrs commented Mar 28, 2024

According to this:

era5_ds = xr.open_zarr(
      'gs://gcp-public-data-arco-era5/ar/'
      'full_37-1h-0p25deg-chunk-1.zarr-v3/',
      chunks={'time': 240, 'level': 1},
  )
dsl0 = era5_ds.sel(level=1000)  # we only care about surface sst
dsl0.nbytes / 2**40 # number of TiBs
# 1119.3059158361393

It looks like the full surface level ERA5 corpus is ~1120 TiBs. i.e., this is ~4.5x larger than the demo that Coiled performed.

🤞🏽 hopefully this means it will only cost ~5 times as much?

@alxmrs
Copy link
Owner Author

alxmrs commented Mar 28, 2024

Wait! We only care about SST! So, this is only 4TiB!

sst = dsl0.sea_surface_temperature
sst.nbytes / 2**40
# 4.100021640770137

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant