Skip to content

Lambda function based orchestration for horizontal scaling of aggregations#1

Merged
espg merged 60 commits into
mainfrom
lambda
Mar 6, 2026
Merged

Lambda function based orchestration for horizontal scaling of aggregations#1
espg merged 60 commits into
mainfrom
lambda

Conversation

@espg
Copy link
Copy Markdown
Contributor

@espg espg commented Dec 10, 2025

Rough attempt at making an embarrassingly parallel version of what demo_s3_xdggs.ipynb does. Lots of room for improvement.

What currently works: scaling to ~2000 workers. Generally, I set this slightly lower (1,700) for two reasons:

  1. The wall time is dominated by the longest running process, so increasing workers past a certain point does actually speed anything up.
  2. We start run into scaling errors around 2000+ workers (ssl handshake errors with the dispatcher and other hard to fix issues), so it's good to stay slightly under that limit.

In order to minimize wall time compute, I sort the runs so we start with the furthest south cells first, which due to orbital convergence tend to have the most observations (or zero obs, if we're trying to calculate a cell within the pole hole).

The default aws cap on concurrent lambda functions is 1,000, but they'll immediately approve 2,000 on request:

> aws service-quotas request-service-quota-increase \                    
                                             --service-code lambda \
                                             --quota-code L-B99A9384 \
                                             --desired-value 2000 \
                                             --region us-west-2

Here's the tldr of what all we get out of this pipeline:

======================================================================
Production Lambda Orchestrator - Catalog-Based
======================================================================

[1/5] Loading granule catalog from granule_catalog_cycle22_order6_stereo.json...
      Cycle: 22
      Parent order: 6
      Total cells in catalog: 2395
      Total granules: 4153
      Processing 2395 cells

[2/5] Authenticating with NASA Earthdata...
      Credentials expire: 2025-12-10 05:46:36+00:00

[3/5] Invoking 2395 Lambda functions (max 1700 concurrent)...
      Output: s3://jupyterhub-englacial-scratch-429435741471/atl06/production/
[4/5] Cost Calculation
----------------------------------------------------------------------
      Total Lambda execution time: 89,229.4s (24.79 hours)
      Memory: 2048MB (2.0GB)
      GB-seconds: 178,458.8
      Cost: $2.9743

[5/5] Summary
======================================================================
      Total cells:          2395
      With data:            1739
      Empty (no granules):  0
      Empty (filtered):     656
      Errors:               0
      Total observations:   1,370,185,727
----------------------------------------------------------------------
      Wall clock time:      463.2s (7.7m)
      Lambda compute time:  89,229.4s (1487.2m)
      Throughput:           5.2 cells/sec
      Estimated cost:       $2.9743
----------------------------------------------------------------------
      Output location:      s3://jupyterhub-englacial-scratch-429435741471/atl06/production/
======================================================================

That's 1.37 billion observations that are aggregated to a grid in a little under 8 minutes for about $3. I would love to see it happen for $2 and take under 5 minutes.

Here's the relevant speed operations to make this work:

  1. Catalog based query. We hit the NASA CMR once to download for a time range (one orbital cycle above, 92 days) and a large spatial area (everything south of -60 latitude), and then hit that catalog locally for splitting work up to workers. Way faster (which makes it cheaper), with less to debug.
  2. h5coro hyperslices to do spatial subsetting in place for the ICESat-2 data
  3. Aggressive metadata culling via item number 1 above... this is also where there's much more room for improvement (see comment below)
  4. Server-less horizontal scaling with lambda orchestration (possibly switching to cubed in the future)
  5. Right-sized worker processes

Number 5 above means 1 vcpu with 2048 MB of ram. For lambda, AWS charges by the GB-second, where a GB refers to the amount of ram that you provision the process, and vpcus which scale at a rate proportional to the ram allocated at 1 vcpu per 2 GB of ram. Our horizontal processes are pretty light on ram (under 768 MB), so it seems like dropping the memory allocation would be a way to make things cheaper. However, dropping memory also drops your cpu allocation, and the scaling on a portion of a vcpu vs a fully dedicated vcpu is abysmal... so much so that it takes way more than twice as long with 0.5 vcpus than with 1 vcpu, which means something more expensive with a longer wall time wait.

I suspect that this may get re-implemented using DevSeed's recommendation of cubed at some point; the only reason that it isn't used now is because I already had started building the lambda layers in AWS and wanted to benchmark there prior to switching to a new library.

@espg
Copy link
Copy Markdown
Contributor Author

espg commented Dec 10, 2025

By far the biggest hassle, and place where there's the most room for improvement has been on both sides of the spatial subsetting. By both, I mean:

  1. Defining the the worker 'cells' that encompass and cover all of Antarctica
  2. Determining which of those cells actually have data

I've been using morton indexing which makes the second problem slightly easier at the cost of making the first problem more annoying. The mortie library that I maintain is built on top of healpix, which is wrapped via healpy, and the entire stack has pretty lackluster support for converting complex polygons into coverage maps of healpix indices. Here's some of the (mostly failed) attempts at solving this:

  1. Building a minimum spanning tree of indices based on coverage of polygon vertices. This is complicated by the fact that the spanning tree isn't restricted to contiguous cells, so it's easy to have 'holes' appear in the geometry.
  2. Using the built in coverage function in healpix to convert from a polygon to healpix index coverage. This only works for convex geometries, which seems to not be any of the geometries that I would like to use.
  3. Trying above, but using qhull to force a convex geometry. Miserable and abject failure; geometries that are convex in polar stereographic aren't necessarily convex in spherical coordinates.

... and many more that are are equally frustrating. Currently, the way that we solve this is by a pretty reliable but jenky solution that does the following:

  1. Take each of the 27 Antarctic drainage basins, and then calculate the a.) the centroid , and b.) the distance to the furthest point in the drainage polgyon to that centroid.
  2. Use healpix disc query with that distance and center point to grab a set of healpix indices at desired cell resolution.
  3. Merge all of those basin disc coverage's together, and then take the np.unique set of that.

You could do it for the entire Antarctic polygon, and it would massively overestimate the cell indices that you want, but doing it per basin and merging gets relatively close to proper coverage with a slight over provision.

Determining which cells have data from the NASA CMR is also a hassle. Bounding boxes are useless in the polar regions, so instead, we try to use the simplified geometries as query objects. It's relatively easy to get to a '90%' solution-- we take the point / line geometries from the NASA CMR , calculate the morton indices for the vertices in those geometries, and use that provision our worker nodes. It almost works beautifully:

======================================================================
Production Lambda Orchestrator - Catalog-Based
======================================================================

[1/5] Loading granule catalog from granule_catalog_cycle22_order6.json...
      Cycle: 22
      Parent order: 6
      Total cells in catalog: 1742
      Total granules: 4153
      Processing 1742 cells

[2/5] Authenticating with NASA Earthdata...
      Credentials expire: 2025-12-10 05:07:33+00:00

[3/5] Invoking 1742 Lambda functions (max 1700 concurrent)...
      Output: s3://jupyterhub-englacial-scratch-429435741471/atl06/production/
      [  50/1742] empty (filtered) | 3.8 cells/s, ETA 7.4m
      [ 100/1742] OK (139 cells, 17,605 obs) | 6.8 cells/s, ETA 4.0m
      [ 150/1742] OK (773 cells, 73,345 obs) | 9.0 cells/s, ETA 3.0m
      [ 200/1742] OK (488 cells, 38,105 obs) | 10.9 cells/s, ETA 2.4m
      [ 250/1742] OK (876 cells, 65,582 obs) | 12.2 cells/s, ETA 2.0m
      [ 300/1742] OK (610 cells, 70,907 obs) | 13.1 cells/s, ETA 1.8m
      [ 350/1742] OK (1468 cells, 193,102 obs) | 14.3 cells/s, ETA 1.6m
      [ 400/1742] OK (1344 cells, 146,479 obs) | 15.7 cells/s, ETA 1.4m
      [ 450/1742] OK (2551 cells, 400,688 obs) | 16.9 cells/s, ETA 1.3m
      [ 500/1742] OK (1773 cells, 215,345 obs) | 18.2 cells/s, ETA 1.1m
      [ 550/1742] OK (2694 cells, 373,129 obs) | 19.6 cells/s, ETA 1.0m
      [ 600/1742] OK (2036 cells, 272,797 obs) | 20.8 cells/s, ETA 0.9m
      [ 650/1742] OK (2839 cells, 477,125 obs) | 21.9 cells/s, ETA 0.8m
      [ 700/1742] OK (2268 cells, 264,217 obs) | 23.1 cells/s, ETA 0.8m
      [ 750/1742] OK (3082 cells, 513,850 obs) | 24.1 cells/s, ETA 0.7m
      [ 800/1742] OK (2868 cells, 465,596 obs) | 25.0 cells/s, ETA 0.6m
      [ 850/1742] OK (2469 cells, 348,257 obs) | 25.9 cells/s, ETA 0.6m
      [ 900/1742] OK (3114 cells, 443,050 obs) | 26.8 cells/s, ETA 0.5m
      [ 950/1742] OK (2644 cells, 349,372 obs) | 27.6 cells/s, ETA 0.5m
      [1000/1742] OK (3029 cells, 564,727 obs) | 28.1 cells/s, ETA 0.4m
      [1050/1742] OK (3520 cells, 794,216 obs) | 28.5 cells/s, ETA 0.4m
      [1100/1742] OK (3421 cells, 661,995 obs) | 28.5 cells/s, ETA 0.4m
      [1150/1742] OK (3664 cells, 785,075 obs) | 28.8 cells/s, ETA 0.3m
      [1200/1742] OK (3081 cells, 487,608 obs) | 28.6 cells/s, ETA 0.3m
      [1250/1742] OK (3759 cells, 972,599 obs) | 28.3 cells/s, ETA 0.3m
      [1300/1742] OK (3507 cells, 870,284 obs) | 28.3 cells/s, ETA 0.3m
      [1350/1742] OK (3907 cells, 940,720 obs) | 28.3 cells/s, ETA 0.2m
      [1400/1742] OK (3268 cells, 709,374 obs) | 27.9 cells/s, ETA 0.2m
      [1450/1742] OK (3890 cells, 883,258 obs) | 27.0 cells/s, ETA 0.2m
      [1500/1742] OK (3386 cells, 1,205,016 obs) | 25.4 cells/s, ETA 0.2m
      [1550/1742] OK (3809 cells, 1,192,299 obs) | 23.3 cells/s, ETA 0.1m
      [1600/1742] OK (3720 cells, 1,521,734 obs) | 21.4 cells/s, ETA 0.1m
      [1650/1742] OK (3973 cells, 866,088 obs) | 18.0 cells/s, ETA 0.1m
      [1700/1742] OK (4094 cells, 2,804,549 obs) | 13.4 cells/s, ETA 0.1m

[4/5] Cost Calculation
----------------------------------------------------------------------
      Total Lambda execution time: 62,289.5s (17.30 hours)
      Memory: 2048MB (2.0GB)
      GB-seconds: 124,579.0
      Cost: $2.0763

[5/5] Summary
======================================================================
      Total cells:          1742
      With data:            1727
      Empty (no granules):  0
      Empty (filtered):     15
      Errors:               0
      Total observations:   1,107,767,160
----------------------------------------------------------------------
      Wall clock time:      377.4s (6.3m)
      Lambda compute time:  62,289.5s (1038.2m)
      Throughput:           4.6 cells/sec
      Estimated cost:       $2.0763
----------------------------------------------------------------------
      Output location:      s3://jupyterhub-englacial-scratch-429435741471/atl06/production/

@espg
Copy link
Copy Markdown
Contributor Author

espg commented Dec 10, 2025

Compare the above output to what's in the description of this PR. We note a few things:

  1. We're missing 200 million observations
  2. Those 200 million observations are from 12 cells that aren't processed
  3. We're doing a way better job with false positives-- of the 1742 cells we check, 99% have data. That compares against the production solution where only 72.6% of the 2395 we check have data.
  4. The above cost about a third less to process and finishes a minute faster

The wall clock time difference is due to the extra points we need to process...but majority of the cost difference is because of the extra ~600 cells we check to make sure we aren't missing any data. The reason that we're missing the 12 cells in the first place is because the geometries in the NASA CMR are simplified, and we tend to miss the edges of some cells near the pole hole. We fix this by densifying the geometries... which at times means crossing the pole hole and adding in many, many extra cells. But, we don't want to miss any data (certainly not 200 million observations).

A quick intermediate fix here is probably calling a buffer operation... but it would be nice to have a better exact solution rather than a marginally improved heuristic if possible.

- Use Docker AL2023 container for builds
- Add git-lfs for layer zip files
- Pin numpy<2.3 to avoid Lambda issues
- Patch astropy/__init__.py to remove pytest dependency
- Remove boto3/botocore from layer (Lambda provides)
- Disable ARM64 build (healpy compilation issues)
- Include working v14 layer (73MB, matches AWS)
@espg
Copy link
Copy Markdown
Contributor Author

espg commented Dec 10, 2025

Results from the above run(s):

results_lambda

(from lambda/visualize_production_results.ipynb)

@espg espg mentioned this pull request Dec 10, 2025
@espg
Copy link
Copy Markdown
Contributor Author

espg commented Dec 10, 2025

Todo:

  • Fix / diagnose spacial sub-setting issue (i.e., fix 90% solution to work 100%)
  • Finalize build pipeline for working x86 layer (working version is hand tuned)
  • Create working arm layer (lower priority, requires building healpy wheels for py 3.11 on arm)
  • Better output and visualization pipeline / notebook (lots of work to be done here!)
  • Define modular library functions and calls that aren't specific to atl06 (feedback appreciated)

Notes

This lambda layer from commit 522185c is what you want to use to reproduce; the more recent one is still being debugged and isn't known to work correctly yet.

Our infrastructure for the cloud deployment is here ; the terraform / open tofu template is mostly for setting up jupyterhub, but the most recent commits also patch in our lambda permissions. For now, they're specific to the prototype, but we'll update them to allow arbitrary lambda function calls from the notebook environment in the future.

@espg
Copy link
Copy Markdown
Contributor Author

espg commented Dec 10, 2025

HEAD version of the lambda zip is working now; there was an issue in bumping from either h5coro 0.8.0 to 1.0.3 , or a regression in pandas from version 2.3.2 to 2.3.3

@espg
Copy link
Copy Markdown
Contributor Author

espg commented Dec 10, 2025

h5coro version 1.0.3 is working now; older versions (i.e., 0.8.0) didn't have an explicit close() method when opening h5 datasets, but would close / release memory automatically as they were iterated over (reference / variable name overwrite would trigger release and garbage collection). Version 1.0.3 not only added the close() method, but requires it to de-reference and release memory.

@espg
Copy link
Copy Markdown
Contributor Author

espg commented Dec 11, 2025

arm64 drops the cost and time a bit:

----------------------------------------------------------------------
      Total Lambda execution time: 87,059.8s (24.18 hours)
      Memory: 2048MB (2.0GB)
      Architecture: arm64
      GB-seconds: 174,119.5
      Cost: $2.3216

[5/5] Summary
======================================================================
      Total cells:          2395
      With data:            1739
      Empty (no granules):  0
      Empty (filtered):     656
      Errors:               0
      Total observations:   1,370,185,727
----------------------------------------------------------------------
      Wall clock time:      409.7s (6.8m)
      Lambda compute time:  87,059.8s (1451.0m)
      Throughput:           5.8 cells/sec
      Estimated cost:       $2.3216
----------------------------------------------------------------------
      Output location:      s3://xagg/atl06/production/
======================================================================

With a better working spatial metadata subset, we can expect close to 6 min and under $2. Output (and notebook) updated to public bucket

@espg espg closed this Feb 18, 2026
@espg espg reopened this Feb 18, 2026
@espg
Copy link
Copy Markdown
Contributor Author

espg commented Feb 26, 2026

one of the more brutal bugs that I've had to track down. I was assuming that the mismatch in morton cell coverage was tied to the mismatch in data observations. This is not the case; here's the 12 'missing' morton cells at order 6:

mortMisses

The actual problem is that we were missing granules within the order 6 cells, even though we had the correct geospatial coverage. When we tried to fix it (by doing things like densify the search polygons), we accidentally solved the underlying issue by pulling in more granules.

The updated catalog (and code for catalog generation) fixes this by doing two passes--

  1. Get the coverage of the order6 cells use for processing.
  2. Intersect those cells with the granule polygons inside of the downloaded CMR catalog

Item 2 is obviously what's new. Earlier we were checking for vertex intersections in the CMR polygons with the morton cells...but this didn't account for geometries where edges spanned over a cell. Now, we use shapely polygon intersection-- we cast the morton indices to polygons, and build a shapely.STRtree for the intersections (all in polar stereo coordinates), and then use that output for building our json catalog that maps CMR records to lambda functions.

Here's the out from all this:

======================================================================
Production Lambda Orchestrator - Catalog-Based
======================================================================

[1/7] Loading granule catalog from granule_catalog_cycle22_order6_polygon.json...
      Cycle: 22
      Parent order: 6
      Total cells in catalog: 1742
      Total granules: 4153
      Processing 1742 cells
      Sorted by granule count (descending)
      Range: 396 → 1 granules

[2/7] Authenticating with NASA Earthdata...
      Credentials expire: 2026-02-26 03:07:13+00:00

[3/7] Creating template Zarr store...
      Output: s3://xagg/atl06/morton_aggregation.zarr

[4/7] Invoking 1742 Lambda functions (max 1700 concurrent)...
      Architecture: arm64 ($0.0000133334/GB-sec)
      [  50/1742] empty (filtered) | 2.7 cells/s, ETA 10.6m
      [ 100/1742] OK (93 cells, 8,878 obs) | 4.7 cells/s, ETA 5.8m
      [ 150/1742] OK (232 cells, 23,524 obs) | 6.4 cells/s, ETA 4.1m
      [ 200/1742] OK (1100 cells, 95,094 obs) | 8.0 cells/s, ETA 3.2m
      [ 250/1742] OK (923 cells, 90,943 obs) | 9.2 cells/s, ETA 2.7m
      [ 300/1742] OK (1642 cells, 205,329 obs) | 10.1 cells/s, ETA 2.4m
      [ 350/1742] OK (844 cells, 99,245 obs) | 11.2 cells/s, ETA 2.1m
      [ 400/1742] OK (2146 cells, 268,954 obs) | 12.2 cells/s, ETA 1.8m
      [ 450/1742] OK (1920 cells, 234,683 obs) | 13.2 cells/s, ETA 1.6m
      [ 500/1742] OK (2704 cells, 396,563 obs) | 14.4 cells/s, ETA 1.4m
      [ 550/1742] OK (2742 cells, 490,900 obs) | 15.4 cells/s, ETA 1.3m
      [ 600/1742] OK (1371 cells, 163,537 obs) | 16.3 cells/s, ETA 1.2m
      [ 650/1742] OK (2368 cells, 328,609 obs) | 17.3 cells/s, ETA 1.0m
      [ 700/1742] OK (2760 cells, 481,800 obs) | 18.3 cells/s, ETA 0.9m
      [ 750/1742] OK (2448 cells, 362,290 obs) | 19.3 cells/s, ETA 0.9m
      [ 800/1742] OK (3502 cells, 611,855 obs) | 20.1 cells/s, ETA 0.8m
      [ 850/1742] OK (3389 cells, 622,335 obs) | 20.9 cells/s, ETA 0.7m
      [ 900/1742] OK (3307 cells, 593,266 obs) | 21.6 cells/s, ETA 0.6m
      [ 950/1742] OK (3265 cells, 598,722 obs) | 22.2 cells/s, ETA 0.6m
      [1000/1742] OK (3077 cells, 589,407 obs) | 22.6 cells/s, ETA 0.5m
      [1050/1742] OK (3530 cells, 680,971 obs) | 23.1 cells/s, ETA 0.5m
      [1100/1742] OK (3283 cells, 552,654 obs) | 23.3 cells/s, ETA 0.5m
      [1150/1742] OK (3553 cells, 703,267 obs) | 23.0 cells/s, ETA 0.4m
      [1200/1742] OK (3501 cells, 908,437 obs) | 22.5 cells/s, ETA 0.4m
      [1250/1742] OK (4010 cells, 917,096 obs) | 19.3 cells/s, ETA 0.4m
      [1300/1742] OK (3134 cells, 555,841 obs) | 18.7 cells/s, ETA 0.4m
      [1350/1742] OK (4094 cells, 1,378,540 obs) | 18.6 cells/s, ETA 0.4m
      [1400/1742] OK (4065 cells, 1,040,883 obs) | 18.2 cells/s, ETA 0.3m
      [1450/1742] OK (4096 cells, 1,243,165 obs) | 17.8 cells/s, ETA 0.3m
      [1500/1742] OK (3821 cells, 1,134,885 obs) | 17.0 cells/s, ETA 0.2m
      [1550/1742] OK (3845 cells, 2,021,774 obs) | 15.8 cells/s, ETA 0.2m
      [1600/1742] OK (4086 cells, 1,188,083 obs) | 14.8 cells/s, ETA 0.2m
      [1650/1742] OK (4096 cells, 2,691,035 obs) | 12.9 cells/s, ETA 0.1m
      [1700/1742] OK (4096 cells, 2,984,062 obs) | 8.1 cells/s, ETA 0.1m

[5/7] Consolidating Zarr metadata...
/home/espg/software/magg/.venv/lib64/python3.13/site-packages/zarr/api/asynchronous.py:247: ZarrUserWarning: Consolidated metadata is currently not part in the Zarr format 3 specification. It may not be supported by other zarr implementations and may change in the future.
  warnings.warn(

[6/7] Cost Calculation
----------------------------------------------------------------------
      Total Lambda execution time: 87,151.7s (24.21 hours)
      Memory: 2048MB (2.0GB)
      Architecture: arm64
      GB-seconds: 174,303.5
      Cost: $2.3241

[7/7] Summary
======================================================================
      Total cells:          1742
      With data:            1727
      Empty (no granules):  0
      Empty (filtered):     15
      Errors:               0
      Total observations:   1,370,172,290
----------------------------------------------------------------------
      Wall clock time:      408.8s (6.8m)
      Lambda compute time:  87,151.7s (1452.5m)
      Throughput:           4.3 cells/sec
      Estimated cost:       $2.3241
----------------------------------------------------------------------
      Output location:      s3://xagg/atl06/morton_aggregation.zarr/
======================================================================

Disappointingly, we don't get any real increase in speed or reduction in cost-- but we're at least doing things correctly now, and only trying to process valid cells (and not missing any data).

All of this is probably more complicated than it needs to be. We're doing a ad hoc query against a polygon, but we don't have a proper polygon to morton spanning tree algorithm, so instead we're doing a few casts back and forth from vertex data to polygon intersection. Looking at the above plot, it's pretty obvious that we could just ask for all data south of something (65?), rather than bother with the orbital segment query and basins. A true fix would be to cast from the drainage polygon to coverage span at whatever order we're splitting by, and then do the same polygon intersection that the last few commits have applied. That would get rid of the Falkin islands / Maldives, and the tierra del fuego-- it would also be much more general, since the only reason that the vertex based approach has worked is because the data is roughly compatible (i.e., it wouldn't work if we did too small of a tiling split).

The cast from a polygon to a span coverage is doable, but also beyond claude's abilities to just code up-- it's not something that exists in it's training data, so it will do it wrong. I'll need to pen and paper design it and then pass the pseudo code to the agent for this to work. This worked well for minimum spanning tree / polygon that I coded this week... so it will still make the process much faster once I have time to properly implement it.

@espg espg merged commit 859f847 into main Mar 6, 2026
10 checks passed
@espg espg deleted the lambda branch March 6, 2026 02:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants