Skip to content

Add NASA HLS v2.0 data sources and integration tests#623

Merged
robmarkcole merged 21 commits intomasterfrom
add-nasa-hls
Apr 28, 2026
Merged

Add NASA HLS v2.0 data sources and integration tests#623
robmarkcole merged 21 commits intomasterfrom
add-nasa-hls

Conversation

@robmarkcole
Copy link
Copy Markdown
Collaborator

Address #622

@robmarkcole robmarkcole marked this pull request as draft April 24, 2026 10:07
…oads and add integration tests for fallback behavior
@robmarkcole robmarkcole marked this pull request as ready for review April 24, 2026 11:47
Comment thread rslearn/data_sources/nasa_hls.py Outdated
Comment thread rslearn/data_sources/nasa_hls.py Outdated
Comment thread rslearn/data_sources/nasa_hls.py Outdated
Comment thread rslearn/data_sources/nasa_hls.py Outdated
Comment thread rslearn/data_sources/nasa_hls.py Outdated
Comment thread rslearn/utils/stac.py Outdated
Comment thread rslearn/utils/stac.py
@favyen2
Copy link
Copy Markdown
Collaborator

favyen2 commented Apr 27, 2026

@claude can you review this PR?

@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

Claude finished @favyen2's task in 3m 32s —— View job


PR Review: Add NASA HLS v2.0 data sources and integration tests

  • Read CLAUDE.md for repo guidelines
  • Review rslearn/data_sources/nasa_hls.py
  • Review rslearn/utils/stac.py
  • Review test files
  • Review documentation
  • Summarize findings

This PR adds three new data source classes (Hls2S30, Hls2L30, Hls2) backed by LP DAAC/CMR STAC, plus pagination support in StacClient. The implementation is solid overall and well-tested, but there are several architectural and code-quality issues worth addressing — many already flagged by @favyen2's review comments.


Design Issues

Hls2.__init__ bypasses _NasaHlsBase.__init__ (nasa_hls.py:659)

Hls2.__init__ manually re-implements everything that _NasaHlsBase.__init__ does (sets self.timeout, self.auth, self.client, self.query, self.sort_by, self.sort_ascending, self.limit, self.properties_to_record) rather than calling super().__init__(). This is fragile: future changes to _NasaHlsBase.__init__ won't automatically propagate to Hls2. The Hls2 class should call the base class initializer or the base class needs to be refactored to accommodate multi-collection searches.

Hls2.get_item_by_name and Hls2.get_items overrides exist only for multi-collection (nasa_hls.py:771, 784)

As noted in the review comments, these overrides exist almost entirely to pass self.collection_names (a list) instead of self.collection_name (a string) to StacClient.search. Modifying StacDataSource.__init__ to accept collection_name: str | list[str] (internally normalizing to a list) would eliminate both overrides and the code duplication. Fix this →


Code Quality Issues

Dual environment variable names (nasa_hls.py:74)

Supporting EARTHDATA_TOKEN and NASA_EARTHDATA_TOKEN (via _first_set_env) adds complexity for minimal benefit. Pick one canonical name (EARTHDATA_TOKEN) and remove the _first_set_env helper. The documentation currently lists both, which is confusing. Fix this →

Band alias complexity (nasa_hls.py:489, 509, 556, 621)

The ASSET_KEY_TO_COMMON_NAME + EXTRA_BAND_ALIASES system in Hls2S30/Hls2L30 and BAND_ALIASES in Hls2 creates a multi-layer alias system. As the reviewer noted:

  • B01/B02 are already the native asset keys and match existing conventions in this codebase — no renaming needed.
  • Extra aliases like nir_broad, nir08, qa, FMASK, saa, sza, vaa, vza add surface area to the API without strong justification.

Suggest stripping aliases down to just the canonical names and a minimal set of genuinely useful aliases (e.g., fmaskFmask capitalisation normalisation). Fix this →

Overly broad exception catch in _download_asset_with_fallback (nasa_hls.py:337)

except Exception:
    http_asset_url = self._get_http_asset_url(item, asset_key)
    if not asset_url.startswith("s3://") or http_asset_url is None:
        raise
    ...

This catches every exception (disk full, network timeout, etc.) and falls back to HTTP. The condition asset_url.startswith("s3://") filters non-S3 cases, but for S3 it would fall back even on non-permission errors. A more specific exception type (e.g., botocore.exceptions.ClientError) would be safer. The _read_raster_for_item method is tighter (except RasterioIOError) — the download fallback should be similarly scoped.


stac.py Issues

next_method default (stac.py:176)

As flagged in review: link.get("method", "GET") should use cur_method as the default so that a "next" link without an explicit method field inherits the calling convention, not always GET. Fix this →

Missing comment on GET pagination (stac.py:161)

As flagged in review: add a comment on the GET branch of the pagination loop explaining that request_data is not passed because this only happens for GET next-links where the query parameters are already embedded in the URL. Fix this →


Documentation Error

timeout_seconds parameter doesn't exist (docs/data_sources/nasa_hls_Hls2.md:38, 153)

The JSON config examples in all three data source docs show "timeout_seconds": 30, but the actual __init__ parameter is timeout: timedelta. There is no timeout_seconds parameter. This will silently be ignored (or error depending on the config loader). Change docs to use "timeout" with the appropriate timedelta representation, or add a timeout_seconds: float convenience parameter to __init__ that converts to a timedelta internally. Fix this →


Minor Notes

  • Hls2.COLLECTION_NAME = "HLS2" is a placeholder that doesn't correspond to a real STAC collection. A class-level docstring or comment explaining this sentinel value would help future readers.
  • The integration tests (in tests/integration/) use mocked HTTP servers rather than live calls, which is appropriate for CI but worth noting that they don't exercise the actual CMR STAC API.
  • Test coverage is solid — the unit tests cover band aliases, S3/HTTPS fallback, credential caching, and merged Hls2 sorting.

Comment thread rslearn/data_sources/stac.py Outdated
Comment thread tests/unit/data_sources/test_nasa_hls.py Outdated
@robmarkcole robmarkcole merged commit 158e4c7 into master Apr 28, 2026
3 checks passed
@robmarkcole robmarkcole deleted the add-nasa-hls branch April 28, 2026 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants