Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

start sketching the CMR mosaic backend #10

Merged
merged 8 commits into from
Feb 1, 2024

Conversation

vincentsarago
Copy link
Member

from titiler.cmr.backend import CMRBackend

with CMRBackend("C1996881146-POCLOUD") as src:
    assets = src.assets_for_tile(10,10,10)

Granules found: 7903
print(assets)
['https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc',
 'https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20020602090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc',
 'https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20020603090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc',
 ...
]

titiler/cmr/backend.py Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
@vincentsarago
Copy link
Member Author

Data access

I'm not sure how we should configure the data access.

In earthaccess there are multiple ways: https://earthaccess.readthedocs.io/en/latest/howto/edl/

right now I've set .data_links(access="direct") for the DataGranule, which (if I understand correctly) will return an S3 URL.

Then to access this S3 url we need AWS credentials. EarthAccess provide a way to return credential using auth.get_s3_credentials but we need to have DAAC name 🤷

should we assume titiler-cmr to be run in an environment where we have direct access to the S3 files?

cc @sharkinsspatial @abarciauskas-bgse

@sharkinsspatial
Copy link
Member

@vincentsarago This is the concern I was discussing in our call with @abarciauskas-bgse a while back. Does earthaccess have an authentication "escape hatch" that we can use?

The initial goal for us would be to deploy titiler-cmr to a Lambda which will be executed using a role has direct access credentials for the DAAC. This is the situation we use now with the VEDA titiler instances. To use this approach, I think earthaccess will need an "escape hatch" so that it just assumes that id doesn't need to pass any credentials to boto3 and will just use the execution role. @abarciauskas-bgse can you confirm that the direct access role we have in VEDA can support multiple DAACs (or is it just LPDAAC at the moment).

This will handle our initial NASA controlled deployments. But we should also consider eventually having another option for non-NASA users to use temporary s3_credentials. Because the Cumulus EDL s3_credentials are slow we've previously used an external credential rotation service to support this that periodically fetches temporary s3 creds for each DAAC and then stores them in an SSM parameter that the tiling Lambda can access at runtime.

@vincentsarago
Copy link
Member Author

With the last commit I've implemented two ways to obtain credentials:

  • "auto": CMRBackend will use earthaccess.auth.Auth().get_s3_credentials to get the AWS S3 credential. Then we pass them to s3fs.S3FileSystem
  • "iam": (likely to be renamed to something else 😅) will assume the Backend has all the credentials set in the environment.
import earthaccess
from earthaccess.daac import find_provider
cmr_auth = earthaccess.login(strategy="netrc")

from titiler.cmr.backend import CMRBackend
from titiler.cmr.reader import ZarrReader

with CMRBackend("C1996881146-POCLOUD", cmr_auth, reader=ZarrReader, reader_options={"variable": "analysed_sst"}) as src:
   img = src.tile(4, 4, 4, cmr_query={"temporal": ("2020-02-01", "2020-02-01")}, )

PermissionError: Forbidden

😬 not sure why it doesn't work for now

@vincentsarago
Copy link
Member Author

vincentsarago commented Jan 26, 2024

import earthaccess
cmr_auth = earthaccess.login(strategy="netrc")


from titiler.cmr.backend import aws_s3_credential
import rasterio

s3_credentials = aws_s3_credential(cmr_auth, "POCLOUD")
aws_session = rasterio.session.AWSSession(
   aws_access_key_id=s3_credentials["accessKeyId"],
   aws_secret_access_key=s3_credentials["secretAccessKey"],
   aws_session_token=s3_credentials["sessionToken"],
)

with rasterio.Env(aws_session):
   with rasterio.open("s3://podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20200201090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc") as src:
      print(src.meta)

RasterioIOError: Access Denied

same with rasterio, so I'm not sure what the credentials are for 🤷

@sharkinsspatial
Copy link
Member

@vincentsarago Where are you trying to execute this code from? Though unfortunately not all the DAACs seem to document this when describing their s3_credentials endpoints, the temporary credentials can only be used "in-region" to prevent S3 based operations from incurring an egress cost. So if you want to run this code, it needs to be in an environment in us-west-2 or you'll receive access errors. https://earthaccess.readthedocs.io/en/latest/tutorials/getting-started/#accessing-the-data

@abarciauskas-bgse
Copy link
Contributor

abarciauskas-bgse commented Jan 26, 2024

@sharkinsspatial

can you confirm that the direct access role we have in VEDA can support multiple DAACs (or is it just LPDAAC at the moment).

The veda-date-reader-dev role in the smce-veda account I believe has the same access as the nasa-veda-prod of the VEDA JHub. Using the nasa-veda-prod role I tested it can access PO.DAAC, GESDISC, LPDAAC, ORNL, NSIDC

I tested access to GHRC (s3://ghrcw-protected), LADS (s3://prod-lads), ASF (s3://asf-ngap2w-p-s1-ocn-1e29d408/) and got access denied.

I can't find anything for ASDC, CDDIS, OB.DAAC or SEDAC in Earthdata Cloud but I may not be searching everything as I used the organization filter and Earthdata Cloud filter for those DAACs and nothing came up.

I do think we want to use the veda-data-read-dev role or one of the other roles we have whitelisted for access in the deployment so that we don't have to handle rotating credentials and fetching credentials from different endpoints for now.

However, I have low confidence role-based access works with earthaccess at this time. We should help fix this if we can and I can continue to look into it next week but we can also use earthaccess to find the data and then xarray + s3fs to open it ourselves.

See nsidc/earthaccess#431

@sharkinsspatial
Copy link
Member

@abarciauskas-bgse For clarification, the issue that @vincentsarago is experiencing here is unrelated to direct role-based access (he hasn't iam credentials option yet). The first comment #10 (comment) is s3fs throwing an access error from the use of

if protocol == "s3":
s3_credentials = s3_credentials or {}
s3_filesystem = s3fs.S3FileSystem(**s3_credentials)
return (
s3_filesystem.open(src_path)
if xr_engine == "h5netcdf"
else s3fs.S3Map(root=src_path, s3=s3_filesystem)
.
The second comment #10 (comment) is from rasterio's underlying boto3 calls throwing an access error. I suspect that these are probably both related to attempting to use the creds outside of us-west-2.

@vincentsarago
Copy link
Member Author

Yes I was running this locally not on AWS 😅

@sharkinsspatial
Copy link
Member

@vincentsarago A few questions on filesystem instantiation

Rather than manage our own fsspec filesystem should we defer to just using the earthaccess convenience open method? I think @abarciauskas-bgse and I can implement an IAM based escape hatch for earthacess to support the alternative authorization method nsidc/earthaccess#431 (comment).

There are a few possible issues here

  1. We won't have direct access to the fsspec filesystem instantiation so if we wanted to continue with @abarciauskas-bgse 's filesystem level caching work we'd need to add more config options to earthaccess (🤔 maybe not a bad thing).
  2. The other area where my understanding is limited are the possible performance differences in rasterio when using an fsspec filesystem rather than vsicurl or vsis3 for access. Have we done any benchmarking to understand fine grained differences in the Range requests generated when using different file access mechanisms (vsis3, s3fs) in rasterio? Maybe we can write up an issue for testing this as part of this development.

@vincentsarago
Copy link
Member Author

Rather than manage our own fsspec filesystem should we defer to just using the earthaccess convenience open method?

👍, I think I wanted to reuse the code @abarciauskas-bgse worked on for titiler-xarray, believing there was some optimization done!

I think @abarciauskas-bgse and I can implement an IAM based escape hatch for earthacess to support the alternative authorization method nsidc/earthaccess#431 (comment).

TBH, I don't think we need to this, we can simply require this project to be deployed on AWS in the same region as the data 🤷

The other area where my understanding is limited are the possible performance differences in rasterio when using an fsspec filesystem rather than vsicurl or vsis3 for access. Have we done any benchmarking to understand fine grained differences in the Range requests generated when using different file access mechanisms (vsis3, s3fs) in rasterio? Maybe we can write up an issue for testing this as part of this development.

I don't think there will be much difference between both TBH, fsspec and rasterio might not do merge requests (this would be interested to validate, more info on multi_range request can be found in rasterio/rasterio#2969)

The main issue would be that rasterio do not support multi dimensional data so it will be easier to use fsspec + xarray + rioxarray

@vincentsarago
Copy link
Member Author

I think the easiest way to test everything is to deploy this to AWS 🚢

I've opened #11 and will update the CI (#13) in another PR

@asynccontextmanager
async def lifespan(app: FastAPI):
"""FastAPI Lifespan."""
app.state.cmr_auth = earthaccess.login(strategy="netrc")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should keep this. We are only using earthaccess to search for data, not access it, at the moment. earthaccess.search_data works without login.

If we decide to use earthaccess to open the data, which seems reasonable, I would guess we want 2 options for accessing the data:

  1. (preferred) the titiler-cmr instance is using a role with access to the data.
  2. titiler-cmr has environment variables, earthdata username and password which are used to fetch credentials.

Should this be strategy="environment" for the second case?

As far as I know, accessing Earthdata cloud data won't work outside us-west-2, so testing tiling locally won't work unless you are running it on an instance in us-west-2, so environment variables seem a more appropriate option.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should keep this. We are only using earthaccess to search for data, not access it, at the moment. earthaccess.search_data works without login.

I'll make the login optional 🙏

Should this be strategy="environment" for the second case?

👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to note that there could be significant overhead with fetching s3 temporary tokens via EDL for every tiling request. Since our function is stateless we’ll need to make the full EDL request cycle every time. This was the rationale for employing external credential rotation on other projects https://github.com/NASA-IMPACT/edl-credential-rotation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I've added a cache layer on top of the credential getter but yeah it won't be super performant

@asynccontextmanager
async def lifespan(app: FastAPI):
"""FastAPI Lifespan."""
if auth_config.strategy == "environment":
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the setting to be either environment or iam

when using environment, we will use earthaccess to get S3 credentials

@vincentsarago vincentsarago changed the base branch from main to develop January 29, 2024 20:31
@vincentsarago vincentsarago marked this pull request as ready for review February 1, 2024 11:08
@vincentsarago vincentsarago merged commit 8158c98 into develop Feb 1, 2024
5 checks passed
@vincentsarago vincentsarago deleted the vincents/sketch-mosaicbackend branch February 1, 2024 11:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants