-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
start sketching the CMR mosaic backend #10
Conversation
vincentsarago
commented
Jan 23, 2024
Data accessI'm not sure how we should configure the data access. In earthaccess there are multiple ways: https://earthaccess.readthedocs.io/en/latest/howto/edl/ right now I've set Then to access this S3 url we need AWS credentials. EarthAccess provide a way to return credential using should we assume titiler-cmr to be run in an environment where we have direct access to the S3 files? |
@vincentsarago This is the concern I was discussing in our call with @abarciauskas-bgse a while back. Does The initial goal for us would be to deploy This will handle our initial NASA controlled deployments. But we should also consider eventually having another option for non-NASA users to use temporary |
With the last commit I've implemented two ways to obtain credentials:
import earthaccess
from earthaccess.daac import find_provider
cmr_auth = earthaccess.login(strategy="netrc")
from titiler.cmr.backend import CMRBackend
from titiler.cmr.reader import ZarrReader
with CMRBackend("C1996881146-POCLOUD", cmr_auth, reader=ZarrReader, reader_options={"variable": "analysed_sst"}) as src:
img = src.tile(4, 4, 4, cmr_query={"temporal": ("2020-02-01", "2020-02-01")}, )
PermissionError: Forbidden 😬 not sure why it doesn't work for now |
import earthaccess
cmr_auth = earthaccess.login(strategy="netrc")
from titiler.cmr.backend import aws_s3_credential
import rasterio
s3_credentials = aws_s3_credential(cmr_auth, "POCLOUD")
aws_session = rasterio.session.AWSSession(
aws_access_key_id=s3_credentials["accessKeyId"],
aws_secret_access_key=s3_credentials["secretAccessKey"],
aws_session_token=s3_credentials["sessionToken"],
)
with rasterio.Env(aws_session):
with rasterio.open("s3://podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20200201090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc") as src:
print(src.meta)
RasterioIOError: Access Denied same with rasterio, so I'm not sure what the credentials are for 🤷 |
@vincentsarago Where are you trying to execute this code from? Though unfortunately not all the DAACs seem to document this when describing their |
The veda-date-reader-dev role in the smce-veda account I believe has the same access as the nasa-veda-prod of the VEDA JHub. Using the nasa-veda-prod role I tested it can access PO.DAAC, GESDISC, LPDAAC, ORNL, NSIDC I tested access to GHRC (s3://ghrcw-protected), LADS (s3://prod-lads), ASF (s3://asf-ngap2w-p-s1-ocn-1e29d408/) and got access denied. I can't find anything for ASDC, CDDIS, OB.DAAC or SEDAC in Earthdata Cloud but I may not be searching everything as I used the organization filter and Earthdata Cloud filter for those DAACs and nothing came up. I do think we want to use the veda-data-read-dev role or one of the other roles we have whitelisted for access in the deployment so that we don't have to handle rotating credentials and fetching credentials from different endpoints for now. However, I have low confidence role-based access works with earthaccess at this time. We should help fix this if we can and I can continue to look into it next week but we can also use earthaccess to find the data and then xarray + s3fs to open it ourselves. |
@abarciauskas-bgse For clarification, the issue that @vincentsarago is experiencing here is unrelated to direct role-based access (he hasn't titiler-cmr/titiler/cmr/reader.py Lines 69 to 75 in 4b383a5
The second comment #10 (comment) is from rasterio 's underlying boto3 calls throwing an access error. I suspect that these are probably both related to attempting to use the creds outside of us-west-2 .
|
Yes I was running this locally not on AWS 😅 |
@vincentsarago A few questions on filesystem instantiation Rather than manage our own fsspec filesystem should we defer to just using the There are a few possible issues here
|
👍, I think I wanted to reuse the code @abarciauskas-bgse worked on for titiler-xarray, believing there was some optimization done!
TBH, I don't think we need to this, we can simply require this project to be deployed on AWS in the same region as the data 🤷
I don't think there will be much difference between both TBH, fsspec and rasterio might not do merge requests (this would be interested to validate, more info on multi_range request can be found in rasterio/rasterio#2969) The main issue would be that rasterio do not support multi dimensional data so it will be easier to use fsspec + xarray + rioxarray |
titiler/cmr/main.py
Outdated
@asynccontextmanager | ||
async def lifespan(app: FastAPI): | ||
"""FastAPI Lifespan.""" | ||
app.state.cmr_auth = earthaccess.login(strategy="netrc") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we should keep this. We are only using earthaccess to search for data, not access it, at the moment. earthaccess.search_data
works without login.
If we decide to use earthaccess to open the data, which seems reasonable, I would guess we want 2 options for accessing the data:
- (preferred) the titiler-cmr instance is using a role with access to the data.
- titiler-cmr has environment variables, earthdata username and password which are used to fetch credentials.
Should this be strategy="environment"
for the second case?
As far as I know, accessing Earthdata cloud data won't work outside us-west-2, so testing tiling locally won't work unless you are running it on an instance in us-west-2, so environment variables seem a more appropriate option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we should keep this. We are only using earthaccess to search for data, not access it, at the moment. earthaccess.search_data works without login.
I'll make the login
optional 🙏
Should this be strategy="environment" for the second case?
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanted to note that there could be significant overhead with fetching s3 temporary tokens via EDL for every tiling request. Since our function is stateless we’ll need to make the full EDL request cycle every time. This was the rationale for employing external credential rotation on other projects https://github.com/NASA-IMPACT/edl-credential-rotation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I've added a cache layer on top of the credential getter but yeah it won't be super performant
…into vincents/sketch-mosaicbackend
@asynccontextmanager | ||
async def lifespan(app: FastAPI): | ||
"""FastAPI Lifespan.""" | ||
if auth_config.strategy == "environment": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the setting to be either environment
or iam
when using environment
, we will use earthaccess to get S3 credentials